Designing Scalable Large-Scale Systems

Oct 8, 2024

Notes on Designing Large Scale Systems

Key Challenges

  • Ensuring system responsiveness under heavy load.
  • Handling massive data volumes and concurrent user requests.
  • Database as the primary bottleneck.

Techniques to Address Challenges

  1. Database Replication
    • Creating multiple copies (replicas) of a database distributed across servers.
    • Ensures high availability and uninterrupted service.
    • Scales read capacity.

Types of Replication

  • Leader-Follower Replication

    • One database acts as the leader (master), others as followers (slaves).
    • Write operations directed to the leader; changes propagated to followers.
    • Read operations distributed across leader and followers to enhance capacity.
  • Leader-Leader Replication

    • Multiple databases act as leaders, each can accept read and write operations.
    • Potential for conflicts between leaders requires conflict resolution mechanisms.

Replication Methods

  • Asynchronous Replication

    • Changes propagated to replicas in the background.
    • Faster but risks temporary data inconsistencies.
  • Synchronous Replication

    • Changes committed to both leader and followers simultaneously.
    • Guarantees consistency but may impact write performance.

Conflict Resolution Techniques for Leader-Leader Replication

  • Timestamp-Based Resolution

    • Most recent update wins.
  • Last Write Wins (LWW)

    • Most recent write regardless of timestamp wins.
  • Custom Conflict Resolution

    • Application-specific rules for resolving conflicts.

Database Sharding

  • Distributes data across multiple servers for horizontal scaling.
  • Splits large tables, locating parts of data in each shard.

Sharding Criteria

  • Data can be sharded based on criteria like:
    • Customer IDs: e.g., users 1-1000 in shard 1, 1001-2000 in shard 2.
    • Geographical Region: e.g., US users in one shard.

Shard Keys

  • Determine data assignment to shards.
    • Range-Based Sharding: Data partitioned based on ranges of the shard key.
    • Hash-Based Sharding: Uses a hash function applied to the shard key for even data distribution, but less efficient for range queries.

Differences in Sharding Implementation

  • SQL Databases: Typically lack out-of-the-box sharding; require custom logic.
  • NoSQL Databases: Many (e.g., MongoDB) provide built-in sharding support.

Summary

  • Replication: Ensures high availability; scales read capacity.
  • Sharding: Enables horizontal scaling; distributes data.
  • Common to use both techniques in combination for optimal performance.