💻

Lecture Notes on Sharding and Partitioning in Databases

Jul 28, 2024

Notes on Sharding and Partitioning for Database Scaling

Key Concepts

  • Sharding: Distributing data across multiple machines (at the database level).
  • Partitioning: Splitting data into smaller pieces (at the data level).
  • Both techniques are used to improve throughput and availability in database systems.

Course Introduction

  • Code-based course on system design since March 2021.
  • Encourages collaborative learning through problem statements and brainstorming sessions.
  • Aims to cover trade-offs in design decisions.
  • Course highlights:
    • Week 1: Core foundations and design of online/offline indicators.
    • Week 2: Database logging and scalability examples.
    • Week 3: Distributed systems, such as designing load balancers.
    • Week 4: Designing social networks.
    • Week 5: Building storage engines.
    • Week 6: High throughput systems.
    • Week 7: Information retrieval systems and designing message brokers and task schedulers.
    • Week 8: Advanced algorithms for system efficacy.

Understanding Scaling in Databases

  1. Initial Setup
    • Begin with a small server and MySQL database.
    • Serve a limited number of writes per second (e.g., 100 writes/sec).
  2. Vertical Scaling
    • Increase server resource capacity (CPU, RAM, Disk) as traffic grows (e.g., to 200 writes/sec).
    • Introduce read replicas to handle increased reads without affecting writes.
  3. Limitations of Vertical Scaling
    • Hardware limitations cap performance (e.g., unable to exceed 1000 writes/sec).
  4. Horizontal Scaling
    • Distributing load across multiple servers (sharding) to handle higher throughput (e.g., 1500 writes/sec).
    • Each shard can handle a portion of the data, allowing for parallel processing of requests.

Sharding vs. Partitioning

  • Sharding: Refers to distributing the database across multiple servers.
  • Partitioning: Refers to splitting data into segments.

Example: Managing Large Data Sets

  • Split 100 GB data into partitions (e.g., 30 GB, 20 GB) to handle data more efficiently.
  • Partitions can reside on different shards or the same shard.
  • Logical partitions enhance data management and allow for load balancing.

Strategies for Partitioning

  • Horizontal Partitioning: Split data by rows/documents.
  • Vertical Partitioning: Split data by columns/tables.
  • Deterministic strategies are vital for effective partitioning based on access patterns and load.

Visualization of Sharding and Partitioning

  • No Sharding/No Partitioning: Basic local database setup.
  • Partitioned with No Sharding: Logical partitioning on a single database server.
  • Sharded with No Partitioning: Read replicas with identical data.
  • Sharded and Partitioned: Both strategies applied for efficient data handling across multiple servers.

Advantages of Sharding and Partitioning

  • Improved Throughput: More servers means handling more reads/writes.
  • Increased Storage Capacity: Storing large amounts of data across multiple servers.
  • High Availability: Backup servers can take over when one goes down.

Disadvantages of Sharding and Partitioning

  • Operational Complexity: Management of multiple servers can be complicated.
  • Cross-Shard Queries: Joining tables across shards can be slow and resource-intensive, affecting throughput and increasing latency.

Conclusion

  • Understanding these concepts is crucial for practical applications in real-world systems.
  • The right approach combines both sharding and partitioning to achieve scalable and efficient database architectures.

Call to Action

  • If enjoyed the content, give feedback and subscribe for more in-depth engineering videos!