Notes on Scalability in System Design

Introduction to Scalability

Scalability is crucial for applications that may experience sudden traffic surges.
A scalable system can handle increased loads by adding resources without compromising performance.

Core Concept: A system's ability to manage increased load efficiently.
Focus on cost-effective strategies to extend capabilities rather than just surviving increased demand.

Coordination: Adding resources (processors/servers) requires effective coordination.
Performance Overhead: Must consider if coordination overhead negates performance gains.

Use response vs. demand curves to compare system scalability.
- X-axis: Demand
- Y-axis: Response time
- A more scalable system has a less steep curve.

No system is infinitely scalable; each has limits.
Tipping point appears as a knee in the response vs. demand curve where performance degrades.
Goal: Push the tipping point as far right as possible.

Centralized Components: E.g., a single database server limits simultaneous requests.
High Latency Operations: Time-consuming tasks that slow down overall response time.

Statelessness
- Servers do not hold client-specific data between requests.
- Enhances horizontal scaling and fault tolerance.
- For stateful applications, externalize state to distributed caches/databases.
Loose Coupling
- Designed system components operate independently with minimal dependencies.
- Allows for specific parts to be scaled without affecting the entire system.
Asynchronous Processing
- Use event-driven architecture for non-blocking operations.
- Reduces tight coupling and risk of cascading failures, but increases complexity in error handling.

Vertical Scaling (Scaling Up)
- Involves increasing the capacity of a single machine.
- Suitable for applications with specific requirements but has limitations and high costs.
Horizontal Scaling (Scaling Out)
- Adding more machines to share workload.
- Better fault tolerance and often more cost-effective for large-scale systems.
- Challenges include data consistency and complexity in distributed systems.

Load Balancing
- Directs incoming requests to the most capable servers.
- Uses algorithms like round-robin, least connections, or performance-based methods.
Caching
- Stores frequently accessed data to reduce latency and backend load.
- Implementing a content delivery network can improve response times.
Sharding
- Splitting large datasets for parallel processing across servers.
- Choose sharding strategies based on data access patterns.
Avoid Centralized Resources
- Centralized components become bottlenecks.
- Use multiple queues, break long tasks into smaller tasks, and design patterns to distribute workloads.
Modularity
- Create loosely coupled independent modules with defined interfaces.
- Avoid monolithic architectures to maintain scalability and flexibility.

Monitoring is essential: CPU, memory, network bandwidth, response times, throughput.
Continuously reassess design decisions and adapt architecture to evolving needs.