System Design Concepts for Beginners

Introduction

Definition: Availability is the percentage of time a system is operational and performing its intended function.
Real-World Impact: Downtime can lead to significant revenue losses and other negative consequences, e.g., Facebook's 6-hour outage in October 2021 led to an estimated $60 million loss in ad revenue.
Different Levels of Availability: Different systems require varying levels of availability.
- Example: Air traffic control system vs. restaurant reservation system.
Measuring Availability: Often measured in "nines" (9's):
- 2 Nines (99%): 3.6 days downtime/year
- 3 Nines (99.9%): 8.7 hours downtime/year
- 4 Nines (99.99%): 52 minutes downtime/year
- 5 Nines (99.999%): Less than 6 minutes downtime/year (challenging but achievable for some systems)

Factors Affecting Availability
- Hardware failures, power outages, natural disasters
- Resource exhaustion (e.g., disk space, overloads)
- Software bugs (e.g., null pointers, memory leaks)
Design Strategy: Accept that failures are inevitable and design to mask these localized failures.
Eliminate Single Points of Failure
- Example: A single application server goes down -> downtime for the website
- Solution: Redundancy (e.g., multiple app servers sharing the load)
Load Balancer
- Manages client requests and distributes them to available servers
- Can fail itself -> use redundancy (backup setup or multiple active load balancers)
- Backup Setup: Use passive secondary load balancer to take over using floating IP in case of primary failure
- Active-Active Setup: Both load balancers share the load using DNS for redundancy
- DNS Issue: DNS servers may not be immediately aware of load balancer failure; requires additional monitoring service to update DNS

Geographic Redundancy
- Distribute servers globally to mitigate regional outages
- Improves latency by serving requests from the nearest server
- Trade-offs: Increased complexity and cost

Start Simple: Begin with the simplest design, then optimize
Consider Trade-offs: More components increase complexity and the potential for failure
Prioritize Redundancy Justifiably: Only add redundancy where critical for system function and data integrity