System Design: Availability

Jul 12, 2024

System Design Concepts for Beginners

Introduction

  • Speaker: Iran
  • Series: System Design Concepts for Beginners
  • Goal: To simplify and explain basic system design concepts for beginners.

Importance of Availability

  • Definition: Availability is the percentage of time a system is operational and performing its intended function.
  • Real-World Impact: Downtime can lead to significant revenue losses and other negative consequences, e.g., Facebook's 6-hour outage in October 2021 led to an estimated $60 million loss in ad revenue.
  • Different Levels of Availability: Different systems require varying levels of availability.
    • Example: Air traffic control system vs. restaurant reservation system.
  • Measuring Availability: Often measured in "nines" (9's):
    • 2 Nines (99%): 3.6 days downtime/year
    • 3 Nines (99.9%): 8.7 hours downtime/year
    • 4 Nines (99.99%): 52 minutes downtime/year
    • 5 Nines (99.999%): Less than 6 minutes downtime/year (challenging but achievable for some systems)

Achieving High Availability

  • Factors Affecting Availability
    • Hardware failures, power outages, natural disasters
    • Resource exhaustion (e.g., disk space, overloads)
    • Software bugs (e.g., null pointers, memory leaks)
  • Design Strategy: Accept that failures are inevitable and design to mask these localized failures.
  • Eliminate Single Points of Failure
    • Example: A single application server goes down -> downtime for the website
    • Solution: Redundancy (e.g., multiple app servers sharing the load)
  • Load Balancer
    • Manages client requests and distributes them to available servers
    • Can fail itself -> use redundancy (backup setup or multiple active load balancers)
    • Backup Setup: Use passive secondary load balancer to take over using floating IP in case of primary failure
    • Active-Active Setup: Both load balancers share the load using DNS for redundancy
    • DNS Issue: DNS servers may not be immediately aware of load balancer failure; requires additional monitoring service to update DNS

Geographic Considerations

  • Geographic Redundancy
    • Distribute servers globally to mitigate regional outages
    • Improves latency by serving requests from the nearest server
    • Trade-offs: Increased complexity and cost

General Advice

  • Start Simple: Begin with the simplest design, then optimize
  • Consider Trade-offs: More components increase complexity and the potential for failure
  • Prioritize Redundancy Justifiably: Only add redundancy where critical for system function and data integrity

Conclusion

  • Video aims to make system design concepts simple and accessible
  • Future videos will cover more topics like load balancing and DNS in detail