Transcript for:
System Design: Availability

hey guys it's iran and i'm super excited to announce that this is going to be the first video in a series about system design concepts for beginners if you've watched my previous video about preparing for system design interviews you will know how important this type of interview is and the first step in getting ready for it is understanding those basic concepts and again if you've watched this video you will know that i'm not a system design expert but i did learn a lot this past year and i wanted to share with you what i learned in a way that is hopefully super simple and digestible and really geared towards beginners so let's get straight into it the first video is going to be about availability availability is an extremely important topic in system design you can design a super efficient low latency system but if it's down 30 of the time then what's the point right so what is it exactly availability is the percentage of time in a given period where your system is up and performing its intended function the importance of availability is obvious right any minute the system is down can cause huge revenue losses and other consequences depending on the type and scale of the service you're providing in october of 2021 facebook and its entire family of apps went down for a period of around 6 hours this downtime had major impact on communications and businesses around the world and according to some reports facebook lost at least 60 million dollars in ad revenue availability always matters it's always something to consider when designing a system but there are levels to it some systems will require a higher level of availability than others as you can imagine an air traffic control system will require a much higher level of availability compared to a system that manages reservation for a restaurant for example but most services will require a very high percentage of availability even an availability of 90 which sounds high means that the system is down for 36 days of the year so that's not great either because we're normally dealing with very high percentages when it comes to availability it's very common to measure it in terms of nines if a system is available 99 of the time then we say it has two lines of availability this will translate to 3.6 days of downtime a year if the system is available 99.9 of the time then it has three nines of availability and that will translate to 8.7 hours of downtime a year 99.99 is four nines which is 52 minutes of downtime a year and the golden standard of availability is five nines which means a downtime of less than six minutes a year and that's a pretty challenging bar to reach and i'm going to explain why in just a minute but some systems do require it and that makes it worth the investment so the billion dollar question is how do i achieve high availability for my system well there are many factors that affect the availability of a system there are many reasons why service would fail right when obvious causes hardware failures servers sometimes crash power outages happen parts wear out you know natural disasters can happen also resources can run out you know your service can become overloaded with requests maybe run out of disk space and fail to process new requests software bugs can also cause failures in various ways you know software engineers are not perfect we will the reference now pointers from time to time we will cause memory leaks you know stuff happens so designing a system for high availability is not really about fixing those mistakes or avoiding these localized failures altogether but rather accepting the fact that failures are inevitable and masking those localized failures in a way that keeps the system at large operational and the number one way to do that is by eliminating single points of failure and what do we mean by that let's say we have this very simple design for website food.com we have our clients here and then we have this application server that runs all the business logic well obviously if this app server goes down the system will not be able to function and we will experience downtime until it recovers this makes this application server a single point of failure also as scale increases this app server will most likely become a bottleneck because even powerful machines have a limit to how many requests they can serve per second so in order to scale the system and eliminate the single point of failure we can add more app servers to do the exact same thing this is called redundancy redundancy means duplicating parts of the system if one server goes down the others will pick up the slack and the system will stay available and of course having multiple service processing requests in parallel means we can handle more requests per second but now that we have all these options how would a client know which survey needs to connect to and how will he know not to connect to a debt server this is when we add a load bouncer the client request will first reach the load bouncer and the load bouncer will decide where each request should go it would do that based on some distribution strategy but let's not get into that now i will make a full video on load balancing later in the series for now all we need to know is that the load bouncer will decide where to send the request and we'll also be aware of the state of the servers meaning if this server dies the load bouncer will become aware of that and will stop sending requests to the dead server until it recovers okay so that is great and all the application server is no longer a single point of failure but now we find ourselves in a very similar problem right now a load balancer is a single point of failure if the load balancer dies then all decline requests will basically go nowhere and the system will become unavailable the solution for that is again redundancy there are generally two ways we can go about this if the load bouncer is not a bottleneck itself and we just need to protect the system in case of failure then we can go for a backup setup we add a second load balancer which will remain passive not accepting any requests and only in case of failure it will snap into action and take over we can implement this failover functionality from primary to secondary using a floating ip and a health check service the floating ip is a type of virtual ip that is mapped to one of these nodes and act as the gateway to our service in a healthy state it will be mapped to the primary load balancer this means that as long as the primary load balancer is alive it will be taking all the requests the two load bouncers will monitor each other's health by sending heartbeat messages back and forth if the primary load balancer dies the health check will fail and that is how the secondary will know it needs to take over it will do something like running a script that will reassign the floating ip to its own and that will make subsequent requests reach the secondary load balancer instead of the primary when the primary load balancer comes back to life it will reassign the ip to itself and the secondary can go back to being passive the second way to eliminate the load balancer as a single point of failure issue is to do something similar to what we did in the app service layer we'll make the load bouncers equal there will be no concept of primary and secondary they will share the load and handle requests in parallel in order to decide which little bands a particular client request should go to we use the dns server a dns server translates domain names like youtube.com into ip addresses like this one dns is again another topic that deserves its own video and we'll get one later in the series but for simplicity we can just think of it as a huge dictionary that maps domain names to ip addresses and domain names don't have to map to a single ip address in our example these two load bouncers act as the gateway to our website foot.com then we will want the dns server to map the hostname4.com to a list with both their ips and when a client wants to access our website it will make a request for the ipof.com and the dns server will send back one of these two ips this is how the client will know which load balancer it needs to connect to the problem is that the dns servers will not be aware of the state of our load balancers if one of them fails the dns server will not know about it and we'll continue sending requests to the debt server so in order to get that failover functionality that we want we will need to have some service that will monitor the state of these load balancers in case of failure it will update the dns that it needs to remove the data from the list that way we stop sending requests to the debt server but unfortunately this update to the dns will not happen instantly it will only take effect once the time to live for our domain name will expire but again i don't want to get into too much details regarding dns we can talk about that in a different video so as always there are advantages and disadvantages to each of these right the first approach is most self-contained failover can be very quick as we do not rely on the dns service time to live but the second approach also offers increased scalability in case your load balancers become a bottleneck because both of the nodes will take requests in parallel so you should always choose the option that is better fit into your use case another thing to consider when it comes to availability is geography if there is a region-wide power outage or some natural disaster and you keep all your servers in that area then everything goes down the solution for that is to distribute your servers in multiple locations globally it's preferable to even duplicate the entire application stock that way the application can run independently at each location if one location goes down the others can continue running without interruption doing that will also improve latency right if a client request from india can be served by a server that is physically located in india that would be much faster than sending it all the way to a server in the u.s but i wouldn't be perfectly clear that doing that is not easy in a system design interview it seems easy to add more servers and duplicate your entire system over five regions right it's all just little drawings on a piece of paper you don't actually have to pay for it implement it or maintain it but in real life achieving high availability takes significant investment and you have to be aware of those trade-offs adding more components to your system doesn't only cost more money but it also increases the system's complexity which in turn increases the odds that something somewhere will break it also poses some data consistency challenges that we will touch on later in the series so you should always try to be conscious of your decisions and how they would work in the real world don't add 500 servers if you can easily handle your target load with 10 generally try to keep things simple and only add complexities that you can clearly justify in an interview i suggest that you start with the simplest design possible and then optimize look at each component of your system and think what would happen if this part fails will my system no longer be able to function will i lose critical data these are really good reasons to add redundancy while again acknowledging the trade-offs okay so that is it for this one i hope you enjoyed this video let me know in the comments if you have any specific concepts that you would like me to cover thank you for watching and i will see you next time [Music]