Netflix System Design Summary

design Netflix hi everyone we're here today for another exponent mock interview with Andreas uh for those of you who aren't familiar with exponent exponent helps you get your dream Tech Career with our online courses expert coaching peer-to-peer mock interviewing platform and interview question database check it out at TR exponent.com all right thanks so much for being here with us today Andreas uh would you quickly introduce yourself to our viewers yeah be happy to hey everyone my name is Andreas uh I'm a software engineer currently at Modern Health which is a mental health Tech startup awesome all right so let's get right into it um so the question today is design Netflix so Netflix is composed of like a lot of common thing or things common to a lot of tech platforms like you have users they can find some sort of video content in this case like movies and TV shows and then they want to watch that content um so including those things are we designing the highle system to include everything from subscriptions users search and videos or would you like me to focus only on a particular part um so let's focus on the core product for now so that would be like the videos and the users that you described um and you can ignore um other features like search or subscriptions okay awesome so um when you say users in this case here write that down users and videos um when you say users uh do we want to worry about authentication or just users like in relation to the videos um you specifically need to focus on the user activity data so that's like the data that would go into their uh video recommendation engine um so you'll need to be able to find a way to both Aggregate and process the user activity data that's on the website okay awesome um great and so given those things um well I need to like come up with some sort of algorithm for this recommendation engine um not in this case that'll be outside of scope for this question um but I'd like you to be able to focus on how the data will be gathered and processed for those recommendations okay great and um so we have this video service we have a recommendation engine um and for the video streaming service uh it's I'm guessing it'll be important that we have high availability even with a tradeoff of we can say maybe not as great consistency but decent consistency because we'll want like Fast low latencies around the world is that correct yes that's correct you should definitely account for that in your design okay awesome so low latency and and um and we'll also say Global great um so now that we have all these things uh for the background or for the recommendation engine do you think we can run that as a background asynchronous job yeah that sounds great okay awesome oops okay so building a list of requirements here um and then in outside of these cases um I guess a larger question is how many when maybe transitioning to scale is how many users can we assume Netflix has um maybe about 200 million okay cool 200 million users and um so I'd like to take us maybe based off of this number through some data storage estimations and then we'll come to I guess organically to some of the architectural decisions along the way if that sounds good to you sure perfect okay awesome um so we have a few different types of data that we're going to need so our types of data are going to be uh video content um like static content tied to the video so things like um the names of the movies [Music] descriptions um what else [Music] thumbnails uh maybe like a cast list since we're doing movies and and shows um and then we'll also need user metad [Music] dat uh so like when the Last Watched if they have watched a show or uh if when the last watch time stamp is so we can return them to the place in the movie that they were watching um maybe likes on a movie like we can thumbs up or thumbs down movie on Netflix uh then I I guess the last bit that we talked about so we have this user metadata uh the video static content the video content itself and then these activity logs so that although we might actually have some overlapping user metad data I think we can probably do all sorts of tracking events like clicks Impressions uh Scrolls and we can fine-tune that uh a little more based off of what we're trying to look to optimize for in the recommendations but um since that is probably a lot larger in scope uh for now I'll keep that out of scope if that's okay with you and if we have extra time we can go into it yeah that makes sense great um and so in terms of some of the size of some of these uh we can start with uh given these this video content um so we have 200 million users and uh what what so how many movies and shows can we I guess estimate that Netflix has uh let's say about 10,000 okay great so 10,000 videos um and then given maybe we can estimate that like movies are on average two hours and then the each episode of a show is maybe 20 to 40 minutes um can we say a video is then on average like an hour long overall uh so that would be assuming that it's about strictly half movies and half TV shows right yes is that all right okay assumption okay cool and then um given that we have that also Netflix supports different uh levels of definition so we have standard definition high def version um let's say maybe we can say that standard definition is 10 gigabytes an hour and then high def is like 20 Gigabytes an hour um so given that we have this I guess here also I'll add an hour average per show and then or per video and then we have this 30 gigabytes per hour of video um then given all of that we can estimate around 300,000 uh 300,000 gigabytes or 300 terabytes y awesome and so given that [Music] um this is like a lot smaller than actually a lot of other video stores luckily because it's not user generated content like YouTube or maybe Facebook video um so we have a ceiling on the amount of videos that are being hosted so given that requirement or that assumption we could probably use something like blob storage um like Amazon S3 or Google Google uh cloud services and then we can just store and replicate the content there if that sounds okay yeah so um why is blob storage uh in particular better for this type of data yeah I guess in this case um since we're just hos like rather than using a rdbms for example um I think it makes sense to be able to put it in Blob stores where if if had larger amounts of data we could just have ever increasing blob stores and then that would not be great for us to be able to sort through and find and index but in this case since we have a static uh you know a top level amount of that I think it is okay for us to be able to search across all of that and but also we don't want to just store it inline because they're pretty large files individually okay yeah that makes sense awesome so great I'll start also as we're going through this um some of the design so we have this blob it's got a blob store here and then I guess we can kind of we can assume we're adding this in some sort of uh I don't know either from an API or or like a client that um we can upload the videos up into the blob store um and if that's all right we'll just uh not go too deep into that but I guess we can just assume um on our side we can upload videos into our system sure okay cool and so now we have this blob store um now we'll probably want to take a look at the static content since we have these videos stored and so for the static content um here I'll write it down here oops go back in So for the video static content um okay so we'll need to store things like we talked about you know titles um um descriptions uh cast list all of those sort of things with uh then that will actually be correlated with the number of videos in this case because you know per video we're just storing this metadata about the video um so in that case we can probably just store these in a standard like uh relational database or I guess we could also use a a document store um but in my case in this case i' I think I would go with like just something like postes um and then we'll probably want to add a cache to it just so that we can make uh reads and faster to this uh to the client it's not too much data does that sound okay to you sure uh yes so brief question about this what would you be caching so in this case caching um the most frequently accessed uh like activity or in this case for a given user let's say they access a series a lot or like a certain show that they've been watching would' cach that as they go to watch it so that when they go on subsequent views we don't have to rerender all of that ah okay yeah that makes sense so it'll make their user experience faster then yes 100% so like that latency I guess we were talking about trying to increase the uh speed of all of these parts of the service cool um and then I'll put that actually create that here oops it's not what I wanted and so we'll have the post grz here I'll call this the static content and then um we'll have a bunch of API Services uh so these API Services um we can I guess just assume in this case that the the client is um from the client the users are making calls to the apis and it would be great we assume because we're creating a distributed system like horizontal scaling or we could just add more of these Services as we go um and then also I guess we'll have a cache uh just with a standard maybe uh right through so as they go to make the call we'll update the cache and then go to um [Music] postgress cool uh great so we have that and um so our final type of data that we talked about was the see so we have the video static content um video content right here and then we have a video static content and then we have finally the user metadata so this user metad data um will be all sorts of things that we talked about like user are watched last watch times at mics um so in this case since the metadata is tied to the number of videos as well um that will be the lower amount but since it's tied to it's tied to both amount of videos and amount of users but amount of users going to keep increasing at a higher rate than the amount of videos or hopefully that's what we hope with our service um since we have 200 million users in this case that's a lot more than 10,000 uh we can assume based off of 200 million users that will have uh in this case maybe a thousand videos watched per user per lifetime for their the lifetime they're on our um application so given that which is like about you know 10% of our total content we assumed earlier and given that and um so per video let's say we store all of this in information that we're storing of these uh user watched likes and things like that per video we're going to say maybe it's 100 bytes if that does that all seem so far okay as far as assumptions okay great um then from there we we have so that puts us at 100 kilobytes and times 200 million users which is about uh 20 terabytes great so we have 20 terabytes of data here um but this is like our so our video content is actually more than this which is great um so that means we don't have to do anything to crazy um but since we'll need to be able to query it in this case whereas our video content we just needed to pull it from a store um we don't need to actually I think store it in an S3 bucket I think we can go ahead with something like with uh postgres in this case since we're already using that uh right here in vide static content so we got postgress and but we will need it to be highly available because we are using that like we talked about we want low latency um I guess we have the option of going no SQL with something like Cassandra but I think in this case we should do rdbms and if we're going to do that route we definitely need to uh shard Shard the system I would say we could Shard it based on off of the user metadata which in this case we could index it based off of user ID I think would make sense since it's so uh user Focus the data um and then each Shard we can say is 1 to 10 terabytes of data um and so that way we can have really quick reads and writs on the related data so to recap then that means that um each Shard is uh actually um assigned to a range of user IDs right yes exactly gotcha okay great Point yeah not an individual user ID for each Shard that would be a lot yeah yeah um okay so that makes a lot of sense is there a particular reason why you chose to uh use postgress instead of a instead of no SQL um yeah I I think in this case um although it's more complex for us to set up like a sharded system no SQL out of the box would allow us to do a lot of this stuff um I think to be able to add make complex queries on this large set of data we'd rather and and also the background jobs we're running we'll want to use um post grows instead of no SQL it's I guess in my opinion the advantage of how we should use postgrads versus using no SQL right okay that makes sense cool um all right awesome so now here I'll go ahead and also add this up here um so connected also to our API Services let's see Mo this oh yeah and we can use like an inmemory cash for this maybe like redis or M cach I don't know if I mentioned that before but um we'll cash also on both of these um pieces of content so this isn't the static cont this is the uh user metadata okay great and this is post SCS also um okay awesome and yeah actually I'm realizing as we're going through this I one thing I wanted to mention is since we have these horizontally distributed um load balance or horizontally distributed Services we'll want a load balancer to sit in front of them to be able to distribute the load evenly between all of them um and this would be the load from users right not from internal servers okay yes great Point yes um I think if we wanted to uh since we'll have asynchronous jobs I imagine that we won't need to balance load from those but however I could see if we are like uploading a lot of um information we might want to add consider a load balancer on this side on the back end calls to upload videos I think that's a great Point um okay great and um are there any ways that we can probably improve the speed of the requests um yes so I guess in this case we we have this um and as I I talked about this reddish cache and I guess that's how this will come in um we'll want to make sure that because we're using this right through caching um that will have speed on both of these uh for met for the user metadata as well as the static content MH okay great so that would be like for each user you basically cach the most recent like metadata for them yes exactly and so um keeping that hopefully would increase our uh our availability in that case gotcha okay um and so now that we have all this put together together um we there's probably one last consideration also when you're talking about uh speed and and lowering latency is the global nature of these videos since we're serving them all around the world how can we make this fast because right now we probably only or we were assuming maybe one Data Center and it's like how we have users all over the world um so in this case we can also add a CDN between uh our blob store over here and the end client so so I guess in this case uh we can add relevant um uh run a job that has let's see here I'll this I'll call this maybe CDN populator and I'll switch these actually but oops there we go so and that'll connect to this the CDN itself um so in this case with from the CDN will serve the most relevant content to uh users that are around the world based off of where they're geographically located there's probably you know Bridger 10 came out in the US and then a bunch of people were watching that so we'd probably want to cash that there um and so and the populator will take care of that Geographic uh population of each of the individual CDN based off of the geographic location okay thanks so much Andrea um that was really great uh so I think now would be a good time to debrief a little bit what do you think went well and what do you think you would want to improve on more in the future yeah um so I definitely think uh I it was cool walking through all of like how Netflix Works um I think a lot of times we don't think about because it's so seamless when we're using it that or hopefully is that we don't think about like all of that goes on behind the scenes um I would say like the I was there was a lot of calculations so uh keeping my head on straight for that I'm uh I'll give myself a pat on the back for that I would if I were to improve on something I would probably say uh the manner in which maybe I go through all of this I would probably actually do these rather than congruently it felt a little jarring even myself going through it um going back and forth between them and instead Tak I would summarize all the calculations and thoughts I had and then go to the um to the board to design it but uh that would be the only Improvement I guess or not the only Improvement but what immediately comes to mind okay yeah I definitely agree with a lot of those points actually um and I particularly thought that you did a really good job of considering what the user experience would be like and try to incorporate those considerations into your design so for example you thought about how like the user might want um really low latency and they want their videos to load really quickly so you did a lot of caching to try and ensure that and you also talked about how um we're going to have users that are all over the world and so um and the media might vary depending on where in the world they are so having a globally distributed um CDN helps a lot with that and those are both really great considerations um yeah and you also do fantastic job of considering tradeoffs you know I always think that being able to um analyze a system uh using the cap theorem um is a great way to decide like what to optimize for end design versus what to uh what is okay to sacrifice on a little bit and you did a fantastic job of that um yeah and I agree like sometimes um it can be a little bit confusing to start out with trying to list out all of the detailed um calculations and perhaps as you're working through the higher level um design uh knowing how different components connect to each other can help you work out the details a little bit more so I think it's okay to like once you've gather GED the requirements and asked all of your clarified questions I think it's okay to start out with like some of the higher level components if it helps you better conceptualize the idea in your head how does that all sound yeah I completely agree so thanks for the feedback yeah otherwise this is a really um interesting uh interview to do with you so well done thank you um and thanks everybody who's watching and good luck on your upcoming interviews thanks so much for watching don't forget to hit the like And subscribe buttons below to let us know that this video is valuable for you and of course check out hundreds more videos just like this at Tri exponent.com thanks for watching and good luck on your upcoming interview [Music]

Transcript for:Netflix System Design Summary

Transcript for:
Netflix System Design Summary