Video Chat System Design Overview

hello everyone and welcome back to the channel it is currently a weekday after work which means I am extremely tired uh trying to grind out some videos before I go away for the weekend going to Indiana for the Indie 500 so if you happen to be watching this on a Sunday it means I'm probably black out in the middle of a field somewhere uh but in the meantime let's go ahead and learn about some systems assign So today we're going to be talking about video group chats uh specifically something like Zoom or Skype without further Ado let's jump into it all right so let's go ahead and get into things so the past that uh or like the last time that I made this video I was basically going ahead and saying that I thought it would be kind of unrealistic to ask for a systems design interview and as I make it this time I'm actually starting to kind of go back on that thought because I think that there are probably a couple of aspects of this video that you couldn't possibly be expected to know but I do actually think that um you know if you had to extrapolate out this problem just using some reason and you were asked less specifically about you know like actually transmitting video over the Internet it's actually not too unreasonable of a video to make so let's go ahead and get into things so if you're not familiar with zoom or Skype if you've been under a rock for the last few years uh here is an example of what it might look like typically what you'll have is some sort of video call as I've displayed on the iPad as you can see here's an example of me at my beach house mansion sitting at my desk got a nice infinity pool SL ocean in the outside and then we've got Karina cof also on the call with Alexander d Ario and also anad armos I can't show what their screens are displaying because uh you know it would not be fair to them obviously this was a screenshot taken off of my real computer would not want to violate their privacy okay so let's actually formalize some problem requirements number one is we can support both one-on-one and group video chats number two should be that we can support pretty large calls up to 100 people at a time which uh if you're a working software engineer like myself you'll know that these do happen pretty frequently uh number three is that uh we should be able to record all of these video calls from some sort of centralized server or you know on the back end right it shouldn't necessarily be me using my computer to do it that can overload my client so again we should be able to upload those to the cloud for later access cool so the first thing that we're going to do is start off by quickly doing a refresher on networking uh if you watch the video game video that I made last week uh you should hopefully have this pretty fresh in your mind but if you didn't because you're an and you don't like watching these videos in order uh basically we've got TCP and UDP uh TCP has all sorts of protections involved with sending packets over the network this involves trying to send them in order right if a client receives a packet from someone else what it can do is say oh you know what I'm actually missing a sequence number that I should have can you send me that one first before I apply this guy number two is that you resend dropped packets uh which is another nice optimization so you don't just drop things and never send them again uh number three is that we initially establish a connection via a three-way handshake so clients aren't receiving any packets from people that they haven't agreed to receive them from and number four is that we also have congestion control and flow control so congest congestion control is when I'm sending packets I'm going to limit how many I send flow control is the receiver is basically giving me back information saying like whoa there don't send anymore I'm already having trouble processing these cool the next thing is that as far as we are concerned with video chats we don't actually really care about getting every single package right so if every imagine every packet is a frame or something like that it is okay for me to drop a couple of frames here and there and not get them back the reason being that you know if I'm doing a video chat and uh I lose a frame and then I get the next frame I don't want the old frame again I don't care anymore you've just buffered and whatever it's done now so generally speaking what we actually want to use here is UDP because UDP is going to be a little bit faster because it doesn't have all of these protections that TCP has and it's not like we really need them for video chatting anyway so UDP is probably going to be the way to go here at least for the actual sending of the video and the audio itself for other stuff perhaps not cool next we're going to be talking about peer-to-peer communication so you might think to yourself well obviously for something like a video chat or just with networking in general the way to make things as fast as possible is to communicate with two computers directly right you want to get as little involved in the middle because that always adds overhead in terms of them actually processing those packets then sending them out again it takes a little bit so ideally if I'm video chatting with you we want to be sending each other packets directly right well the question is is this actually always going to be the case so maybe if it's the two of us together however if it's us two and your entire company now all of a sudden we're all sending one another packets and that can add a lot of load on our client devices so that is a problem right there and also what if we wanted to actually record this video footage well now all of a sudden basically whoever is doing that recording probably needs to be ingesting all of the video footage from all these different places right it could be up to 100 and then actually compiling them together and that could be a very expensive operation too so it may not necessarily be something that we want to be doing that being said for just the sake of a one-on-one video chat peer-to-peer communication does seem ideal so let's say it was a one-on-one call how would we actually go ahead and do a peer-to-peer communication for me to cover a topic like this I think it is a good prere requisite that we first understand something in modern networking known as Network address translation or n for short so basically the gist here is that every single device that we use that's connected to the network has some sort of IP address right however you might think that you know uh your laptop and your smart TV on your Wi-Fi have separate IP addresses and technically that's true however in reality when they go and communicate with the public network which is basically anyone that's not on your Wi-Fi uh they're both basically being abstracted away under under one IP address known as your public IP address so basically you have some sort of natat device like this over here that's going to keep a mapping right uh from B to a where B is your public IP address and then a is going to be your private IP address now all the devices on your local network can see each other's private IP addresses but we want to basically mask that private IP address publicly simply for the sake that um you know it's it's going to allow us to add extra security number one so that no one has direct access to your IP address and number two and also pretty important is that if we're using ipv4 again we're really getting into the weeds here probably not that important for your standard systems design interview uh we have a limited number of possible addresses that we can use and so as a result uh we have to figure out a way to make the public IP address space smaller and the way that we do that is basically by saying oh for every single local network you just get one public IP address cool so hopefully that makes sense but the gist is you know we've got some Nat device here our client device has one IP address a the N device is exposing it as some other IP address B with a certain port and then uh it goes to the server says hey I'm contacting you on IP address B the server reaches back out to B the nap device maps that back to a and then that goes right back to the client so the gist is the reason we call it Network address translation is because the nap device is doing that translating from public address back to private address so let's continue talking about the n space a little bit because this is important for how two clients can actually directly communicate with one another uh as far as Nat is concerned there is another concept here known as the stun conserver uh the stun server so the n in stun server actually does stand for natat uh but I can't remember the full acronym and frankly it's not important I'm kind of uh you know going on a bit of a tangent here but uh I think it is relevant for the rest of the video it's just that you know if I were an interviewer and I were to ask you this question I wouldn't expect you to just like be able to conjure up uh the concept of net unless you were someone who's been working with networks for a while cool so gist is you would reach out to the stun server right both clients can reach out to it figure out their private IP addresses or rather the private IP address of the other party and then all of a sudden now they can communicate directly problem is this doesn't always even work depending on the KN device that you're using uh stun just may not uh you know expose your private IP address properly and you may just always have to communicate through that KN device in which case your latency is always going to go down because now we have an extra on the network regardless let's go talk about an actual centralized chat server CU like I mentioned when we have a group chat the actual feasibility of using uh peer-to-peer communication directly tends to go down quite a bit so what is the main benefit of the central chat server like I mentioned each client really only has to listen to and send videos to one place and this is pretty crucial because if we had to listen to a 100 different other clients and we had to send video to 100 different other clients and we were doing this on something like a mobile phone you could expect that battery to die pretty darn quickly cool what is the con well our server is going to be under a bunch of load right if we've got 100 different people sending video at say 24 frames per second all to this server at 1080p uh that server is going to get pretty overloaded so how do we actually want to be able to handle this video on the centralized server uh so that we can basically maximize uh the efficiency of this server and minimize the amount of load on it well let's take a look at option number one option number one is that the centralized server is going to take in all the streams from the uh from all the clients and basically compile it down into one simple video right so we take all of these frames merge them together somehow and then export that out all to uh the clients that are interested in our video chat stream so there are a few cons to this approach which is ultimately why we're not going to go with it number one is that actually doing this transpilation on the server is itself going to use a lot of CPU load so that's not ideal and then number two is that every single client has to watch the same Stream So something like this might be acceptable for twitch where you're really only looking at one stream at a time but for something like video chatting we want all customized experiences right if I'm a teacher presenting maybe I'm a student uh and I want to see the Whiteboard as the big screen and not the teacher's face and someone else uh if the teacher is a baddie might want to see that teacher's face so you know that's how it goes sometimes so the question is how can we actually improve this well now we're going to talk about something known as selective forwarding which is definitely something that we see quite a bit in actual video chat applications so option two allows us to have a customizable experience per user because basically what we're going to do now is only send clients the streams they actually care about and in the resolutions that they care about them so let's go ahead and make an example here let's say our Zoom call looks something like this right we've got all the best world leaders on one Zoom call we've got Trump Kim Jong-un Putin and Stalin he came back from the dead so let's imagine that in my particular view I've got Trump as the big guy and so over a websocket I can communicate with the centralized server and say hey on my particular view I want Trump as an HD stream I want Kim Jong as an SD stream Putin as an SD stream and Stalin as an SD stream if I was getting HD feeds for all of these guys that would be a lot more load on my server and that would be not great it would just overload the network unnecessarily and then it would also have to basically just like down sample all of those um streams to a lower resolution so it doesn't help me to do that at all I just want the lower resolution variant if for whatever reason you know I were to put Stalin on the big stream and Trump goes back here then you know we can change that over such that Stalin is HD Trump is SD and the way that we would do that is just by having the web socket between my client and the main server uh reflect that and the server is going to cach it such that as new frames come in it knows what to send me so let's keep talking about this selective forwarding a little bit because there is actually going to be a little bit more Nuance here the question now becomes if the server can send me multiple different types of streams right or multiple different resolutions or multiple encodings or anything like that again I don't really want to get into video specifics here a because I don't know that much about them and B because I don't expect your interviewer would expect you to either how is it that the client is actually going to send all of the data that the central server needs to do selective forwarding so option number one is the client sends a single stream to the central server the central server basically converts it to a bunch of other formats or resolutions or anything that you might need and then it's going to basically send it out to all of the clients that care about it so this is good because the client only has to send one stream right at the same time it's not so good because the server has to do a bunch more work and that we really don't want because that is going to be the failure scenario right if our server is going down all the time we can't have a video chat in the first place uh regardless of how much load is on the client so that leads us to option number two the client still only has to send one stream to our uh to our Central server however this time it actually goes through an intermediary server so this intermediary server can do all of the processing on that stream to convert it to multiple different resolutions and then send that to the central server now in theory this seems like a really really great deal because uh you know the client now only has to send one stream and Additionally the central Server doesn't have to do any extra processing on it so this idea of like a proxy server seems to work really nicely at the same time when you introduce yet another intermediary server that is just more latency and more time that it takes for all of these frames to actually get to the central server for it to send them out and so it's going to cause more lag in your video chats so in reality what we often see most of these video chat applications do is take approach number three which is that basically the client is still going to do all the encoding but it's going to do it all locally so maybe instead of sending out one stream now we're going to send out three streams so the client is going to have to do a little bit more work which is unfortunate but ideally it shouldn't be so much because you're only doing it for one stream at a time so we send these multiple streams all the way to our Central server and then right there it means a little bit more work for the client device but no extra load on the central server to do more encoding and uh you know now it has all the information that it needs right off the bat to send that right back to all the client so this is the approach that often gets taken now basically everything I've described so far is uh known as uh web RTC which is basically some technology that abstracts all of this away from us uh but again I think that you know mentioning or having to have that knowledge in an actual systems design interview uh would be pretty unnecessary because the whole point is to interview for generalists and see if they can recognize the patterns here cool now let's talk about partitioning and replication because obviously this is going to be important for any time that we have a centralized server so the main idea here is that you know if we have a bunch of different video calls ideally we want one centralized server per call so what do we do well we Shard on chat ID that should be pretty simple we can use consistent hashing to make sure that uh you know every server is getting a relatively even amount of load it is possible I guess in theory that one of these servers for a given chat may actually getting uh be getting so much load that we have to partition it out even further and if we really did need to do that I guess what we could do is use two or three chat servers basically uh spread out which users have to send video footage to which server and then now every single user is listening to say two or three servers instead of just one so you do have to do a little bit more work on the read path but the work on the right path stays the same and then this way uh that Central server basically only has to uh keep track of all the configuration that it needs for a couple of different users at the time cool so the next aspect of this is going to be replication how do we make sure that our Central server if it goes down uh our video chat uh does not completely Crash and Burn I guess the gist here is that what we can do is use centralized servers in kind of like a active passive configuration as we've seen with load balancers in the past so what this really means is basically you know we have some passive backup that just sits there doing nothing we've got a zookeeper cluster that's constantly getting heartbeats from our main node and then if our main Central server goes down uh the backup basically uh you know says to zookeeper oh you know what I'm actually going to be the primary now and then everyone reestablishes their websocket connection with the the backup because they're going to hear from zookeeper also saying hey uh this has just changed for this particular node and then now all of a sudden we're good to go so yep that should be simple enough uh we've seen this pattern plenty on this channel cool another thing to note is that we don't really need to keep any state on the backup the reason being that all of this is pretty much stateless right we're basically just forwarding UDP streams You could argue that uh the state of what streams each user wants uh could maybe be pre-cached in the backup to go for like a more seamless transition but at the same time I think that on reestablishing those websockets uh this is not going to be a particularly large data model and each user could just go ahead and tell the backup uh which streams it wants to perform selective forwarding okay the final aspect of this problem like I mentioned is to talk about video recording so the first thing to note is that we don't want to be doing this on the central server this is something that I said I would do on the central server on maybe another thread in uh my last version of this video or the last version of this series and I do regret saying that so let's go ahead and change that up basically the gist is we can have other servers that subscribe to the video call and also perform the encoding that we need to actually make this recording and put it in S3 so let's say that we just need one recording right um we just need one version of the recording where you know it's like the active speaker is always going to be the big frame on the video chat and uh you know that's just going to be what we have access to after the fact that's fine we can probably do that easily Enough by having our Central server here our recording server here and then the recording server is going to take in all those frames you know do some work on them on another thread and then upload them to S3 easy enough however sometimes it may be the case that you don't just want you know one version of the zoom call but you actually want like full highdefinition views of every single user stream throughout the duration of the call if for whatever reason this were going to be the case it may be the case that this guy right here the recording server could get overloaded right because now instead of just doing selective forwarding where it's like you know one HD stream and maybe one SD stream and maybe one more SD stream you might need 100 fullon HD streams and that could overload this guy that would be bad so like we mentioned uh you know being able to get all user streams could be too expensive why don't we go ahead and distribute this thing so to do that what we would basically do is we would have multiple different recording servers still basically doing the same exact thing as clients right they would establish over here via websocket which streams they actually want access to so maybe this guy gets uh you know the stream for user one in HD this guy over here gets a stream for user two in HD this guy over here gets the stream for user three in HD they can go ahead and cash those throw them in Kafka that's right Jordan's back he's using Kafka we sharded this queue on our recording ID or rather the chat ID because you know we just need them all to be together if they're from the same chat so they can be combined eventually and then we've got some stateful stream consumer the reason I say stateful is because if this guy were to go down uh it would be bad if we had to restart reading all the frames from Kafka it would be great if uh we could just pull a checkpoint from S3 and continue encoding from there and then the saful stream consumer will in turn sync the final recordings over to S3 cool uh the one other thing to note is that you know because we have all of these uh frames coming from different sources it's possible that if one of them is a little bit slow to consume that footage that they might not be perfectly aligned in Kafka right this guy might upload way before or maybe 5 seconds before this guy down here and so if that's to be the case you know how does uh the stream consumer actually know well uh you know at this Frame this guy has lined up with this guy well fortunately every frame is going to have some Tim stamp from the central server so the central server can actually just export that time stamp with the frame and that way eventually this stream consumer can basically use the timestamp of each frame to properly align all of the footage cool so let's go ahead and look at our full on diagram over here basically if if we have a call client on the Left Right This Guy's our chat client the first thing that he's going to do is go to the load balancer with his chat ID now the chat ID is probably something that's going to be encoded in the actual link that he clicks to open the zoom so once he does that the load balancer which listens to zookeeper is going to use the consistent hashing policy to map him over to our chat server once we have a chat server we're going to do two things we're going to establish a websocket connection and this is going to be for the selective forwarding because the chat server needs to know which streams I want and then in addition we're also going to start subscribing to UDP feed so the UDP feed over here right this guy is going to be the incoming frames and audio of other users simultaneously our user also has to send a few variants of their stream over to the chat server maybe we've got a 1080 a 720p and a 480p and all three of those are three different UDP streams that the chat server can go ahead and ingest in addition we've got our passive chat server over here which again is listening to zookeeper basically just waiting for the chat server to go down and if it does it'll take over claim its place in Zookeeper and then we can proceed as normal finally as far as recording goes let's imagine that we did in fact want to uh process a lot of video footage here so much so that we needed multiple nodes to actually take uh in the recording data so imagine now that we have stream Ingress server number one and two they're both listening to UDP feed they're both connected as well via websocket to basically say what exactly it is that they want right and again you know if we really want to get into the weeds here they can also use some sort of consistent hashing policy such that server one is responsible for one range of users footage and server two is responsible for another range of users footage but the gist is they're going to take in all this footage they're going to throw it in Kafka with its Tim stamp which we partition on our chat ID and then over here in our stateful consumer I use flank use whatever you want you know I love flank sorry uh basically we're going to go ahead and cash all of this stuff combine it uh however it is that we want to do so and then finally sync it to S3 for later access well guys I hope you enjoyed this video I personally am falling asleep so I'm sorry for stuttering a little bit uh but hopefully it did actually make sense this time around anyways enjoy your night and have a great day

Transcript for:Video Chat System Design Overview

Transcript for:
Video Chat System Design Overview