Designing Uber System Architecture

hey everyone welcome back for the second uh system design problem breakdown the first one was Ticket Master um the reception to that was really really positive that was encouraging to be honest I didn't know if I wanted to make these YouTube videos you all really enjoyed that first one there were some great comments so going to try to keep these rolling um I can't make any commitments but maybe once every two weeks or so at least in the near term you should see a new one so the one that we're going to do today is we're going to design Uber so this is the the hardest or the most interesting in my opinion of the proxim search problems if you can do this one then you can probably do find my friends Yelp Etc it has a lot of the the same common pieces and this is asked a ton uh at most of the top companies frankly notably it's Asic Google and meta um so I think it's a great one for us to practice now a bit about me as a refresher I spent five years at meta I was an interviewer and a staff engineer there and I'm now the co-founder of hello interview and so hello interview is a site that helps candidates prepare for upcoming interviews largely via mock with senior F engineers and managers like myself and so I've asked this question well over 50 times and I know exactly where candidates of all level from mid-level up to senior director or staff and principal uh do well and where they tend to get tripped up so we're going to walk through this as if it's a real interview um but I may periodically interject or I will periodically interject with some tips or Frameworks and really sharing that perspective uh from a senior interviewer lastly if if videos aren't your thing I've written this detailed breakdown to this problem here on hello interview.com so you can come over I'll link it in the comments and read this if you prefer uh we also have a handful of other breakdowns to common problems and this list is continuing to grow so come over here and check it out all right without further Ado let's get into it okay first things first let's touch on the road map that we're going to follow throughout this video as well as the road map that I recommend that you follow for any of your system designing interviews particularly those for which you're designing a user facing product like design Uber so the first thing we're going to do is we're going to go over the requirements of the system this is both the functional requirements the features of the system as well as the non-functional requirements so the qualities of the system we're then going to touch on the core entities in the API core entities is similar to what you may be thinking in terms of data model but it's really what are the objects that are exchanged and persisted in my system and then the API will use those objects in order to expose the user facing apis um that allow you to satisfy your functional requirements or the features of the system next we'll sketch out a highle design the goal of this highle design is simply to satisfy the core functional requirements of the system it would be simple just to satisfy those core features and then lastly we'll go into detail in some deep Dives and the goal of the deep Dives is to make sure that our system satisfies all of our nonfunctional requirements so this is the road map that we're going to follow all right so the very first thing that you're going to do is you're going to write down the functional requirements of the system the functional requirements as I mentioned a moment ago are the core features of the system so these are often like users should be able to statements now ideally the interviewer has chosen a system that you've used before and you know pretty well that would be amazing that would make this a lot easier in the case that they didn't you'll need to ask a handful of questions to try to understand that system now I'm going to assume that most of us know Uber pretty well um and it's certainly the case the case that I do so the core requirements of the system are that users should be able to input a start location and a destination and get an estimated fair so this is what happens at the very beginning when you open up your Uber app you end up saying where you want to go it knows where you are and it gives you a bunch of price estimates right it says that for Uber X this will cost you $32 for Uber XL this will cost you XYZ now I'm actually going to Mark out of scope multiple car types bad typing uh multiple car types and we're just going to do Uber X for example so just one estimate and then the next thing once you get that estimate is that user should be able to request a ride to be matched with a nearby available driver in real real time so users should be able to request a ride based on an estimate cool and then the last thing is that drivers need to be able to accept that ride request and then ultimately navigate to the user's location and finally to the user's destination so drivers should be able to accept deny a request and navigate to pick up SL dropoff now there's different versions of an Uber system design question in some versions maybe you don't have the estimated Fair part maybe you focus more on the actual navigation um this is typically where I lead candidates it's what I think is the most interesting but know that it's not all inclusive so maybe you want to take some moments to yourself to think about some additional functional requirements maybe and how you'd additionally uh include those in the high Lev design that we'll get to later so you saw that I added one thing out of scope here it's good to clarify some additional features that you might think of but you're clearly not going to do um it's really important that you stay focused in a system design interview they move really quickly so I recommend that you focus in on at most those three core features of a system and everything else you put out of scope uh so for example some other things that I could think of is ratings for drivers and writers I'm not going to handle that that's going to be out of scope um some other things as like schedule a ride and Advance right this is showing my product thinking as an interviewer or as a candidate but showing clearly that I can prioritize and focus on what matters so we'll keep it there the next thing that we do is our non-functional requirements so the non-functional requirements are the qualities of the system these are typically like the Iles so it's scalability availability Integrity these sorts of things now most candidates make the mistake of a rushing through their non-functional requirements and B just writing down some of those buzzwords and I can tell you right now that's both not useful and it doesn't help you with the design later on so for all candidates but particularly senior and staff candidates the non-functional requirements are incredibly important this is what's going to dictate your deep Dives later on so you want to make sure that you a identify the qualities that are unique and relevant to this system what makes this system challenging what makes it different difficult and then B that you're able to put them in the context of this system and C ideally quantify them if possible so for example what do we care about in this system well we care about low latency as we do in most systems but in the context of uber it's low latency matching and we can quantify that so we can say maybe less than 1 minute to match or failure and so that failure would just be you know we give the user are message thing that there's no available drivers in the area um now what else is important we can consider cap theorem here so do we prefer consistency or availability given that partition tolerance is a must now in this case we actually really care about consistency uh but specifically consistency of matching so we want to make sure that any given ride is only ever matched to One driver so ride to driver is one to one right we want to make sure that there are multiple drivers who get the same ride and for that reason consistency is going to be incredibly important um now on the other hand we do still care about availability we care about availability basically everything outside of matching so highly available outside the matching and this means that we're going to minimize downtime the system is going to be consistently available reliable we can process requests 24/7 this is going to be an important component of our system uh and then lastly we should be able to handle high throughput but again putting that in the context of our system this is in surges for peak hours or special events uh like if you know a a football game or the Taylor Swift concert just got out you're going to have massive surges there and so we can try to quantify what our surges might be and given that we're talking about stadiums or New Year's Eve surges are probably on the order of hundreds of thousands of requests within a given region all right perfect and then additionally we can consider some out of scope here as well some additional things that are always worth considering are gdpr and user privacy but we're not going to go into it uh in this design certainly not uh resilience and handling system failures frankly we may touch on that a bit but it's not going to be the the primary importance of the design monitoring always a good thing to have in any design uh but we're not going to have it in ours no monitoring logging alerting Etc and then you know deployment cicd pipelines those are other things that would be great in any design but aren't going to be the focus of our system so now we have a really clear understanding of the system that we're going to build both the core features being able to get Fair estimates request of ride match them and allow drivers to accept or deny those requests as well as the non-functional requirements the importancy of low latency matching consistency of matching high availability outside of the matching context and the ability to handle high throughput so from here we can move on to our core entities in our API all right before we move on to the core entities there's one thing I want to call out and that's that some viewers might have realized that we didn't do back of the envelope estimation and you're right so my opinion here is that in conducting so many of these the back of the envelope estimations particularly early on in the design like right now uh are almost always a waste of time so candidates estimate daily active users bandwidth storage and at the end of those estimations they look at their numbers and they go okay so it's a lot and then they move on and at most they came to the conclusion that it's a scaled system that's going to need to be horizontally scaled going to need to use things like maybe S3 whatever it may um but they knew that they knew that before they started the calculations so I learned nothing about the candidate during that they burned valuable time and what usually just a 35 minute interview when you exclude uh pleasantries and time for questions at the end so my opinion here is that candidates should forgo the back of the envelope estimations immediately after the non-functional requirements and only do them if the result of the calculations is going to directly influence the design so here's how I suggest you approach that when you finish nonfunctional requirements say this to the interviewer say I know a lot of candidates do back of the envelope estimations at this point um it's my preference that I forego them for now and that instead I do estimations during my high level design if there are some calculations that I need to do for which the result will have a direct influence on my design is it okay with you and 99.9% of interviewers will say yes that's great you might have 1% of weirdos who say no in which case no worries you you'll do your estimations then to satisfy them but that's my strong recommendation when it comes to back of the envelope estimations now monologue aside the next thing that we do here is we jump into the core entities and so you may be wondering to yourself why does he keep saying core entities as opposed to data schema or data model as you're probably more accustomed to and the reason is I found that most candidates and myself when I'm an interviewer I don't know the full data schema at this point I don't know all the fields and all of the columns it's far too early in the design but I can sketch out what my core entities are What are the entities that are being persisted this usually Maps one to one with the tables or the collections that you're going to have so often times I'll document the entire schema in the full columns as I'm going in my high level design as you'll see me do here today and I'll start here instead by just bulleting those core entities and these core entities are going to be useful to inform my API we'll see that here in a sec so what are the core entities for something like uber well I'm going to have a ride object of course I'm going to have a driver I'm going to have a riter and then this one's maybe less obvious I'm going to have a location entity and this location entity is going to be responsible for having the most upto-date location of all of my drivers so that I can quickly match them based on proximity and we'll see of course how that works in the highle design and so now we move on next to the API and the api's goal is really simple people people kind of overthink this or make this more complicated than it needs to be we're going to use our core entities in order to go one by one satisfying our functional requirements so we might have one we might have many apis per requirement but we're going to look at those one by one and make sure that we have the a full set of apis in order to meet them so for example the very first one that we have here is to get a fair estimate and so what this is going to do is it's going to request the ETA and the price of a ride and it's going to persist those in our database and I'm going to choose to instead of use a separate entity just create a ride entity with an ETA and a price and that ride entity is something that we can then continue to update if ultimately they decide to request the ride and match Etc if they don't we have some excess data laying around but it'll be great for analytics to know what people are requesting and ultimately not choosing um so this API is going to be a post because we're going to be creating new data I'm going to have bide Fair estimate and it's going to return a partial ride and I use partial here this is just typescript not you can use whatever you want this is not a requirement for the interview it basically just says we're going to return part of the right entity uh specifically the parts that I'm interested in are the ID the right ID the ETA and the price and I might mention that verbally in the interview you might even write it down and the things that's going to be passed into the body here are just going to be the source and the destination basically where am I coming from and where do I want to go so we're exchanging our core entities in order to satisfy our first functional requirement easy now let's move on to the next functional requirement which was user should be able to request a ride based on an estimate so now given that estimate that was just returned to us we should be able to request a ride I'm going to use patch since we're going to be updating that same row this could also be put it doesn't matter too much I wouldn't get caught up here but I'm going to request a ride and that's going to take in the ride ID that was returned here right so that ride that we persisted that we made an estimate 4 that has a source in a destination already I'm now going to request that I get matched with a driver and I actually want to want to go on that ride and it's going to take in a ride ID and since this operation is going to be asynchronous we talked about it taking you know up to a minute to match with the driver this just going to return you know a 200 or status code maybe a 400 in case of failure um and it's going to happen asynchronously great um now the other thing to consider is that the actual matching process is going to require us knowing the location of drivers we mentioned that so how do we know the location of drivers well we're going to have to have some additional endpoint here maybe it's a location endpoint update that drivers call every n seconds with their lat launch to let us know where they are um so that's going to be important and note a lot of candidates miss this up front in the API section that's totally fine there's no expectation that you get every single API right out of the gate as you're designing you might find more and you'll come back and you'll add to this list that's totally great of course I've done this so many times I have the the foresight that you might not have in an interview um so I'm going to add it here immediately but just know this is kind of a a moving model that can be updated as you go and so now we've satisfied the second functional requirement the third one is the driver should be able to accept or deny a request and then also navigate to pick up and drop off so to be able to accept or deny a request Quest we're going to go back to our ride API this is going to be driver excepts and they'll pass in that right ID as well as maybe true FAL or uh yeah you know true false if they accept or deny it respectively so given this right ID do I accept or deny matching and again I chose patch here because we're going to be updating that same row realistically this could also be a post because it probably will need to create a new row on the status or maybe up a different status um we could come back and change that but I think patch is going to be fine for now and then the last thing is that drivers are going to be navigating first to the pickup location and then next to the destination they need to be able to update us you know tell us that I've gotten to the Des I've gotone to the uh initial pickup location it's time for me to go to the destination so I'm going to create an endpoint for that it's going to be called ride driver uh it's going to be update and what it's going to take is of course that right Ed again and then you know some status and maybe this status is going to be picked up or you know dropped off doesn't matter something like that and what it's going to maybe return is the lat laun of the next place that they need to go or null if if if not relevant so in the case if they pick them up from a destination then we'll update that in the database and we'll send them cool or from a pickup location excuse me here's the lat laune of the destination you need to navigate to on the other hand if they tell us that they dropped off a candidate then we'll just give them a 200 return null here uh there's nowhere further for them to go great so this is a complete set of our apis they use our core entities um in order to satisfy our functional requirements a couple things that I'll I'll note quickly beyond the fact that you can come back and amend these as I already mentioned people may may be noticing that I I don't put any data types here and why is that now if you're a mid-level candidate or a junior candidate go ahead you can put data types that's fine senior candidates please don't I find that this is a massive waste of time um I see candidates sitting here doing things like this typing out number and number and uh it's borderline insulting to the interviewer maybe I'm being too flippant in saying that but like it's inferred you're a senior candidate you're a senior plus candidate we know that you know that a latitude is a number we know that you know that a right ID is either a string or a number respectively we know that accepting is a Boolean like we know these basic types so unless you're going to introduce something unique like I did here in the case of an enum then maybe it's worth spelling it out but otherwise your interviewer is smart they're following they've done this plenty of times save yourself the time again 35 minutes it's going to go quick um the other thing to mention is that you'll notice maybe that I don't put like user ID in any of these requests so for example driver accept well how do we know which driver sent this and so we're going to get this off of the session or the JWT that's in the request header this is important for security reasons I suggest you don't put something like user ID here unless you're explicit with your interviewer that you don't actually mean that you're putting this in the body but it's in the header instead um because if we did put it in the body then any user could say request a ride on another user's behalf by just making this request with the other user ID's user or with the other user user ID in the request body so good API security we're going to use either JW JWT tokens um or or session tokens and they're going to be in the request header this is something that I would speak to during the interview maybe I would write it as a bullet but not necessary awesome all right so we got our core entities and our API down let's keep moving now for the fun stuff all right we've later relased strong Foundation we've gone over our requirements the core entities and the API at this point ideally you're about 15 minutes into the interview if you're a mid-level candidate maybe it's taking you a little longer than that 15 to 20 if you're a senior staff candidate ideally 15 maybe even a little bit less and this gives you plenty of time now to go into the high level design which we'll do now while leaving room at the end for deep Dives so we're going to continue on this theme of each section relying on the one before it so for the highle design the goal as we mentioned is to be able to satisfy our core functional requirements or the features of the system and the way that we're going to do that is we're just going to go one by one through our apis and use them as the input flow of data in order to design a system that satisfies each of them and if we satisfy our apis then we've satisfied our functional requirements because of course we built our apis based on our functional requirements so hopefully that makes sense um let's start then right away with our first API endpoint the first API endpoint again is post rare estimate which takes in a source and a destination and returns to the user an estimate as well as an ETA for a particular ride request so what do we need in order to make that a reality well the first thing that we're going to need is a client and we're going to have a client for a driver and so this can be an IOS app this could be an Android app we won't support web and so this driver oh excuse me I've said driver here I mean writer so this writer is going to make a request and that request is first going to hit are uh AWS managed API Gateway and so this guy is going to be responsible for a couple of things it's going to be responsible for load balancing it's going to be responsible for routing this is the thing that it's kind of most important it's going to take each incoming API request and it's going to Route it to the correct micros service and then I can also do some additional auxiliary things like authentication SSL termination um uh what else can it do rate limiting Etc so it's nice to jot some of these things down don't get distracted by them you know this this shows well for the technical Excellence axes that that meta evaluates under and most of the um major companies have have something to that effect but don't let it distract you it's not the most important thing so important I got an AWS API managed Gateway here and it's going to simply be responsible for routing to each of my respective microservices now in the case of this request this is going to be our get Fair estimate request uh we're going to hit this ride service that we'll create here and the ride service is going to handle Fair estimates and so the real Uber is going to have some complex ml model that's going to be responsible for getting these estimations and in the interview you have a couple options you could just blackbox it and say there's going to be some complex model here this is what we're going to use or you can just overly simplify I'm going to take the route of just overly simplifying and I'm going to overly communicate that to my interviewer and I'm going to use a thirdparty mapping so this could be like Google Maps and so this is just going to be an API call that we're going to make here in order to find out given current traffic what is the current ETA to get from our source to our destination and then we're just going to use that ETA in order to estimate a price maybe a simple multiplication from there so our writer is going to make a request to get a Fair estimate which is going to hit our ride service which is going to use a third party mapping service in order to actually get that estimate and then we need to persist that estimate in our primary database here and so I'm just going to label this primary database for now for primary DB um and this is where I had mentioned earlier that we're going to forego on the actual data schema and the data models until we know more during the high Lev design and this is the appropriate time to do that so now we're persisting one of those right objects and given that I'm so deep into the flow of data I know exactly what needs to be persisted and what should exist on this right object so for example I know that it's going to have an an ID that's pretty basic I know that it's also going to have a writer ID it's going to have a fair that we just estimated it's going to have an ETA it's going to have a source it's going to have a destination uh and then it's going to have a status and that status is going to you know start right now as Fair estimated right and we'll have some more things that we add in in a moment and so this isn't complete there's more things here and we're going to come back and we're going to add to it each time one of our arrows lands on our database we're going to see is there another field that should be written here um and so then the the other thing that we'll add since we added this writer ID we know that we have a foreign key to some writer object this is less interesting uh a writer is going to have an ID and they're going to have some metadata probably where their location is um some of their payment information Etc it's not the most interesting thing for this design so I'm just going to dot dot dot it which is kind of a great tactic to use to ensure that you don't get distracted so we've satisfied our first API endpoint just to be super explicit about the flow of data the writer makes a request to get that fair estimate we calculate the estimate we persist it in our DB with these fields super easy and then we return back to the user the ID the fair and the ETA fantastic now the next API in point that we need to satisfy is when the user got that fair estimate and on their client they clicked on it to say I like the estimate I want to request a ride and this is what starts the actual matching process and this is what ultimately matches them to a nearby driver based on proximity so to handle that I'm going to introduce an additional microservice and this is going to be the ride matching service and the reason I'm using a separate service here oh ride matching service the reason that I'm using a separate service here is for handful of reason reasons one the things that these two do are incredibly different this one's going to be more computationally uh expensive and it's also going to be an asynchronous process and so by separating them I allow them to scale independently I allow them also to be maintained by separate teams and that's going to be um kind of effective uh for this current design so now the writer is going to make a request so request ride it's going to hit our matching service which is going to be responsible of course for matching drivers and writers and so the first thing that this thing needs to do is it needs to know the location of the drivers so we're going to have some location DB remember we had a location entity so we're going to have some location DB and this is going to store the location of drivers it's going to be just the current location of the drivers and so we can make a request here to get driver locations and this is probably going to be within some radius so maybe we say get all the drivers within one mile two miles three miles um of the drive Riders get all drivers within n number of miles of the current Rider's location the r Rider driver thing really trips me up anyway hopefully you guys are still following and so then this begs the question naturally of well what about how do we get locations in into that database in the first place that's where this endpoint came in right location update this is the API endpoint that drivers are going to call in order to make sure that this location DB is up to date so we'll introduce now a driver client that driver client is going to make periodic calls let's say every 5 Seconds now just to be easy every 5 Seconds to a location service that is responsible for simply updating the location of drivers so this is going to be update location so it's going to take a driver location Sor in the location DB so that the ride matching service can query it so the ride matching service then has the eligible drivers it's going to need to make sure that none of those drivers are currently in a ride or maybe offline um so we can say get status based on driver ID here so this guy can make a call to this service uh in order to query our database here for a driver and a driver is going to have an ID it's also going to have some metadata like its car its license plate their image all that sort of stuff but importantly it's going to have status and that status might be in ride offline or available uh maybe there's some additional ones too but we'll stick with those for now so ride matching service got all of the drivers within close proximity and then filtered out those that are in a status other than available and then it starts sending out requests to them saying do you want to accept this ride and the way that it's going to do that is going to be a bit abstracted in our design but I'm going to introduce this notification service a notification service itself is an interview question um so I'm going to blackbox it for now and this is going to use like apple push notifications or I believe Google Android uses Firebase am I making that up don't quote me on that but regardless we're going to use the native push notifications for each of them respectively uh Arrow now is going to be like this and we're going to send a push notification to the driver's device saying do you want to accept this ride and if they choose to accept they're going to make an API endpoint when they uh an API call when they click on it they're going to be calling that driver accept endpoint so that's going to be this accept and it's going to take in a Boolean it's going to hit our writing service and then we're going to update the status of the ride accordingly and so one additional thing that could be here is of course the driver ID we'll call it optional because it doesn't exist from the start it only exists once there's a match and then we would update our status to maybe matched or in ride or whatever it may be but something along those lines um so that's what happens when the driver then accepts the ride just and then the very last thing that happens is that the writer needs to know where to go or excuse me the the driver needs to know where to go and so we created an endpoint called update for them where they tell us that the status has changed and we give them the next place to be so the driver calls update it updates their status to maybe they got to the source they got to the destination um any of those additional statuses and then we optionally return the lat La of where they need to go NP next so if they just got to the source then maybe their status updates to picked up driver um and we send back the destination okay so at this point we have a fairly simple design it should satisfy all of our functional requirements and we should be able to go back look at it and confirm that yes we can request Fair estimates yes we can request a ride based on an estimate and yes drivers can accept or deny a request and then navigate to the drop off and pickup location now this design is simple um and it doesn't go into detail in a lot of places where we will need to go into detail such as how does this location DB work how does the ride matching service ensure consistency such that there's no double matching of drivers to rides these are all things that we're going to go into in our Deep dive but for now we should be able to communicate with our interviewer I have a highle design that satisfies my functional requirements next up I'm going to go deep in a couple places to satisfy my non-functional requirements all right so our high level design is in place we're going to jump into those deep Dives now now if you're a senior or staff candidate in particular these deep Dives are really where you're going to earn your key this is where you're going to show off that you have level appropriate knowledge so if you're a mid-level candidate then what we have on the board right now is pretty damn close to what would be a passing interview now it's probably not quite enough your interviewer is going to probe you in in some places maybe as it pertains to the matching algorithm and consistency or this location DB being sped up maybe sharding and your primary DB scaling Etc you should be able to answer some of these questions to the best of your ability fairly competently um and that's that that's probably around good enough for a mid-level higher now for senior that's that's certainly not the case you would now need to go deep to satisfy your non-functional requirements and you should be able to go deep in a couple of places you know aim for you know two places or so and then for staff the expectation is that you can go even deeper so maybe in more places say three places but uh the degree to which you go deep the dep is increased and it's often even the case with staff candidates that you teach your interview or something and so this isn't strictly a requirement but it's usually a really good sign if your interviewer walks away and it turns out that this staff candidate had a lot more knowledge about say Dynamo DB and they were able to go deep in some properties of Dynamo DB and how it pertains to this particular system leveraging their own personal experiences um then that's a really fantastic sign so we're going to show a couple places where you could go deep and we could show off that depth we'll do more than is needed in frankly even a staff interview probably um but I kind of want to show the range of places where candidates could go deep here so as I mentioned in order to inform where we go deep we come back to our non-functional requirements and we can really just go one by one through these so the first thing is we need to make sure that our system has a low latency matching um that's great so that's going to matter in a couple places but the place that I want to focus in on first is our location service because this is kind of a really pivotal and interesting part of our system so as it pertains to efficiency and cost for what it's worth we need this query on the location DB to get near proximate drivers to be really quick so note that and then the additional thing along this path is that this is going to be a lot of updates so we need this path to be able to handle the number of transactions per second that we're getting from driver updates and so that's actually where I'm going to start is I'm going to estimate that and we talked about in the beginning how don't do matches for the sake of doing math and you can do it later on in your design when it makes sense and that's exactly what I'm going to do here so I know that Uber has 6 million drivers um I'm going to estimate that maybe 3 million of those are active at any given time for what it's worth that it's probably an overestimate but we'll go with it anyway and then we had mentioned earlier that we were going to send updates every 5 seconds so a driver will send their location update every 5 seconds so that means that we're going to have 600,000 updates coming into the system every second great so whatever database we have here as well as the service is going to need to be able to handle 600,000 transactions per second now we'll talk about later later in the interview realistically there's other ways to to trim down on this um but let's keep that that that higher estimate for now awesome so let's talk about this location DB how can we store the location of all of our drivers and make it incredibly quick and efficient in order to look up drivers that are within some radius of them well the most kind of obvious thing that we could do is use some SQL database say postgress we could have a column for lat and La respectively and then we would just query range query in order to determine any drivers that are within a lat and laun upper and lower bound and I guess we would have an upper bound a lower bound a left and a right bound uh respectively so four total bounds now this sucks for a handful of reasons the first is that the indexes that we would build here on Latin lar respectively are be trees and be trees are optimized for one-dimensional data so L LA is inherently two-dimensional as we just mentioned and so that's not going to be well optimized these are going to be slow queries non-performing queries and also it's going to require us scanning a large number of rows for wider queries if we have a region of say 20 miles or something then now we're going to have to scan a ton of rows in order to get this data and so that's also bad lastly and why post Crest is just a terrible answer here is that you know postest can handle maybe 2 to 4K transactions per second that certainly can change depending on uh you know write replicas uh optimize Hardware but let's just estimate it's around there and so if we do that or if we have two to four transactions per second then we're way off the 600k that we need to be at so we can kind of consider this out of the question it's a bad option um if a candidate proposed this I would probe deeper even if they were mid-level it's not a great answer so if I probe deeper and I pointed out the limitations of that then naturally a candidate might realize two things one if I'm going to use postgress then I need to be able to look up these proximity searches much more efficiently and then two I I need to somehow reduce my 600k TPS to something closer to 4 TPS and so to handle the first one you would introduce a geospatial index and so a geospatial index is something that makes querying spatial data more efficient and in this case we would use a quad tree and for postgress specifically has an extension called postgis and so quad trees the way that they work and let me maybe zoom over here and I have a little a little diagram that I can show so the way that quad trees work is that they take a map you can imagine that this is a map of the world and they recursively split it into four regions so imagine this first split here and then we make this a tree so here's our root node this region is our our red guy here this region green blue and orange uh as you can see and there's some K value in our case i' I've randomly chosen five and this K value determines whether or not we need to recursively split within any given cell so in this case we only have three drivers in this region of the world and so we don't need to split them further this is less than our K value of five same two here and same two here however in this region you can see there's a lot of blue dots there's a lot of drivers so we can recursively split again into fourths as we do here and then bang add to our tree four additional branches and then we take a look at each of these respectively so this one has four we don't need to go further three three this one itself had four five six seven which is greater than five so we recursively split it again right and so when you want a query you simply walk down this tree and when you get to Any Given Leaf node it should be the list of all of the drivers within that cell so that's how a quad tree works and that's what uh is a really popular geospatial index and one that um you know we can have supported here by postgress via postgis so that's one option there's some pros and cons there this is going to be linked in the bottom of the the YouTube video so you guys can read this on your own time as well so that's great we've now made the uh queries a bit more efficient that's wonderful but how do we handle these increased transactions per second well sort of our only option there in this this current setup is to add a q here and this Q would be responsible for handling all of these transactions coming in and then batching them to write in batches such that we can get the 600k down to 4K and as you can see this would be some pretty large batches so obvious downsides to this are one we've introduced a significant amount of latency such that our location DB is now going to be mildly inaccurate because we needed to wait until all these locations came in into a large enough batch for us to write them to our database so that's not good and then additionally when we wrote to our database because we're using quad trees we need to reindex this entire thing right so we have all this new data that comes in this tree needs to shift around this is expensive as well it also takes up a lot of uh space in memory so there's some additional downsides as to why this is an okay option but not a great option if a mid-level candidate or even a senior candidate landed on this um it would be okay midle candidate certainly passing senior I'd probably point out some of these limitations and then hope that they could go further for a staff candidate uh I have no expectation that you really have a deep understanding of these geospatial indexes but again I should be able to point out the limitations you realize them and adjust accordingly so this is not optimized but this is an okay answer now there's a better answer and that better answer is that in order to handle these high TPS we can make this this an inmemory data store specifically something like redis which can handle anywhere from 100K to if it's well optimized a million transactions per second in a redus cluster so that's great that handles that issue and then from a geospatial index perspective redis supports something called geohashing and so geohashing is a different algorithm that's used to do geospatial queries and it's very similar to quad trees but it doesn't require the maintenance of this addition data structure this additional index so the geohashing works let me paste in a quick example here next to this one is like this so again we split the region into fours you can imagine this is a map of the world we split the region into fours and then we continue to just recursively split the region into fours again now we don't make any determination based on K so it's not density dependent we just continue to split until we get to a Precision value that we're happy with so in this case we have 0 1 this cell is two if we split this cell two again then we end up with two 0 2 1 2 two and 2 3 respectively this is Layer Two now we could then go again to a layer three and maybe take this two and this two will become two 1 or 2 two 1 two 02 203 right you get the picture and this can continue to happen recursively and so what you end up with is a base 30 to encoding of this string that looks you know something like this and the longer the string is the more precise the more that final layer is zoomed into a smaller region uh the the less long this string is the shorter the string is the less precise of course and you can do things like look at just the first one character two characters three characters and understand like the larger region that this exists in before you continue to zoom in so this is how geohashing Works red is support geohashing out of the box and so the benefits of geohashing which are great for this situation is that it's easy to calculate It's relatively cheap to calculate a geohash and it's easy to store it's just a string and it doesn't require any additional data structure so when you're in an interview and you're having to make a decision between quad trees and geohashing which might sound like some pretty obscure knowledge but for what it's worth for these proximity questions Yelp find my friends Uber this is kind of a place where you spend a lot of time and a lot of focus so the question is often between quadree or geohash and the takeaway that I want you to have is that quadry is great for when you have an uneven distribution or an uneven density of locations so something like Yelp for example where you have a really dense New York City but there's nobody in the Atlantic Ocean or nobody in the middle of Kansas whatever then it's fantastic um but it's also required that you don't have a high frequency of updates because we don't want to reindex that tree on every single update it's expensive and geohashing on the other hand is indiscriminate as it pertains to density the world is split into evenly precise boxes um so less good when you have an uneven distribution um or or or an uneven distribution of densities of locations but really good when you have a high frequency of RS so in this case despite the fact that our Uber drivers locations would have an uneven density because we have such a high frequency of writs it makes geohashing the optimal answer and so that's what we're going to end up going with here is redis and geohashing and it's worth noting there are a couple other geospatial um kind of algorithms indexing strategies that exist Uber actually uses something close to geohashing but instead of it being squares they're hexagons uh I believe hexagons maybe octagons and this solves the problem that the distance from the middle of a square to the outside of a square is not the same as to one of its edges so sort made it more circular in that way I think this was built inhouse at Uber um so a couple options that exist but these are the two main ones the two main ones if you know these two in an interview you're totally set if you can weigh the trade-offs between them just at the depth that I just described you're in a really good Set uh spot so now we've solved our problem if we don't need a q anymore a q here would be over engineering we can easily handle the number of transactions per second into our reddis cluster using geohashing to then do these proximity searches awesome hopefully that makes sense um okay and then the the other thing that I want to mention before we move away from location is that your interviewer might ask 600k transactions per second is really high how can we make that smaller do we need to overload our system like that in the first place and this is a great question we don't have to there's no reason for us to a do it every 5 Seconds we could make some product trade-offs on accuracy and say it's every 10 15 20 seconds even up to a minute um but a more sophisticated answer would be that we could do this dynamically on the client so the client could take in um some information and be adaptive or send adaptive location updates based on things like the status is the driver currently accepting rides are they not if they're not then of course we don't need another location the speed are they parked have they been parked for 20 minutes if so their location isn't changing there's no reason for us to send updates the proximity to ride requests or hot a areas if they're out in the boonies then maybe we don't need to send their location as frequently because we don't need as high of precision so we can put this logic we can put this logic here on the driver and call this Dynamic location updates and this is going to make sure that this TPS goes way down I'd still stick with the setup that we have here um but you know maybe we now got the 600k down to 100K or maybe even something smaller all right so that should cover that deep dive uh specifically low latency matching under 1 minutes and optimizing for our location service and our location DB the next one that we can look into then is consistency of matching so this is where we want to make sure that no driver is assigned more than one ride at a time and no ride is matched to more than one driver so there's two things that we need to consider here let me kind of paste those in to save the typing so for consistency of matching we can't send more than one request at a time for a given ride and we don't send any d driver more than one request at a time so we're going to handle these two separately so when our ride matching service we're going to get that list of drivers and so we have a list of drivers list of eligible drivers maybe it's the top 10 drivers or so and the way that we can handle this first one making sure that that ride doesn't send out to more than one driver at a time is really easy this can just be logic within our ride match service so we can simply have a a wow Loop here that says while we still don't have an accepting driver go to the next one send him a request wait our 10 seconds and then if we still haven't matched go to the next one send the send the request Etc and this is because a single ride instance ride matched instance is only being handled or a single ride that needs matching excuse me is only being handled by a single one of these ride matching service instances right a single server here so we can have that logic within that server and it satisfies one pretty simply I'll kind of paste in some pseudo code in a moment that hopefully will make that even even more clear because I Fe my description there wasn't as clear as it could have been but one is the easy one two is the hard one and two is the hard one because you need to remember that we have multiple of these right so in each of these cases our servers are are horizontally scaled and so you know we have we have multiple instances of this ride matching service and what you could imagine happening is that people just got out of a concert and the closest driver is the same for a lot of people and those ride requests got matched on different servers here and so each of these different servers said give me who is the closest and in each case it returned a list that maybe looks like driver 1 and then driver two and so each of these instances each of these ride requests sent a notification to driver 1 and in the case of a Taylor Swift concert there's 100,000 people or whatever then driver one is sitting there and their phone just went d d they got 100 requests all at once that's obviously bad we don't want that to happen so given that we have these multiple instances there's nothing that we can do on a single box because we need coordination between the the N number of instances here they need to all see a similar state that a given driver is currently occupied okay so what we could do is we could update a status here we already had a status field in our driver and so we can have an additional status maybe that says um you know request sent something like this and so when we get to driver one in our list we would update this status and then send them the notification such that any other instance here that wants to send to driver one they first need to look in our database and see was that request sent this of course would need to be Atomic um but we would check that and then if the request was already we would move on maybe to Driver 2 because we know the driver 1 was already occupied so this is very similar to to Ticket Master actually and how we handle no double booking for anybody who watched that YouTube video uh if you haven't go check it out but very similar idea here and if you watch that YouTube video then you see the issue with this approach as well and that's that we only want to send these requests for a certain period of time we only want to request to be outstanding for say 10 seconds maybe 5 Seconds a driver has 5 Seconds to accept it and if You' in an Uber you've probably seen this on their app they have a popup that comes on their app it has some loading bar that's loading quickly about 5 seconds and they have to click that button before that 5 seconds or else we're going to go send that ride request to to somebody else so in this case if we set this to request sent and the driver doesn't respond they don't accept or deny they're just currently driving they're not touching their phone whatever it may be then this status is stuck it's stuck as request sent and for an indefinite period of time now nobody else can send a request to driver 1 because they're stuck in this status so we need some way to update this status and so one thing that many candidates think of is that they want to do something like this they want to introduce some Cron job which runs every you know n seconds or minutes and it looks for drivers that have a request sent and then we would need to have a status update time and so we would basically do a query on this driver table to find anybody who's in a request sent status and to see if it's been 10 seconds since that status was updated and if it was update this back to available and so this works like this would absolutely work but the issue with this just like with Ticket Master and maybe even worse in this case is that you're introducing some Delta n from the time that a driver should have become available to the time that the crown job run so if the crown job runs every minute and our Delta here is 5 Seconds right we only want a driver to be locked for 5 seconds then and it could take up to 55 seconds before we actually unlock you when you should have been available to accept rides this isn't optimal and in a case where things are really busy we want to make sure that we can match you to the optimal driver to ensure that users have the best experience possible so this is an option it's an option that's great for a mid-level candidate it may even be passing for senior depending on how well you do in other areas of the design um but I would ask follow-up questions it's probably not good enough for staff I would hope that a staff candidate identif identifies the limitations here um but it is one option A better option just like with Ticket Master is to actually not do that and instead use a distributed lock so this could be reddis we could use the same instance here I'll actually even talk about in a moment how it could be in the primary database but just I want to make this clear by having it separate for now and so we can have a rice instance here which when we want to lock a driver all that we do is we set driver ID with some value maybe true and then it's going to have a TTL of 5 Seconds excuse me true so this is just the key value pair so we put in reddis which is going to be a distributed lock in this way we're going to use this distributed lock to say this driver is currently locked so we'll add him to reddis add driver ID to redus the value doesn't matter I put it as true who cares we're going to set this as driver uh driver ID and we're going to set a TTL of 5 Seconds so in 5 seconds that entry is going to be automatically removed from the cach and so now what's going to happen is that anybody trying to send a request to a driver first needs to look to see if it's locked so they'll check here this is going to be really quick it's basically just a hashmap they're going to look up that driver ID they're going to see does the driver ID exist in my lock if it does I got to go to the next one he's currently locked uh but if the driver doesn't respond now this row is automatically removed and the next person who reads this won't see driver ID here and they can make a request to the driver ID so this is fantastic it's a distributed lock very similar to what we did in Ticket Master it ends up working such that all of our ride matching Services have a consistent global view of which drivers are available to send requests to um and it's w wicked fast Snappy great Etc now you might point out I have two reticences now why don't I just put this stuff here as well you totally could do that another thing that you might want to do which might even be smarter is that this could be something like Dynamo DB which I think would be a good choice here there's not a ton of relations High availability is important we can scale this largely indefinitely and then we can introduce another table which would be a driver lock and this would just have the driver ID and you can set in Dynamo DB ttls on rows on tables so could end up doing something like that now you don't need a separate why is my text so much bigger you don't need a a separate lock here or a separate a redus instance to manage you just use the same database good consolidation no need to introduce an additional technology um but we'll use this as the lock so that's how we're going to handle the consistency of matching and then just to make sure that this is Crystal Clear I want to paste in here kind of what this would look like so within this ride matching service you would say while we don't have a match then the driver is going to be the next driver from a list that we got from our location DB we're going to lock that driver by updating the driver lock in our database and then we're going to send a notification to that driver to say would you like to accept that ride and then we're going to wait our 10 seconds and so that driver might accept they might deny it's going to come through this path update it's going to update our primary database and their status is going to update to in ride and so at this point then no match should be false because there is a match they already in ride this is ended everything is good here um and then of course we'll we'll have some way to notify the user either by a notification pulling or otherwise but I'm not going to go into a tremendous amount of detail there great so those are those are certainly the two most important uh deep Dives for this interview but let's let's continue to make sure that we round this out I recognize that the video is getting a little long as well the next one I want to talk about is handling High throughput surges for popular events so this one's relatively straightforward but what we want to do is make sure that we can scale we can scale dynamically in order to meet that demand so if we had only n instances here and then all of a sudden we had all of these people who want to ride then maybe it's too much to be handled by these instances that we currently have and we can't horizontally scale them dynamically quickly enough and so naturally the solution to something like that let's see if I can fit this elegantly into our diagram is to introduce a Q and so we can have a Q here this is going to be our ride request Q ride request Q apologies that this is a bit tighter than I would like it to be okay so we have our ride request queue here and when rides come in they're going to sit on this que and then our ride matching service is going to pull off of this queue in order to uh match drivers to riders for a given ride so if we have a huge surge no problem we'll be able to handle that it'll all wait in our Q here and we'll just pull off when our ready our Q is is likely going to be partitioned by regions so one issue that you run into with this Q is that if it's simply a fifo Q then you might have somebody who is uh the request came in first but it's a much harder request to match maybe it's out in the boonies there are no drivers around whatever it may be and then you could have some people who are stuck behind the queue here not getting matched despite the fact that they're in the middle of New York City and there's drivers everywhere or something like this right so that's not optimal uh a couple things that that that we could do for that is that we partition this queue based on location or based on regions So within New York City you would have kind of more fined grained regions even more fine Grandin probably than the Burrows but you would partition your queue on this in this way uh you don't have that problem where somebody ends up sitting behind a request that's taking a longer time to match and then additionally the added benefit here is that if any of these ride matching Services go down and remember these are pretty computationally expensive like uh they're waiting these 10 seconds they're they're taking a long period of time and so any of them could go down in that minute that we're trying to find a match and if that happens no problem we'll just pull it off the ride request queue again it won't have been acknowledged as succeeded and then we'll continue to to match that user so this works pretty nicely so the ride request queue hopefully it's relatively straightforward I won't go into a tremendous amount of detail but that's how we're going to handle these these high surges High availability outside of matching so this is where our system uh you know it's kind of the the traditional answer we can horizontally scale each of these these will have their own load balancers this was Dynamo DB so largely it scales indefinitely we talked about the scaling here from reddish so that's great the last thing is that um we can we can split these by region so realistically we're not going to have a single instance of this full setup like maybe this is the Northeast region that you're looking at here and then we're going to have a copy and paste of this entire thing out in California for you know the southwest region saying with up in Seattle for the northwest region these will be located in different data centers um now there's probably things like users who can travel of course you could go from New York to California which will'll have in every data center maybe just a a read replica or a copy in the primary database of Any Given uh user table but that's an additional way that we can increase availability reduce the strain on the system uh make things like the Q partitioning more streamlined and obvious um by splitting by region here all right I think that's just about wraps it up congratulations if you got all the way through it apologies this video ended up being a little bit longer than I intended um of course as I mentioned I went into more detail here than would be required in an interview so I don't want anyone to feel overwhelmed um just to be crystal clear here mid-level candidate if you got that highle design and you were able to answer questions from your interviewer about some of the deep Dives you're doing great senior candidates I hope that you could have gone deep in at least a couple of these places the location service and ride matching even if you didn't land on optimal with uh you know the the the distributed lock with TTL but you landed on the Cron job or you suggested the queue here um with postgis and quad trees even if you didn't know the geospatial index probably great very likely very likely a higher and then staff candidates we do have that higher bar expect that you were able to show more depth in places but maybe it wasn't the places I went deep maybe it's places that that you understand even better than I do and that would be great as well so don't feel overwhelmed but I wanted this to be relatively comprehensive so that you feel well equipped when you go into your interview all right with that said thanks everybody for watching if you have questions comments concerns anything I did stupid that you want to call out please put them in the comments I'll be checking those out and responding frequently um this was the second of these that we've done again check out Ticket Master if you haven't seen that one yet and I'll try to do another one here in the coming handful of weeks all right thanks everybody for watching and best of luck with your upcoming interviews take care

Transcript for:Designing Uber System Architecture

Transcript for:
Designing Uber System Architecture