System Design of Uber

hey everyone this is anshul sadariya software engineer 3 at Google in the past we have done so many system design related videos like we did for Twitter we did for Tinder it's always amazing to see how large scale products like this work in this video as well I am going to cover the system design of one such large scale product day to day for commute we use this application on a regular basis yes you guessed it right we are going to cover the system design for Uber now when I say Uber I also mean Ola lift Kareem and all of those applications which are used to provide us cab services but before we dive deep into the video make sure you check out scalar's free master classes by industry leading experts only on scalar's event page link is in the description below now let's get started now before we start the system design of any application the first step is to understand the customer user Journeys which is how the customer would interact with that particular product or application in this case we have two different types of customers a normal user or a rider and a driver let's try and understand what all different customer user Journeys will be there now we will refer to customer user Journeys as cujs let's first try to note down how a driver will behave with the application they will first create their profile then they will show their availability on the application that okay hey I am available to take the rights please give me rights if they are available in the nearby location next they have an option to accept or reject the ride whenever they get a notification that okay some user in the nearby location is asking for a cab service so accept or reject a ride now when they reject the right they are back to square zero and the same flow repeats over and over again but now if they have accepted the location they will be able to navigate to the location that is the pickup location of the particular user so navigate to pick up location now once they have navigated to the pickup location they are given an option to start the ride they have accepted the ride now they will start the ride as soon as the person gives the OTP or any such authentication mechanism so start the ride as soon as they start the ride the user would have mentioned some pickup location and some drop location so the navigation towards the drop location begins after the ride has started so navigate to drop location once they have navigated to the drop location the driver can end the ride and accept the payment now optionally we can also give them the facility to rate the particular Rider as well and the same facility will be provided to the rider as well to rate the driver now this is an optional requirement we can add it if we want to let's try to see what all other customer user Journeys for a rider same thing they will also create their profile now based on the location of the rider it will show them what all cabs are there in the nearby like if you have opened the Uber application whenever you open the application based on your current location it will show some cabs which are available in the vicinity so show cabs nearby now I will be given an option to put in the start location and the end location for the cab service so add start slash and location now based on the start and end location that I have put in I will be shown the estimated time required to reach the location which is the end location and I will also be shown the approximate price of whatever the cab service is going to be you know it's not the exact price but it also gives an estimation that okay if Uber Moto it's this particular price Uber Auto it's this price and so on so it gives an estimation of what the price could be more likely the price is going to remain the same but depending upon the traffic condition or maybe the driver needed to change the route because of Any obstruction the price may increase or decrease so show ETA and approximate price now once it shows the approximate ETA and the price for the cab service what the application will do is it will start finding the caps for the particular person so book cab is one user Journey For That like right now we have not started to book the cab we are just seeing what all prices are there water timing it's going to take and so on so book cap from the start location to the end location it's the next user Journey and then the remaining user Journeys are on the part of the driver so once the driver accepts the ride the driver will navigate to the user and the user will sit in the cab the driver will start the trip the driver himself or herself is going to navigate to the end location and then drop the user lastly the user is going to make the payment and rate the driver so that is also an optional requirement but let's add make payment over here so now these are the customer user Journeys on the side of both driver and the rider now one important or One customer Journey which we have not added in the typical flow is the ability to see the past trips or the payment history so that is something we can add over here as well let's just change the color so see past trips now this can be done on the side of both Rider and the driver and similarly you can see your current trip as well because lot of time it happens that okay 10 minutes the driver is going to come in and we keep on looking at the trip status or we can keep on looking at the current trip so we can either look at the past trips get a particular trip like if you want to get information of a particular pass trip also we can get information about a pass trip so get a pass trip or get current trip status now the benefit of understanding all of these customer user Journeys is that each of these customer user Journeys get translated into a functional requirement like these are the functionalities that the app should provide now we have understood some of the functional requirements now when I talk about customer user Journeys or functional requirements I mean one and the same thing all the customer user Journeys are going to be mapped to some sort of a functional requirement now let's take a look at some non-functional requirements as well when I say non-functional requirements which I have covered in detail in Twitter system design as well what I mean by non-functional requirements now this doesn't typically affect the customer user Journey but it affects the user experience for example if it takes a lot of time for you to check the current trip status then it's a bad user experience so that is a part of non-functional requirements that the latency should be as minimal as possible and so on so let's take a note of some of these non-functional requirements as well now we talked about the first non-functional requirements which is minimal latency now when I talk about minimal latency it's not that I want super low latency like that which is required for high frequency trading firms I am fine with some sort of a delay some of the services for example if I'm making a payment it's fine if it takes a bit of time for the payment to get through or if I want to check some of the past trips it's okay if it takes some time to show me the list of past trips as well but some of the customer user Journeys should have as less latency as possible now the next one would be highly available now it is not possible to imagine if Uber is down in a city like Bangalore or Mumbai or Delhi and so on so it needs to be as available as possible and I'm just talking about India over here in U.S and all the countries around India as well it is very very important that services like uber stay highly available 24 7. so highly available is an important non-functional requirement now the next one would be durable when I say durable I mean the data should persist forever for example let's say that a driver is taking 10 trips in a day and at the end of the day the entire trip record for the days vanished all together or for example if I made a payment to the driver and it still shows that okay the payment is not done through even after three to four days that payment history is gone all together so such an experience is not expected from applications like uber the data should persist forever be trip history be it location data of the driver or be it the payment history anything at all the information should be durable in the database now the next one will be it should be highly scalable highly scalable because every day there are some P Cars and non-p Cars so if the system is not able to scale to the P cars for example 9 am in the morning or 5 PM 5 to 7 PM in the evening then the user experience will be really bad and if we want to add the service to new cities to more Tier Two Cities tier three cities and so on the application should easily be able to scale to such an extent that we are able to cover all almost all the cities of the world so it should be highly scalable as well and lastly it should be highly consistent or let me put it into better technical words it should be strongly consistent now you must be wondering that we want our service to be highly available as well and we also want it to be strongly consistent now cap theorem says that one service cannot be highly available and strongly consistent both at the same time now that is absolutely true but here's a catch in this we don't want the same features or same set of micro services to be both highly available and strongly consistent for example if there is a cab finder service we want it to be highly available we don't want that to go down at all but for example if there is a trip service or a payment service where I want to check some of the past payments or if the payment has gone through or not we don't want that to be strongly consistent we are fine with eventual consistency for such micro services it's all right if I made the payment through and it takes five minutes 10 minutes for the payment to get reflected in my application but on the other hand let's say I have booked the cab and some driver has accepted that particular ride I cannot accept the experience where other drivers are still being shown that okay this person is still asking for a cab so in such situations strong consistency is really required and also for example if I start the trip and it still doesn't show that the trip has started on the driver's application then that's also a really bad user experience so all of these things some of these features need to be strongly consistent whereas some features we are fine with eventual consistency so the set of features which are highly available and strongly consistent they are mutually exclusive and they do not overlap so this is the set of non-functional requirements which our application must be having now from the non-functional requirements let's try and understand the scale at which an application like uber works okay let's just crunch some numbers so that we have an understanding of what the read QPS can be what the right QPS can be and so on so first of all like I am following the data which was published by Uber's blog Uber's website itself so it's a public data and I'm not making the numbers up now it was seen that there are almost 100 million monthly active users 100 million Mau which is monthly active users which means that over the duration of 30 31 days there are almost 100 million active users so on a daily basis you can expect to have around 10 million rides so there are 10 million rides daily rides per day now we will try and compute the QPS based on the number of rides in a day let's just wait for a second for this particular thing and another number like all across the globe we can expect around 1 million drivers to be active on a daily basis since there are 10 million rides in a day we can assume that one driver typically has has 5 to 10 rides in a day so ideally there would be 1 million drivers active on a day to day basis now when we say 10 million rides in a day what does it mean per second how many rides are getting booked or how many rides are getting requested let's do some competition for the same now in a day there are 86 000 400 seconds now I am going to assume this to be one leg seconds just for the sake of computational ease so now 10 raised to 7 million rides divided by 10 raised to 5. now this is almost equal to 100 rides getting requested or booked per second now this is a huge load because behind the scenes whenever a cab is booked a lot goes behind the scene a radius is created around the user's location and the application tries to identify what all cars are there or what all cabs are there in that particular location pings are sent to all those drivers in that particular radius and then a ride is booked so 100 QPS is our load operational load and we will cover some of these aspects like what is the purpose of having one million drivers in a day or 100 million monthly active users will try and understand how it is going to influence the system design of uber when I am going to talk about some of the important Tech Concepts now let's get started with some of the tech Concepts which I feel are really really crucial in order to understand the overall system design of uber starting with the basic the most basic one like how is the driver and the rider being able to communicate with the Uber servers so I'll just create a superficial diagram of this now when I say Rider or user I mean one and the same thing so I might use these terms interchangeably during the course of the video now ideally both the driver and Rider are going to communicate with each other only via the server now there are two two ways of communication the first one is the synchronous way of communication using HTTP request so what does http request mean you send a request for some data and you receive a response in return so the first one is http means of communication or HTTP protocol as we popularly know it as so you send a request and you receive some response in return now let's say for example the rider wants to request a cab from The Uber application so it sends an HTTP request to the server saying that hey I am at this particular latitude and longitude or I am at the at this particular location please send me an Uber now when I say book an Uber for me I mean Uber go auto moto or anything at all we are not going to differentiate between these different modes of cab services we are going to treat them as the same thing if you want to include different types of cab services possible on an application it's a plug-and-play thing it's not a big deal for the purpose of this system design video so we are not going to treat them as a different thing when I say Uber or a cab service I mean one and the same thing all right so now the server knows now the application knows that hey I want to book a cab for this particular Rider located at a particular location now how does it communicate with the drivers does it send all the drivers an HTTP request requesting for their particular location like the server would ping all the drivers and ask them that okay let me know what your locations are and then based on the locations and the nearby vicinity it will send the possibility or suggestion for a cab request to those few few drivers only this is one way in which we can perform the communication between the driver and the server but as we know this is an inefficient way of communication the other way of communication is using websocket servers now what does websocket servers mean when we create a websocket server it creates an open connection between two clients now in HTTP there is a concept of client and server whereas when we talk about websocket Connections or socket connections there is a P2P connection peer-to-peer there is no discrimination between client or the server so this is a P2P connection both the peers can send each other messages asynchronously so HTTP was a synchronous way of communication and websocket connection is a peer-to-peer or a asynchronous way of connection so this is asynchronous and HTTP is synchronous Now using the websocket servers what the drivers can do is they can keep on sending the location pings to the server on a regular basis it can be every four seconds five seconds or 10 seconds it depends on what right QPS is sufficient for our particular use case so now let's try and understand how the websocket connection is created so let's consider a server sitting over here and a user sitting at the other hand now when I say user here I mean both the driver and the rider I don't mind discriminating between them as well so this is a socket connection created between them so let's say whenever a driver becomes available it sends a ping to the server saying hey I am available and then it immediately creates a websocket connection between the server and the client so the server knows that okay this is some connection ID we will also have a connection ID and this connection ID corresponds to this particular user ID now this user can be a driver as well as a rider so now the server knows the link between a connection ID and a user ID as well now for the user the user only knows what connection ID it needs to put the message on as soon as the message is put on that particular connection it travels to the server and identifies what the request is or what the message is it can be both a request and a message a message can be just a location ping for some driver that hey I am at this particular location whereas a request can be I am accepting the ride or I am rejecting a ride so in this way the websocket servers make the communication between two different clients or two different peers efficient so for the use case of an Uber application we are going to use websocket servers which is going to establish a connection between the Riders and the server as well as the driver and the server so that the messages keep on transmitting between all the parties now we understood how the communication in an Uber like system behaves now let's move on to the next technical Concepts and see how we can find cabs nearby a user who is requesting for a cab service so now let's let's just create a table for the location data based on which we are going to find the nearby cab drivers so there is going to be some sort of an ID related to the driver and we are going to store both the latitudes and longitudes now in between I just might shift from latitude longitude to X Y axis so I really hope you try and understand that by X Y also I mean the latitude longitude thing so now let's assume we have a table in which we are storing the latest location data of the drivers I am going to explain in detail how we are going to do that as well but for the time being hypothetically let's assume we have some sort of a database a fast retrieval database from which we can access the latitudes and longitudes of the drivers now we know for a case that there are 1 million active drivers in a day okay so at any moment when we are going to find the nearby drivers we can assume that estimation to be correct as well so let's say this is the map of the world okay I'm just going to draw a rectangular diagram and this is the map of the world and the user is located at some point over here so now ideally we want to find the drivers in a nearby radius of a few kilometers one kilometer to five kilometers and so on so what the name way the most trivial way in which we can do this is to is by iterating through all the like this entire table location table we can iterate through the entire location table and try to identify the drivers who are near me now another important thing to keep in mind is that we were having a QPS of 100. for cab search so now if the QPS is 100 and we are going to Traverse through the entire table of 1 million drivers it is not going to be efficient so one way in which we can get some improvement in the performance is by creating index over latitude or longitude we obviously cannot create a primary index on both of these Dimensions so let's say we create a primary index on latitude so for example now over here we can assume that the latitude and longitude is X and Y okay so nearby the X latitude we want x minus 2 2 X plus 2 this is the range in which we want to find the drivers now even after we have optimized based on one particular index it is going to be a lot more inefficient because there could be in some cases there could be a lot of drivers in a particular range of latitude or longitude as well another way to get more optimization to try and improve the efficiency even further is by creating two different databases with a primary index on in one table we can have a primary index on latitudes and on one table we can have a primary index on locket longitude now based on those primary indices we can find all the drivers who are in that range of latitude and in that range of longitude so ideally we will find a pool of drivers which belong to this particular latitude range and we will find some drivers belonging to this pool of longitude range now after finding those drivers we need to find the intersection between both the sets now this too with a QPS as high as 100 it's going to be very very inefficient so we need a way in which we can convert this two-dimensional data into one-dimensional Data so for this purpose I covered some of those topics when we discussed the system design video for Tinder so do go and take a look I am going to mention some of the ways in which we can perform this particular thing wherein we convert the two dimensional data to 1D data one dimensional data now the first such way is by using geohash I am not going to cover in detail what Geo hash means but basically the entire world's map is divided into chunks and each of this chunk corresponds to some particular geohash now another way to do this particular thing is using quad trees now core trees is also something which I talked about in tinder's system design video and if you want me to cover this in detail then I can also prepare a system design video on Google Maps system design like you can add it in the comment section if you want me to prepare a system design video for Google Maps as well where I am I I can cover all of these topics in much more detail now what Uber actually uses in its backend for the map service in order to identify where a particular driver is or in order to find the ETA between two different points and also for the navigation purpose it's the Google S2 Library now what Google S2 Library does is also similar to like somewhat similar to what Geo hashing and quad tree does it divides the entire spherical glue which is a sort of a three-dimensional data into chunks now over here I have just for the sake of clarity uh for the ease of explanation I have just created a rectangular map and I have distributed the entire map into smaller chunks so now each of these chunk is some form of a one-dimensional data we can just name each of these chunks we can just order each of these chunks for example this can be chunk the ID of the chunk can be one this is two three four five six and so on so in this way we have categorized the entire spherical data into one dimensional data so we converted the two dimensional data which was causing as an issue during the driver fetch by converting into one dimensional data now whenever a user is pinging from a particular location okay we know the latitude and longitude of the user now each of the latitude and longitude is going to be part of some particular chunk let's say it's part of Chunk five okay so now we ideally want to find the drivers who are in that particular chunk or in the chunks nearby it so now this is the responsibility of the of the map service that hey given this latitude and longitude let me know what chunk does the user belong to now this user can be both Rider and the driver so when we put in the latitude and longitude to the map service it tells us that okay this particular user belongs to chunk five and then what the map service does is based on a user's location it will create a radius around the user's location it can be a two kilometer radius which will have a four kilometer diameter or any XYZ particular radius okay now based on this radius it will cover a set of some chunks we will get what all chunks are part of the subset in which we want to find the drivers from so over here we can see that chunk ID is one to nine is the search space wherein we want to find the Riders wherein we want to find the drivers my bad so ideally we will also have a service which will store that in a particular chunk C what all drivers are there so we can directly communicate with those drivers and send them a ping that hey a user is requesting for this service do you want to accept the ride or not so in this way we just find what like what chunk does the user belong to and based on the nearby chunks we send the Ping to the drivers who are in the vicinity of the user now what we have done is we have isolated the maps micro service which is responsible for converting the pair of latitude and longitude or the X Y Dimensions into the chunk ID so let's keep making a note of what each of the micro services are doing so map service it is going to convert a pair of latitude and longitude to chunk ID now this map service itself has dissected the entire map into different chunks So based on the chunk ID of the user it can identify what all different chunks are there nearby and we talked about some service let's keep it a question mark service over here we will name it ahead like I will talk about the service in detail in the coming section so now what this does is it has a mapping from chunk ID to the set of drivers who belong to that particular chunk now ideally using that map service what we have done is we have reduced the search space of the drivers to whom we want to send the pings that hey there is a user do you want to accept the ride for that particular user or not so now let's try and understand what all different things the map service can do first of all the ETA computation let me also create the map of a city to explain it in a better way now this is just a map of some City I am going to divide the map into smaller chunks which is what the map service or the Google S2 service is going to do now these are the chunks let's say the user belongs to this location and Google S2 service Google S2 library is going to create a radius around it to find what all chunks the other drivers are in now let's say we found some drivers in the nearby chunks there is a driver located over here here here and here okay we found out four drivers in the nearby chunks based on the radius that we set and the user's location now we want to find the ETA that okay given a start point and an end point what is the ETA for first of all what is the ETF or the driver to reach to the pickup location and ultimately what is going to be the ETA to reach the destination so now based on the chunk IDs if you remember we also had a location table which contained the latest locations of each of the drivers so now if we have isolated four driver IDs or four user IDs who are in the probability radius then it is easy to get the exact locations exact latest locations of each of these four drivers now based on those locations we have two choices with us first of all is just to compute the distance which is the euclidean distance based on a set of two different points but that might not be the ideal distance because we know that for any Inner City between point and point B we follow a road like structure so the distance is if this distance if the aerial distance is one kilometer then the actual distance might be something else all together so this can be a three kilometer roll distance so we cannot rely on the radial distance we cannot rely on the aerial distance as well what we'll do is we will use the map service which is the Google Maps service or any map service which is feasible for the particular application and then based on that map service we'll just let it know that okay this is the driver location and this is the user location we have isolated some drivers around and we have the locations let us know what the ETA is for the driver to reach the user now based on that we'll just compute what the distance might be for each of the drivers and what the ETA will be for the driver to pick the user up now that is added to the ETA for the pickup location to the drop location so in this way we can compute the ETA now this ETA or ultimately what we are doing is we are finding the distance probabilistic distance and the time taken to cover the distance now both of these things can be passed on to the payment service which will then calculate what can be the estimated price for that particular ride so in this way in collaboration with the map service we will get the locations using the location service we will use the payment Service as well to compute the ETA and also to compute the estimated or approximate price now let's see for example if a driver is moving the driver will keep on sending the location pings to the server so how is it going to affect the flow for example the driver is situated in chunk one for example all right now let's say the driver is moving in the right direction so at some point the driver is going to change the chunk idea he or she is going to change the Chunk in the map okay so let's say this chunk is chunk two now on a regular basis the updated location pings of the driver are sent to the server and what the server will do is that it will use the map service and inside the map service the Google S2 Library will let us know that okay this is the latest location and as per the latest location either the chunk ID has changed or it has not changed and if the chunk ID has changed let's say for example it was one earlier and it changed to two if you remember we had a question mark service right so that question mark service it will have that okay earlier the chunk ID had driver D1 but now since the chunk has changed we will get rid of the driver one from chunk id1 and then we will move the driver to chunk ID now let's say the chunk ID earlier was one and then based on the location change we got to know from the map service that okay now based on the late test latitude and longitude pair the chunk ID is two so ideally if you remember there was a question mark service okay let me name it right now it can be the dispatch service or the cab finder service let's call it the cap finder service because it keeps the note of what all drivers are belonging to a particular chunk ID so earlier driver D1 belong to chunk id1 and as soon as the location changed based on the information we retrieved from the map service we will remove D1 from chunk id1 and then it will add it to chunk id2 now we will understand the cap finder service in a lot more detail but this is what I wanted to explain that okay this is how the user or the driver keeps on sending the location pings to the server based on the location pings that we are getting from the driver the map service will identify that okay this is the chunk ID which it belongs to so now let's start categorizing what all different Services we have discussed till now first one is the map service the map service is responsible for converting the lat long to the chunks it belongs to it also performs the calculations from point A to point B that okay this can be the potential route this can be the approximate time taken to reach the destination this is the approximate price which will be required for the cap service so these all things are done by the map service now we kept on talking about receiving the location pings so let's try and cover the location service in detail as well so the location service is responsible for all the location data related to the user as well as the driver like user I mean the rider so let's say there is a driver and let's like let's continue this discussion only for the driver because the driver is continuously going to send the location pings to the server and what the server will do what the location service will do is that it will store the latest data into the table now we will have a table as we discussed earlier which will contain the driver ID or the user ID I have to also also contain the let and long of the particular driver along with this we also want the entire history of the location pings that the driver or the user has been sending to us why do we need it we potentially might be needing it because we want to perform some machine learning thing or we want to perform some analysis on top of it let's say a trip has ended and the rider has challenged that okay the driver took a bad route and that is the reason why the price has increased and I don't accept this particular price or this particular fare so what the application will do in the back end is that it will check the entire location history and identify if this was the optimal route suggested by the map service or not and then based on it it will retrace the entire trip and identify if there was any Miss happening which happened during the particular trip so there are a lot of use cases and for the safety we are going to store the entire location data for both the users and the driver now naturally if we are going to store the latitude longitude and the user ID let's do some calculations to see how much data is it going to store so for the user ID it is going to be 8 bytes 64-bit now for latitude and longitude also let us assume it's going to be three bytes itself so one row for one driver it's going to consume 24 bytes or for the ease let's consider it 25 bytes for the computational ease now there are 1 million drivers who are active every day so the entire capacity of the table is going to be 25 times 1 million bytes now this is going to be 25 000 kilobytes and which is equivalent to 25 megabytes now 25 MB is quite small as compared to what all resource capacities we have nowadays so this entire table can be stored in a redis cluster in the form of a cache so that the retrieval is fast whereas when we talk about the entire location history of any User it's going to consume a lot more data and this data is going to grow along with time so it does not make sense to store it in the form of a SQL database so that data is going to be stored in a nosql cluster or a Cassandra database so we have two things the location service is going to hold the latest location data and we are also going to store the historical location data now this is going to be stored in a radius cluster for quicker access whenever we require it like the map service can require quicker access to the latest location of XYZ drivers so location service will transmit that data to the map service or any other service which requires the location data so location service is the source of truth whatever location we require for any user they are going to ask the location service that hey this is the user ID can you share me the latest location or can you share me the historical location for this user now this is going to be stored in Cassandra now I hope it made sense why the table size motivated us to use a redis cluster and it also helped us to get faster access to the location data now let us just beautify this entire system design diagram for the ease of understanding and then we will talk in detail about the cap finder service we started from the map service we went down to how we are receiving the locations in the location service now we will take one step back once again and try and understand when a user request for a cab what the enter flow looks like so first of all let me just make a better diagram now we know that the user communicates with the server through a websocket so this is a websocket server and in front of the websocket server we will have a layer of load balancing now why do we need load balancers because as we notice there are 100 million monthly active users and since these users are a lot we are receiving 100 QPS of traffic from each of these users so ideally we would want to balance the traffic amongst the websocket servers so it's not just one websocket server there is going to be a websocket manager who is going to handle all the traffic the websocket manager will know that okay this particular connection ID belongs to this user ID and this connection ID is catered by a particular web server websocket server so we will have a layer of websocket servers it's a distributed layer and then behind this layer we will have a routing service now what this routing service will do is based on the request it will distribute the traffic between different Services we talked about map service if there are constant pings of location data coming in then it will go to the location service if a user is requesting for a cab it will go to the cap finder service and so on so right now we talked about map service we talked about the location service and now let's discuss in detail about the cap finder service now ideally what the cab finder service does is that it matches the demand created by the user and considers the supply of the drivers in that particular radius or among different chunks so it tries to match the supply with the demand so some people might also call it a dispatch Optimizer or the dispatch service or the supply demand service so I am going to use more simpler terms and put it as capfinder service now let's go in detail and understand what the capfinder service is and how does it cater a request of a user requesting for a cab cab finder service so whenever a request will come it will look something like this ah I have this user ID this is my latitude this is my longitude go and perform the magic right so now this cab finder service calls the map service to identify what chunk it belongs to now the cab finder service looks something like a hash string I will explain in detail what the hashing means as well but to put into simpler words there is a distributed service where there are multiple servers each of this server is responsible for a set of Chunk IDs okay it doesn't matter what chunk ID is for example this is server one server 2 server 3 4 and 5. ideally the number of chunks across the globe is going to be in the range of millions if not billions and I'm not sure like it's going to be in the nearby order itself but for the sake of clarity let's just assume that the number of chunks is just 10 so that I can explain in detail what the cab finder service looks like so each of these servers are associated with a set of Chunk IDs now server one might be responsible for one and two three four five six seven eight and nine ten okay we have assigned the responsibility to each of these servers and there is also a server Optimizer or a manager of assaults who is going to take control over what request is going to be sent to the server now this particular service leverages the benefits from consistent hashing consistent hashing now what consistent hashing means is for example we want to scale this service up because the number of Chunk IDs increased for example let's say this is a chunk id5 and all of a sudden you can say around 5 PM to 7 PM the traffic inside this chunk increased so ideally what the Google S2 service will do is it will break the chunk into multiple other chunks so as the traffic inside each of the chunks is almost in the similar range we don't want buyers to be there in any of the chunk IDs so ideally chunk id5 will get converted to something like 11 12 13 and 14 I'm just giving it random numbers but what happened is chunk id5 went off and instead of it 11 12 13 and 14 came in okay probably we will leave the chunk id5 where it is and we will just replace it with 14 so the number of chunks increased and the responsibility for a particular server will also increase for example earlier it was the responsibility of server 3 to cater to the traffic belonging to the chunk id5 now with the increase in the servers it might be allocated to any of these servers for example we might assign the responsibility of 11 12 and 13. in any random particular order so what consistent hashing does is whenever the scale increases it is easier to add servers whenever the scale decreases we can just get rid of some servers because a bunch of Chunk IDs might be merged into one okay so hash ring has a structure of a ring like that is the reason why we call it a hash ring and it has some nodes on top of it 1 to 11 3 4 12. 5 6 13 7 8 9 and 10. now the entire concept of hash ring is in such a way that whatever chunks or whatever items are there before a particular server it will cater traffic for that particular chunks for example between S5 and S1 there are three chunk IDs so ideally it will be the responsibility for S1 to cater the traffic in those particular chunks and the same like 3 4 and 12 are there between S1 and S2 so S2 will be responsible for that now for example let's say we want to get rid of some chunks and ideally we will get remove one of the servers okay for example we just removed S5 ok so what will happen the hash ring will remain as it is now ideally S1 which earlier had only 1 2 and 11 it will now also cater traffic for 9 and 10. so just by the removal of a particular server it doesn't require a lot of changes it obviously doesn't affect what happens to S4 S3 and S2 they can remain as they were and the server manager which we talked about earlier they are notified that okay now S1 is also catering the traffic for 9 and 10. now there is another protocol which is used for this particular matching algorithm or cab finder service which is known as gossip protocol so whenever the responsibilities of any of the server changes it will just gossip to the other servers that okay now I have got more responsibility for chunk ID 9 and 10 as well so if there are any requests for the particular chunk ID do assign it to me for example in the earlier let me just get rid of some of the clutter over here so for example if we receive a request from a user belonging to chunk ID 6. okay so what the Care Finder service do is first of all it will identify which server is catering to that particular chunk ID so over here it will identify that okay S3 is catering to chunk ID 6. now it will communicate with the map service and identify what all chunks are nearby the chunk six now in our case I am not sure what all different chunks are there but let's just take an example and let's just call them random numbers 3 7 and 12. so what S3 will do is that through the gossip protocol it will know that okay 3 belongs to S2 3 belongs to S2 now 7 belongs to S4 and 12 also belongs to S2 so it will send requests to both S2 and S4 that ok I am server 3 and I want the data that what all drivers belong to 3 and 12. so S2 will get a ping that give me the list of driver who belong to 3 and 12 S4 will get a ping that give me the list of drivers belonging to chunk 7 and then based on all the drivers we will send the pings to each one of these drivers that there is a request for a cab service do you want to accept it or not so in this way the cap finder service finds the drivers who are in the vicinity and sends them the pings through websocket manager so let's move on to the next Concepts I hope it is clear what the Google Maps or the map service does what the location service does and now we have also covered what the cab finder service does now we have covered three major Services which are going to be part of our Uber application the map service the location service and the cab finder service now let's have a check at the functional requirements to see what all functional requirements were yet to implement I think one important chunk which is thing is the trip service we haven't yet stored the information related to a trip when the trip is accepted when the trip is started and when the trip ends we also haven't checked at the payment service so let's just make a note of what all services are remaining create profile is remaining for both the drivers I think that is very self-explanatory like we will have a profile service which will expose some of the crud apis like create a profile update profile get profile and delete profile so I'm not going to explain that in detail the profile service will take care of that now show availability for the driver it is exposed to the websocket server that whenever the availability is live it will start sending pings to the location service now show cabs nearby is done start and end location show ETA book cab so we have almost covered two to four book a cab is also done payment services remaining rating the driver can also be done through the profile service itself so for that we are not going to expose a different API like for rating we will expose a different API for but we are not going to have a rating service just to rate the driver or the user so this rate the driver is also done one thing we are remaining is the trip service and possibly a payment Service as well all the navigation portion related to the driver is going to be handled by the map service and start right and end right that is going to be handled by the trip service so let's just go and understand what the trip service and the payment service will look like I think this is empty we can take use of this particular page so whenever a trip is accepted we can just have a trip service first of all we will have the entire chunk of websocket servers and we will have a load balancer routing service and so on and ultimately we'll have some trips location a trip service my bad so whenever a ride is accepted by the driver a new trip will be created and added to the trips database or trips table we are not sure we'll discuss that how we want to go ahead with it but let's just look back at the numbers we were having 10 million rides in a day and there are 24 hours so ideally we had 100 QPS or 100 rides were getting created almost every second so we can divide the trip service the database of the trip service into two parts one which is going to store all the inactive trips or the completed trips they can be rejected trips as well they can be completed trips as well now that database is going to keep on increasing with time that is never going to stay the same or saturate down the lane in future so ideally as we discussed related to the location service as well we would want that data to be stored in a Cassandra cluster or nosql cluster because it is going to grow on and on as the time increases so we'll have a Cassandra cluster which will store the data for all the inactive trips for the active trips let's make a calculation how many active trips can be during any time so 100 QPS is there in one hour we can assume that one trip lasts for one hour approximately or 100 minutes just to make the calculations a bit quicker so approximately 100 QPS is converted into 100 minutes and in each minute there are 60 seconds so ideally this is going to be in the range of 6 into 10 raised to 5. so this data can be stored in a SQL cluster because it has a higher probability of getting retrieved by the user on a active basis so we would ideally want the read latency for that particular information as less as possible most likely the Cassandra service is going to get retrieved whenever we ask for some past trips or some payment information that will be covered by the payment service but we will have a live Corpus of a SQL database and will store the information for all the live trips in SQL and what we can do is we can also have a trip archiver or a Cron job this is an archiver what this archiver will do is every six hours or every some XRS it will run a crown job which will transfer the inactive trips from the SQL database or the redis cluster to the Cassandra cluster so in this way the entire trip service can be configured now going to the payment service also it is quite self-explanatory it's not as complicated as the location service or the map service or the cab finder service whenever a trip ends which is also notified by the trip service whenever a trip ends it will send a notification to the payment service to make the calculations of what amount is going to be there for that particular ride based on the distance covered now this trip service will have the entire ping of location data both from the user as well as the driver why do we need both of them when they're sitting in the same Cab in some cases there can be fraudulent charges raised by the user to the driver that the driver dropped the user in between so we can configure or we can manage all of those challenges or all of those complaints by having the entire history of location pings during the course of a trip for both the user and the driver so the trip service data is going to be quite heavy because we are going to send location pings every few seconds from both the users so the trip service will have the location data it will have the start location and location start time and time and all the information which is necessary for a particular trip whenever the trip ends we'll also have trip status accepted started and ended these three can be the trip status whenever the trip ends it will send a ping to the payment service saying that hey this trip ID has ended based on the location data calculate the kilometers traveled based on the start time and end time calculate the time taken and based on all of these information find out what will be the price of the cab so payment service is also almost a trivial one I mean not as trivial as the profile service but this is also trivial one and will configure the payments based on this particular service now if we look at the entire diagram so over here we have added three more services if I'm not wrong one is the trip service one is the payment service now this payment service will also be configured with a payment Gateway it can be stripe it can be razor pure any other payment Gateway feasible for the application and lastly the profile service now let us try and cover some of the ad hoc technical Concepts let us try and cover some other ad hoc technical concepts for example whenever you are in a mall you must have seen this a lot of times instead of a particular pickup location there is a preferred access point let's say this is a campus okay let me cover the example of my hostel itself like I was in triple it Hyderabad and let's say this is the map now there were multiple access points for the cab drivers there was boys hostel over here there was girls hostel somewhere here this was the main entry gate and this was the academics block now at each of these four locations there were preferred excess points given now how does Uber compute them now if you notice whenever you start a ride or end a ride we have a location ping associated with it now based on the aggregations of this location point over a historical time period now it can be over a few days it can be over a few months we will realize that okay these are few of the preferred access point inside the campus boundary and then the proper nomenclature will be done for all of them that this is the entry gate this is the academic block pickup this is the boys hostel and this is the girls hostel and so on So based on the aggregated data if if you remember we stored all the location data along with us and we were confused what all different things we could do with that location data so we can perform some sort of machine learning aggregation on top of it we will store all the data which is present in the Cassandra clusters it can be the trip service it can be the location data and so on we will pass it to the Kafka queues and that will be consumed by different Services it can be consumed by the machine learning fraud services I will talk about that also it may it can be consumed by the spark storm Service as well which will perform some analytics on top of the data that we have given for example the preferred access point is one such service and then it can be consumed by map Service as well for example if the traffic is increasing or decreasing it can lead to price surging as well so all of these different Niche concepts related to a cab service application this can be done on top of the data that we are storing inside our different services so in this way preferred access point can also be covered now for example when we're finding the potential drivers who we are going to send the pings for in order to book a cab it is also a possibility let me first show you what I mean so here's the user denoted by the Red Cross and we have some drivers who are waiting for the cab service now there can be one such driver who is going to end their trip at a particular location near the user so ideally we would want this particular driver who is about to complete his or her trip in a nearby location to also be considered in the potential candidates for the drivers who we can send the pings so all of this data is done on top of machine learning and all the data that we are storing with the trip service the trip service knows that okay the trip is about to end the trip with this particular trip ID and the driver ID is about to end so we can also send the Ping to this particular driver that hey your trip is about to end do you want to accept a new ride or not so there is a lot of things going on behind the scene when we are searching for the drivers to book cab when we are routing through a particular City and so on a lot of things are done in batch mode as well after the trip has ended for example the preferred access point was one such thing now I also talked about a machine learning fraud mechanism for example a lot of drivers do this thing there is some sort of an incentive let's say If a driver completes 10 trips he or she is going to get thousand rupees extra so if the driver has just completed nine trips they can just book a ride with another mobile phone of theirs and they can then complete the ride just to get that additional incentive so all of that payments data all the location things that we are having all of these data is aggregated and then used to detect any fraud if possible so now we covered a lot of topics a lot of technical Concepts which are important for the overall understanding of Uber's system design so let's just go through them once again so these are the six services that we talked about map service location service cab finder service profile service payment service and trip service we know the responsibilities of each of these service let's once go through the apis which each of these services will cater to as well now these apis I didn't cover them in detail because it felt a bit trivial after I explained all of those technical Concepts and what are the responsibilities of each of these services so I think you can just pause the video at this moment just to have a look at the different apis now one important thing which I want to highlight for this particular apis part is that let's say I want to delete a profile okay over here I am just sending the user ID of the profile which I want to delete I cannot delete the profile of some other user ID that is quite an intuitive thing so how does the service know that okay the user ID which I have mentioned and my user id both of them will belong to the same person or belong to the same profile so ideally for each of these Services all of these Services there is some additional context that we are going to pass along with it and that can be in the form of user token or some sort of an API key on top of this so we can have a user token now both the drivers and the users can have this user token in order to authenticate themselves that if you want to get a trip with this particular user ID and trip ID the user ID should belong to you so we can use a user token or some sort of an API key so I deliberately missed this particular part out from this API slide so that you get an idea of authentication mechanism as well along with this so whenever a user is logged into an application there will be some token or API key created against them and this is going to persist over a longer time as well if it's an API key the user token will get generated every time the application is opened now this is the entity relationship diagram or what all tables can we store in our database we already talked about the location table which is going to persist all the information we have the trips table as we talked about the start location and location we have the status as well we have the location ID which is associated with the locations table and it's a repeated entity you know every four seconds or every five seconds some location ping will come and will keep adding that location data to the trip stable now we covered all the different Services we covered fraud detection as well we covered Maps machine learning as well and Driver profiling as well like based on the fraud detection results and based on the ratings that people have given the users have given to the driver the driver profiling can be changed now all of the data can be transmitted live through the Kafka queues or they can be asynchronously passed in batch mode along with this just for the sake of completeness all of these services are emitting some logs okay this can be error logs this can be just debugging logs as well now all of these logs are consumed by Kafka queue and then on top of the elasticsearch log stash kibana elk on top of this particular service Elk service elasticsearch log stash kibana we can perform log analysis and per like create some visualizations as well that okay what was the availability what went wrong at particular time and lots of important graphs related to monitoring and alerting can be created on top of it so I think this is a good part to end the particular video like we have covered a lot of services we have talked about a lot of technical Concepts as well we have talked about some machine learning Concepts also which can applied on top of the information that is transmitted from the different Services I really hope you enjoyed watching through this video where we talked about some of the most important technical concepts related to an Uber like large scale application if you found this content insightful make sure you give it a thumbs up add your opinions and query in the comment section below and lastly don't forget to subscribe to scalar's YouTube channel and don't forget to hit the Bell icon so that you never miss out on insightful content bye

Transcript for:System Design of Uber

Transcript for:
System Design of Uber