Transcript for:
Overview of Twitter's System Design

hello everyone my name is Lorraine and indecision let's talk about system design quadrature and I'm not going to talk a lot about what technology specifically Twitter usually but instead I'm going to talk about the core concepts like how to generate use a timeline how to generate home timeline how to calculate the trending hash for different locations in the old world and also the data flow through the system what are the different features we need to support in this system design first one is user should be able to tweet to the followers as fast as possible sometimes a particular user will be having about millions of followers to it so she should be able to send the street to all of their followers in above few seconds so the second feature is timeline basically everything you see in the Twitter is called as timeline right so there are different types of planets first is home timeline home timeline means basically the tweets from all the people are the pages which you are following whenever you open the twitter by default you lands at the home timeline the next year's user timeline this is specific to your timeline that is all the tweets you made all the retweets you made whenever you open your profile you will see all the tweets you may write that is called as use the timeline and you should be able to see that and also this such timeline in which you can assert some keyword in the Twitter and then you should be able to see all the tweets which are related to that particular keyword and the next feature is trends you should be also seeing all the trending hashtags or the topics to see what's going on in your location or in Iran the work before architecting any system you need to understand the characteristics of that particular application for example it has about 300 million plus users as of today that means that we might be expecting a lot of Reed & Reed's and write Twitter gets about 600 tweets per second or we don't rights that means that there are people who are tweeting 600 tweets per second and also Twitter gets about 600 thousand tweets of queries from the users that means that if you see the ratio of and Wright was a Swede it is almost like one is 2004 everyone tweet generated there are about thousand people in on an average looking out or reading that particular tweet which gives us that the Twitter is a real heavy that means that we should have a system which is very good at reading all the tweets and also it should also support about 600,000 tweets per second or query per second from this particular system and also if we also see that Twitter is not like a biking application where you need to have data consistency it's ok to have these tweets eventually consistent say for example if I tweet it's ok if my father's see this particular tweet one or two second delayed it's not a big deal so it's okay to have eventual consistency and also if we talk about storage twitter as of today has a limit of about 140 characters per tweet that is a lot less in these days right because the data storage cost is a lot a lot less these days so we don't need you a lot worry a lot about storage cost but we just need to worry about how we scale this storage and how we cope with this much of our reads from the users and I know you guys are curious to see how the system design diagram looks like and here is a sneak peak as we already learned that twitter is real heavy that means that we need a system which allows us to read the tweets or any data faster without loading the system heavy and also it should scale horizontally what is that system so we can think of using the Redis because lettuce is much faster as all the data will be stored in memory and also which can scale horizontally or we can build clusters out of it and also we can't just solely depend on radius because it is also persistent but we can't just dependent completely on the radius we have to have a copy of all the data in the VB that means that the more the important database tables which we needed are use a table tweet table and follows table whenever a user create a profile is created we have an entry here and whenever he follows pages or person we have to have a one-to-many relations and the follower and that is shown here users in the pallets and also as the user makes tweet all the tweets are also saved in the tweet table and that also has one-to-many relationship with the tweets and all the we tweets are also considered as the tweet for this particular tour so all the retweets also a copy of retweets either you can make a one-to-many relationship with the tweets itself or you can save a copy of sweet in the user to tweet relationship so that way if the source tweet is deleted we have a copy of tweet map to the user and Twitter tweets so also we need to understand how we are gonna save the data in Redis in radius speed to design the key how we can save the data so obviously we need to have a user ID in the key and also some kind of unique identification so user ID - tweet string which gives out all the tweets with which user has made so far we can use the various own list data structure to save all the tweets so with that we can get all the different kind of operations which we can do on top of this list data structure and also to have a being off user I you know user followers in the cache we can have key which contains user ID and also I can polish which gives you the list of all the followers which are users making and these IDs are the IDS in the table similarly in the tweets all these IDs are the IDS in the tweets table so how do we get all the data to generate the use of timeline say for example in someone visits I use this timeline how do we get the data to get that data what we need to do is go to the user table by the user ID get all the tweets retweets and virtual reference tweet and sort by the date and time and then give back the data this won't scale or this won't work always because we know that Twitter is read heavy there are about 600,000 reads coming in so we can't always just lower the DB with all these queries so what we need to do is we need to have one more layer called caching layer and we save this the data for this user timeline query in the latest before that what we need to do is when I use a tweets we also need to keep saving the tweet into the various that way when someone visits a user's timeline what we need to do is by user ID we can just get all the tweets which we have made so far and this data is present in there it is so the raid is much faster because all this data is in memory when once we get the all the tweets which our user is made by user ID and this key we get all the IDS of the tweets which that user has made then what we injuries by user by the tweet ID over here you can get the exact treat from the Redis itself so one underscore tweet to underscore tweet like that we can get all the tweet information which is same in the Redis which can also have other metadata with all of the data we can just bail the user timeline now let's see how to get the data which is needed to show in home timeline looks something like this which contains all the latest tweet of the person you are following or the page you are following and how to get this data so it is very simple right what you need to do is first get all the followers you are following and then what you need to do for each follower get all the latest tweets and then you need to merge all these tweets and sort by date and time and then display on the home page that shows all the latest tweets made by all the followers whom that particular user is following does this work if you see we have to get the followers hypothetically every user will fall over one hundred hundred and fifty pages right that means that we had to go two hundred and fifty pages our user table and then get their latest tweets these coils are heavier on database you can argue that I will shard or I use no sequel I'll have multiple nodes and all these nodes will take care of those many queries but trust me making 150 queries or even if it is 10 queries it is not good you can't written these response or the data which is needed to show home timeline much faster when you if you have seen that when you load a Twitter homepage it loads much faster it will the api's which serves the data for home time line gives response within by 200 milliseconds or even much lesser how does the do it for that we need to do some different strategy let's discuss that let's see how do we get the data required to compute home timeline efficiently the approach is called as final approach sanim means moving to a different direction from a single point that's the definition okay let's see how do we use it that means that whenever we get a tweet do not have pre-processing and then distribute the data into different users home timeline that way we can reduce the queries had to make when there's a request to compute the home timeline for that particular ism say for example a user has been followed by three different people and this guy okay has a cache timeline entry that is called a user timeline and all the followers will also have user time line and also home time line okay and we also have dB now consider this guy made of three each and this is a tweet now what do we need to do is first add an entry into the DB that means that we have saved three in the DB now it is persistent now what we need to do is add the same tweet into his user timeline that are they be discussed so the tweet ear is added we to use a timeline if you make a query to get the use of time fine the street is already there and all the other historical tweets also there in here now what we need to do is send this data or pan out this particular tweet data to all the father's or that means he'll send the street to all different followers whom following this particular person when we send the street it will get updated to this person's home timeline in the cache so send the street into the home timeline entry has been added here and this way I to this garage so in these guys so this is all radius entry I'll consider it as register a write is list as such and this tweet has been added here okay now whenever this guy visits his home page or home timeline he just need to get the data from this cache this crash or this array which is in the radius doesn't just contain this guy Street if there is one more person if this guy is also following this guy whenever this guy please this entry will also come and sit here similarly if this were just following this guy when this week saves here afterwards the same tweet will also be updated here now there are two three in this home time and of this particular user and here also there are true tweets in this if this day is following some of the page whenever the they tweet that in tribute also be updated here so when I was this guy visits his home timeline we just need to get this cache entry which has all the tweets which are recently made by all this Wow you know follower pages that way you don't need to make any database query albany trees just go to this cache my user ID access the home timeline data in the latest as you notice is much faster because it is an in-memory we get this list you just need to show those tweets in some worsening order by date and time on your home timeline that's it somehow we reduce the DP queries completely we are using radius to get the home time and data but does this model works always it doesn't sometimes when it is when considered this guy is a celebrity some kind of like Taylor Swift or someone yeah like about 30 million users in Twitter so that means that we have around 30 million followers to this celebrity ok these many followers all calling to this particular celebrity when this little bit tweets is it de is it an easy task to update this particular tweeting to all 30 million persons home timeline it easily takes about minutes even on the ladies so we need to follow different approach for dispensing the tweets made by the celebrities let's see how to do that so let's see how do we handle the tweets made by celebrities consider this guy the celebrity will name it him as see when this guy tweets what we need to do obviously save the street into tweets table in DB then add the tweet in to use a timeline cache and then what we need to do is if he was not a celebrity then you know bugle fan out this vertically updating all the followers home timeline right but since this guy has about millions of followers over here I can't update each and every one it's too much so that's it you just need to do nothing just do this process and stay quiet when this when the follower tried to load his home timeline as we already saw in his cache in home timeline - we have the tweets oddly updated by panel of other people who are not celebrity albany to use just written this data back to the front end or the application but before doing that let's check if he has any celebrity tweets how do we do that let's also maintain in this users cache that's all also maintain the DES stop celebrities who whom he is following okay and since it is in cash this operation will be much faster consider he has about five different celebrity he is following already to do is go to first celebrity in the cache itself go to the user timeline of the celebrity okay that is first celebrity and see if there is a tweet recently made if so get the tweet and then add it to the response in the response we have already have added these three tweets one two three tweets we got one celebrity tweet added there is four tweet and next go to the next celebrity and keep checking that since all these operations are in memory operations are various operation this shouldn't take much time and is much efficient that way we have included all the home timer tweets and we have added all the celebrity tweets into the final tweets list for the home timeline in a real time that way we got all the tweets which were supposed to show in the whole timeline of this particular person now that we optimized for praxis use a time line and home time line efficiently using various cache now also optimize a little bit do we really need to calculate our pre compute all these home time line and use a time line for the users which are not even logging into Twitter recently it's unnecessary right because we simply can't click in this home the time line and also use a time line for the people who doesn't even log into Twitter for about more than 15 days that means that we don't need to compute home time line for that guy we save a lot of memory in Vedas and lot of computations also so let's learn how to eat our trends are calculated Twitter and considers volume of the tweets and also the time taken to generate that particular one of tweets based on this data they determine whether this particular hashtag is training or not say for example a thousand tweets in five minutes is much interesting than 10,000 tweets generated in one month that means that in little time or short period of time people are talking a lot about something so they generated about thousand tweets but 10,000 tweets in one month could be something which is pretty normal because the time duration is bigger and the events are like say for example if there is a election results happened so people must be talking about different parties or something else all there could be a cricket match going on or there could be something like there is a movie release happen just now and people are talking about its reviews or something like that and these are the hashtags which trends how does Twitter process that kind of data obviously it needs to go through each and every tweets which has tweeted in the platform and then compute what hashtag is trending what framework to Teresa's Twitter uses streaming stream press and promote like Pakistan or Iran here is the system design for stream processing of the priests to figure out what is the trending hash tag or keyword okay now you can build this system using two or more different kind of frameworks like you can use either artist strong to win the whole thing or you can use conquer streams also to build this kind of system there are many other stream processing framework which you can try once okay now before expending remember all these things happens in a real-time that means that if a person tweets on Twitter that treat actually goes through this kind of pipeline also in real-time so these boxes are you can think it has like operators each has its own functionality or function say for example this is a filter function or filter operator and this operators work is just juice to filtering of tweets based on self certain criteria and I'm going to explain that later and this guy's work is passing so just that every operator here has its own designated functions and they do that only and how do we scale this kind of system so we have tightly coupled these operators are operations how do we scale it say for example this is fast this is polished but geolocation tagging is little store for we scale in that case only we need to do this just scale this operator and more operators will does this operation the rest on you can keep like one operator one operator it's not ideal situation but just for the comparison purpose when all these runs on one operator maybe this guy or this particular operation takes some time to geotag or something that is the reason we'll need to scale only this thing this all this data flows through some kind of buffer all kind of cue it depends on what technology or what kind of framework you're using if you are say for example if you're using chakra streams all these different operators are connected between by a cube our topic of Kafka something like that so what happens now is when a person tweets the message enters into this filter through a cube that is that could be cocky or any other cube the street is pushed to carve a cube and then the from the Kafka view this filter has picked up that message okay now what this filter operator does is it obviously knows the Twitter guys have updated some of the filters that's kind of manual job that most common frequent hashtags which is doesn't qualifies as training hashtag likes a fun travel food codes how this helps a claim hashtags right we don't even consider them as a hashtag which can go as a part of trending hashtag right so we have to filter all this hashtag from the tweets if there are any so this filter operator does that and also it checks for any violation check if a particular tree has crossed the violation or not like violations why does it have any other content or does it have any while at it any copyrights something like that if it violates then that retweet is dropped and not processed further then once this processing happens this message whatever we have that with all the metadata will be put into the queue and then this guy picks up the advantage of having queuing between these operators aides say when we scale this operator to even many more operators then these guys based on which operator is free they can pick the they just easy like all this works are synchronously so this sparse operator does is it passes the tweet not always every user mentions the hashtag of a particular tweet in that case on how Twitter needs to figure out an appropriate hashtag by doing some kind of natural log language processing some kind of that kind of process and then you need to assign a hashtag to it it makes shows that at least one hashtag is associated with any given tweet and also while passing we have to remove all the stop words stop words like the normal any words which doesn't have meaning like to the as of now etc those things remove all of that pick all the important words in the tweet and then based on that figure out the appropriate hashtag and now from here onwards the date our the Twitter sorry the tweet will get distributed into two different partners one for the geo processing and another for the you know trending processing in here the tweet a copy of the three is also passed to this pipeline and to despite and also no well X I'll explain in one - that we'll consider other thing now the fridge is here now as I already mentioned before computing the trending hashtag we need to consider two things the time taken to tweet those many tweets and also the number of tweets which have made so far that means that we need to come do kind of with the operation on all the trees which you have received so far so we kind of do an window operation here on all the tweets received so far like say one minute window ten minute window or one hour window based on that we will get the count of any given hashtag like kind of rate at which these hashtags are produced in the Twitter platform now once you have the wait now that particular data is forwarded to you're an operator in here what happens is the rank operator figures out or gives a rank to each and every hashtag which he which the spelling operator C so far based on that we know what hashtag is trending but Twitter is present everywhere that is a most of the countries in the world right except a few countries like China so just figuring out the hashtag doesn't make sense because say an election is happening only in United State that means that hashtag election might not be a lot suitable for a different country like Australia maybe that's a bad example because the whole world watches US elections say for example a rugby match is happening on you in u.s. in that case that hashtag rugby is not appropriate in country like Pakistan right so we have to also consider the fact that we also need to map these hashtags with geolocation or location so the same tweet is passed to geolocation operator here this particular operator is responsible to figure out an appropriately location or country for any given Queen if because of that and the same thing happens like some window operation find do a mapping of location to the hashtag on a grid or Christian and then figures out what particular hashtag belongs to which particular location then all this data is fed into the radius from the radius any API can read the data and will be consumed by the Twitter front-end or application so that's how it works I'm not going to explain in that for to understand more on how these streaming framework works please go and read a little bit about caucus wins and a Pakistan then this will make a lot sense to you guys and now let's learn a little bit about how Twitter search timeline works let's see how Twitter handles searching off its tweets and hashtags Twitter uses something called as earlybird internally other but what does is what it does is inverted full text indexing operation and simple word how index inverted full text indexing works is say for example every time when a tweet is inserted into the Twitter are updated in the Twitter that tweet will be broken into different Thai words and also into hashtags and then all these words are are then remarked or indexed to all the different tweets say for example when the treaty is broken into four different words high time super and movie and also other hashtags and now these words are indexed and how does that happen say say it has a big table or distributed table in which each word has a reference to all the tweets which contain that but your world say when I search for the word hashtag bank what happens is all we need to do is look up to the table and then find a word or hashtag called tank and then figure out all the you know references to all the tweets in the system and then just give out that result say for example this complete result now that I also said it is distributed because you can't have just one system one big table to handle all of this data then that is a single point of failure so obviously we need to go for distributed computation so tree just does that that is something called the strategies called a scatter and gather you have different nodes which is scattered across different data centers when you get a query you send up weight to all the different nodes or data center and then each data center looks for all the tweets which it knows for that vanilla hashtag or for any given keyword and then all these nodes or data centers gives back our dessert and then the system the cell system collates all the results which it gets so far and then returns back that particular output that's the list of all the tweets based on the popularity so far we learned how search works how timelines are computed or how timelines are calculated and also how trending hashtag is also calculated using stream processing now it's time to learn how data flows through the system or system design for Twitter and here is the system design and I have already explained most of the important components in this system design now I'm going to explain a little bit about how the data flows through the system say for example if a person reads that message or the API called hits the load balancer and then that call hits the Twitter writer and these are the API which is responsible to handle the tweet or incoming tweet what it does is as a whole other explain the tweets are updated in two different components first a copy of tweet is saved into database and then a same copy of tweet is also sent to Pakistan for hashtag trending hash tag competitions are to figure out the trending hashtag and also a copy of tweet is sent out to fan out service as I already explained panel is the service which is responsible to update the tweet into all the followers timelines like home time line and also user timeline once it gets the treat it updates in all the timelines all the home time names of the followers of a given person and also art which is a copy of treaties also sent to the search service as we need to index or break down the tweets into different words and then these words are indexed to make a better searchable and that is handled by search service and it's also called as ugly word in Twitter when user searches for some hashtag or some word or some keyword from the application now what happens is the search requests all such a vehicle and slow to answer and then the request is forwarded to Timeline service and it turned time and services service transfers the call to such service in here what happens is such service cattles the request that message it calls all the different nodes or data centers or it sends the same query to all the different data centers and then it collects all the result from those data centers and then sent back to the application and also few more other scenarios are when the request when the user requests for home time line or user time time those EPA calls also hit the load balancer and then that request will be forwarded to time that service the time and service directly talked to the radius and it figures out appropriate are the timelines for for a user or home time line in the Redis and then it is much faster it retrieves those timelines and then gives back to the application in the form of JSON and here you can see that there is HTTP push or WebSocket component which is responsible to handle the real-time or persistent connection with with the mobile applications it could be Android or iOS it handles the WebSocket connection and this service should be able to handle millions of connections at any given point of the time and here you can see zookeeper so people is the coordination service for distribute the competence say for example we have a number of No actually Twitter runs about thousands of nodes in any given cluster just for radius because most of the data is present in Redis and also a copy of data is permanently present in the DB so we need to have a big cluster of Redis right and when we have big cluster we also need to coordinate between these nodes and also we need to have a master who is responsible to coordinate with other nodes right all this work is done by zookeeper basically the zookeeper helps you to maintain the configuration for each and every node in the cluster and also it maintains and also it helps you to elect a master in this very special and also it is it can be used to name the servers automatically and also the zookeeper keeps track of what are the servers online at any given point of time or what are the servers which is offline based on that each coordinates with all the nodes in the cluster I think I have explained all the components over here and finally DB Twitter uses something called as bizarre it is built on top of my sequel and we know TB which is again distributed service since most number of calls are not going directly or most number of queries are not going to database is we don't need to worry too much about DD but still a Twitter scale we need to make sure that DB scales as there are more number of users signing up to the Twitter and also Twitter uses cassandra for cassandra DB for some of the other computations like for some kind of analytics and other kind of services where we don't we need we have a big data and we need to run some kind of analysis I think that's about it I think I have covered all the major components or the services which are needed for which are needed to design the Twitter yeah if you liked this video please subscribe to my channel and also tell your friends about the channel and please leave me the solutions on how I can improve these videos and also if you have any specific topic please drop a comment in any of the videos I'm going to reply to each and every comment thank you