Insights on DynamoDB Architecture and Performance

let's dissect the dynamodb paper today Amazon published two papers on dynamodb one was called Dynamo it was not about dynamodb much the second one is about dynamod with the actual dynamodb implementation right so we'll dissect this paper and understand the nitty-gritties of it into it right so folks who are living under the Rocks dynamodb is world's most popular to be honest most popular non-relational database out there and the the core highlight of it is you would see like it is known for providing consistent performance at scale any scale right so a lot of database are known to give consistent performance but they typically falter at humongous skin dynamodb gives you consistent performance no matter how huge the scale is that's the beauty of dynamodb just to share some numbers from the paper itself in 2021 during the 66r prime Day Sale Prime day that we went for 66 hours foreign s Millions is fine billion people to parties trillions of calls to dynamodb with peak of 89.2 Million requests per second just imagine the sheer scale they handled at that time and that entire duration dynamodb didn't go down it was highly available with a single digit millisecond performance which means less than 9 millisecond how amazing is this right and almost all the major services at Amazon like internal services at Amazon apart from us the external users of Amazon even a bunch of internal Services of Amazon they used animal DB to store metadata and actually power the actual products of it apart from us the customers right so what we'll do we'll dissect this paper into it understand what it is all about understand the building blocks of it understand how it guarantees such such beautiful such beautiful consistent performance right okay thing what is the goal of Dynamo DB design like what it like how it all started so dynamodb started with this just one goal that it wanted to provide consistent performance at scale and they didn't want like the day Zero was more like hey we want consistent performance at scale like hey basically whatever you are guaranteeing guarantee no matter how small the data is how large the data is how low the traffic is how high the traffic is the performance would be consistent but then they stretched it up a notch and said hey will provide consistent performance at scale in this performance will be a low single digit millisecond in most cases less than five millisecond insane right so this is the goal that they started with and the entire design revolves around this exact same thing right okay so let's understand the workload pattern because we would not just click hey when we are designing the system we cannot just start start designing it we need to understand what are the kind of things that we would support what are the kind of things that we would not support so the workload pattern that Amazon started with is that they wanted multi-tenancy that's the most important thing at scale that hey when this Dynam would be infrastructure that they would have would be used not just by internal services but by external customers and internal services are also a lot of critical Services they have to ensure that load of one service does not impact the other this basically multi-tenancy that's one second they wanted to ensure High resource utilization they cannot just have a bloated infrastructure and say hey to guarantee single digit millisecond we just put everything in memory that cannot be the goal right so what happens is they would want a very high resource utilization to keep the infrastructure expenditure to bare minimum third is they wanted a boundless scale to tables basically you should not have any limit on how big a table would be you could have millions you could have billions Euros you could have trillions of rows as well if you wanted right fourth is they wanted predictable performance beat megabytes of data or terabytes of data the system had to be highly available which definitely leads to fast recovery and replication sixth is about they want flexible use case support which means that you should not be restricting to a particular schema which is where you want hey if I support schema list with basic enforcement I would be able to support large number of use cases so this is the kind of workload pattern that suited AWS or Amazon the most which is how dynamodb like with this key goals in mind dynamodb was started to be architected okay let's go into the architecture of it so just to give you a brief of it a dynamodb has multiple tables each table has collection of items each item is uniquely identified by a primary key now this primary key can have two parts first is the partition key second is the sort key sort key is optional partition key as the name suggests would help you determine which partition your data would go into so if you don't provide the sort key partition key becomes the primary key if you provide sort key the combination of partition key and sort key becomes the primary key but you definitely have to provide partition key no matter what right the sort keyword help you determine that hey within this partition the data will be ordered by the sort key so basically it helps you determine the order in which your data is being stored right okay so this clearly indicates because of the word partition key you can clearly see that the data would be partitioned across or like the data would be partitioned and stored across the nodes which is exactly what you do so given an item that you need to store it is typically passed through some function I'm not just saying hash function it could pass through some function that would help you determine which partition your data will go into now this partition could be like like you could do hash based routing you could do range based routing will look we'll take a look into what type of routing it uses but be open could be any routing right so given a key given the partition key you find which partition it would go to it would go to that corresponding partition and if you provide a sort key within that it would be ordered by the combination of primary key and sort key right okay so now with this there is a new feature like a new thing that dynamodb supports it's called secondary indexes now here we are only talking about the primary thing the primary key but what if let's say your primary Keys user ID and you store user details around that but now you have to query on the age of the user so you cannot just do Full Table scan every time that you would want so which is why Dynamite API also offers your support on secondary indexes on secondary indexes what do you get is secondary indexes are literally like are literally dynamodb tables with two columns or n columns where the First Column is the indexed value where the second column is the primary key there are other options as well go through the dynamic uh go through that animated documentation to understand those options but the core idea of secondary index is to have the mapping stored the bare minimum thing you can do is to store the mapping of the indexed value in our case in our case uh the age of the user and the primary key so dynamodb internally manages that you don't really have to worry about it or most databases internally manage this for you but the thing is that it supports second and index so now you can fire efficient queries around hey give me all the users whose age is 10. it would give you that right so it stores that information as rows 10 1 10 2 10 91 these are users like user ID 1 2 and 91 have the age 10 right kind of this right so this is just a simplistic representation of it right so dynamodb supports primary key primary key has partition key and range is optional partition key is mandatory if you provide both with the combination of both the stated as unique if you provide just partition key that is considered as unique partition key helps you determine which partition your data would go into it has to determine the data ownership dynamodb also has support for secondary indexes right okay so given this let's go step deeper dynamodb table is how it is split into partitions so dynamodb table is divided into partitions we know that but each partition understand this each partition is disjoint which means no two partition would share the data so each partition is disjoint it is subset of the entire data and it holds contiguous key ranges so range parties right but not really range range but you get it once you once we go deeper you'll not automatically understand so for example if I take example of let's say I have words apple banana cat dog elephant fan grapes and house abcdefgh so you would see partition that looks something like this apple banana cat goes to partition one dog and elephant goes to partition two and fan grapes and house goes to partition three so part like sorted key ranges obviously contiguous key ranges part of a particular partition now how it would decide where to split where to go that's a different problem which will tackle later but understand what partition is here you see that apple is present in Partition one it means it would not be present in Partition door partition three these are disjoint mutually exclusive but these are subset of data your entire data set is what apple banana cat dog elephant fan grapes house but the subset each partition holds a subset of data which is contiguous key range right okay now partition is the uh is what a very nice abstraction that they have added right that for me this is this range is one partition what dynamodb does is in order to ensure that your data is durable later data is highly available that even if one node goes down it would still work so what they do is they take the partition and they replicate it across multiple nodes with some replication Factor let's say I take replication factor of 2 but they typically do it with three so doing a very replication factor of two let's say I have three partitions partition one partition two partition three I would have three availability Zone availability Zone think of it like data centers right I would have three data centers on which I am putting uh three or I'm taking one partition replicating it two times in this example in reality it is three and I would put it across uh availability zones for example the yellow partition is in availability Zone one another zone three the pink one is in one and two and the blue one is in two and three here you see in case one of the availability Zone goes down due to any reason let's say there is a water flood in a data center you would still be able to serve the traffic from other right this is the beauty of it so to give High redundancy to give to ensure High sorry to ensure High availability you need to make your data redundant this is how they do it right so no matter if one data center goes on two data center going down is highly unlikely but if one data center goes down you would still have other two data center to solve the exact same data right now these are called replicas it's different from read replicas they are called replicas because it is replica of the data right so you have partitions and these are two replicas of the partition which means these two have the exact same drugs it's not read replica and I'm making it extremely clear read replicas MySQL world postgres world is different these are called replicas right simple so partition so you have data complete data data is split into partition each partition has multiple replicas not really replica and these replicas are distributed across multiple availability zones for high availability right okay now when you have multiple copies of data obviously one of them needs to be the leader other have to be the follower so out of the three copies of data that you have let's say I have my one partition which is now split across three availability zones or three data nodes per se like could be in same and it is on a different algorithm that's none of like that's nitty gritties of it but it needs to be there such that out of the three copies of the data that you have be it P1 is the my first partition on data node a data node B and data node C out of this three one of the data node for this partition would be the uh leader while the other two would be the follower right so now what would happen is this three are part of are kept in consensus or kept in sync using multipack Source they use multiplexers for consensus discussing factors is a different problem you can read about it it's just a consensus algorithm that will help you achieve a lot of stuff raft is an alternative for that right so multipack Source slash draft you should notice one of them so just go through in case you are unaware right and because you have leaders and followers there would obviously be some sort of leader election which is happening in case a follower goes down a new follower is brought up in case a leader goes down one of the follower becomes a new leader so when a replica detects that the leader is down what replica would do a replica would initiate the leader election that hey leader is now leader is down let us elect a new leader one of the replica would become the leader and it would start serving the request whatever it is right now understand this these are partitions and not the entire data right so this is happening at partition level not the table level right and there could be many such data nodes and on each data there would be multiple such partitions right so every single one of them is part of a paxos group or a raft group so they use multipaxos for that but the thing is that they form leader election this is not node level this is a partition level right so a subset of data it is at that level right okay so now given that we have leader what's the job of the leader the leader of a partition which is this node for partition one data node B can be leader for partition two right that is different now in terms of partition one what is the role of a leader the leader would take care of serving the rights so all the rights coming to a corresponding partition will be taken care by the leader of that corresponding partition so in this case data node c will get the request and the the this corresponding partition will be taken like will be accepting the rights if rights come to P1 or rather data node a and data not B they would not be doing anything right and the no and the job of the leader is also to serve not just the rights but the strongly consistent reads because rights are going over here this node is in the best position to solve strong consistent reads if required so in dynamodb when you fire the read request you have two options either you say that hey I want strongly consistent reads but the second because I said I it's okay if you give me a little bit of sale data but if you are requesting a strongly consistent data the read request will be also served by the same data node right so serves the right serve strongly consistent read by the leader of the corresponding partition right eventual read or eventually consistent reads can be served by other data nodes Like A and B for partition one but strongly consistent read will be served only by data node C right okay so now what happens when write comes to the leader right so when Right comes to the leader this is what happens so the leader what it would do is leader would accept the right write it in the right ahead log file on its machine so data node C is the one where the write request came in because it holds the leader for partition one so if my right is for partition one it would come to data node C data node C would write it to its right ahead log which is this and then it would send the right to the other two nodes that serve the replica of the same partition so replica of the partition P1 is served by data node a and data node B so data node C would send request to data node and data node B about the new incoming right for partition one now data node and data node C would write it to their own right ahead log so they are not making changes to the to the table data itself like in the partition data itself they are registering this incoming right in a right ahead log file basically durability purposes that instead of waiting for them to change data into their table it is better that they just make an entry of that right and they respond with and they acknowledge it that hey we have made an entry in a right-hand log file and once your data node C the leader of this partition gets the acknowledgment from both of them basically attains the quora it primarily what it does is it responds back to the client right is successful so this is all happening synchronously right so synchronously the request came in to data node C for partition P1 which is the leader here it sends it makes an entry into its write ahead log file then it sends the request to data node a and data not be because they own the replica P1 synchronously your client is still waiting for a response they make an entrance to their corresponding right-hand log file and they respond back to done done and then your data not say responds back to the client right is successful right then asynchronously these data nodes would take right and apply it over here that's a different thing right so changes on the table changes on the actual data will be done afterwards but when the write request comes to the replica to the followers they simply register it in the write ahead log file and then applied a synchronously to the actual table data this way what happens is you are not waiting for because making changes to the action data might require you to rebalance the B plus 3 in case you are using that so they're not waiting for it what they are doing is they are simply registering the right the incoming right that is happening and then acknowledging it this way you can get very quick acknowledgment while ensuring durability over here and you can immediately respond back to the client right in the right ahead log file that you get the right array log file on each of these instances you typically register all the operations that you are getting put a key delete a key whatever the operation is they all are getting registered over here right so all the incoming write operations are registered in the right ahead log file on each of these instances and then they are asynchronously applied to the actual table data that you are storing right this is how your write Works in dynamodb now let's take a look like we've been talking so much about partition let's take a look at why this abstraction is so important why did dynamodb introduces partition exception why can't they just store the entire table in one data node what's the problem why do they have to split the table logically and now in this case even physically you are into quote unquote partitions and they're distributed across the cluster what advantage do they get so the advantage they get over here is let's say I have partition partition one partition two and partition three for each one of them I have two replicas R1 and R2 so this is partition one replica one partition one replica 2. this is partition 2 replica 1 partition to replica 2 partition 3 replica 1 and partition 3 replica two right you have a table being split into partitions and each partition being replicated n number of times two or three or five and then they are distributed across the node like what advantage would you get with this so you get advantage of high availability and fault tolerance we all agree to that right that because you have multiple copies even if one of the node goes down you are able to serve it from other but beyond this what else the best part over here is because of this abstraction what you can do is in case a partition is becoming hot you can actually split this partition into half half or basically whatever fraction you want to and you can then move it across different across different physical servers across different data nodes that you have so in case this partition is going hot what you can do is split it into half and put it across two nodes so let's say node one that is this partition which is becoming piping hot you split this partition into half and move and keep the first half in Node 1 and move the second half to node two this way you would be able to handle the load of a partition very easily you can split it and distribute it if you want to fine the way you split it depends on your use case or how uh or on what on which keyword you want to split do you want to split in exactly half or three fourth or one fourth or what not it's totally up to you or basically totally on the incoming workload that exists right there is no such hard and fast rule per se over here right but now the beauty is because of this your table can now elastically scale without any problem because now when a partition becomes hot you spread it in half distributor workload if one of this partition is you break it in half and distribute it simple right this abstraction is so important so important for dynamodb and it gives you that nice flexibility because you don't expect your entire data of a table to be present in one node because you are splitting it into partitions you can distribute it across the cluster and you can leverage the actual Network bandwidth of each one of those machines very very easily right so dynamodivity typically does that apart from that dynamodb also takes care of partition rebalancing so on every single data node it has very high observability setup like measuring CPU memory disk and what not a lot of Matrix Rings done so Dynamic to be proactively sees hey this data node becomes hot or now looking hot let me take some partitions from this node move it to other does this seamless migration of partitions here and there because of this abstraction this one simple abstraction they have made things so much elastic which is exactly what you need from a truly horizontally scalable service but now does this mean like for how long would we keep breaking this partition what if one row what if one partition becomes one row you typically don't go that deep what happens is in some cases it is futile to split the partition further I'll give an example let's say there is this one key in this one table which is piping hot it is accepting huge number of Rights coming in for that particular key I'll give an example let's say for a social media post let's say uh some social network stores the social Post in dynamodb or youth celebrity posted something everyone started liking liking liking liking liking that a huge number of Rights started coming to dynamodb where their like count is getting plus plus now if you look carefully in this entire partition there is this only this one celebrity which exists now which means that only one row is piping hot not everything else so why would you split it one row level because it does not make sense to split it at one row level so dynamodb does not just keeps on splitting forever right so it understands that hey do I need to split is like there is there should be this bare minimum number of rows that qualifies a partition to be a partition although it does not fit that if it is not an entire partition is just one key is hot it does not make sense to create just a partition to hold that particular key because after some time that load would go away right so understanding when to make that split so important right so dynamodb does not just keep splitting forever right okay now the next part let's go a little deeper let's see what this storage node holds we are we were just drawing these boxes hey this is partition a this is the right hand okay this is this this is that right let's go a bit deeper and see what this is so each storage node has multiple partitions right each partition is a combination of a binary tree and a right headlock file right so binary is the one where the actual data is stored B plus trigger binary or sorry binary B tree so B3 is uh B3 is a really good data structure for databases because it is it gives you order login lookups and all you can search more about it about the advantages of using Beatrice and to store data because of its consistent performance and data being stored where and what but what it does is dynamodb typically stores all of this information all the data in a b tree and it has write a headlock for the obvious durability purposes so when a write is received for a particular partition on that partition for that corresponding law in that corresponding log file the update operation is registered and then the changes are applied to the B Tree on that particular partition the actual data right so each storage replica look something like this these are storage replicas then in dynamodb there is also called as there is also something called as log replica so log replica our storage replica without the data log replica only contains the log file the write ahead log file it's typically there to help you recover quickly recovering from data failure takes time especially when you are recovering the actual data data so which is where you quickly add a log replica so that you can start accepting rights and rights are distributed and registered so that you get fault tolerance out of the box right so storage replicas and log replica storage replica contains the actual data and the write ahead log file while the log replica only contains that for this partition this is the log file all right so that is log replica we'll touch upon log replica when we talk about high availability right okay so remember log replica for that now let's go a little level up now what we'll do is we'll take a look into different microservices and the responsibility that would that makes up dynamodb it's going to be a huge section we'll talk about each and every microservice and dynamodb has and how it helps you ensure the consistent performance at scale so our classic high level quote-unquote diagram for dynamodb looks something like this you have authentication and authorization service you have a metadata service you have a storage layer you have your router layer this is your end user or the client that you have right let's understand what is each of the service does and then we'll talk about each one of them in depth the first one is the metadata service metadata service is one of the most critical services for dynamodb because metadata service holds the metadata basically which partition is present on which storage nodes the routing information is held by metadata right so that your router your router basically receives the request from the client router takes help from metadata to understand where to Route the request to then it routes the request to the corresponding storage node and the solid node takes rights in the right I had log and does that Quorum thing replica acknowledgment respond back to the user right so this is what happens upon every single write that happens so your client always talks to the router that you have my router takes help of metadata to know where to forward the request to it forwards the request to that corresponding storage node for that partition writes it in the write ahead log file and updates the data over here before that it would do replication across the replicated units or the replicated partitions across the cluster synchronously right that same flow copy paste goes over here right that is request routing service so request loading is the front facing thing where all the request comes in storage service is the one that takes care of all the storage thing across storage nodes and replication and ensuring that uh the end replication factor for each partition is met all of that goes on the storage side of things right we'll talk about each of one of them in detail one of the most critical Services of dynamodb is the auto admin service so think of Auto admin Service as your orchestrator service which is the central nervous system or dynamodb Auto admin as the name suggests it's an administrator service which is in autopilot mode so it basically takes care of hey is everyone doing their work properly or not hey is my fleet healthy hey is this partition healthy or not hey is this partition if this partition is configured to be of replication factor of three do we have three replicas for this partition or not hey is this node uh this is not getting hot if it is getting hot split the partition and distribute it and whatnot right it takes care of auto recovery from failures it takes care of load balancing it like load balancing offload on the storage side it takes care of Fleet health and everything around it so it's the master orchestrator that takes care of your entire system right why on your reads and write path that you have the request comes to router service goes to storage gets the response sends it back this is the bare minimum that you need to do but to keep everything up and running it's a job of your auto admin service apart from that there are few other service one for transaction management one for lock management which takes care of remote locks for in case of transactions you have Auto admin you have backup and restore Dynamic DB has support for dynamodb streams where all the updates that are there are flown to the screen so that for fluent to kindnesses or SQL so that you can consume from that stream and then reapply changes wherever you would want all the fancy features are each one of them is their own corresponding service now let's go deeper into implementation of each one of them let's start with the first one this is a huge one metadata service metadata service was this component so metadata service is one of the most important services in dynamodb what happens is request coming from user goes to the router Router consults metadata to understand hey where do I need to forward a request to metadata response hey please talk to storage node 2 your router then forwards the request to storage node 2. storage node 2 does that processing sends a response to the router Router sensor responds to back to the user so here you see the criticality of metadata service because it owns the metadata the data about the data so metadata service holds the most critical mapping about hey these are this is a table in this table these are the partitions this partition owns this particular key range this particular key range is presented these these these storage nodes all this information is stored over here so for each table in dynamodb you have this table has these partitions each of this partition has this is owned by this range this range is present in these storage nodes these are the replicas out of this this is the leader and this is not all of that information is part of your metadata service right so now you would think that hey when I'm getting this request like obviously router would router is not storing any data up until now router is using metadata service to know where to reformat the request to so router uses this to do that but now if you think carefully we for every incoming request be it read or write if router has to always consult the metadata service you would never get a consistent performance you would never get high performance system because every time you are getting the request of making a network call to get this thing and process it so what would you do this is exactly what animal DB does it caches the routing information locally on the router so when a router receives the request so when a router receives a request it sees it if it has the routing information for the corresponding part for the corresponding gear for the corresponding partition if yes it would use that to forward the request to the corresponding storage node if not it would go to the metadata service download the corresponding routing information for that table and keep it here cached locally on the routing server and then use that information to forward the request to the corresponding storage node there's a bare minimum that you would think you'd do right and that's exactly what anime DP did so dynamodb started caching the table routing information on the router lazily so if the request comes in then it does that and there is an eviction so after sometime this would be deleted and whatnot all of that handy right okay now if you think about it the routing information would not frequently change how frequently would a partition be changed from one node to another not so frequently so here what animal did we observed dynamically observed a cash hit ratio of 99.75 which means that most of the requests that came to the router was served directly from locally it was not serving the data it was a routing information that where should I forward a request to not the data the routing information right so request came to the router Router already had and it forward the request of the corresponding store does not get the data send it back to the user right router cache the routing information not the data data caching is different this is routing caching right okay so they did this now because you see very high cache hit ratio what would happen if this invalidates what would happen if let's say my router service restarted all the nodes all of a sudden what would happen the incoming request would come to this router Router would not have anything cached what's the fallback they would all call metadata service to get the response mirror data service would send their routing information so now in case the router service reboots in case there's a huge cache miss all of this would fall back to metadata service spiking the traffic on the metadata service right it spiked the traffic on metadata service and that would crash the metadata service and if metadata service crash there is an outage because a router is very this very heavily reliant on metadata service to know where to forward the request to so what do you do to prevent this from happening because this is a huge pin in case your router service restarts in case Cash invalidates There's a huge Spike on metadata service this could this would break things so what's the solution the solution to this is what Amazon did is Amazon actually built something called as a memds a memory data store an in-memory remote distributed data store which is optimized for range queries for example given this key find which partition would it hold or rather find which partition does it hold it from that partition you would know which storage node does that partition that that partition is present on right so memory MDS is an in-memory distributed data store that stores data in memory it is not a persistent storage that stores data in memory which is used by meta like it uses metadata service to get the data and what not there is one more flow I'll talk about that so here what is happening is the request is coming to router Router has some data cache locally that still exists but if it does not have the data it goes to memds MDS is a distributed in memory data store whose job is to hold the routing information but it is optimized for range queries basically less than this key greater than this key float function seal function it is implemented using Patricia plus Merkel tree combination it's a hybrid combination of both right because you know how Miracle tree would help you find what changed like if there is any change and Patricia tree has its own set of properties so these structures help you uh fire efficient range queries right so a petition is typically it basically gives you efficient range queries so what happens is given a key let's say I want to find the value for Key Cat where would I have to go I have to go to node one but how would I know where would I have to go to node one so the way the MDS is distributed in memory data store optimized for range queries you fire the query to MDS MDS would give you the answer which node would you or which node should you go to that's the job of MDS it does that right this way what you do is instead of you relying on or rather your router relying on metadata service to fill the data or to populate it it actually calls memds to get the data right that's a huge Advantage because now you are not directly dependent on metadata service to get that information you are relying on some other component but memds is in memory data store you can very quickly get it plus MDS is optimized for range queries which is exactly what your purpose is right so router uses memds to get which node should you go to and then it basically makes the call over here but router still has things cached locally so it might not for all requests would have to go to memdes to get the the information right once yes not all the times right okay no one key thing I just said I just said that router service does not always go to memds but that's not true so the flow is MDS is a distributed in memory data store now what you need to ensure when you're building a highly available system that they're in no point in no case your system is going down now typically when does the system go down a system goes down when it sees very high load right and it is not provisioned enough to handle that load assume that your system is horizontally scalable let's say your your system is automatically scalable but because you were caching things locally you thought hey let me keep only two instances of MDS over here why do I need 20 but if this goes down to any reason or if this cache invalidates a huge amount of load will come to memds then memds would go down because it had two instances it actually required 20 instances so the same problem so what problem was with metadata same problem happens with MDS so why are we adding MDS then the reason is that MDS is not provisioned to a bare minimum capacity MDS is provisioned for the actual load for example anytime the request comes in to the router service if it has if it does not have the data it goes to MDS gets the data caches it locally and then serves the request right but while it is doing that it would make an asynchronous call to memds this way your mmds is like all your requests are still going to MDS but not synchronously or synchronously you don't do anything with that but it's just more about preparing like keeping memdes prepared for the high load that would come in in case your cash goes down this is a really important high level design practice right for you to build a highly available system this is a very common practice fii where you are not in synchronous request response you are not relying on memds a lot but still asynchronously your firing request on members just to keep it handy it's more like during war time or rather during peace time if soldiers are not uh is are not prepared during wartime they would be lazy right then you would not be able to win the war right same thing goes over here that even though there is no even though there is peace time where there is cash it there is no need to call memdes but you still are synchronously calling memds so that your memorys cluster is well provisioned and it is used to handling so much of request right but because you are doing a sync in case it is down or facing some issues you would know that when you don't want to skill and when you don't want to scale right but now who updates the information in MDS the storage nodes update the information in MDS directly that hey I am the storage node I on this partition all of this information gets indexed in memds but MDS is what transient because it stores a Time memory what if that goes down you have metadata service to serve that right it would have some persistent layer paper did not mention explicit persistent layer for that metadata service but there should be one which holds it in a persistent manner that hey this is this node owns this sort of partition this partition is presented this node this is where the replica lies right but this is such a beautiful high level pattern such a beautiful high level pattern where you are ensuring that there are no cascading failures no matter what now let's take a look at H cases over here if router if all the nodes of router goes on in your enemies dynamodb stop right so you keep this behind load balancer have some monitoring setup one goes nose goes down other nodes takes it place and whatnot right if this cache goes down the request go to memds if let's say all the local caches of all the routers go down what would happen all the requests would go to members but MDS is already provisioned to serve those much of load right that's the beauty of it because my MDS is already provisioned enough to handle that load even if all the local cache goes down your road look your MDS is provision enough to handle that which means that there would not be any cascading failures ever just by over provosting memds you solve the problem of cash cutting failures such a beautiful piece of distributed system right but you are still having this local cache so that you don't have to synchronously make call to MDS every time to get the routing information and then send the request to so once it is cached locally you directly send request over here but asynchronously you send request to MDS that's the beauty of this design right this is how you Ensure High availability at all times right keeping yourself up for the water tank that's how the world of software engineering works right okay that's about the metadata service I told you it's going to be deep right the next one is storage admission control really important so storage admission control is the part that how much load one storage node takes care like could handle this is all about storage admission control right so we get a lot of request the request at the end has to be served by the storage nodes but you cannot over bombard or you cannot bombard a lot of request on storage not otherwise it would go down how do you rate limit that that's the concept of storage admission control right okay so activation control ensures that your storage nodes are not overloaded and the requests are rate limited but how let's talk about it so what do we want is on a storage node you would have a small component running for admission storage control call admission storage content call it anything right but the idea is first of all one storage node can host partitions from different tables right so let's say I have three partitions a b and c it is very possible that these three partitions are not from same table they could be from three different tables right for each table in dynamodb you provide something called as a capacity Unit A read capacity unit and a right capacity unit think of it like how many writes per second do you want it to support simple terms right so you can configure that hey on this table I want to support 500 writes per second on this table I want to support 200 writes per second on this table I provision it for 50 writes per second you configure it right that's the idea behind it it's up to you you have to configure it right now what would happen is when a corresponding partition gets that much of request once it hits that limit it would not allow further requests to be processed because rate limiting is essential because the customer is paying for 500 rights per second why would customer call 1000 requests per second it cannot it's your responsibility to block and throttle them that's exactly what storage activation control does it basically ensures that your storage nodes do not handle more requests than it is provisioned for because the day you start firing request on that and storage node gets overwhelmed it would crash and your system would have an outage you don't want that to happen so which is why rate limiting is really essential but remember every single thing in the world has a limit storage node one physical storage node has a limit that hey one let's say I'll take another analogy let's say one ec2 instance you have can handle only 10 000 requests per second only ten thousand requests per second that's a physical limit of it no matter you host 10 000 partitions in that one million partitions and that or one partition in that the limit is at the server level that a my server at Max can handle 1000 requests per second so no matter how many partitions you host you cannot be handling more than that much a request right and that's exactly what is a global limit for a storage node that you have so let's say you have a limit of 300 requests per second so this entire storage node can handle 300 requests per second within that storage node you have three partitions and each partition can handle let's say 50 requests per second so in all 50 50 and this entire thing is 300 so it means that there is a way like there is a possibility that you can add three more partitions each handling 50 requests per second into that you can host it if you want it right but there is a global limit for a storage node that you need to adhere to and there is a local limit to each of the partition depending on what customer has provisioned for that corresponding table right this is what you need to ensure now the auto admin service that you talked about the central nervous system of dynamodb what it ensures is that at any given point in time on any storage load the total amount of request each partition can is assigned because who is assigning capacity to a table the customer the customer is saying that hey I will handle like I want 500 requests per second on this table right so customer knows how many how many requests per second each partition will be will be holding our dynamically knows how many requests each partition is supposed to handle so the total limit of the storage node the total Global limit of the storage node should always be greater than the summation of the individual the limit the of individual partition that it is hosting so for example if the storage limit is 300 requests per second I cannot have three partitions each having limit of 200 requests per second because then it would be 200 200 200 600 but my limit is 300. right so I cannot host it so Auto admin service ensures that no matter what one storage node is never assigned the partition whose cumulative limit exceeds 300 requests per second that's what the job of Auto admin Services it has to ensure otherwise if you allow more partitions we are serving more partition from that request word comment and you now have to handle but you cannot handle right so that's the concern so how do you address that so now we'll go a level deeper to understand how all of this is in short so dynamodb has concept of read capacity units and write capacity units right now what happens over here for a particular table not now let's answer this question first what is the limit of a partition does user know how many partitions a table has no user just says I have this table this table needs to handle 500 requests per second that's it it's dynamodb is partitioning it dynamodb is splitting into five partition 10 partition 50 partition thousand partition you don't care as a customer you don't care you say hey this is the table this is the limit that I want to work with right which is what you would do so dynamodb what it does is dynamodb configure something called as read capacity units and write capacity units and the idea is simple that if my table has let's say read capacity unit of 1000 right and suppose dynamodb chose to split my table into two partitions P1 and P2 so what would happen is I have 1000 uh read capacity units like 1000 requests per second to put it in simple terms and on that table I have two partitions P1 and P2 I would distribute it proportionally so because I have two partitions total capacity is one thousand partition P1 capacity rate will be 500 partition speed will be 500. splitting it equally among all the partitions so now let's say suppose due to any reason any reason yeah suppose let's say I am seeing a lot of traffic on partition P2 this partition and I choose to split the partition into half what will I have I will have partition P1 Partition p two one and partition P22 right partition P1 Partition p two one partition P22 now what's the RCU distribution across this it will be 333 333 1000 divided by three right so no matter how many partitions you have and always dynamodb has different ways this is the starting point right absorb these things so what happens over here is the if my table capacity is n and if I have M partitions so each partitions will get M divided by m Bala capacity that's the thing right so what dynamodb assumes is all the keys are equally likely to be accessed that's not true but animal to be assumed it on day zero right but in reality some keys are more likely to be accessed than others for example social media post let's say I'm storing the feed of a user or social media Post in a dynamodb table more recent posts are more likely to be reacted like comment share subscribe whatever that is right because it is more likely to be reacted it would see bright High read and write uh uh uh read and write request per second right but that is a problem because now what would happen is first of all you would have hot partitions because the posts that have just been published I will get huge reads and huge write request so what would happen is let's say one one of the post was there right one of the post was there posted by some youth celebrity the data was lying in Partition two that corresponding row was in Partition two that post is published right partition 2 had how many capacity units 500 because you had one thousand you divided by two you get 500 500 right now what did happen partition P2 became hot because of all the reads and writes that were happening on that now what did you do you split the partition into half p21 and P22 and split is in half the post would the post the hot post would happen in any one of them let's say it is in P22 here but now what would be the distribution 333 333 333 and the problem is that same post was part of partition two was getting 500 as its provision capacity but after splitting that same post is now getting 333 as it provision capacity so this problem is called throughput dilation you have diluted the throughput by horizontally scaling it that's not what customer would expect customer would expect hey because I'm splitting it because this partition is hard I would should get the performance but you just diluted your throughput by splitting it in half but there's a no but that's how my system is customer wouldn't understand customers say I'm paying you money I want performance I don't care right so how do you address the throughput dilution problem that becomes the root cause now let's take a look so first way of handling throughput dilution is called bursting in real world if you look carefully the pattern is something like this if I take example of social media if a post is recent a huge amount of traffic would come in after some time no one gives a crap about that post correct same thing would happen that's the exact same thing would happen right so now if you look carefully for a particular storage node not just table for a particular storage node each of the partitions let's say some throughput some capacities allocated to each of the partition not all partition would use their allocated throughputs simultaneously for example each of this partition has 550 50 request per second provision what are the chances that all of them are using 50 requests because all of them are getting 50 requests per second not really because there is no uniform distribution or there is a uniform querying of data or uniform query load on all of your partitions which means we have some breathing space so idea is that given that not all partition would be using their allocated throughput simultaneously what we can do is we can let some partition tap into the unused throughput capacity of the node for example this node has capacity of 150 150 requests per second each one of This Is 50 50. but let's say this is only using 20 or I have this example sorry my bad I forgot I had this example okay let's say I have my partition P1 P2 P3 each one having 1000 right capacity units provisioned provisioned right so it's geared up for 100 writes per second or 100 capacity units now what has happened is with the observed is that P2 and P3 are only getting 20 20. right so now what we can do is if P1 is getting a little higher load we can allow P1 to go beyond its provision which is 100 and go little above that little above that why because P2 and P3 are only using 2020 of their hundred hundred you have 8080 160 in buffer you can let P1 go to like 150 160 170 180 something like that right this is called bursting you are allowing to support burst request coming in for a particular partition on a specific node right but how do you implement everything is about implementation wrong boxes is easy thinking of implementation is hard so how do you implement implementation is interesting so what happens how do you implement this how do you allow a particular partition for some time to tap into unused capacity of a node the way you do it or the way dynamodb does it is called as token buckets very simple use case so what they do is each storage node has a token bucket which says that hey every second this token bucket will be will be replenished right so you would like it would be reset it let's say I have 100 or let's say my node has 300 writes per second so every second the token bucket on This Server would have 300 because at the global limited I have so 300 tokens would be there in this bucket right each of the partition would have its own token bucket the provisioned throughput or the allocated token bucket that hey for this partition 100 is the allocated one so for P1 it is 100 for P2 it is 100 these are allocated thing right so apart from that what you would have is each storage node has its token bucket each partition has one token bucket for allocated while one token bucket for burst so it's not that hey because my machine or my storage node has can handle 300 writes or 300 requests per second I would take it to 290 because other nodes are not even using anything no everyone has some limit to that that is a burst limit for example you say that if I'm my provision is 100 for a particular partition my burst would be let's say 30. so my partition can have 100 requests per second but it can have the handle burst up till 130 requests per second right so it has that limit so now what do you do for any incoming request that is coming in you obviously know which partition it would go to you get the partition you check that if allocated for that partition has token or not which means from its allocated capacity that partition is allocated capacity of let's say 100 so have I used all hundred of them if not I would just use that and done because I'm using from my allocated quota which is good if the allocated partition has nothing for that token or sorry for that partition what you would do is you would say burst has something if the burst bucket of that partition has some tokens which means you are allowed to go to that certain level so you would plug that out from this bus token and one from this allocated node of this is what you have right so this way you are allowing 100 requests per second plus some bursting that it is obviously configurable right ten percent twenty percent thirty percent bursting is allowed per partition for example right but you should no matter what you should never breach the maximum that a storage node can handle that's the thing and this is how you handle throughput dilution via bursting so even though you have provision 100 requests per second per partition of your you can they are still allowed to go beyond that in case the storage node on which that partition is hosted can handle more load such a beautiful piece of design I just love it I love distributed storage right you understand the beauty of this how amazing and every second the token bucket is replenished problem solve right okay second way to handle throughput dilution we talked about bursting but what if your spikes are long bursting is where where you see immediate Spike and it goes down right but what if your spikes is long lived like you see a surge but it is not short-lived it would not end up in like it would not just finish in five seconds it's Long Live going for 30 min for example I'm just taking random numbers don't take my seriously on that so the long lived spikes how do you do that bursting cannot handle that for sure so these are typically skewed workloads which means for example I'll give an example let's say uh you have a dynamodb table in which you define your partition key as time date time right now what has happened is the data that you are putting in is Partition by time so most recent data will always go to the most recent partition that you have now what happens is all other partitions are just vacant they don't have any load coming in while the reset value is always going to the most recent one so these are long-lived spikes that you would get on one of the partition how do you handle them so now let's go into adaptive capacity how do you handle it so if a table is experienced throttling but the table level throughput is not exceeded the idea is for example you as a customer of dynamodb what you would say is hey I have this table I am partitioning it by date like let's say day I'm purchasing it by day every day has one right the data I will write will be per day right so data I am just writing with respect to dates all going to that one person because of partitioned by day I configure my throughput capacity to be one or to be let's say one thousand right let's say I have configured my table capacity to 1000 writes per second I am writing per day basis right so if I'm writing it that way why are you throttling me because I am paying for 1000 rights per second I am writing to a I'm writing to your table I don't know you are partitioning it I don't care if you partition it I want 1000 right per second on my workload I don't know how you do it but you do it very common use case right so what happens is where adaptive capacity comes in so let's say the the capacity that you have is one thousand right so let's say you have it basically let me start with that the Adaptive capacity the plan is foreign capacity is adaptive capacity it adjusts the partition throughput in proportion so whatever the request that you are getting it is proportioned the capacity of that corresponding partition is increased and for other one is decreased because your workload is queued this is adaptive capacity right so what we studied first is if my table has if my table uh if my tables capacity is set to 1000 and if I have two partitions each one would get 500 500 right but what adaptive capacity says is hey I would not split it to 500 500. what I would split it is depending on the amount of request that I am getting I would split my I would allocate my throughput for that partition in that proportion that one partition gets 100 while other partition which is heavy gets 900. right but here the global limit for the storage node would still remain the same that auto admin service would not allow other huge partitions to be present on this storage node right because the global limit of like summation of provision throughput on each of this should be less than the total what a node can handle right it would still handle that it would still ensure that your auto admin service but what you are doing is you are allocating the provisioned capacity across the partitions of a table depending on the amount of request that it is getting over long lived period it analyzes those metrics and assigns the provision capacity for each partition that you have this way you can handle long-lived spikes with adaptive capacity right very nice implementation I just loved it right okay we talked about two ways to handle one way to handle burst we talked about that we need to handle long-lived spikes we talked about that so the way burst is handled is by using token buckets and having some burst capacity no one would ever exceed like no one no matter what should exceed the total request one node can handle handling burst but for long-lived spikes you are doing adaptive capacity the concern is that bursting helps with short-lived spikes we saw that adaptive capacity is reactive it takes time to react right so while this is happening while the requests are coming in while the stable is grating throttle because it takes time for your adaptive capacity to you know adjust the capacity across partition but while this is happening your tables have observed unavailability because you made the request your table request got throtted that's a problem so your customer the customer dynamodb is not happy because customer says hey I have allocated I have allocated 1000 requests per second why are you not giving me 1000 requests per second do something right but while this ticks off while this adaptive capacity kicks off you are seeing unavailability on tables goes against the higher availability goal of dynamodb so what's the idea which is where the concept of global admission control comes in we talked about storage admission control now there is a global admission control called GAC the idea is now hear me out on the idea the idea is that there would be a centrally track of table level throughput what do you have combined so let's your customer says for this table 1000 requests per second is what I'm provisioning so you in your Global activation control you'll have 1000 requests per second the day this table is allowed to process 1000 requests per second that's what your Global admission control would do right this is done at request router level so the request router at this level what it does is you have a global admission control which has a bunch of tokens so request order is basically what it does is it maintains a local token bucket and it periodically gets the new tokens from Global admission control so Global addition control for each table it knows the what it can handle per second right request router gets the tokens from Global uh from Global admission control for each request that it gets it uses one token each right when the token exhausts it gets the next set of token from Global admission control right this way your request router at least at the first step it allows you to process it allows you to accept 1000 requests per second processing is different but at least a request order accepted those many things but if your table fires more than 1000 requests per second you have provision 1000 requests per second but if your client actually fires more than 1000 requests per second it would be dropped add the request order level here we can clearly see that that solves so at request order level this problem is solved but it would have to go it would what would happen is that when it goes to the storage layer then all the things that we discussed at the storage site right still holds true the Adaptive capacity and this and that it would still hold true exactly like that right you're bursting Your Capacity so you are not limiting anything at a global level you're allowing a request to come in and then you can do retry you as in dynamodb can do retry of those requests and whatnot or staggered staggered way of processing and what not below this level so the idea being your Global admission control takes care of the global requests that are coming in at the request router level and it takes care of letting the data in it's a global kind of like a global rate limiter you may think of it that way right so it would not let your client client Bridge the partition level so the combination of this three would help you Ensure that your storage nodes are not over provisioned you get enough time to process things you have you give the right message to your user that hey I am letting a request in for for for whatever capacity that you have provision right the combination of this three helps you address this problem okay now the next one the next one is about durability next interesting thing okay so durability what is durability persistence I'll just store data in the disk what's the problem is there more to durability yes there is there is there is reliability is not just persisting data to the disk it's much much much much more beyond that so let's talk about that so durability is more about you have to prevent detect and correct any possible data issues that you have any possible data issues right or whatever could happen whatever could go wrong would go wrong and you have to be prepared for that let's take first example Hardware failure so for example while you write something to the table what if data goes down what if node crashes what if there is a hardware what if there is a disk scratch are you like are you able to recover from that that's the question that you should ask so what animal DB does is dynamodb uses write ahead logs for providing durability we discuss that right rate lock is up and Only log it keeps appending over that once it is appended once it is there and then the changes are applied on the B plus tree or the other B3 that you have right so by using write a headlock it ensures it ensures the fact that the changes are first written in the log file and then applied on the tables or on the data that you have right that is one so in case that goes down you can recover your semi process data from your write ahead log that's exactly what dynamodb also it's a standard procedure across all databases thanks now this right ahead log that you have they are periodically archived to S3 it's really important because what if this data node becomes unaccessible you should be able to recover that right so these logs these writer head logs are constantly archived to S3 basically periodically archived to S3 right but the edge case what if I have made an entry into this write a head log but before I could archive it on S3 what if this data are not crashed what would you do right this is where we leverage replication so remember a replica a data of a table is partitioned each partition has multiple replica the right goes to the leader adds it to write ahead adds it to right-hand log file sends it to replica replica adds them to the right ahead log file and then responds back to the leader and then respond and then the leader responds back to the client right now here what you do is if this node goes down before the logs were replicated on S3 what would you do you have other partition to take care of like other partition can give you that so remember the partitions have replication Factor typically of three so if a node goes down a new node is assigned that responsibility so let's say this node went down I immediately spun up the new node and this new node that and I as in the auto admin service does that right so automatic service and an existing node it assigns the responsibility of this part it says Hey the replication factor of this is not met let me ask another node to immediately replicate the data because now this is uh because this is fragile because if this went down and this went down you would not have any copy of the data right so what do you do you typically do it with three I have taken example of 2 over here only right but the idea is that if this note goes down immediately a new node starts accepting the responsibility of the node that went down and it immediately starts copying the data here and there immediately immediately right but now the thing is that instead of immediately when this starts up instead of taking instead of it being a storage node what it first does it adds rep it adds log replica nodes remember log replica log replica or storage nodes without the customer data but only the write ahead log file so what it does is it immediately adds the log replica of those corresponding partitions this way what happens is that the rights that come in because when is Right successful write is successful When the Right comes to the leader from that leader it sends it to the replicas that replica simply adds it to their log files and responds back they are not ensuring that it is added on the data which means that if we just have log replicas then our rights would still work which is precisely what dynamodb does so the first step when a node goes down immediately I draw log replica for that and then take care of data replicas right that's the idea behind it this way you minimize the time for your recovery that in case a note goes on you should not take more than few seconds to recover that's how it does such a beautiful thing right okay on durability again extending our discussion on the durability thing our data can go corrupt which is called Silent data errors now what are the silent data errors silent data errors are data errors which are silent is more about the day when I when customer thinks that customer wrote that that is exactly what needs to be written onto the disk it should not automatically convert to cat let's say why would it convert to cat bits and bytes over the networks can flip while writing to the disk it can flip there are so many places where things could go wrong that needs to be handled so across the entire layer entire layer right from your customer over the network into your infrastructure across the services onto the disk till the level persisted you have to check for check sums CRC cyclic redundancy check at every single step in your request flow you would check for checksums you have to ensure that no matter what you are never storing a incorrect data something that customer didn't intend to store so you check checksum from the customer side all the way to the storage layer at every single step which is how you enjoy because that's you see building databases is not easy it's really difficult right these are some very trivial things that we typically don't think about much but these are so important in building a resilient data store all right okay apart from that even the log files that are archived on S3 they are also piggy packed or they are also attached to check some file with that just ensuring that what I am intended to store on S3 is exactly what is getting stored on S3 in case there is a check some failure do a retry so you ensure that there are no silent failures in your data okay next Point apart from that you handle silent failures but what dynamodb also does is it takes case of continuous verification so dynamod if we continuously verifies the data I dressed to ensure that all the three replicas of the data have the exact same data again check sums right everything anything that you read Around data Integrity is always have to do with checksums right it ensures that it is uh the checksums are there apart from that whatever is archived on S3 what is archived on S3 log files it periodically downloads the log files recreates a replica and then confirms if the data matches or not such high integrity checks dynamodb has applied right in case you have never heard of checksum just you are just a Google search away but checksum is a very standard way of ensuring that your data is um is not corrupted right or if you detect corruption you cannot fix it but you can detect corruption so then depending on what your use cases you can choose how to fix that right okay the next part how do you enter durability aggressive testing so what they do with respect to aggressive testing at every single step they do stress test on storage nodes on API servers everywhere they do failure injection explicit failure injection on every part and they use formal methods like TLA plus you would have heard of TLA plus which gives you formal proof of distributed transactions and distributed flows and distributed systems in case you have not it's however I don't know that much but tll plus is a very nice way to uh it's a very formal way to ensure the correctness of a distributed system right a lot of Cs researchers use that to ensure the correctness of the system that they are designing right so dynamodb does all of this to ensure the correctness of it but we want to go deeper into that okay now the next part let's talk about availability really important for something as big as Dynamo TV so what anime DB does is dynamodb tables are replicated across availability Zone one availability zone is kind of like one data center right so one region would have multiple data centers let's say in Southeast Asia you have five data centers in Mumbai you have three data centers and whatnot so what happens is what dynamodb does over here is they run few tests every now and then think of it like chaos monkey what they do is they check for resiliency against node failure against rack failure against availability Zone failure so if your storage nodes or your API server goes down if the rack goes down and if the entire data center goes down it keeps on checking the resiliency across that it keeps on checking resiliency against power outages they own the infra what if someone just plugged out the power or the power went off due to some reason they ensure that hey their system is still functional even in case of power outage and resiliency on the data corruption with Integrity that we discussed now let's go deeper deeper into how do you ensure availability of partition which is where Pac Source come in where you remember where we discussed that table is split into partition each partition has multiple replicas and the partition the three replicas of the partition they form a Pax source group for consensus so the idea is that one of them is leader if one of the node goes down one of the other node triggers a leader election and one of these two would become the leader and would start accepting the rights and strongly consistent reads right so partition is said to be healthy if you can get a right Quorum right Quorum is what right majority because when the Right comes in we said that right goes to the leader it then sends to both of them it gets acknowledgment from both of them and then Returns what if one of the node goes down will it not accept the right right majority is what it is waiting for the day out of this three if two have confirmed that the rights are there done I'll move forward right so even if one of the node goes down your rights are still accepted because you still have the majority when this would come back up it would take its place and it would it would check what all has happened it would apply those changes on its local machine and would would resume its responsibilities right so availability of partition is so important with consensus right so this is where your paxos and the rafts come in dynamitab uses multipaxos for that right so paxos across partition so which means if you have thousand partitions there will be thousand paxos groups one for each partition and each set of replicas right okay that's availability now really interesting part is gray Network failures now gray Network failures are network failures which uh are not really failures so you think that things failed but it did not actually fail what happened is there was a connection problem so you think let's say a follower thinks that a leader is down because it could not check the it could not get the heartbeat from the leader so the follower think that a leader is down the leader is not down there is this network issue that is happening this is a huge problem because your follower thought that the leader died but leader did not die followers messages could not reach the leader due to network issue so what's the consequence of this your follower would initiate a leader election but the leader is actually fit and fine how do you solve this this is a very tricky problem and it's a very common problem in distributed system because now what happens is follower makes a call to the leader does not get a response it thinks leader is dead but leader is fit and fine it is accepting the rights from other places and getting things done while this follower initiates a leader election what if this follow initials and leader election it also becomes the leader split brain problem how do you do it solution is very interesting very simple so to solve this gray network connection problem what it does is dynamodb says before any follower initiates a leader election what it would do is it would check with other the following hey do you also think that leader is down that's what it does so all the followers of a particular launcher it's at the partition level remember this partition is where all of this is happening right not at node level partition level so the follower so because let's say I have three replica groups or sorry I have three replicas for a particular partition one is leader two are followers right so if this partition this replica of that corresponding Foundation is not getting anything from the leader it would think the leader is that so instead of immediately triggering a leader election what this would do is it would ask another foreign that's the same thing that happens over here hey distribute systems are very much like humans FYI right so he just counter checks with other replicas to see that a leader is genuinely dead if this also says a leader is dead then the new election is then the new leader election is kicked off otherwise no so it basically minimizes the number of false positives that you would have which is precisely what you want right okay next part availability the measurement of availability now there are few very interesting things that would come in measuring availability so you do basic monitoring of your tables the number of read requests per second write request per second availability measurement that you typically do that's there one very interesting thing what they do is because dynamodb is specific to Amazon and it's proprietary it's not self-hosted per se so what Amazon does is Amazon wants to measure what customer is seeing that who will be the customer of dynamodb the person who are on AWS infrastructure right but so on AWS infrastructure what they've done is for all availability Zone that they have they have set up Canary apps now this Canary apps is like a dummy application that fires requests to dynamodb from that availability zone so let's say U.S customer have hosted our application in AP Southeast 1 1A right and basically one of the data center so they would have one Canary application running there so which would make call to dynamodb from this availability zone so whatever customer would see that same thing this Canary app is also saying not for every use case that I'm seeing but one from each availability Zone this way your dynamodb would like this Canary output keep on registering what uh response time up time latencies whatever it is getting it is observing it is it is storing that so apart from the dynamodb internal backend uh measurement that is happening on top of that you are measuring what customer is observing not we are injecting in customer code but running a simple application in the same availability Zone as your customer simple but such a brilliant way thinking of being customer obsessed is such an important factor of Amazon and we can clearly see in this design how they are thinking about this one really interesting thing is dynamodb is not a standalone service dynamodb is dependent on a lot of other services internal to Amazon right so all the services the dynamodb depends on should be more available than animal correct otherwise if they are not always available as Dynamo if they go down the Dynamo DB goes down so the services the dynamodb depends on they have to be more available than dynamodb how do you ensure that so I'll take an example of AWS IM and AWS KMS awms access management KMS is uh encryption decoration key management system right so for every request that comes in what dynamodb has done is it has listed on all the service that it depends on it sees that out of all the service that I depend on is there any service I think for almost every service I should be able to function even if that other service is down let's let me take an example which means that my dynamodb should still function if my AWS IM is down how do you ensure that so what animal DB does is caching dynamodb caches things locally for some time so uh dynamodb instead of making call to AWS IM every time it basically caches the tokens and the refresh tokens and whatnot locally so now what happens is even if AWS IM is down for some time until this local cache expires dynamodiv is still able to function right but once this local cache expires then it would have to make call to AWS IM and in case IM is down at that time then you are having an outage but up until then you would still you are minimizing your outage time so this AWS does autonomy does for all the services that it depends on so the services the dynamically depends on has to be more available than random would be no matter what but dynamodb ensures that dynamodb runs even when the services it depends on are dead for some time dead as in it could not be dead for hours and days straight for minutes right by awf has very high availability thing and but that's what it does so it locally caches some keys encryption Keys some eight some authentication tokens and whatnot so that it does not have to make call to that every time it is not only just performance boost but also just minimizes your outage right and the final thing on dynamodb is deployments how dynamodb deploys dynamodb so why do you Dynamic divide your dynamodb would need deployment because Dynamic DB is also a software so new features bug fixes performance optimizations they would do so dynamodb deploys without having a maintenance window so when dynamodb deploys dynamodb there is no outage oh there is no uh downtime that's the beauty of it there is no impact on performance there is no impact on availability so the weight is rolled out is that there are no problems no hiccups whatsoever how do they ensure that there are few very interesting things over here so plus when dynamodb is rolled out if you serve dynamodepas multiple Services Auto admin service router and metadata and this and that all of them are deployed still no important performance in availability let's take a look on what they do so first thing first when anyone has to deploy with confidence you have to ensure that there is a strong rollback strategy first of all have a plan B in case this goes down in case this page have a strong rollback strategy because if you know that this deployment caused the problem immediately roll back and how quickly you can roll back determines how minimize how minimal your outage would be right so that is first thing apart from that dynamodb does is what Canary deployments Canary deployment is what Canary deployment is all about deploying things on one server observing it if it is all good then you deploy another servers otherwise you don't deploy another servers right plus you have automatic rollback so in case you see elevated error rates in your new deployment immediately automatically roll back without waiting for an on-call engineer to come and Trigger the rollback right so with relying on machines to do it most important but now very interesting challenge would come in deployments are not Atomic because let's say you have four servers let's say while I was deploying on servers I were deployed on two servers but on other two servers the deployment field they are not Atomic in nature so you have to ensure that the code that you are writing is Backward Compatible which means that even if half the machine has new code half the machine has old code you would not have any problem but can you always do Backward Compatible changes not really let me give an example an example of that is let's say you have uh you are writing a new type of message in a message broker what if the consumer does not have like so then which let's say you are publishing a new type of message in a broker so you need consumer to consume that type of messages right now in this case if your half the machine have code where it is publishing this new type of message or some some backward incompatible changes while other half is changing the old version of it the problem is that new version started pushing that message while while half of the server is not sending it so the problem it's not Backward Compatible the way dynamodb handles that it's really beautiful what they call this is read write deployment the idea here is that instead of deploying all the changes at once split them into read changes and the right changes the idea being that you should never deploy the right changes unless there is a reader available to read them so in simple terms in case of the message broker example that I gave think of it this way first deploy the consumers that have the capacity of handling the new type of messages and then you deploy the producers who are producing that type of messages so now you would never have that situation that your producer has produced the message of a new type that your consumers don't know how to consume them simple so splitting your deployment in reads and writes is really essential this way what you do is you handle everything with Grace because now even if there are some machine that sends the new version of the message you have the reader available on all the machines that would know how to read that type of messages and then your slowly your changes are getting rolled out to all other machines anyway right so what dynamodb said that this they call it redirect deployment so they split the deployment they do the read deployment first and ensure that all machines are capable of handling the new type of data in this case the broker example that I gave and then the writers are pushed as second deployment this way they ensure that even though they do not have backward compatibility they're still you don't still face any issues any outages any concerns at all and this is all about dynamodb research paper it was really intense quite fun one and a half hours straight quite fun talking about it but such a beautiful piece of engineering I just loved when I went through this paper you get to learn so much so from just going through this one paper in depth trying to understand every single step what happens behind the scenes right so yeah I hope you found it interesting hope you found it amusing I hope it sparked some curiosity of you understanding how distribute systems are built more importantly thinking about factors that we typically don't think of much like deployments like Integrity like like high availability dependence on external Services replicas paxos leader election outages all of this at scale all of these things matter right so again superb paper I loved it I would highly recommend you to read it whenever you find time I tried my best to go through every single aspect of it and I hope you got value out of it right superb so yeah that's all what I wanted to cover in this one I hope you found it interesting hope you found it amazing that's it for this one I'll see in the next one thanks

Transcript for:Insights on DynamoDB Architecture and Performance

Transcript for:
Insights on DynamoDB Architecture and Performance