Key Principles of Data-Intensive Applications

I've read five different system design books uh designing data intensive applications Alex shoes volume 1 volume two Don Mart system design primer I've done mock interviews also and in today's video we're going to be going over design data designing data intensive applications and what does all that stuff mean look at this I'm just going to go through these chapters I think and maybe like draw on diagrams to explain them I think so this chapter starts off with kutman introducing like what is a system essentially or like what that like data is like probably like the most important or the hardest like challenge that we have right now and designing like something that's proper is is going to be important right like I mentioned like these buzzwords so like AWS no SQL uh it's really all about trade-offs and understanding like you know when to use what in what scenario and so there's like three components of software uh you want to make sure it's reliable you want to make sure it's scalable you want to make sure it's maintainable uh what do these mean I guess we'll go into the reliability right you want make sure that the that the system does what it says and that like it's going to be consistent if somebody ask for something like you probably don't want it to give like random stuff just people aren't going to want to use an application like that right uh there's also like Hardware errors and things like that but we probably want to just prevent errors from happening ideally and then what the current load is latency versus response time I don't know if this matters not sure if this really matters yet not yet maybe in the future and then scalability is going to be like are we going to be able to use this to a millions of users so if one user if we have one person who uses the website right like yeah sure it works for this one person so yeah we have one person who use it and it works for them what is scalability right scalability in is the the idea that like if one person access something like yeah sure it works for this one person but like what about for the other like 50,000 people will it work for them and so it's this idea of Designing A system that can handle millions of users T tens of millions of users billions of users and things like that right like Tik Tok you know is getting up to billions and this is probably like the hardest part honestly of a system is sign interview as I kind of mentioned before is that like like reliability like this is you know kind of the object oriented programming feels like the approach where you just want like to make sure that you want to write good tests you want to make sure that like there's like systems in place that when somebody like actually does something you can roll back um but scalability is like the hard thing because there's a lot of problems when it comes to when you need to build out SEL servers um for most startups it's just like hey like do reliability and then do maintainability Let's Help let's maintainability being like the idea that we can evolve our code over time and that it's not going to be like a a massive hassle of like spaghetti code where you make a feature and then it breaks other things right we wanted to where this is like long lasting ideally and this is the the kind of three tenants of this book although I think it's mostly focused on scalability um because this is just like a harder it's just a much harder thing to actually do right there's a lot of books on how to write good tests and things like that and and a lot of LinkedIn gurus talking about um all this stuff cuz this is like the the low hanging fruit in my opinion but Okay cool so let's go into number two right so how do we store data this is going to be like the biggest thing right what database do we want to use um do we want to use a database at all and if you read in Chapter 2 like this is what he's going over so we have like relational databases versus document based SQL databases right how are you going to know what are the pros and cons of them how do you know which one to use um the the tldr is that you want to use relational when I feel like for assd properties so like when you have data that you really want to make sure is like correct and if you have like complex joins and things like that and then for the do model or like no SQL type of models like helps with like scalability or like if you just don't have like those problems yeah so we say like one to many Jon's hard a document model these are just like more historical stuff I think schema flexibility right don't need to enforce stuff relational has a a schemas like the schemas really matter you can't be flexible data locality query languages I've heard that like graph databases don't even like matter apparently like a graph database is just like a nosql database but it just has like fancier like syntax that you can use and let's see proper to graph model like again I don't think these are like super like even super relevant I don't really hear them talked about and I also don't see them like really design that much oh also another thing that I want to talk about right so like another interesting thing about system design is that it's way more like biology and that like you can actually just learn it by just watching lectures or like just reading books unlike it like Theory theory is like a lot nicer versus system design than like data stru algorithms which is like math or physics where you have to practice solving problems in order to get better but that's something I just want to bring up so here we've got like you know chapter 3 which I guess is deeper into actually storing the data what are these what are ways that we can actually store them we talk about SQL and no SQL but like what is that what even is that what is it underneath like you know that hood of it and so here we get into these these like actual methods so we've got like oh well we can just store data as a file so we you know literally pop open a txt file and then we just like start writing records like oh this happened this happened this happened but then like what happens if we want like a backup what happens if we want to like search for something individually well now we've got to search the entire file and that's like an oin operation obviously right these are slow so how can we do better and we've got like uh right right ahead logs which is where we we keep keep track of the offset and then we just write based off of that right so this is like version two where we write we write but then we're like oh well I want to go to this offset and now we can actually immediately go there in like o of one right and we can write and right right right like you know like when I'm talking about these right ahead logs like you kind of just have to know this stuff and this is something that you don't really see a lot for DSA yeah like if you just study your notes you take notes on it you study you read the notes like you'll be fine probably but you should do mock interviews for sure I mean that's going to that's really going to seal the deal end because there's a lot of nuance a lot of nuance in the interviews where sometimes an interviewer wants you to talk about right ahead logs but other times they don't want you to um and then this is what um SS tables and LS entries this is how nosql databases typically work not not all nosql rate depends the sequel is pretty flexible but this is how um Cassandra works it's a pretty famous um thing which is using SS table what is an SS table I also don't really know I just know it's good for rights honestly um that said in these notes it probably explains it I don't know and then we've got be trees which are good for reads and this is what my SQL uses a b tree and it's also what Dynamo DB uses um but it's you know it depends right you get right heavy we get right heavy requests we got good reads requests berries right and then we get other D structures too and then we also have analytic databases so if you have to like run queries on things multiple times right that's another type of database that you're probably going to want so you know somebody designed something that was like really good at storing um for queries so instead of like like let's say you want to get all you know people in the United States well you just like store that data into the table into the row um so you can just query that repeatedly or I guess this I think they're called column based column based tables so you can have over 100 columns um I guess it's somewhere in the book that I don't have written here but essentially instead of querying rows like you know like you would normally do you can just store all the queries like pre computed and just store that data like as its own kind of row so to speak um and then then you don't have to check every single row like you normally or every single column you have to check every single column to see like oh who's from the United States and who has like this thing um that's probably like a really bad explanation honestly I need to actually study back up on that but okay I'm going to this is like the first like chapters one through three so we did this in 15 minutes now we're going to go into like more advanced stuff right so the first chapter of the first chapter of Designing data intensive applications is going over reliability like what is like a system what is a good application how do we know what that is right and he boils it down to three concepts which is reliability scalability maintainability right reliability is this idea that if you if your program throws an error it's going to be like some kind of consistent error that like users are going to know what is wrong right like if somebody like tries to send money to their friend that if it throws an error right like one it shouldn't really been throwing errors but if it does throw an error they can at least know that and that the system isn't just going to completely die because of you know that one case uh number two is scalability right so it's okay sure your program works for one person but can it work for 1 million people at the same time right uh and then maintainability is the idea of like can it evolve over time is it if you make a change to it is there like a ton of spaghetti code and you fix one thing and then it breaks everything else right so that's chapter one and um in my opinion I think that let's see in my opinion I think that like you know this is like probably the core part of his book is a scalability cuz this is like you know again we've talk we talk about a lot about reliability which I misspelled and maintainability which I also misspelled and the core thing is also I misspelled scale ability bro what happened here did I scale ability maybe I didn't actually I don't know scale but yeah the hard part is like doing this is like everybody already talks about this stuff like you know tester ofin development and all that but nobody really talks about scalability so most people also don't really need to know about it so that's probably the reason so we go into when should you SQL no SQL and um I guess just briefly going over this it's going to be use SQL when you want like acid or like strict really strict parameters um acid you know being like that consistent things happen that like there's no like dirty reads or dirty rights or things like that where people are like reading data they're not supposed to be reading or like race conditions and things like that like for example for like a banking application where somebody like accidentally loses $1,000 just because it the system was or the database like you know had Hiccup and then graph models um and okay how do you actually store the data so uh the third chapter is like okay we talked about like the the pros and cons of no SQL and SQL but like what actually is that inside the data and what you notice here is that like okay so the base thing is just like we just store it as a txt file right now the problem is that that doesn't really scale if we got a lot of data we're doing like an oen thing whenever we want to look up some specific type of data right if we want to know what was said here well now we got to go through the entire thing and that's like very expensive and it's just not going to work on like large scale applications where we've got lots of text files so the next level of that is that you can have a write a head log right so we can say a wall which is that we write the data every single time it gets depended to the end but we also keep track of the offsets so that now we can like if we know like this one is located at like index 2 or something we can actually just jump to index 2 directly and read it so now it's like 0 of one if we know like what the offset is or off like with that data and this is nice this is you'll see this in like message cues and stuff although I don't think he mentions it in his book at this point um and there's things like okay so what do you do if it if it needs to be recovered right what happens if like the server crashes and it was writing and like you know it broke well you can have copies right you can take snapshots you can keep track of everything single thing that you've written in like a very long log and then when you go there you say oh well let's just redo everything up to like our you know our end point or whatever um and then right so appending is fast crash recovery simple removing old segments removes need uh and then hashtable must fit in memory and then range queries are difficult right so range queries is like what if we want to yeah so it's yeah it's 0 of one for one entry but like what if we want multiple and we have to kind of like jump the needle several times I think to um actually retrieve that which is expensive cool so then we move into the SS tables and LSM trees now again I don't remember this fully it I truly don't remember it but I just know that these are slow at reading um I think because they have to write multiple copies of data or something but I just know that they're slow when it comes to reads and then be trees uh are good for reads and this is what uh a SQL databases uses a b tree so instead of having like this file like this we can create a some kind of tree structure store files why am I doing arrows store files like this and then when we want to retrieve them now it's like login right look at that log in what other structures right so like there's other things actually these are really the core I'm sure there's like other ones like primary keys and I guess there's stuff like here there's also column based indices which is another thing I wanted to talk about oh and then there's also full teex search thing so aasa search so there's another idea too where you can have a like if you want to like search for text like let's say we've got like Twitter and we've got like a ton of files or whatever what you can do is you can use an invert I think it's called an inverted tree I don't remember but essentially the way it works is that you say like oh well if I'm looking for cat cat is going to be in document one is going to be in document two and like document three and so that whenever we want to know like oh what are all like the documents that contain cat we can just like you know pull that up essentially um from there and you do that for like cat or like hat you know maybe that's like Doc zero or something but this is a faster way of like if we want to like parse and search something so like um the man like like the man or something we want to search all tweets that say the man well we can just kind of go through our tree and then it'll like tell us like what things have like you know all these words but this is ala search which is again very popular um very popular kind of service for like this full Tech searching thing but I don't think you mentioned it in here but just something Chan do second thing here too is we got so analytics type of databases and again I think analytics is essentially the same it's just like so it says data warehouse is typically a relational database for queries and then one Central table that branches off into Dimension tables and I don't I might be wrong here maybe this something I should look up but like I think that when you have a like you have like a typical database or whatever which is just like this you got like your rows right these are all going to be every single entry and you got like your columns right here and then typically if you want to do a query like you'll check every single row which is like you know kind of big right it's like I I mean that's probably o of n for like you know n being like all the the rows that you have but what you can do is just store those queries as their own column so then you take all of these and you say like oh well maybe like these are like us people or something I think this is the idea and then you just kind of like store them um I think these are called column based databases although I should probably spin that maybe spin that up but but essentially the idea is just you can speed up your analytics queries so like if you want oh I want someone from in USA I want the person from India I want the people from you know like Russia or something instead of constantly doing like this oin thing when you want to run this like you know a million times you can just like pre-compute it essentially and then just like store the result of whatever it is and I believe that's how it works it's my understanding of it okay cool so that is that chapters 1 through three of ddia let's go into the next one cool and the cool thing about this is that I actually kind of read this one pretty recently so chapter 4 is encoding and evolution so one we kind of talk about maintainability and this is what he talks about a bit here with Evolution which is the idea that like when we make a change to the code right it's not going to break stuff in the future and the other bigger thing is just encoding honestly I don't this chapter doesn't feel like very very important when it comes to system design but it's kind of useful I guess knowing like how computers are even doing stuff so with encoding he's talking about this idea that that like when you're coding something in python or C++ you the data object that you use kind of change and modify right like you can have like some service that's doing this but then you have to send it to another service that's using like Java right maybe we have some application that um has like some array right in C++ or it has like some you know some dictionary in Python right but you have to convert that to Java and this is this process is called like encoding this is the encoding and this is going to allow us to send data to multiple places and so we can even do that we could even do it for the what else is it we can even do it for like another service like some other like micros service I don't know but it's just the idea that like we're like you know kind of and how that works is there's multiple ways of encoding right so there's like Json this is like the most common I think Json um and and there's also multiple Pathways of doing it so there's like HTTP requests of like sending the actual data there's HTTP there is like message cues and then there is like grpc grpc these are how we're going to be sending the data to like other services and um and then in there in this like this method we can send it using Json XML XML or like pure binary right and that's how this is how like this this kind of process is working when you actually have like C++ code and you want to send it to servers or you have JavaScript you have a JavaScript web application and you want to send it to the server or something right you you encode it and change the data type to what it's going to match so instead of just Java right we be sending it to like a DB send the data to the DB and then we have encoding and serialization this is actually a leode problem too like a little bit but probably much difficult and then we can encode it into bytes right so kind of what I was talking about is that you can I guess you have pickle serializable and if you do that I guess you're tied to one language and you have to decode like bytes which can people can sneak in like malicious code into the bytes that'll just be read like straight up and then there's no versioning for these libraries so we get Json XML and binary variants so we get XML Json csvs these are way that people like typically store data um there's a lot of complaints for like each one but there's not really any there's not really any other idea I guess is is what kman says in his book and yeah specifically I mean he brings it like some spe some specific problems which is like there's ambiguity in around the encoding I guess you can't distinguish between ins and floating points um you can't you lose accuracy with large numbers yeah but you know generally just good it's generally good enough and then we have binary stuff so it's going to be fast which is going to be a good benefit send the data fast Bon bjon bison SM this stuff I don't care a lot of too much stuff so he talks about you know how it actually works under the hood I don't think this is relevant for for a system design interview I don't think that any questions really ask you about this kind of stuff so I I would just skip it but if you're curious about like how these binary encoding libraries work and like how they can you know take a bite and like store it um you know he tped he talks about like you can create like these structures or whatever and so we have the data flows like htps RPC uh message cues right database called and for HTP calls we have rest versus soap remote procedure calls not really sure why they use them but apparently it feels like it's just another way I don't know feels like it's just another thing I like truly have no idea why you would use RPC over over htps most most of the time I think it's just HTTP that's how you send data between things and then message cues but anyways that's the encoding section I don't you know it's not really a whole lot um it's just kind of interesting seeing how like okay how do you actually send data between a database how do you send it from in your python to your database um to your you know python to your server and things like that it's just kind of all like this stuff so yeah replication so like what do you how do you prevent stuff from happening to where like where like you have this system but you don't want to like lose the data permanently I believe that's the question that replication is asking there's also like it also helps with scalability if you have like too high of like a server load you can you know offset it so one of the first things that kman will mention in this chapter is this idea of leader follower right which is that you have one leader and then it's going going to send that data to the followers and so if the leader actually fails so if a leader fails it can the data isn't lost because there's actually copies of it and leader follow is also nice too because if somebody's like if you have like a ton of read requests they can actually just read from these other servers that you have from these servers down here which are the followers so your your only bottleneck is just like your right through put which I'm not sure what the number is I think it might be like maybe like 20,000 rites per second I'm not sure how fast a DB can write but so let make sure this is correct replication is useful for when one system goes down all the other ones are still going to work um and then the data is not going to change over time right so one you don't get a single point of failure and like I mean I guess it's useful if your data is not constantly changing although I feel like if your data is changing it's kind of fine I'm guessing this means like right right heavy type stuff which is talking about here maybe I don't know actually so distribute databases must challenge handle the challenge of CH challenge of data changing over time so we've got like three different Str G's uh single leader multileader leader less which have like just a right a lot of problems um leaders and followers right kind of talked about this and there's synchronous versus async you probably want async would be my guess and then you can add new followers but you probably have to just make a snapshot and turn it off if you want to add something new right so like if you want to add a new one your system's already like set up well you're probably either going to have to stop it briefly and just turn off the system or you I guess you take a snapshot and then and then just start right over and then have it like follow up whatever the snapshot was and so this node clones whatever the snapshot was they get node outages what are you going to do if one of these nodes crashes and dies right well now you have to make a new one and how are you going to make a new one right again we keep a log of all the changes and then you restart it and then just have it read the log and if the leader fails well then this is a bigger issue because now we're no longer getting reads and we also have to handle like this idea of like swapping so like if this one crashes we have to like place this other one here and again this is this is like this kind of stuff that I'm talking about is like Way Beyond like a mid-level engineer this idea of like swapping noes I don't think that will ever be asked you in an interview but you have to replace the node um and then you have to hope that you didn't lose any requests or that the requests are still like in the log or something and then the same thing too with split brain where two nodes think they're the leader so yeah like I said I think this is out of scope for a mid-level but you know just kind of something that you can know is that like what do you do if a node fails probably not going to asked it but what do you do right you have to like okay kill it you're going to have to have some replication log that goes up and then what do you do if it's like a split brain where where this node died temporarily cuz it failed like the heartbeat checks and then we we swapped this one here but then we actually kind of messed up because this one was still alive and so now we've got like two things that are like you know getting like information like that's like a that's a problem with this this kind of system is that you have to handle this stuff when you're trying to implement this on your own it's very annoying this is why you know we have cloud like AWS AWS Cloud because it already like it just wipes away all of that stuff but this yeah this is why it's it's so popular cuz this is annoying thing to have to code for for scaleability okay implementing replication logs uh right ahead logs logical logs sure replication lag we can just wait so it's not always going to like catch up immediately there's always a delay when you're sending data to something always some delay especially when it's over a network and then sometimes like things just get lost so sometimes you'll have like messages that are just like gone unless you're doing like a message CU maybe but then you're just going to have delays if you're using a m q so now you're introducing delays so then you just wait eventual consistency just wait uh reading rights monotonic reads consistent prefix I don't know any of this stuff I don't probably not going to bother with that multi-leader um you can have multiple leaders right so if you maybe you have some multiple data centers or you want to reduce latency this will allow you to have multiple or higher throughput so you can actually record you know multiple types of things yeah now you can have like multiple you know kind of throughputs uh instead of like 20,000 per second you can actually go to 40,000 and then if you start sharding that can help too but and then we have leaderless replication leaderless is supposed to be like easier when it comes to distributed system because it handles my cons consensus apparently Dynamo DB Cassandra use this yeah this is the idea of replication there's also like if you want like a an easier kind of thing right obviously there's Don Martin's where his if you go over his replication it's like Master Slave and then master master and where does he have this at yeah like his is just like one page right so like notice how I spent like kind of what I guess 10 minutes just talking about this stuff but like his is like you know two paragraphs right this is where I'm talking about this is what I mean when I say like depth is that DDI goes way more in depth although you don't need to know this stuff you didn't you know nobody nobody's probably going to test you on um the fact that like what do you do when a node fails and like how do you have a replication log but like you can just go through and just say like oh um we are pros and cons and then Federation with that let's go into chapter 6 although you know what I mean I don't know the St Mar maybe I should just look at it leader follower additional logic needs to promote a slave to a master and then disadvantages replication but it's the same for both right I think replication just going to be slow probably data loss rights are replayed lot of Rights read repli get bog down requires additional Hardware complexity disadvantage of master master need a load balancer the thing is further reading and stuff but yeah chapter 6 is related to partitions and the idea behind partitions then is that like you can have a database you can have like some database um you know that has like some entries or whatever but you can actually get you can query things way faster again we don't have to do o of in we don't have to go o of in oh my God we don't have to go o of n like searching everything we can actually just group them based off some kind of criteria and like split it so like maybe a through z or a through you know a to c is here and we have another one that's like you know D to F or whatever etc etc and so now whenever we want to search for a it's actually like a lot faster than o of in because we don't have to search everything not everything is like jumbled so that's the idea behind partitioning is that you can increase speed and in some cases you know you can have like entire databases that are just dedicated to like the letter a nice I love it you have some databases that are like dedicated to like the letter A or something just chill cool so um there's also the issue of like hot partitioning so like this a b and c thing doesn't it's not practical because maybe you can have like A's that are taking over and so it it ends up being bigger it's not like evenly distributed but I think this uh this is where the hasher ring comes in but okay charting so it allows you to break up data into partitions where each partition is essentially a small database you can distribute a large data set across many processors leading to faster lookup times right we're not hitting o in anymore we're hitting like probably login or even probably login I guess is what it is but it's much faster um so we can do key range partitions just group them right you can have some hot spots though and that's this is another issue where again out of scope for midlevel I think or at least maybe for like a low bar mid-level but you can have like hot hot keys and then you need to figure out a way to handle that so you can have a hash function right just consistent hashing and then we have here we can have secondary indices right you you can also indexing is like a thing but you can also have secondary indices and you can also have like sorting keys and databases but again this is like out of scope you can have multiple like uh some indices and then rebalancing partitions I apparently I mean I just did not go through this that much but I guess look up consistent hashing if you're curious about this curious about kind that kind of stuff okay cool now we're back to the stuff that I kind of like studied more recently I should probably go through the partitions again but I don't know how much I want to do that anyways we're going to end this chapter here I think maybe we can even look at the non Martin thing too actually while we're at it partitions no where's my partitions partition tables it's like literally like nothing yeah there's just nothing on partitioning anyways yeah that's all you would really need to know so I think so that's about all you would need to know I don't think you're going to ask anything more complex than that in a mid-level interview cool so the these other chapters are I believe like they're higher than midlevel so you don't really have to like know about this stuff but if you're kind of curious I think it's you know it's good to know probably but we're continue over with chapter 7 so transactions so this entire one is I believe it has to deal with like preventing race conditions and preventing like dirty reads that you essentially have like multiple states of your database like what do if you have like a database and what if you have a database and you know you got like all these entries or whatever and you got like a query you got like one query that's going here to here to here to here and it's like changing stuff so like now the stuff is like dirty and then you have like another one that's going like here here here well like now your thing is like now your system is like kind of screwed up so like how do databases prevent that and like what's what are algorithms and strategies that they can use to prevent things um obviously like a lock is like you know a deep one so you just lock it you say nobody can access it and then they just have to wait but um that's very timec consuming and your system will have like worse performance if you're doing that so like what what ways can you can we fix this right uh many things can go wrong right we can have Hardware to fail so you were writing something and then it fails and so it never fixed and so now you got to like roll back things um I just mentioned you can write to the datas at the same time and and you know in order to be reliable they have to deal with this so a traction is just like a unit of several reads and writes like grouped together so either succeeds you commit or you fail right and there is the like all all databases support transactions um it's pretty acid one thing that he mentions kutman mentions is that acid isn't even like defined that well like consistent means like absolutely nothing consistency means like absolutely nothing it's not even like actually there's no metrics for for it it's just kind of like this word people throw out to to sell things basically it's just like oh it's consistent oh this is eventually consistent but like what does that even mean like it means that like yeah it will eventually catch up like is there any time limits right there's a big range of numbers that you can kind of get through and yeah even more vague than acid so we have base this is what no SQL does so they say like oh it's basically available it's soft State it's eventually like what do basically available mean can you give me like actual numbers on it nope okay so like what is acid we go through every single thing kman goes through every single thing which is atomicity you can have one thread that executes like the operation um there's no half States we have consistency and which is generally that like everything's eventually going to kind of catch up to the same data I believe and see here if a transaction starts with a valid database according to those invariant and then isolation so just no race conditions no half States no half States no race conditions the data eventually like catches up right nothing's going to be like kind of inconsistent and then dur right so that it's going to be persistent that it's not going to crash or get corrupted or whatever like ideally it's going to like have let but there is no perfect durability because things can get power outs can happen discs wear away yeah cool statistic one study found that 30 to 80% of ssds develop at least one bad block during four years of appc of operation he cool so there are single uh object transactions where you have multiple reads and writes on a single row of a database and then you can have multiple ones where you want to write multiple rows essentially multiple objects which they I guess it's called and then we have handling errors a key feature of transactions is that it can be aborted weak isolation levels concurrency bugs are hard to find by testing serializable operation like I guess this one's kind of interesting right so there is a there are different levels to isolation and it doesn't just go like at the strongest level of like oh this will never ever happen because again there's the performance trade-offs right like we don't want to just lock this thing and just say oh well yeah just nobody can use it until I'm finished um there are ones where you can just allow it allow multiple reads for example and you can say everybody can just allow to read this one and um this is like the lowest level and you can set these in a database so if you have like a mySQL database you're actually you can actually just say like oh I'm okay like doing this at like you know like the read like level or something like read uh you know think other things can read or whatever you can have like the read lock or the right WR only or whatever so seizable isolation and then right it's like preventing performance cost you use like the weaker isolation level um you can have concurrency bugs but it's not always needed to have like oh let's just be strongly consist or strongly or whatever you can have like serializable where you have to do every transaction has to go but you can have like some you know some applications or even a lot of applications where you that doesn't matter at all I guess like one application if I were going to give an example of one is like maybe like Facebook like messages maybe where you don't really care if a post is like has like some kind of bug in it because nobody's like going to die really and especially if your if it means your page loads faster but yeah I think most people just say like oh it needs to be like perfect you should never have any bugs okay this is it read committed okay read committed that's the first one only see the committed data no dirty reads and when writing overwrite data that has been committed no dirty writs so like if something is modifying this row and something else reads it right you can you should only read it when this first one finished you can still read it but like this one just needs to be like finished and committed and then dirty rights the same idea right like if some multiple if something is writing this we don't want to allow reading it took me 4 hours 3 to 4 hours just to type the notes on this one and I'm kind of going through it pretty fast going through pretty fast this yeah this one took me 3 hours I think 3 to four hours just writing out this reading it and writing it out how do you even Implement something like this so he goes into how you implement it which I believe is like versioning I want to say use row level locks when a transaction wants to modify you just lock it right and the problem is that like if one thing locks it for a read well we still want to like allow reads and so what we can do is we can version this and we can say oh if it's locked just hey check out this check out the old version Oh this is locked okay then go to the old version of it and then you return that data and so now things are can go fast now things can go faster now there is some problems with this so you can have someone who transfers money to another person just talk to this Alice and Bob example where Alice transfers money to someone I wish I had the diagram but I yeah I guess I don't have it I believe it was that like Alice transfers money to someone so she's like negative like $50 and that person is like positive $50 but there's like a brief moment where there's a brief moment where this person looks at what her account balance is and then it says that she has like uh oops and then it says that she has like $50 because like the transaction never finished which would be like really confusing for that person and he calls this like a non-re repeatable read or read skew just for a brief second though implementing this stuff transaction levels like this is a lot of stuff to memorize transaction acid um atomicity consistency durability this is a lot this there's like a yeah which is like the reason why most people won't read any of this stuff and probably most people won't even watch this video that I'm making but but for the people who actually want to learn stuff um I do want to like you know kind of give this as a kind of like an option but yeah I mean like um Bon these are this is like the most hard this is the hardest you know kind of concept that that if you can Sol learn this thing that is like you know you're saying it's very easy but you know if you can learn this bro you're making a million dollars a year easily for sure $1 million a year if you can know this information and like put it into practice of course but like because what's going to happen during a a system design interview is yeah sure you can draw the boxes but what they're going to do is they're going to ask you oh okay so like well one they're going to say like one question they could ask is oh like what is a what what isolation level do you think we should have like those are like things that they can ask and why like why should we have this isolation level for this application cap theorem yeah cap theorem okay uh snapshot isolation is the most common solution to this problem so each transaction reads from a consistent snapshot of the DP typically use right locks to prevent dirty right uh database multiversion concurrency control repeatable read and naming confusion Call a repeatable read something called it this right like yeah this idea of repeatable read is different depending on like what database you're using preventing lost updates comic right cursor stability make an exclusive lock that blocks reads until the update is finished think this is fine do like make sense and then replicated databases are like even more annoying because now you have to do this and worry about like different nodes having different types of data there is another one too with compare and set yeah there's another issue too with compare and set I think so like here's the this is like one issue and then the other one is that like you can have multiple doctors that need to be like on call or something and you always need to have one doctor on call and then if like two of them try to do it at the same time then it like doesn't work so you have to need you need to compare and set before you do that so you can have like you know two people who want to like leave but you need to have at least one but then you execute the first one and then you execute the second one but you never checked to see if it was Zero before you did that and you know this one like should never happen and so now you got like doctors that think that they can leave but actually shouldn't so this is the compare and set idea okay and then there's another problem which is the right skew and Phantoms oh yeah so this is what I just talked about which is right SK and Phantoms um if two doctors check out at the same time a condition can occur right the Snapchat will say that both are because there's like just no check characters and R CU lost update more examples you can have meeting rooms where two people book the same room but they weren't supposed to um two people who have the same username in a video game and they try to like claim it feel like this is is really difficult tldr this is really difficult I know this is like again I keep saying this this is above like mid- level stuff and I'm not going to bother with it but I'm fixing this seems like it's a difficult problem okay serializability so we're almost done with this uh this is going to be the the hard the hardest the the strongest type of isolation level it's going to also be the slowest right strongest isolation level it's going to be the slowest so we just execute transactions in the order that they come in and then we can also do two-phase locking and U maybe some optimistic locking type of control so like a versioning essentially and then actual serial execution right the simplest way is we just do run action at of time it's just slow to do it that way though so when we mean one transaction at of time right we've got like all these things that we want to do so like one let's say we want to do like you know XYZ and then two we want to do this one we just keep the order of them and just like do this one first and then do the other one and so on and so on database can't just sit around waiting for users to make up their minds if they want to book an airline ticket transactions are broken up application must submit the entire as a storage procedure database is much more performance sensitive no version controls storage procedures partitioning and then two-phase locking which is an algorithm for getting this serializability and I think it seems like it's defined off this when someone wants to write to an object you need the access if transaction a is being read then B has to wait for a to finish if transaction a is writing then B needs to wait until a finishes so you're blocking the readers and the writers so there are several databases that are using this and it's called two-phase locking because when locks are required the second phase is when the transaction executes so the perit is very slow and it's very easy to have a queue of transactions that you need to run and I believe that's it uh serializable snapshot isolation so then I think this one's just like versioning but instead of pessimistic locking we just do versioning so that's transactions um the tldr of this you know this this chapter again it was like really long it took me like four hours probably to go through but just want to know that acid is a thing and that there are different levels of isolation that even with acid there's like different levels of acid you have this thing called base which is basically available soft State rate just just very vague but there are algorithms of things that we can use and set different types of levels depending on like how disastrous it would be if we had like you know an inconsistency in the database or something like you know oving over something essentially never seen locking like this come in during a system design locking I mean locking comes for sure comes in it's a very popular one if somebody gives you the the like if someone ask you to design a booking system for sure it would happen so like airline tickets airline ticket you know who I kind of like is all this online this hello interview people they've actually done a good job with their social media they put out some L content but uh Ticket Master like you're probably going to see that I don't I haven't read this article like and I don't remember it really but I'm pretty sure they're talking about locking here just cuz this is something like you need to be able to lock things and so you know I I don't know the pessimistic locking right like oh you just wait this is like the thing that everyone would think oh just lock it and wait but there's other things you can do um then again these people aren't going over every solution but like you've got ideas and then you know obvious but like Ticket Master there's invent inventory systems not really what's the other one hotel booking hotel booking systems like these reservation applications these are they're using this locking type of idea so yeah so that's transaction that's why we're you know kind of go over this stuff um it it FES over to other some other systems

Transcript for:Key Principles of Data-Intensive Applications

Transcript for:
Key Principles of Data-Intensive Applications