Transcript for:
Early Morning Lecture on Database Engineering

hey there everyone it is here good morning this is five o'clock in the morning and turns out couldn't sleep uh so I thought let's record a video and this reminds me it's so much important for me that for some days I need to just get rid of all the tags so that I can get myself back and don't feel the burnout but anyways I'm about to take a break but anyways this video is another engineering side of the video and it turns out a lot of people are these days coming up not not from the formal Computer Science Background but from a different variety of backgrounds uh even some of the chefs and accountants are coming into the programming world and they don't truly appreciate the engineering side of it and that's why I'm creating these sets of videos again these videos are not a hundred percent of my own uh they're coming from a lot of books a lot of blog articles a lot of my libraries in my college as well these are common knowledge everybody knows them but a lot of people turns out don't know this this is is another video in which we are going to discuss a lot about the database I'll spark enough of the Curiosity in you so that you truly understand what it takes to make a database and why there is a database war going on in the market I'll spark enough of the question inside you that you will now question every inch of the database and you will be able to truly understand that what makes database a good database and how you can select a database for yourself although this is generalized I'm making this self here but this is a generalized database engineering side of it it doesn't means that this is a side of MySQL or postgres or mongodb or any other database Vector database that you're working probably a separate video on the difference between them but first this is a generalized side of it not only that this video will also give you enough of the details so that you in case you are curious about the blockchain and What Makes blockchain blockchain You will get a tiny bit hint of it so for all the enthusiasts in the blockchain this is the video so welcome back hope you have already hit that subscribe button because these days I'm on to a spree of making a videos and I should really stop it down I should really so let me walk you through so first of all couple of questions before we go and Embark the Journey of what's a database and how does a database system entirely Works what's the inner engineering of it what are the different components and the parts which are stitched together to make it work a couple of questions to you have you ever wondered and asked yourself a question that why should I choose a database and what's the importance that database is giving me what's the difference between choosing a MySQL and a postgres both are relational databases but what makes them different apart from claiming of the speed that they give us or why should I choose something like a mongodb instead of a MySQL okay it's a document database but what makes it a document database also have you ever asked yourself a question that hey whatever the database is doing I know that it's writing something inside the disk all right so whether you choose MySQL Auto postgres auto mongodb It ultimately writes it inside the inside the disk itself and that writing part is a little different relational database obviously writes it into a tabular format like an Excel sheet or databases like mongodb it just dumps it out into the object format but once they actually put it into the file ultimately what the file is it just is a sector on your disk whether your ssds or hard disk but it's engraved there it's magnetically written there so how does these database actually gets that update facility how they are able to go inside the disk and actually rewrite that sector because if you ever studied any computer science that's not possible once that's magnetically engraved that's it it needs to be wiped off and completely rewrite that there is no update facility in the drive so that's where the engineering side of the things actually spark and that's where I enjoy the most that hey how come this is possible and that makes you a real engineer so anyways I hope those questions are very interesting for you you let's let me take you on to my iPad and again I forgot to uh spin up my iPad and just give me a second I'll spin it up uh not really a big deal should have done that already but hey it's five o'clock in the morning you can excuse me for that part okay let's go ahead and start with that come on spin it up spin it up there we go this is our database yes I have already curated a diagram so that we can focus on uh just the Talking part of it and when I actually write it in the videos I just write it in the very fast phase ugly and writing spins off okay so let me just walk you through one by run one by one you don't have to worry too much this is all manageable you will understand a lot of it not whole of it but a lot of it so first of all this is your or import or odm part what is this in case you have heard about these days popular month stack be a murder stack developer you use a library there known as bangus Now using a mongoose is not really compulsory what Mongoose is is technically an or RM a representational a client which makes my life easier I can go ahead and directly interact with the mongodb client or the MySQL client but the interaction and those default drivers are not so easy to manage understand and they don't provide me tons of facility on top of it that's why when I want tons of facility like Mongoose and there are my SQL versions of it I use or prefer to use orm just to make my life easier but there is no compulsion like that for example if you have ever written some code in C plus plus there's a library Crow there is no such orm for the crow if you want to interact with them with the mongodb you have to use the mongodb direct client and that's also fun but it doesn't give you that kind of validations which Mongoose gives you but hey that's that's the part of it so again orm talks to the client and client talks to the database itself that's why a lot of people you'll see that if you read more articles by the way I'll probably share if you put that in the comment section I'll share more resources about where do I study a lot of these things and where do I get all these debates as well if you are interested let me know in the comments section so that's why you'll see a lot of people debating about that there is no point or no advantage of using orm probably you are using just to make another bottleneck you should be directly using a client because at least one layer is skipped so that's interesting so this is the part which is used for talking to the client this whole big gigantic thing that you see this is your database MySQL postgres mongodb they all have these components in that what's interesting here is at the bottom here that is your disk and that's where the magic happens and a lot of database are fighting on different segments of these database that's why you're seeing there is a database War right now almost a wall some are fighting on how we are going to move the data to the disk some are fighting over how we have actually changed our CEO yes database execution engine that's known as their CEO some actually says that hey we have actually re thought that how this index file works that's their whole USB is there's a lot of thing going on so let me just first walk you through what happens so once you have actually gone through with the orm and the client or the client there could be any one or there could be both so once you have done that you have written some query for that whether that query is dot find or that query is Select star from users where first name equals to hit whatever that is that's a query and that query first and foremost always and always hits this portion of the block some database tools will say that this is not two this is just one some will say this is three some will say this is four part but the whole idea is there is two segments two parts in which it is divided the first one is the parser part of it another one is Optimizer part of it the parser exists in every language because you're writing everything in the characters nobody understands select star what does that even mean so in that case we need a parser agent what this parser agent is it it converts that into ASD that's basically like saying an understandable tree structure so there's a parsing that happens if you have watched the previous video I've talked about that in the browser one in case not that's okay parsing simply means I want to convert that into tokens all over the keywords like if else select where a equal sign and Dot find all of that are tokens so I want to extract the tokens out of it then I want to run through it with the syntax and semantics whether you have written the proper syntax have you have a typo whether it's according to my structure like you cannot write find hitesh where username no no there's a sequence of it so all the semantics should be proper all the syntax should be proper and then I will convert that into a tree and that's why a lot of people say learning data structure is good because a lot of things are being converted into tree but those people are mugging the data structure so much that they forget to show you the impact of what the impact actually is of the trees so everything that you write whether that's C plus plus or your data structure queries are converted into tree now these trees are actually being thrown at the optimizer now this Optimizer is really really cool guy and it always and always talks about the cost part and this is like uh in your company there is always a accountant who says why you bought this why you are paying 3 lakhs to this guy while you're paying 30 likes to this guy yeah this guy always talks about the money part but in the database it's not literally the money the cost could be different parameter for the database for example it could be a faster execution it could be don't do execution twice or it could be I can grab the data faster but that will cost more so there's a lot of optimization that needs to be done optimization has no strict definition for some the optimization is time for some the data optimization is execution speed there's a lot of variables and this is where a lot of database actually fights yeah so database first is getting parsed and then it is being optimized now in every single database there is this execution engine this is the CEO of the database management system this decides that how the client request will come and how I will send it back as quickly as possible just like the CEO consider this as an Elon Musk of the of the entire our database he wants to do everything super fast now it will push the optimizer give me the data fast it will also go ahead and ask hey how can we retrieve the data back from file all this happens this is the master of the ship hey everything happens with me through me and I should be aware of everything that's the whole job of execution engine it doesn't literally do everything like it does some stuff but it doesn't do everything but it actually asks everybody to do things faster and faster and update me then there is a cache as well I'll come back onto the cache in a minute but you don't have to worry more on that now there is another block here which has a lot of utilities and every single database company try to increase their utility function more and more the basic one of course includes the authentication the authorization the backups metrics to track metrics to evaluate clustering every single database company is trying to use these are known as utility functions and every database has their own utility probably I'll make separate videos for my SQL and mongodb where I'll walk you through these utilities which are separate for each one of them but hey this video will serve as a base guideline now coming up now remember one thing no matter what database engine you're using you're always getting the data from the disk which is a costly operation it's not easy to invoke your uh invoke your kernel and grab some data and get that so every single database has this cache operation that hey if this operation or if date if this data is frequently being accessed let's not always call the kernel and rather put it into the cache if we have that and again the cache is exactly same as you study in your OS that hey there is a page and if we get the Page hit or a page missed exactly same so you get the cache if that is there your CEO always asks the cash first hey this is the query it is being passed it is being optimized so do we have the access in the cache if we have the access to the cache throw it back to the client don't even think for a second and on top of that this utility is always runs based on their need or not like for example whether you are authorized to ask me that question or not whether you are a like reliable client or not whether there should be a timely backup of this data or not these are utilities so we are not bothered much about it now what we are bothered about it where a lot of database are fighting these days is this block so there are two portion of how your data is going to be magnetically engraved in the database the first one being the data file and the second one is index file now your data file is actual data there is nothing in it and yes there are debates and there is competition on that for example some data files should look like tables and rows that's what the relational databases are some says the data file should be written in the document format there are vector formats even so this is where a lot of fight happens a lot of variety of the databases given to you that hey I will be able to extract the data faster if you just give me that document back and I'll do majority of stuff in my execution engine or some says hey let's just execution engine handle the part first and then go to the data file to specifically look for rows and columns and get only that data again debate good debates and the index file is actually the metadata because if you notice if you have ever gone into the mongodb atlas you'll find that there are different indexes that you can create and these are searchable like obviously you're not going to search directly from the data file there should be some metadata data about the data so that you know how to find the data which I'm looking for and that's where some people say that hey we'll have just won a primary index secondary index some says we will allow you to have each field marked as index databases databases and the whole thing this saving and all of this done is via that storage engine yep that is your guy uh junior CEO you can say which helps you to store all of this data indexes into the disk so the more powerful your data storage engine is the more powerful your database is and this storage is actually done in the different data format because it's a structure you want to store your data and that's where you learn about the B trees and B plus tree anybody who is doing their masters or the PHD once in a while in the computer science they have to study these B trees and B plus trees and even have to read the research paper of how we can optimize it more further and go beyond the B plus trees there's the hash index there's a lot more that happens now coming back now that you understand why the database are storing uh having a war on this portion as well as on this portion then comes the manager every database is trying to hire more and more managers not literally higher but you can say they are injecting more and more managers the first these are the common managers there are more actually these days the first one is the transaction manager now the transaction manager is somebody who guarantees the transaction a transaction simply means if it happens then it happens otherwise it doesn't happen partially this transaction manager doesn't work alone it actually take helps from other manager but this is a guarantee because no matter what the database is the transaction should be guaranteed like a lot of people say is in the mongodb there is no transaction guarantee no man you are wrong there absolutely wrong the transaction is a guarantee and is a feature of data base even if it's a mongodb either you save the entire data or you don't save it that is the transaction nobody wants to have a database even database will not exist a lot of people use database for their transaction history in the payment gateways as well you don't want that hey sometime it saves the data sometimes it does now if it saves it it saves it it throws an error then it throws the error so transaction simply means a guarantee that it will store the data or will do a complete rollout and will give the error that's the whole job and how it does it is actually dependent on the lock manager lock manager does two things if you want to write the data it gives you exclusive access to that portion of the disk so that you can write it and nobody bothers you during that time because databases they are trying to use your computers a lot as much as possible and if you know these days everything is happen concurrently yeah database is not like your JavaScript that it runs on single thread they try to use as much as possible of your concrete currency but it is also required that the lock manager should give you a guarantee that if you're writing something no two parallel process should come and write that that's important but if you want to read it it actually allows the shareable read so if somebody wants to read it read operation should be faster otherwise nobody will use the database so read operations are shared and all the concurrent process can actually take and share the data it does it through a log table mechanism so in case you want to learn that I wrote it accidentally I changed my pen so in case you want to understand more about it learn about log table simple process there is nothing too much to discuss and this is the main this is the interesting manager which I love a lot about it this is a recovery manager now recovery manager is something really interesting it uses a simple data structure which is append only data structure so if you have ever wondered that how the log tables are able to keep the track of all the logs that I'm making because sometimes I mess up but these logs are actually responsible to roll me back in the time yes that's exactly the recovery manager is it allows the database to have a point in time recovery that hey at this point of time everything was correct so through these logs and through this spatial data structure which is known as append only data structure you are able to go back in the time now this append only data structure is not something new it was always there in the logs and in fact the concept of how you're going to keep logs in database actually came to us only because of this append only data structure and if you remember what is an append only data structure that means you cannot update it you cannot delete it you can only just keep on appending it you can keep on adding on top of it and that's my friend is your blockchain yeah literally this is your blockchain you cannot go back in time you cannot append it you cannot delete it only the chains can be moved forward so all those enthusiasts will say I just carry about or care about the blockchain only I don't care about logs or database my friend you missed the basics it came in here from append only data structures so again this is really really fun now a couple of more things which are Super interesting here right now in the initial days a lot of people actually uh had a competition onto this all your utilities uh so all your utilities this is where a lot of people fought and said hey this is my SQL version this is my SQL version one or other than of them gave us try to give us more utilities and more control over our recovery manager that's it that was the game but these days things are on to a different scale there's a lot of debate and you'll find a couple of databases in this Niche which are just focusing on the cache redis is one of them which has like crazy cash power and again the same thing happens in the radius even there is no difference there a lot of them are fighting over this storage engine and a lot of them are fighting even what kind of data should I store and how should I store it and what metadata should I store it so that's where you see the crazy War of the database is going on all right uh quite a long video not quite long just 20 minutes again for this much of the structure it's 20 minutes again this video is super helpful for somebody who is in early first year of their college and trying to get some interest in the burns tag and finding why the database is so much powerful or somebody who is not from the traditional computer science background and trying to jump into the things this behind the engineering behind the scenes is actually useful for them and again for those people who are enjoying these videos if you're not gonna share them on LinkedIn it doesn't really motivate me that hey somebody is watching it somebody needs these kinds of videos although these are very basic for anybody who comes from a computer science engineering this is very basic but I think it serves a good purpose so share them up it motivates me and again I think now I should go to sleep it's almost six o'clock so time to sleep okay all right goodbye and I'll surely catch you up in another such video