Understanding B-Trees and Their Operations

hey everyone and welcome back sorry for the nasally voice on my end kissing lots of dudes over the weekend and it looks like it finally caught up to me uh just a quick announcement before we actually get into the content of today is that I just want to shout out one of my viewers Ty who reached out to me over LinkedIn to tell me that he got four senior offers recently primarily studying the content from this very channel so to those of you who are haters believe in this channel take your own notes be like Ty and go get to the bag anyways let's get into the content of today's video sorry again about my voice but we'll get through it we're going to be talking about bee trees how they work and what they're good for all right so I decided to do a bunch of pre-drawing so that we wouldn't have to wait for me to draw everything and my handwriting would be worse but let's go ahead and actually start talking about what a b tree is so a b tree is another type of database index so I started last episode by talking about hash indexes and how that's basically just a hash map kept in memory on the other hand the cool thing about a b tree is that it's kept entirely on disk so it doesn't have some of the downsides that hash maps have right we don't have to keep all of our keys in memory and you know we don't have to worry about the durability of our index well for the most part we'll talk about the cases where we might so basically the B tree as we can see from the iPad is this tree-like structure where it basically gives the ranges of all the keys and gives references to other locations on disk that we have to Traverse to in order to go ahead and find the row that we want so you can see written out here let's imagine we have a bunch of names and the scores that they got on a math test so basically all the rows from a to L we can see there's a reference pointer that leads us and basically says okay now follow and Traverse the reference from a to e if you want those names or if you care about the names between E and O we can go and Traverse to this Leaf node which is basically going to have something like Edward with a score of 80 and Gordon with a score of 95. so the point is we can Traverse through this tree basically going to the reference that matches up with the range of key that we want and this is really really great for a couple of reasons so one is that this is relatively Speedy for reads why because to actually go and find the key that we want all we have to actually do is follow these references right or in this case it's not this reference that we want but it's this one here typically I didn't draw it out completely but the speed tree is going to stay balanced we'll talk about how that works in a little bit and so that's really really useful for making sure that the height of the tree is relatively constant and additionally these page sizes right here are typically pretty big I can't remember the exact amount but I think it's typically something like 250 megabytes and if it's 256 megabytes or 250 or whatever it means that we can have a tree that encompasses a ton uh we can have a tree that encompasses tons of keys and values uh while also staying relatively short and that means we're jumping around on fewer places on disk because keep in mind on the actual ring that is the disc all of these references are in different locations so jumping around is actually pretty tough but if we only have to do it a couple of times that's relatively fast so for both reading and updating the general process is actually just traversing through the tree finding where it is that the key that we want to look at already exists if it's an update then we just go ahead and update it in place and if it's a read we can go ahead and read it well what about her rights this is where things get a little bit more complicated so I'm going to draw out another diagram all right so let's talk about actually writing so let's imagine that I wanted to basically this is like a you know one root two Leaf path in our B tree and let's imagine that I wanted to add the key Betty with a score of let's say 90. right so basically the fundamental thing here to remember is that every single kind of node of the B tree has a set size let's imagine it's 256 Megs and so let's imagine now that we have this kind of bottom node right here if it has extra space we're in pretty good shape right it means we can actually just add Betty right over here and we'll be fine because we don't actually have to change the structure of the B tree at all the one thing to note is that Adam Bob and Betty still needs to remain in alphabetical sorted order so we do actually have to kind of write this one particular node right we would go and make one over here with Adam Betty you know or rather Adam Betty and Bob and then we would change the reference to this page so that's how you would write if there is actually space in this child node however what if there isn't space in the leaf node then things actually become complicated so now what actually has to happen is I would have to split the child node into two so I might have one node with Adam and Betty sorry I'm trying to write here and then the other with just Bob and then how can I actually go ahead and rectify this well now I have to go ahead and split the parent node right but you might actually see here that this parent node doesn't have enough space in it right so ideally what we would want to do is basically delineate the parent node to have some boundary where what we're effectively doing is saying here's everything from A to B E and then here's everything from Bo to C right because Bob De C is over in this other page and that would refer to here and then this one would be referred to by this reference sorry it's getting a little bit messy however if there's no space in this node we basically have to keep traversing up through the B tree splitting every single node into two pieces until we actually find free space in one of the notes as you can see there is some free space in this route So eventually we'll be in good shape if the root itself doesn't have free space we actually have to split the root itself into two different pieces and then create a new root node all the way on top of it which points down to the old root node so I know that's a little bit of a convoluted explanation but you can think it out the point is by making sure to split these pages when they get too big we can guarantee that our B tree is going to remain balanced so there's actually an issue here and one of those things is what if for example you know we're updating a node and then basically B tree dies and what I mean by that is computer turns off you know mid B tree update and for example now the references are all out of whack we've created these two pages right here but we haven't actually updated the reference in time so now our B tree is in a consistent in inconsistent State computer dies well if you recall from last episode the best thing to do to ensure that you can always get your state back is to have something like a write ahead log right our sequential only right log that tells us everything that we're about to do in our database before we actually do it so every single time before we actually write to the B tree or update the B tree that's first going in the right ahead log then if the computer were to die we can actually go and replay all of these messages to recover the state of our B tree okay so let's summarize our B tree again in terms of how it compares to the actual hash index itself obviously it is going to be slower right that's because the primary amount of data in the index is kept on disk which is always going to be slower than memory that being said the B tree relatively speaking is still not bad it's relatively good in terms of reads because you only have to do a couple of traversals on disk in addition the main advantage to the B tree over something like the hash index is twofold first of all not all of the keys have to fit in memory we can now have a much bigger size in our data set and even more importantly we can do range queries so if you recall range queries are when we want keys that are similar to one another adjacent to one another in memory or on disk so that we can find them if you look at the actual child structure of a b tree we can see that the keys that are similar to one another are actually going to be next to each other in these child pages so this way if I want say all the keys between e and l i can just go down the tree reference back to this child mode fetch all of them at once I don't have to do a bunch of different reads the data is conveniently effectively sorted through this B tree so just to summarize it and write this out a b tree is the following it is a self-balancing tree on disk and that means one no limits to data set size within reason of course obviously hard drives do have bigger capacities but they're not infinite and then the second and more important point is that now we actually have support for range queries the one thing to note is that even though B trees are relatively quick with reading the rights themselves are not super fast for the reasons we discussed you may actually have to do a bunch of Rights not only to the right ahead log but to the bottom of the tree and then you may have to Traverse all the way back up the tree basically switching all of these references on disks so that can take a while we'll talk about in the next video a different type of index that tries to optimize on this and kind of improve the write speed while at least maintaining similar read speeds to the B tree and a similar ability to perform range queries alright guys have a good one and I will see you in the next video

Transcript for:Understanding B-Trees and Their Operations

Transcript for:
Understanding B-Trees and Their Operations