Transcript for:
Understanding Binary Search Trees (BST)

In our previous lesson, we talked about binary trees in general. Now, in this lesson we are going to talk about binary search tree, a special kind of binary tree which is an efficient structure to organize data for quick search as well as quick update. But before I start talking about binary search tree, I want you to think of a problem. What data structure will you use to store a modifiable collection? So, lets say you have a collection and it can be a collection of any data type, records in the collection can be of any type. Now, you want to store this collection in computers memory in some structure and then you want to be able to quickly search for a record in the collection and you also want to be able to modify the collection. You want to be able to insert an element in the collection or remove an element from the collection. So, what data structure will you use? Well, you can use an array or a linked list. These are two well known data structure in which we can store a collection. Now, what will be running time of these operations

  • search, insertion or removal, If we will use an array or a linked list. Lets first talk about arrays and for sale of simplicity, lets say we want to store integers. To store a modifiable list or collection of integers, we can create a large enough array and we can store the records in some part of the array. We can keep the end of the list marked. In this array that I am showing here, we have integers from 0 till 3. We have records from 0 till 3 and rest of the array is available space. Now to search some X in the collection, we will have to scan the array from index 0 till end and in worst case, we may have to look at all the elements in the list. If n is the number of elements in the list, time taken will be proportional to n or in other words, we can say that time complexity will be big-oh of n. Ok, now what will be the cost of insertion. Lets say we want to insert number 5 in this list. So, if there is some available space, all the cells in yellow are available, we can add one more cell by incrementing this marker 'end' and fill in the integer to be added. Time taken for this operation will be constant. Running time will not depend upon number of elements in the collection. So, we can say that time complexity will be big-oh of 1. Ok, now what about removal. Lets say we want to remove 1 from the collection. What we'll have to do is, we'll have to shift all records to the right of one by one position to the left and then we can decrement end. The cost of removal once again will be big-oh of n. In worst case, we will have to shift n-1 elements. Here, the cost of insertion will be big-oh of 1, if the array will have some available space. So, the array has to be large enough. If the array gets filled, what we can do is, we can create a new larger array, typically we create an array twice the size of the filled up array. So, we can create a new larger array and then we can copy the content of the filled up array into this new larger array. The copy operation will cost us big-oh of n. We have discussed this idea of dynamic array quite a bit in our previous lessons. So, insertion will be big-oh of 1 if array is not filled up and it will be big-oh of n if array is filled up. For now, lets just assume that the array will always be large enough. Lets now discuss the cost of these operations if we will use a linked list. If we would use a linked list, I have drawn a linked list of integers here, data type can be anything, the cost of search operation once again will be big-oh of n where n is number of records in the collection or number of nodes in the linked list. To search, in worst case, we will have to traverse the whole list. We will have to look at all the nodes. The cost of insertion in a linked list is big-oh of 1 at head and its big-oh of n at tail. We can choose to insert at head to keep the cost low. So, running time of insertion, we can say is big-oh of 1 or in other words, we will take constant time. Removal once again will be big-oh of n. We will first have to traverse the linked list and search the record and in worst case, we may have to look at all the nodes. Ok, so this is the cost of operations if we are going to use array or linked list. Insertion definitely is fast. But, how good is big-oh of n for an operation like search. What do you think? if we are searching for a record X, then in the worst case, we will have to compare this record X with all the n records in the collection. Lets say, our machine can perform a million comparisons in one second. So, we can say that machine can perform 10 the power 6 comparisons in one second. So, cost of one comparison will be 10 to the power -6 second. Machines in today's world deal with really large data. Its very much possible for real world data to have 100 million records or billion records. A lot of countries in this world have population more than 100 million. 2 countries have more than a billion people living in them. If we will have data about all the people living in a country, then it can easily be 100 million records. Ok, so if we are saying that the cost of 1 comparison is 10 the power -6 second. If n will be 100 million, time taken will be 100 seconds. 100 seconds for a search is not reasonable and search may be a frequently performed operation. Can we do something better? Can we do better than big-oh of n. Well, in an array, we can perform binary search if its sorted and the running time of binary search is big-oh of log n which is the best running time to have. I have drawn this array of integers here. Records in the array are sorted. Here the data type is integer. For some other data type, for some complex data type, we should be able to sort the collection based on some property or some key of the records. We should be able to compare the keys of records and the comparison logic will be different for different data types. For a collection of strings, for example, we may want to have the records sorted in dictionary or lexicographical order. So, we will compare and see which string will come first in dictionary order. Now this is the requirement that we have for binary search. The data structure should be an array and the records must be sorted. Ok, so the cost of search operation can be minimized if we will use a sorted array. But, in insertion or removal, we will have to make sure that the array is sorted afterwards. In this array, if I want to insert number 5 at this stage, i can't simply put 5 at index 6. what I'll have to do is, I'll first have to find the position at which I can insert 5 in the sorted list. We can find the position in order of log n time using binary search. We can perform a binary search to find the first integer greater than 5 in the list. So, we can find the position quickly. In this case, its index 2. But then, we will have to shift all the records starting this position one position to the right. And now, I can insert 5. So, even though we can find the position at which a record should be inserted quickly in big-oh of log n, this shifting in worst case will cost us big-oh of n. So, the running time overall for an insertion will be big-oh of n and similarly, the cost of removal will also be big-oh of n. We will have to shift some records. Ok, so when we are using sorted array, cost of search operation is minimized. In binary search for n records, we will have at max log n to the base 2 comparisons. So, if we can perform million comparisons in a second, then for n equal 2 to the power 31 which is greater than 2 billion, we are going to take only 31 micro-seconds. log of 2 to the power 31 to base 2 will be
  1. Ok, we are fine with search now. we will be good for any practical value of n. But what about insertion and removal. They are still big-oh of n. Can we do something better here? Well, if we will use this data structure called binary search tree, I am writing it in short - BST for binary search tree, then the cost of all these 3 operations can be big-oh of log n in average case. The cost of all the operations will be big-oh of n in worst case. But we can avoid the worst case by making sure that the tree is always balanced. We had talked about balanced binary tree in our previous lesson. Binary search tree is only a special kind of binary tree. To make sure that the cost of these operations is always big-oh of log n, we should keep the binary search tree balanced. We'll talk about this in detail later. Let's first see what a binary search tree is and how cost of these operations is minimized when we use a binary search tree. Binary search tree is a binary tree in which for each node, value of all the nodes in left sub-tree is lesser and value of all the nodes in right sub-tree is greater. I have drawn binary tree as a recursive structure here. As we know, in a binary tree, each node can have at most 2 children. We can call one of the children left child. If we will look at the tree as recursive structure, left child will be the root of left sub-tree and similarly, right child will be the root of right sub-tree. Now, for a binary tree to be called binary search tree, value of all the nodes in left sub-tree must be lesser or we can say lesser or equal to handle duplicates and the value of all the nodes in right sub-tree must be greater and this must be true for all the nodes. So, in this recursive structure here, both left and right sub-trees must also be binary search trees. I'll draw a binary search tree of integers. Now, I have drawn a binary search tree of integers here. Lets see whether this property that for each node value of all the nodes in left subtree is lesser or equal and the value of all the nodes in right sub-tree must be greater is true or not. Lets first look at the root node. Nodes in the left subtree have values 10, 8 and 12. So, they are all lesser than15 and in right subtree, we have 17, 20 and 25, they are all greater than 15. So, we are good for the root node. Now, lets look at this node with value 10. In left, we have 8 which is lesser. In right, we have 12 which is greater. So, we are good. We are good for this node too having value 20 and we don't need to bother about leave nodes because they do not have children. So, this is a binary search tree. Now, what if I change this value 12 to 16. Now, is this still a binary search tree. Well, for node with value 10, we are good. The node with value 16 is in its right. So, not a problem. But for the root node, we have a node in left sub-tree with higher value now. So, this tree is not a binary search tree. I'll revert back and make the value 12 again. Now, as we were saying we can search, insert or delete in a binary search tree in big-oh of log n time in average case. How is it really possible? Lets first talk about search. If these integers that I have here in the tree were in a sorted array, we could have performed binary search and what do we do in binary search. Lets say we want to search number 10 in this array. What we do in binary search is, we first define the complete list as our search space. The number can exist only within the search space. I'll mark search space using these two pointers
  • start and end. Now, we compare the number to be searched or the element to be searched with mid element of the search space or the median. And if the record being searched, if the element being searched is lesser, we go searching in the left half, else we go searching in the right half. In case of equality, we have found the element. In this case, 10 is lesser than 15. So, we will go searching towards left. Our search space is reduced now to half. Once again, we will compare to the mid-element and bingo, this time, we have got a match. In binary search, we start with n elements in search space and then if mid element is not the element that we are looking for, we reduce the search space to n/2 and we go on reducing the search space to half, till we either find the record that we are looking for or we get to only one element in search space and be done. In this whole reduction, if we will go from n to n/2 to n/4 to n/8 and so on, we will have log n to the base 2 steps. If we are taking k steps, then n upon 2 to the power k will be equal to 1 which will imply 2 to the power k will be equal to n and k will be equal to log n to the base 2. So, this is why running time of binary search is big-oh of log n. Now, if we will use this binary search tree to store the integers, search operation will be very similar. Lets say, we want to search for number 12. What we'll do is, we start at root and then we will compare the value to be searched, the integer to be searched with value of the root. if its equal, we are done with the search, if its lesser, we know that we need to go to the left sub-tree because in a binary search tree, all the elements in left sub-tree are lesser and all the elements in right sub-tree are greater. Now, we will go and look at the left child of node with value 15. We know that number 12 that we are looking for can exist in this sub-tree only and anything apart from the sub-tree is discarded. So, we have reduced the search space to only these 3 nodes having value 10, 8 and 12. Now, once again, we will compare 12 with 10. We are not equal. 12 is greater, so we know that we need to go looking in right sub-tree of this node with value 10. So, now our search space is reduced to just one node. Once again, we will compare the value here at this node and we have a match. So, searching an element in binary search tree is basically this traversal in which at each step, we will go either towards left or right and hence at each step, we will discard one of the sub-trees. if the tree is balanced, we call a tree balanced if for all nodes, the difference between the heights of left and right sub-trees is not greater than 1. So, if the tree is balanced, we will start with a search space of n nodes and when we will discard one of the sub-trees, we will discard n/2 nodes. So, our search space will be reduced to n/2 and then in next step, we will reduce the search space to n/4. We will go on reducing like this till we find the element or till our search space is reduced to only one node when we will be done. So, the search here is also a binary search. And that's why the name - Binary Search Tree. This tree that I am showing here is balanced. In fact this is a perfect binary tree. But with same records, we can have an unbalanced tree like this. This tree has got the same integer values as we had in the previous structure and this is also a binary search tree, but this is unbalanced. This is as good as a linked list. In this tree there is no right sub-tree for any of the nodes. Search space will be reduced by only one.at each step. From n nodes in search space, we will go to n-1 nodes and then to n-2 nodes, all the way till 1 will be n steps. In binary search tree, in average case, cost of search, insertion or deletion is big-oh of log n and in worst case, this is the worst case arrangement that I am showing you. The running time will be big-oh of n. We always try to avoid the worst case by trying to keep the binary search tree balanced. With same records in the tree, there can be multiple possible arrangements. For these integers in this tree, another arrangement is this. For all the nodes, we have nothing to discard in left sub-tree in a search. This is another arrangement. This is still balanced because for all the nodes, the difference between the heights of left and right sub-tree is not greater than 1. But this is the best arrangement when we have a perfect binary tree. At each step, we will have exactly n/2 nodes to discard. Ok, now to insert some records in binary search tree, we will first have to find the position at which we can insert and we can find the position in big-oh of log n time. Lets say we want to insert 19 in this tree, what we will do is start at the root. If the value to be inserted is lesser or equal, if there is no child, insert as left child or go left. If the value is greater and there is no right child, insert as right child or go right. In this case, 19 is greater, so we will go right. Now, we are at 20. 19 is lesser and left sub-tree is not empty. We have a left child. So, we will go left. Now, we are at 17, 19 is greater than 17. So, it should go in right of 17. There is no right child of 17. So, we will create a node with value 19 and link it to this node with value 17 as right child. Because we are using pointers or references here just like linked list, no shifting is needed like an array. Creating a link will take constant time. So, overall insertion will also cost us like search. To delete also, we will first have to search the node. Search once again will be big-oh of log n and deleting the node will only mean adjusting some links. So, removal also is going to be like search
  • big-oh of log n in average case. Binary search tree gets unbalanced during insertion and deletion. So, often during insertion and deletion, we restore the balancing. There are ways to do it and we will talk about all of this in detail in later lessons. In next lesson, we will discuss implementation of binary search tree in detail. This is it for this lesson. Thanks for watching.