good morning good morning welcome to double o six introduction to algorithms lecture two i am eric domain and i love algorithms are you with me yeah today we're not doing algorithms no we're doing data structures it's okay there's lots of algorithms in each data structure it's like multiple algorithms for free we're going to talk about sequences and sets and linked lists and dynamic arrays fairly simple data structures today this is the beginning of several data structures we'll be talking about the next few lectures uh but before we actually start with one let me tell you slash remind you of the idea the difference between an interface which you might call an api if you're a programmer or an adt if you're an ancient algorithms person like me versus a data structure these are useful distinctions uh the idea is that an interface says what you want to do a data structure says how you do it so you might call this a specification and in the context of data structures we're trying to store some data so the interface will specify what data you can store whereas the data structure will give you an actual representation and tell you how to store it this is pretty boring just storing data is really easy you just throw it in a file or something what makes it interesting is uh having operations on that data so in the interface you specify what the operations do what operations are supported and some sense what they mean and the data structure actually gives you algorithms this is where the algorithms come in for how to support those operations all right so um in this class we're going to focus on two main interfaces and various special cases of them so the idea is to separate what you want to do versus how to do it because for you can think of this as the problem statement so yesterday or last class jason talked about problems and defined what a problem was versus algorithmic solutions to the problem and this is the analogous notion for data structures where we want to maintain a data maintain some data according to various operations so the same problem can be solved by many different data structures and we're going to see that and different data structures are going to have different advantages they might support some operations faster than others and depending on what you actually use those data structures for you choose the right data structure but you can maintain the same interface we're going to think about two data structures one is called a set and one is called a sequence these are highly loaded terms set means something to a mathematician it means someone else something else to a python programmer sequence similarly i guess there's not a python sequence data type built in um the idea is we want to store n things the things will be fairly arbitrary think of them as integers or strings and on the one hand we care about their values and maybe we want to maintain them in sorted order and be able to search for a given value which we'll call a key and on the other hand we care about representing a particular sequence that we care about maybe we want to represent the numbers 5 2 9 7 in that order and store that in python you could store that in a list for example and it will keep track of that order and this is the first item the second item the last item today we're going to be focusing on this sequence data structure although at the end i'll mention the interface for sets but we're going to be actually solving sequences today and in the next several lectures we'll be bouncing back and forth between these they're closely related pretty abstract at the moment on the other hand we're gonna have two main let's call them data structure tools or approaches one is arrays and the other is pointers pointer based or linked data structures you may have seen these they're used a lot in programming of course but we're going to see both of these today i'll come back to this sort of highlight in a moment let's jump into the sequence interface which i conveniently have part of here there's a few different levels of sequences that we might care about i'm going to start with the static sequence interface this is where the item the number of items doesn't change uh though the actual items might so here we have n items i'm going to label them x 0 to x n minus 1 as in python so the number of items is n and the operations i want to support are build length iteration get and set so what do these do build is how you get started to build a data structure in this interface you call build of x exactly how you specify x isn't too important but the idea is i give you some items in some order in python this would be an iterable i'm going to want to also know its length and i want to make a new data structure of size n a new static sequence of size n that has those items in that order so that's how you build one of these because somehow we have to specify n to this data structure because n is not going to be allowed to change i'm going to give you a length method um these are methods are the object-oriented way of thinking of operations that your interface supports so length will just return this fixed value and it or sequence this is the sense in which we want to maintain the order i want to be able to output x0 through xn minus 1 in the sequence order in that specified order that they were built in or that it was changed to so uh this is going to iterate through all the items so it's going to take at least linear time to output that but more interesting is we can dynamically access anywhere in the middle of the sequence we can get the item x i given the value i and we can change x i to a given new item okay so that's called get at and set at pretty straightforward this should remind you very closely of something a data structure so this is an interface this is something i might want to solve but what is the obvious data structure that solves this problem yeah a list in python it's called the list i prefer to call it an array but to each their own we're gonna use this this could mean many things um but the uh solution to this interface problem the natural solution is as what i'll call a static array jason mentioned these in uh in lecture one uh it's a little tricky because there are no static arrays in python there are only dynamic arrays which is something we will get to but uh i want to talk about what is a static array really and this relates to our notion of our model of computation which jason also talked about which we call the word ram remember so the idea in word ram is that your memory is an array of w bit words this is a bit circular i'm going to define an array in terms of the word ram which is defined in terms of arrays but i think you know the idea so we have a big memory which goes off to infinity maybe it's divided into words each word here is w bits long this is word zero word one word two and you can access this array randomly random access memory so i can give you the number five and get zero one two three four five the fifth word in this round that's how actual memories work uh you can access any of them equally quickly okay so that's memory and so what we want to do is when we say an array we want this to be a consecutive chunk of memory let me get a color so let's say i have an array of size four and it lives here jason can't spell but i can't count so i think that's four we've got uh so the array starts here and it ends over here it's of size four and it's consecutive which means if i want to access um the array at position at index i then this is the same thing as accessing my memory array at position wherever the array starts which i'll call the address of the array in python this is id of array plus i okay this is just simple offset arithmetic if i want to know the zeroth item of the array it's right here where it starts the first item is one after that the second item is one after that so as long as i store my array consecutively in memory i can access the array in constant time i can do get at and set at as quickly as i can randomly access the memory and get a value or set a value which we're assuming is constant time so [Music] array access is constant time okay this is what allows a static array to actually solve this problem in constant time per get at and set at operation so this may seem simple but we're really going to need this model and really rely on this model increasingly as we get to more interesting data structures this is the first time we're actually needing it uh let's see length is also constant time we're just going to store that number n along with its address and build is going to take linear time iteration will take linear time okay pretty straightforward i guess one thing here when defining build i need to introduce a little bit more of our model of computation which is how do you create an array in the beginning i claim i can do it in linear time and that's just part of the model so this is called the memory allocation model there are a few possible choices here but the cleanest one is just to assume that you can allocate an array of size n in theta n time so it takes linear time to make an array of size n you could imagine this being constant it doesn't really matter much but it does take work and in particular if you just allocate some chunk of memory you have no idea whether it's initialized so initializing that array to zeros will cost linear time it won't really matter constant versus linear but a nice side effect of this model is that space if you're just allocating arrays the amount of space you use is at most the amount of time you use or i guess big o of that so that's a nice feature it's pretty weird if you imagine it's unrealistic to imagine you can allocate an array that's infinite size and then just use a few items out of it that won't be give you a good data structure so we'll assume it costs to allocate memory and okay great we solved the static sequence problem very simple kind of boring these are optimal running times now let's make it interesting make sure i didn't miss anything and talk about oh there is one thing i wanted to talk about in the word ram a side effect of this assumption that array access should take constant time and that accessing these positions in my memory should take constant time is that we need to assume w is at least log n or so w remember is the machine word size in real computers this is currently 64. um or 256 and some bizarre instructions but uh we don't usually think of the machine as getting bigger over time but you should think of the machine as getting bigger over time this is a statement that says the word side has to grow with n it might grow faster than log n but it has to grow at least as fast as log n why do i say that because if i have n things that i'm dealing with and here's the problem size maybe it's the array i'm trying to store whatever if i'm having to deal with n things in my memory at the very least i need to be able to address them i should be able to say give me the ith one and represent that number i in a word otherwise because the machine is designed to only work with w bit words in constant time if i want to be able to access the ith word in constant time i need a word size that's at least log n just to address the n things in my input so this is a totally reasonable assumption it may seem weird because you think of a real machine as having constant size but you know real machine has constant size ram also my machine has 24 gigs of ram or whatever that laptop has eight but uh you don't think of that as changing over time but of course if you wanted to process a larger input you would buy more ram so eventually when our ends get really really big we're going to have to increase w just so we can address that round that's the the intuition here but this is a way to bridge reality which are fixed machines with theory in in algorithms we care about scalability for very large n we want to know what that growth function is and ignore the lead constant factor that's what asymptotic notion notation is all about and for that we need a notion of word size also changing in this asymptotic way all right that's uh that will be more important next week when we talk about hashing and why hashing is a reasonable thing to do but let's move on to dynamic sequences which is where things get interesting so i have the update here so we start with static sequences so all of these operations are still something we want to support in a dynamic sequence but we add two dynamic operations somewhat controversial operations so exciting i want to be able to insert in the middle of my sequence and i want to be able to delete from the middle of my sequence so here's my sequence which i'm going to think of in a picture i'm going to draw it as an array but it's stored however it's stored we don't know this is an interface not an implementation so we have x0 x1 x2 x3 uh and let's say i insert at position 2 so position 2 is here so i come in with my new x and i would like x to be the new x2 but i don't want to lose any information if i did set at 2 then i would erase this and replace it with x but i want to do insert at which means all of these guys conceptually are going to shift over by one in terms of their indices so then i would get this picture that's one bigger and now i've got the new x i've got what was the old x2 which i don't hesitate to call x2 because that's its old name not its new name i'm going to draw arrows to say these guys get copied over uh these ones are definitely unchanged our new x2 which i'll call x2 prime is x and this is x3 prime x4 prime and so on i i want to be careful here and of course the new n prime is n plus one uh i want to be careful about the labeling because the key what makes insert ad interesting is that later when i call get at it's with the new indexing so previously if i did get at 2 i would get this value and afterwards if i did get at it 2 i get the new value if i did get at it 3 down here i would get i get the value that used to be x2 okay that's maybe hard to track but this is a useful conceptually very useful thing to do especially when you're inserting or deleting at the ends so we're going to define in particular insert and delete uh first and last uh these are sometimes given if if you have an insert it has an x if you do a delete it has no argument so this means insert at the beginning of the array which would be like adding it here and insert last means adding it on here so insert last doesn't change the indices of any of the old items that's a nice feature of insert last insert first changes all of them they all get incremented by one and we're also interested in the similar things here we could do get first or last or set first or last okay which are the obvious special cases of get ad and set out now these special cases are particularly interesting in an algorithms context if you were a mathematician you would say well why do i even bother this this is just shorthand for a particular call to get or set but what makes it interesting from a data structure's perspective is that we care about algorithms for supporting these operations and maybe the algorithm for supporting get first or set first or in particular insert first or insert last might be more efficient maybe we can solve this problem better than we can solve insert at so while ideally we could solve the entire dynamics sequence interface constant time per operation that's not actually possible you can prove that but special cases of it where we're just inserting and deleting from the ends say we can do that so that's why it's interesting to introduce special cases that we care about cool that's the definition of the dynamic sequence interface now we're going to actually solve it so uh our first data structure for this is called linked bliss i've taken probably you've probably seen linked lists before at some point but the main new part here is we're going to actually analyze them and see how they how efficiently they implement all these operations we might care about so first review what is a linked list we store our items in a bunch of nodes um each node has an item in it and a next field so you could think of these as class objects with two class variables the item and and a next pointer and we assemble those into this kind of structure where we store in the item fields we're going to store the actual values that we want to represent in our sequence x zero through x n minus 1 in order and then we're going to use the next pointers to link these all together in that order so the next pointers are what actually give us the order and in addition we're going to keep track of what's called the head of the list so the data structure is going to be represented by head if you wanted to you could also store length so this could be the data structure itself and it's pointing to all of these types of data structures notice we've just seen an array based data structure which is just a static array and we've seen a pointer-based data structure and we're relying on the fact that pointers can be stored in a single word which means we can de-reference them we can see what's on the other side of the pointer in constant time in our word ram model in reality each of these nodes is stored somewhere in the array of the computer so maybe each one is two words long so maybe uh one node is the first node is here maybe the second node is here the third node is here they're in some arbitrary order we're using this fact that we can allocate a an array of size n in linear time in this case we're going to have a raise of size 2. we can just say oh please give me a new array of size 2. and that will give make us one of these nodes and then we're storing pointers pointers are just indices into the giant memory array they're just what is the address of this little array okay if you've ever wondered how pointers are implemented they're just numbers that say where in memory is this thing over here and in memory they're an arbitrary order this is really nice because it's easy to manipulate the order of a linked list without actually physically moving nodes around whereas arrays are problematic maybe it's worth mentioning let's start analyzing things so we care about these dynamic sequence operations and we could try to apply it to the static array data structure or we could try to or sorry we could try to implement these operations in a static array it's possible just not going to be very good and we can try to implement it with linked lists and it's also not going to be that great let's go over here our goal is the next data structure which is dynamic arrays but linked lists and static arrays each have their advantages so let's first analyze dynamic sequence operations um first on a static array and then on a linked list so in a static array i think you all see if i try to insert at the beginning of the static array that's kind of the worst case if i insert first then everybody has to shift over if i'm going to maintain this invariant that the ith item in the array represents i guess i didn't write it anywhere here maybe here static array we're going to maintain this invariant that a of i represents x i okay if i want to maintain that at all times when i insert a new thing in the front because the indices of all the previous items change i have to spend time to copy those over you can do it in linear time but no better okay so static array insert and delete anywhere cost and time actually uh for two reasons uh reason number one is that uh if we're near the front then we have to do shifting okay what about insert or delete the last element of an array is that any easier because then if i insert the very last element none of the indices change i'm just adding a new element so i don't have to do shifting so can i do insert and delete last in constant time in a static array no because the size is constant so our model is that uh remember our allocation model is that we can allocate a static array of size n but it's just a size n i can't just say please make it bigger by one i need i need space to store this extra element and if you think about where things are in memory when you call to this memory allocator which is part of your operating system you say please give me some a chunk of memory it's going to place them in various places in memory and some of them might be next to each other so if i try to grow this array by one there might already be something there and that's not possible without first shifting so even though in the array i don't have to do any shifting in memory i might have to do shifting and that's outside the model so we're going to stick to this model of just you can allocate memory you can also de-allocate memory just to keep space usage small but the only way to get more space is to ask for a new array and that new array won't be contiguous to your old one so question uh what is a dynamic array will be the next topic so maybe we'll come back to that yeah okay so in a static array you're just not allowed to make it bigger and so you have to allocate a new array which we say takes linear time even if allocating the new array didn't take linear time you have to copy all the elements over from the old array to the new one then you can throw away the old one so just the copying from an array of size n to an array of size n plus one that will take linear time so static arrays are really bad for dynamic operations no surprise but you could do them okay that's static array now linked lists are going to be almost the opposite well almost so if we store the length okay we can compute the length of the array very quickly we can insert and delete at the front really efficiently if i want to add a new item at the begin as a new first item then what do i do i allocate a new node which i'll call x so this is insert first of x i'll allocate a new array size 2. i'm going to change let me do it in red i'm going to change this head pointer uh maybe i should do that later so i'm going to set the next pointer here to this one and then i'm going to change this head pointer to point to here and boom now i've got a linked list again we don't know anything about the order and memory of these these lists we just care about the order that's represented implicitly by following the next pointers repeatedly so now i've got a new list that has x in front and then x0 and then x1 and so on so insert and delete first at least are really efficient we won't get much more than that but a linked list insert and delete uh first our constant time so that's cool however everything else is going to be slow if i want to get the 10th item in a linked list i have to follow these pointers 10 times i go 0 0 1 2 3 and it's on follow ten next pointers and i'll get the tenth item accessing the item is gonna take order i time so uh get and set at need i time which in the worst case is theta n okay so we have sort of complementary data structures here on the one hand a static array can do constant time get at set at so it's very fast at the random access aspect because it's an array linked lists are very bad at random access but they're better at being dynamic we can insert and delete at the beginning at least in constant time now if we want to actually insert a delete at a particular position that's still hard because we have to walk to that position even inserting and deleting at the end of the list is hard although that's fixable and maybe i'll leave that for problem session or problem set but uh an easy so here's here's a small puzzle suppose you wanted to solve uh get last efficiently in a linked list how would you solve that in constant time yeah double a linked list is a good idea but actually not the right answer that's an answer to the next question i might ask yeah store pointer to the last element that's all we need here and often a doubly linked list has this i usually call this the tail head and tail and if you always just store a pointer to the last list this is what we call data structure augmentation where we add some extra information to the data structure and mean we have to keep it up to date all the time so if we do an insert last or something insert last also becomes easy because i can just add a new node here and update the pointer here delete last is trickier that's where you need a doubly linked list but whenever i add something to the end of this list i have to update the tail pointer also as long as i maintain this now suddenly get last as fast in constant time so link lists are great if you're working on the ends even dynamically arrays are great if you're doing random access and nothing dynamic nothing adding or deleting at the ends or in the middle okay so our final goal for today is to get sort of the best of both worlds with dynamic arrays we're going to try to get all the good running times of linked lists and all the good running times of static arrays we won't get quite all of them but most of them [Applause] and in some sense another way to describe what these introductory lectures are about is telling you about how python is implemented so what we're going to talk about next dynamic arrays i've alluded to many times but these are what python calls lists so you don't have to implement a dynamic array by hand because it's already built into many fancy new languages for free because they're so darn useful this lecture is about how these are actually implemented and why they're efficient and in recitation notes you'll see how to actually implement them if all you had were static arrays but luckily we have dynamic arrays so we don't have to actually implement them but in the in inside the python interpreter this is exactly what's happening okay so the idea is to relax the constraint or the invariant whatever that the size of the array we use equals n which is the number of items in the sequence okay remember in the sequence problem we're supposed to represent n items with a static array we allocated an array of size exactly n so let's relax that let's not make it exactly n let's make it roughly n how roughly you can think about for a while uh but the from an algorithm's perspective usually that when we say roughly we mean throw away constant factors and that turns out to be the right answer here it's not always the right answer but we're gonna uh enforce uh that the size of the array is theta m probably also greater than or equal to n 0.5 n would not be very helpful so it's going to be at least n and it's going to be at most some constant times n 2n 10n 1.1 times n any of these constants will work i'm going to use 2n here but there are lots of options uh and now things almost work for free there's going to be one subtlety here and i'm going to focus on we're still going to maintain that the item of the array is represents x i okay so this data structure let me draw a picture so we've got an array of some size the first few items are used to store the sequence but then there's going to be some blank ones at the end okay maybe we'll keep track of this so the data structure itself is going to have an array and it's going to have a length something like this so we're also going to keep track of the length so we know that the first length items are where their data is and the remainder are just are meaningless okay so now if i want to go and do an insert last what do i do i just go to a of length and set it to x and then i increment length boom easy constant time yeah how do you know you have enough room indeed i don't this was an incorrect algorithm but it's usually correct as long as i have extra space this is all i need to do for insert last but uh i'm also going to store the size of the array this is the actual this whole thing is size and this part is length so length is always going to be less than or equal to size and so there's a problem if length equals size then i don't have any space okay just add to the end unless n equals size i'm using n and length for the same thing uh so the length here is same as n that's our actual number of things we're trying to represent and size this is great this is the interface size this is what we're trying to represent and this is the representation size this is the size of my array these are the number of items i'm trying to store in that array okay this is the interface versus data structure here's the interface here's the data structure okay cool um what do i do in the case when n equals size i'm gonna have to make my array bigger okay this should sound just like static arrays with static arrays we made our array bigger every time we insert it and that was this linear cost of allocation we're going to do that sometimes with static rays we had to do it every single time because size equaled n now we have some flexibility we're only going to do it sometimes it's like uh cookies are a sometimes food apparently according to modern cookie monster um i don't understand but uh if n equals size we're going to uh uh allocate a new array of size any suggestions bigger i like it greater than size how much bigger twice five five things size plus five come on jason trolling me all right so there are a couple of natural choices here one is a constant factor larger you could use 1.1 or 1.01 or 2 or 5 or 10 they'll all work or you could use jason's trolling answer of size plus a constant like 5. why is this bad you'll have to do it again you'll have to resize frequently when five steps later so uh in the original static array we were reallocating every single time that's like n plus one if we do n plus five that really doesn't change things if we ignore constant factors now we'll have to spend linear time every five steps instead of linear time every one step that's still linear time per operation just we're changing the constant factor okay whereas two times size well now we have to think a little bit harder um let's just think about the case where we're inserting at the end of an array so let's say we do n uh insert last from an empty array when do we resize well at the beginning i guess i didn't say what we do for an empty array let's say size equals one so we can insert one item for free as soon as we insert the second item then we have to resize that seems bad immediately we have to resize then we insert the third item okay now let's draw a picture so we start with one item we fill it up uh then we grow to size two because that's twice one then we fill it up immediately we have to resize again but now we get start to get some benefit now we have size four and so we can insert two items before we have to resize and now we're size eight and we get to insert four items before we refill so uh this is going to resize and again resizes are expensive both because we have to pay to allocate the new array i drew it as just extending it but in fact we're creating a whole new array and then we have to copy all of the items over so there's the allocation cost and then the copying costs so it's linear either way but we're going to resize at n equals 1 2 4 8 16. you know this sequence all the powers of 2 because we're doubling that is exactly powers of 2. so we pay a linear cost so this resize cost the allocation and the copying is going to be it's linear each time so it's 1 plus 2 plus 4 plus 8 plus 16. okay really i should write this as a sum from i equals 1 to roughly log n log base 2 of n is lg of 2 to the i okay if you want a terminus here it's roughly n it's actually the next the previous power of 2 of n or something okay but uh that won't matter that'll just affect things by a constant factor what is the sum of two to the i this is a geometric series anyone know the answer two to the uh top limit uh plus one minus one yeah so this is the identity sum of two to the i uh from i equals one to k is two to the k plus one plus one minus one so the plus one's upstairs minus one's downstairs an easy way to remember this is if you think in binary as we all should we're computer scientists uh two to the i is means you set the ith bit to one so here's uh here's a bit string this is the ith bit this is 2 to the i the zeros down here if i sum them all up what that means is i'm putting ones here okay uh and if you think about what this means this is up to k from zero sorry i should do zero to be proper if i write so that's the left hand side the right hand side is 2 to the k plus 1 which is a 1 here and the rest 0s so if you know your binary arithmetic you subtract if you add one to this you get this so if you subtract one from this you get this okay this is why this identity holds um or the higher level thing is to say oh this is a geometric series so i know i you should know this i'm telling you now geometric series are dominated by the last term the biggest term if you have any series you can identify as geometric which means it's growing at least exponentially then in terms of theta notation you can just look at the last term and put a theta around it and you're done so this is theta the last term like 2 to the log n which is theta n cool linear time linear time for all of my operations i'm doing n operations here and i spend linear total time to do all the resizing that's good that's like constant each kind of the kind of is an important notion which we call amortization so i want to say an operation takes t of n amortized time if um let's say any k of those operations take at most k times t of n time this is a little bit sloppy but be good enough so the idea is here if this works for n or k to do n operations from an empty array here takes linear time which means uh i would call this constant amortized amortized means a particular kind of averaging averaging over the sequence of operations so while individual operations will be expensive one near the end when i have to resize the array is going to take linear time just for that one operation but most of the operations are cheap most of them are constant so i can think of charging that high cost to all of the other operations that made it happen and so this is averaging over the operation sequence every insert last over there uh only takes constant time on average over the the sequence of operations that we do and so it's almost constant it's not quite as good as constant worst case but it's almost as good and it's as good as you could hope to do in this dynamic array allocation model let me put this into a table uh and you'll find these in the lecture notes also so we have on the top the main operations of the sequence interface which we will revisit in lecture seven we'll see some other data structures for this get ad and set ad in the first column insert and delete first insert and delete last insert and delete at an arbitrary position so we've seen three data structures now arrays were really good at get at set at they took constant time that's the blue one we're omitting the thetas here all the other operations took linear time no matter where they were linked lists were really good at insert and delete first they took constant time but everything else took linear time in the worst case these new dynamic arrays achieve get at and set out in constant time because they maintain this invariant uh here that a of i equals x i so we can still do gets and set at quickly and we also just showed that insert last is constant amortized delete last uh you don't have to resize the array you could just decrease length and boom you've deleted the last item it's not so satisfying because if you insert n items and then delete n items you'll still have an array of size theta n even though your current value of n is zero um you can get around that with a little bit more trickery which are described in the lecture notes but it's beyond the we're only going to do very simple amortized analysis in this class to prove that that algorithm is also constant amortized which it is uh you'll see in o46 or you can find it in the clrs book uh that's it for today