Text Editor Data Structures

So, I thought something that'd be interesting to look at, but it's probably overlooked because they just seem so simple and ubiquitous, would be the humble text editor. It seems like a very straightforward problem to solve. You just put in some characters which are presumably just moving around some asy. Is that Oh, yeah. I mean, that's that's that's the really interesting thing is that actually the text editor, if you think about it, you think, well, yeah, I just get a character, I insert it in the right place, or I delete a character from uh the document that's already there. and it'd be really easy to write one. But actually, when you you start and when you start to think about what's involved, you realize that if you don't get the data structures right, it can become horribly slow and inefficient to write. And so, actually starting to see how they're written can sort of illustrate how actually the way you design your data structures in a computer program can have a massive effect on how efficient they are and how well they work. So, I think a good place to start would be to define what we mean by editing text. I mean, I've got a text editor on screen here, and we can think about the sort of task that we might want to do with a text editor. So, we might want to insert some text. Let's start. We're just putting in a comment here. Um, hello computer file. Probably don't actually want that in the middle of my source code. Well, it's not my source code, but um, the source code that's on screen. So, we want to be able to insert text, add extra characters between other characters that exist. We want to be able to delete text, remove characters that already exist. And we want to do that both in the middle of a document, at the beginning of a document to an empty document, and also at the end of the document. The other thing we would want to do um is overwrite text, where we type characters that replace ones that are already there in the document. So, we're looking at having a document that's open and being able to edit it. And we might start with an empty document and we add characters to it until we've got the finished text document that we want. And as we said, this actually turns out to be quite an interesting problem both for how we actually represent the document itself, the text itself, which is what we look at today, but also things like, well, how do you display that on screen when you're moving around? How do you update the screen quickly and efficiently? And perhaps we'll look at that in another video at some other time. But certainly, how do we represent the text in a way that's efficient for a computer to actually edit the text that we've got there? So, we might just think, well, we can just use a string to represent this. And then when we want to add a character onto the end, we can concatenate it onto it. If we want to add one in the middle, we can say, well, get me the first characters here as a separate string. Get me these characters here as a separate string. insert the character in the middle and combine them all together. So we take the first part concatenate onto that the new character concatenate onto the bit that we had. Is concatenate different to append? It's probably just a synonym for it's the same thing. Yeah. So effectively we are appending the character. Yeah. Concatenate. It's all right. I've been a computer scientist and a C programmer for too long and stra becomes concatenate comes from concatenate. So yeah. Yeah. We're going to append it to the same thing. Yeah, it's the same thing. We're joining them together uh to form a new string. Now that would work but depending on on the programming language you're using you may find strings are immutable i.e. they can't change. So every time you do that you end up creating new strings and the old ones get deleted which can cause fragmentation in memory and all sorts of interesting issues and just takes time. And so you'll often find um languages like Java has a string buffer which is effectively just an array of characters which is how C for example has always represented strings. So, probably a better way because we know we're going to be manipulating this is to represent the text as an array of characters. Let's start with that as our starting point. So, I need an array. And in true blue fashion, here's one I made earlier. So, we've got an array here. An array is just a storage where we've got an index that we can use to reference things. And we can store things in each box. And so, we can think about how we can represent our text. Here I've got some characters. So we can put a character in the first box, index zero, character in the second box. We'll put the space in the third box, index two, and so on. And we can add the characters. Adding characters or even deleting characters at the end of it becomes really really efficient. Um hopefully I can find all the characters that I need and stick them in. Now obviously the computer is a lot quicker than I am and we can represent line breaks as characters and so on. And as we've seen adding them at the end or removing them from the end is really really simple as is for example overwriting a character we can just take away and replace the one in that location and that becomes easy. But you may have noticed that I have made a mistake when typing this in. And so if I want to insert the extra character there, this is where we start to see that perhaps the array isn't the best way to represent the text here. Because to do that, unlike adding it at the end, or I can just put it in the right place, um, I actually need to make space between the N and the T to do this. So what I end up needing to do, and is move the E from the position it's in to there. move the B into that position. Move the O along one position as well. Move the T along one position. Move the spaces along, which I missed out, but we can assume I did that. And then I can put the character in the right place. And that seems a simple thing when I've got what 20 characters maximum, probably more like 15 16 if if that in the in the page at the moment. But actually, if you think about it, text documents that we might be editing are likely not to be 10 characters long, 20 characters long. They're probably not going to be a hundred characters long. They're going to be thousands of characters long, if not tens of thousands, 100 thousands, millions of characters long. Multimegabyte text files are not unheard of. And so actually if we're having to manipulate and copy all the characters from one position to to the next each time we want to insert something into it and also we need to do it every time we want to delete something in the opposite direction to put things back in the right place. Then this actually takes the computer some time. One of the things about computers is that our CPUs have got much faster over the past 40 years that I've been using computers, but actually the speed that memory's got faster hasn't increased anywhere near as as as much as the CPU has. So actually copying things around in memory takes time. And one of the other disadvantages in what we're doing is that the way we tend to represent characters um particularly if we use ASI or even if we use unioders is bytes or maybe two byte or four byte things and actually copying bite or two byte things is slower than copying larger things because the memory is not aligned with the way that the data is. We won't go into the details. Um, and so it can get slower to copy it because we have to go from one bite position to the next one along and we have to do it in the right order otherwise we'll overwrite things. And so it can actually become quite slow to do particularly if we got a big document. Um, and you would soon notice it if you were wanting to add a a character to the beginning of Hamlet, you know, written by Steven Bagley, not by Shakespeare Chappie. You would soon know that it would take a while. So how can we get around this? How can we come up with a a data structure that doesn't require us to do all this copying of text or the characters from one position to another to sort of speed things up? Now, if you've been watching computer file for a long while, hey, congratulations. Hello, faithful viewer. Um, but you may remember we've talked about arrays and link lists before, and I'm sure your alarm bells are going on thinking, maybe this is a good place for a link list. Well, you would think that. Let me just completely corrupt all my nice Shakespeare and let's think about how we could represent this as a link list. So, what we'd have is the character, the first character in our data structure, and then we would have a pointer there to the next character. And then we'd have that character, and we'd have that point at that, and then that would have a pointer to the next character, and so on. And then we've got this would come around and they can be anywhere in memory. And oops, I've made a mistake. I've missed off the E. But that's all right. We've got a link list. I can create that somewhere. And we can then update this pointer so it points to here. And then make that one point to there. And then we only got to change these two pointers to insert a character. Likewise, if we wanted to delete this space, don't know why. We could just change this pointer to point over there as well. So you think great let's represent the document as a link list. Every character becomes a single element in the list. We can update the sort of pointers between them. But then you think about how much memory that would take up. We'd still take up a character for each letter or digit or whatever it is we've got there. But now we also need space for the pointer to the next one. And on a 64-bit system that's going to be eight bytes long. Reality is we're probably actually going to have two of these because we're also going to have a previous pointer so we can have a doubly link list just because for the sort of things we want to do in a document that actually makes sense and would make more. So we've got 16 bytes here, 17 bytes. We've just increased the size of our memory requirements from one bite to 17 bytes per character. That's a 1,600% increase. for our document. That doesn't seem like it would be that efficient to do. Now, you could come up with a hybrid. You could say, well, okay, I won't put every character in the link list. I'll just put a line of text in. And and that would work. And in fact, there are text editors um to do this. Is a common text editor VI that uses a link list of lines to represent the document. Works absolutely fine. The link list means you can move around in it pretty quickly. you're only dealing with a small 80 100 characters per line usually. Um, so when you're copying things about in that, it's relatively quick to do. So you can come up with a hybrid like that, but it's still perhaps not the most efficient with memory. But let's go back to the array and think about how this was um, and when it was efficient. And I'm just going to let Sean rest his arms while I just put this back together. So, we're back at our earlier thing. And if we think about what we had here, we said that it was really easy to add characters um at the end and to remove them at the end. And the reason why that was, and this becomes crucial for how we can make an efficient data structure to do that, the reason why that was is because there was nothing afterwards in the text. There was nothing that we cared about in the memory of the computer in our array that we wanted to keep. So we could just take a character and add it into that location to insert it at that point. No problem. We didn't have to move anything about. If we want to in something here, then it's slow. But if the gap was there, then it'd be really quick to add things. I mean, if if the gap was already there and we moved all this to this location, then we could add text here. Um, do you mean if it was split in two? if it's split in two. That's the sort of thing. The way we can get around this is to actually think about how people edit a document. When we edit a text, we tend to go to somewhere, the beginning, the end, somewhere in the middle, and add a number of characters. It's very rare that we're going to go in and just add one character. We're going to go in and add a series of characters. So, what we're going to look at is a technique, a data structure called a gap buffer. A gap buffer. Um, it's not a clothing store. Um, but it's actually a way of representing a document that specifically takes into account the way people edit them and says, well, okay, if we're going to go into a particular point and add text at that point, why don't we just move the gap to that point and then we can add the characters there and then when we go to somewhere else in the document, we'll move the gap to that point and then we can add the characters there. And so what we're basically saying is, okay, we're going to have to copy the characters, but rather than doing it every time we insert a character, let's just do it once to move the gap to where we want to type text so that we're just typing text into the gap. And then if we need to type somewhere else, well, we'll move the gap to that point. It'll take some time to do it, but again, if we think about how people use the text editor, they will move the cursor there and then they'll start typing. there's a there's a gap there where you can sort of move all the characters around and they probably won't notice. If they're an efficient typist, then they will type quickly and they would notice the delay between each character. So, we can hide the fact that we've got a copy memory based if we on the way that people used it and we can just move the memory once. So, how does that work? Well, let's suppose we've got some of the document written and I'm going to have to write things backwards. So, let's say we've got to be or and we haven't typed that and we've got that is the question already in our document. Now before we would had them all following each other, but what we do is we say we're going to edit and add at this point. So what we do is we move this gap that we've got here to that point. So we move question right down to the end of the buffer followed by the. There we are. And of course we should probably move this like a computer would. Um character by character. Oops. There we are. that. And now once we've done that, we've got this big gap between the two of them. So we can add not we don't have to move anything once we've done that. Not to B. We can add that without any problem. We can continue adding there if we wanted to. But if we decided we wanted to add some text here or let's say we wanted to yeah add some text here. Then what we do is we come to this point. We'd shuffle all this back up um before the gap. Where was it? I was going to said I was going to have It was at that point, wasn't it? That is. So, the gap is now here. And we can add in some extra questions. And this is where I'm going to have to try and come about with a word. Just pulling things out there. That is There we go. G L A R E. There we are. That is glare. the question who said Shakespeare was a good writer probably everyone having who's seen this video compared to me um but we can also delete easily from this point because we can take these away and actually we don't even need to take them away all we need to do is remember where this the gap starts um is here and we've effectively deleted them because if we want to put something else in its place we can just overwrite it in memory as we go through. So, what do we need to represent a gap buffer? Well, it's actually really, really simple. Uh, let me try and not destroy my Shakespeare and just get another fresh piece of paper. So, what do we need to do? We need the start of the buffer or we need our buffer and I'm a C programmer, so I'm going to store that with a pointer to the start of the buffer. I need to know how big it is, how many characters there are. And I need to know where in this case the start of the gap is. So the gap index in this case is 28. So we can set that to be 28. And I need to know how big the gap is. So the gap length or probably a better way of actually representing that would be the end of the gap depending on what you need. In this case, the gap is well, let's actually not store that. Let's store the end of the gap and that is um 78. In this example, I numbered the squares to make this easier. So, what we need to do is well storing the buffer, the length, and also the number of characters. So, we need the size of the buffer overall. So, that's the length of the text and the size of the buffer. That's all we need to store. And then when we add a character, let's say we're going to add a character here. So we put the L in there. We can up that it date that to be 29. Next time we come in, okay, we need to implement this. So this is going to go at um that now becomes 30 and we know where they are. And then when we move things around, we copy the characters, we update these numbers and we can represent it. Now, if we want to fetch a character out of the thing, say we want to print it on screen, again, it's very, very simple. We want to get one, we'll have an index for it. If it's less than the gap index, we can just use that as an index into our buffer. If it's bigger than the gap index, well, all we need to do is subtract the gap index from the index, add on the index where the gap ends, so the first one that's not in the gap, and return that character, and we'll get the character that we need. And so because we've taken into account how people take tend to edit it and we've looked and say well okay we'll just move the gap hence gap buffer we end up with a way of building a text editor that's really rather efficient still relatively simple to implement. I did one the other week and it was a very few lines of code but we can implement it in a way that's efficient because we're always inserting and removing characters from a point where we've got a gap. If you wanted to delete a character, dead easy. We just update the number, take one off it. So, by manipulating those numbers, you basically can work this. Is this how word processes work as well, or is this is that a bit more complicated? So, that's a really good question. Um, we've been talking about a text editor. What's the difference between a text editor and a word processor? It's probably a bit of a sliding scale. Um, text editors are the sort of things we tend to use for editing source code or just raw asy text. That's what we've been talking about here. Um, when you move into a word processor, um, then you're starting to add features which make it easier. So, they might wrap lines at the end of a a line. They might allow you to center text as you move through. You might get something like where you're adding wizzywig, what you see is what you get type features where you can set fonts. um you can change the size, you can justify or so on the text there. It's it's doing more, but effectively it's still got to store the text and edit it in exactly the same way. And there'll be word processors that use gap buffers to do that. Um Emacs, for example, the text editor that uses a gap buffer. Word doesn't. Visual Studio Code, that's another text editor. Word is a WordPress obviously. Um they don't use gap buffers. They use other things. Um there are various data structures that you can use but they're all basically based around the idea of well let's try and avoid having to copy characters whenever we don't have to. Let's sort of have a system where we can just insert and remove characters and then very quickly move around it. So you get things like piece tables or ropes which are used by various text editors to do similar things. But they're all, as we said, trying to avoid having to copy characters from one position to another as we saw right at the beginning. That buffer in the middle there, or that gap, rather, is that just there while you're doing the editing, does that gets doesn't get saved? So when you save it out, what you do is you would save out the first, in this case 29 characters and then you would save out the we got 11 characters that follow that and you sort of concatenate them together into one file that only exists in memory because it makes it easier um to manipulate the text. And that again is something that you probably want to do when you're designing data structures. think well how can I represent this so it's easy to process even if it isn't how I would store it on file or on disk or and so on so for example think about pictures we talked in the video on PGs and the whatever the vulnerability in PGs a couple of years ago you'll often find that in memory when you store an image you will make the width a certain multiple of 16 or eight or so on because it's easier to process but actually when you have it on disk you don't want to waste the space so you may not store it in the same way or you may compress it um when it's on disc. So the representation in memory that we have here is different from the way even different from as we're talking about from the way we present it to the rest of the program. The program will say get me the character at this point. What are the characters between this range and this range? And when you do that you'll remove the buffer so that you can print it on screen. Uh remove the gap sorry so you can print it on screen uh and so on. It's really a case of minding the gap when you're programming. even more artifacts, right? So, it's it's sort of for robustness. Now what's happened is the app has gone in cropped the image down to a much smaller one which takes up less space in memory and then overwritten it onto this file but it's left the rest of this file intact because it's not truncated the file.

Transcript for:Text Editor Data Structures

Transcript for:
Text Editor Data Structures