Enhancing Byte Pair Encoding Efficiency

hello everyone and welcome to yet another recreational programming session with Mr Zin as usual so um it's going to be offline session right so because I don't anticipate it to be very long because I just want to quickly check uh some quick ideas so I don't think it's worth to start up an entire online stream or anything so on the previous sessions we implemented an algorithm called bite paay encoding it's a very simple algorithm actually uh you have a text and what you do is that you pick a most frequent pair right and you replace it with an unused bite and then you repeat the process compressing the text until you can't comprise no more and you have this sort of like a table so this kind of algorithm is used for deriving at tokenizer for large language models and uh stuff like that so and we implemented it and then we tried to paralize that right so because effectively what you can do you can split the text into several chunks and you can process those chunks in separate threads and then merge the results of the processing together sort of like in a map reduce style as somebody in in a comment section said in the previous video uh right but the problem is that um I tried to apply this Improvement on a big text that I have which is the log of our Discord server is actually 33 megabytes I think uh yeah there we go so I have like a Discord logs in here and it's a 33 megabytes and it's still very long right so I couldn't wait until the end it's like several hours and it's not even close to being processed uh even though it was an improvement it was I think roughly 2x Improvement we we can actually measure it we can actually measure it so I have a tool uh right so which is called txt to bpe right so I have some uh prepared um command line arguments in here so we can take a look at the uh arguments right so I improved it a bit so I actually made a lot of parameters customizable through the command line utility and stuff like that uh so essentially what we have to do we have to provide uh first the input file that we're going to be compressing right so this is going to be Discord log and then we have to provide the output directory right so and essentially the reason why it's an output directory and not the file as in the previous session is that uh I implement it so it snapshot the current state of the process every few uh iterations right you can customize like how frequently it does the snapshot so since the process is rather long right maybe several hours or so you may want to actually check out periodic how the current state is doing right so just check the currently identified tokens and how much the texted compressed and stuff like that so what it does every few iteration is just like dumps its current state that you can then while the iterations are going just like check out and like see if it's going in the right direction and stuff like that uh right so and also let's actually reduce the amount of iteration right so it's going to go for very long it's like over one like 100 thousands of iterations right so it's going to be like several hours let's limit the amount of durations to 100 right so and let's actually try to run it um maybe I'm going to go ahead and disable the threading mode just to see like how slow it goes uh without any threading so let's go so this is going to be basically 100 iterations and it's going to be dumping its state every 10 iterations so this is the first iteration and yeah essentially one iteration takes half of a second right it takes half of a second so which is rather slow uh right because uh during that one single iteration it needs to iterate through all of the tokens of the text identified all of the frequent ones then com uh then replace them and so on and so forth so it's it's it's pretty slow right I'm not even sure yeah it's actually getting closer to 100 right so we're already at 60 uh so as soon as it reaches 100 we're pretty much done um so let's just wait a little bit more so my CPU is actually spinning up I can hear the fans spinning up uh and yeah and we're done right so here we have basically you know snapshots of the state for every 10th iteration essentially so and it's a single threaded one right so it's actually Market as a single threaded and now I'm going to enable uh the maybe it actually makes sense to save the log I didn't think about that this is actually a very cool idea what if we also save the log so we can actually see what the is going on so this is going to be Discord log St txt uh something like that this is a cool idea I really like that uh and now let's actually go to txt2 BP and enable threading and see how better it is going to go right so it's actually restarted um so let's wait a little bit and as you can see now the time of a single iteration is uh you know 170 milliseconds first with a single threaded mode it was half of a second now it's around like 150 right it's an improvement it's a two and it's already finished by the way it is definitely an improvement it is definitely an improvement but it's not that big of an improvement it's still very slow uh right because we also like deating and recreating the hash tables over and over again and it's kind of a painful process it would be kind of cool to find a way to not recreate the hash table over and over again so so maybe that will spit it up um okay so let's actually mark it as multi threadit so we have a single threadit and multi threadit let's save the log as well uh right so and this going to be Discord so let me find the log can I just like mtxt there we go so we have also this kind of log so but uh after the session where we did multi-threaded implementation in our Discord server so let me actually find the message where we had a discussion about that so let me quickly open it up um Discord is insanely slow so uh Isel gray right so suggested a pretty cool idea I think that maybe rebuilding the frequency table is kind of a waste of time you could in a step of replacing uh most frequent pair for instance AB CD a z modify the hash table removing appearance of a and CD for each substitution and add a z ZD to the hash table so this is actually pretty cool idea so essentially you do not recompute the hash table over and over again right so if you identified the payer that you want to replace you basically in place update the table that's what you do you just in place update it um but be but this approach is actually not particularly paralyzable I think right so I don't see how you can split this work into several chunk and and put it on several threats but maybe it's not even needed to be like that maybe it's going to be faster than multi threat solution anyway right so you don't really need to parallelize it some some problems are just not parallelizable like that uh especially when like result that you're currently Computing kind of depends on all of the previous results which is also the case in here I think I think it is is the case in here so anyway let's actually give it a try and see if it's going to improve anything uh so txt2 BP all right so let's disable the threading I'm not really sure if it uh if it's important to disable the threading or not okay so here's the single iteration So within the single iteration first of all we free the hash table we don't have to free the hash table right and then it depending on whether the threading is enabled or not we're collecting the frequencies of the pairs either in a multi- threaded mode or in a single threaded mode it doesn't really matter like which one is enabled in here so we might as well so we disabled this thing yeah we disabled this thing so we're going to be using a single thread mode uh we only have to do that once right that's kind of the whole point this thing that we been doing on each iteration we only need to do once at the beginning of the of the process right so we don't have to clean anything maybe we can yeah we don't have to clean anything so all right so we computed it once right before starting everything then we are iterating through the hash table and finding the maximum pair the maximum like the maximum frequent pair that's what we're doing in here so and then if the frequency of the most frequent pair is basically one or you can actually that right I recently edit that you can stop the process with a custom termination pair frequency right so maybe you stop at like five right so it's not it's not going to go for too long right I haven't used it yet but I thought maybe it's a good idea to have this kind of option so then you add the maximum pay to the payer table and then you start the replacement and this is where we have to do all of that stuff so the problem with this algorithm even though on the paper it sounds kind of cool um it's actually in reality rather complicated because because we're iterating the pairs from left to right so that means uh to the left we have things that we already kind of replaced and to the right we have things that we have not replaced and the the way we replace that is actually we take the tokens input right so this is our array of tokens we iterate it from left to right and append whatever we want to append to the output tokens and if we are about to replace some pair we just replace the pair with the new token in there right so essentially so we have something like um a b c d so and this is in input tokens and here we have output tokens so we encounter a right so a is not the most frequent pair so I just pend a then I encounter BC let's assume that BC is the pair that I want to replace right so because of that I have to actually subtract from subtract the frequency not of the a pair but B and whatever we have in output because maybe we have something that we already replaced in here right so essentially when in this entire process we look into the past we must look into the output but when in this process we look into the future we have to look into the input right so we have to be super careful in here and the reason why I didn't implement this algorithm right away is because how finicky it feels right a lot of cornat and I absolutely hate this kind of algorithm when there's a lot of small little coring cases but unfortunately this kind of algorithm are usually the most performant ones uh right so the most performing algorithm are the ones are that are the most annoying to implement but maybe if the um you know um if we're going to win a lot of uh Time by just using this algorithm maybe it's going to worth it right maybe it's going to worth it but anyway so um let's actually take a look at different cases in here and what's funny is that we already have have a bunch of cases in here right so here's one case here's another one and here's the third one right so we have three different cases in here so what's the first case when i+ I so the current character I +1 is greater than the amount of tokens we have we just upon like a pen that into the output and we continue so this is basically checking for the last character are we looking at the last character if we're looking at the last character there's nothing much to replace because you don't have a pair in here right so we identify a pair by by looking into the future by one token right so that's how we identify if you last character if you last token like there's nothing in the future there's no pair in here so we just openend it to the output and we uh we just die in fact we could have actually written this entire thing something like this right and it would have made a little bit more sense right so basically un nested so because after doing this kind of thing you're going to exit anyway uh right you're going to exit anyway but what I did I essentially let this condition exit the entire process so I'm not really sure if it's a if it's a good idea to do it like that but yeah so let's just keep it as it is okay so uh the else branch is when you do have some token in the future and there you go here is the pair here's the pair that we're looking at so essentially we're comparing that specific pair with the most frequent one um so maybe it would be actually kind of cool to maybe factor out yeah let's actually factor out this entire thing to something like Max pair uh right so here is the max pair and so now we can actually do something like Max pair there we go so this is a Max pair if the pair that we found is equal to Max pair this is what we have to modify now right so this is what we have to modify now if it is not the pairer we just append the current character and we move on I don't think we need to modify anything in here right so this is a situation when we do have a pair but it's not the pairer we're looking for this situation is when we don't have a pair right so I think it makes sense so anyway um so let's actually bring this entire thing in here I think it's going to be very useful so we essentially basically looking at this kind of thing right so we're looking at this kind of situation so first of all we need to subtract the frequency of the pair a b but we have to actually look at a from the output not from the input uh but here's an interesting thing what if the what if we don't have anything in the past right what if we don't have anything in the past we need to specifically check for that case right so if so when we don't have anything in the past it's when I is equal to zero it's one way to do that or when tokens out is actually you know equal to zero right so because we didn't append anything there it's kind of equivalent two two equivalent conditions so in here we can say the following thing if you have something in out that means you have something in the past to look at right you have something in the past to look at uh and essentially here we already have a pair that we can kind of utilize so the left character in here is going to be the last character of the output so it's actually take items and uh put count minus one so this is the left one and the right one is going to be the current one in the token's input right so this is going to be something like that uh there we go here is our pair here is our pair so now we need to look up that pair in a frequency table so I think it's something like get I I uh I'm going to put it in here so we need some sort of variable where we're going to hold an index to the bucket so here's the frequency and we provide the pairer all right but here's an interesting thing will that pair exist if it occurs in the output that means it does exist in the table so we can actually kind of assert that place is greater or equal than zero because if it isn't that means we made a mistake somewhere because in this particular situation that pair must exist in the table right it must exist in the table okay so after that we just take this specific uh pair and we decrement this counter what's interesting now is that since it exists in the frequency table right its counter has to be greater than zero it has to be greater than zero so this is another condition that we can actually double check in here again if it's not the case um that means we did something wrong like previously somewhere in the algorithm I think it's since it's a very like a very finicky algorithm it is very important to put as many asserts here as possible to make sure that we don't have any inconsistent state right so anyways so we process this kind of thing so we subtracted a right essentially we subtracted a now since we're going to be replacing BC with Z right so this is our replacement equal BC we need to uh increment the pairer a z right um so but that means we need to create a new pair in here so the pair is going to be left equal to this right so again this is basically a uh and the right one is going to be the maximum pair the maximum one um is the yeah it's basically the the counter the counter of the pairs minus one so it's this one uh yeah so essentially we even append it into the output in here right so we can basically take this one maybe it makes sense to factor it out to a separate variable honestly because it's kind of dangerous uh so let me quickly do that uh I'm going to put it somewhere here so let's actually call it something like Max token so there we go Max token or we can actually put it before we append anything into the pairs and remove minus one right because it's essentially minus one in here before you yeah so I think that's even better that's even better so this is a maximum token this one becomes kind of a Max token as well in here so that's pretty cool so honestly since this one is the same thing maybe it makes sense to not do that right so just like yeah we're just setting it once and only modify the left one and the right one so we're just setting left and here's the when it's right one and here is yeah do that makes sense anyway so let's now look for this kind of thing and this one by the way might be missing because we just introduced a new token so you may not have that pair uh in the in the in the hash table so that means you may have a situation when the place is actually negative right so there's no such thing in that case I think what we have to do we have to add a new pair a new entry into the hash table so we're going to do Huts frequency pair and since it's one pair it's the first pair that we encounter so its frequency is one okay so if you do have that pair before if you already replaced it before uh I think it's like you just incremented right so it's going to place value just incremented by one uh right in this particular case we don't have to worry if it's like you know greater than zero or not we're just incrementing it if it exists so all right so we basically processed this case right we essentially processed this case but there is a lot of more right so now we have to uh decrement this pair right then we have to decrement this pair and then we have to increment this pair there is a lot of cases in here so as you can see that's why I didn't want to implement this algorithm because just like so many freaking cases uh anyways so in here uh let's take a look at um so here we need to look up the maximum pair right so uh we already kind of have this kind of thing so place um let's actually maybe declare this variable somewhere here uh and keep reusing it throughout sort of like the body of this thing so I need to hm get I Max pair uh in the frequency right and I think since it's the most frequent pair it must exist in a frequency table so this is basically we have have to do assert uh Place uh where than equal there we go and also since it exists its frequency actually has to be greater than zero because we're going to be decrementing it uh maybe another thing that we want to do in here is if this entire thing becomes equal to zero we have to delete it uh so not to consume too much memory so that the memory can be reused but I'm not really sure so you know what um I think I'm going to just do something like this so going FR and this is the pair and I'm going to put to do in here because I'm not very sure maybe it's not going to grow too much we'll see we'll see um all right so we decrement this entire thing so everything's Gucci okay everything seems to be tamaguchi so um after that we are incrementing the counter right so we are appending that Z to here and we incrementing this entire thing by two right we incrementing this entire thing by two so I suppose after that we have to work on this part of the algorithm right so we have to decrement this pair right and then increment pair uh ZD so that's what we have to do that's what we have to do all right but here's the thing you may actually have a situation that after like advancing the cursor you can go outside of the input and in that case you don't really have you know any pair in the future to process right so it's kind of a similar situation uh um to the Past like in here but now it is in the future right so it's kind of similar to that so I suppose what we have to do in here is just basically check if I is greater or equal to the tokens uh tokens input right so this going be count only then we consider this entire thing right so every time we consider past and the future we first check do we even have past or do we even have future I think it's kind of important so let me actually bring this into like that's a lot of code already just for this small freaking thing God damn it like I and we haven't even this is like half of the idea this is only half of the idea like this is a very small description of an idea right and this is how much code I have to write to make sure that uh you know like everything is correct and stuff like that and of course AI bro is going to say oh just put that into the into llm is going to write that code for you so anyways uh all right so we have this kind of thing so we're looking at that stuff right and we know for sure that we have the pair in the future so what what kind of pair do we have in here so the left one is token uh tokens in items IUS one right so this is this one and the right one and the right one is going to be the current one so we have to do in here we have to subtract it and what's funny is that uh subtracting this entire thing is kind of similar right so it's basically this code like you you get the pair you check that place is greater than zero and the frequency is greater than zero and then you decrement it and it's kind of the same regardless of what you're doing here so maybe might as well just you know copy paste this entire stuff uh copy paste this entire stuff and just like put it in here right so this is the decrement um all right so an increment is going to be essentially the left one becomes the tokens right so it's the last one right so the last one in the output items uhuh and count minus one and the right one is going to be it's going to be again the same I think right so the right one kind of stays the same so it's only the left one that is changes so maybe we can do something similar in here right so right one stays the same left left y keeps changing and for this one we just have to increment it so in that specific case I can copy paste this code right so there's a lot of kind of cod in here but it's kind of repetitive actually if you think about it um yeah so this is not one of them here so I don't know why copy pasted the thing um so I think that is it did I miss anything right so this is for the left part this is for the middle part and this is for the right part so that's what it did is that's what it is and we don't do hm3 anywhere we just like keep reusing the whole thing over and over and over again okay so let me try to maybe compile the whole thing and why did compile first try Okay so yeah let's give it a try so maybe is going to be a completely bogus actually so bogus amus let's start with something with something simple let me save this uh specific prompt uh right so I'm going to start with something much simpler txt to bpe the input um input file is going to be data and just like a small paragraph from Wikipedia uh so the output here is going to be uh the same name BP Wikipedia Def and let's go uh okay so when it completely failed instantaneously insta taneously that's very interesting so something didn't exist in here at at all did we not append properly things because that's kind of weird uh that's kind of absolutely weird because you telling me that this thing doesn't exist um like right away H search and place it's like right here okay so let me maybe take a look at this kind of thing so what's what's the pair we're talking about what's what's the pair uh so the pair is going to be some something like uh u u new line pair L pair R and let's just actually abort in here so let's also print the location right so because the assert prints the location in here I don't really want to lose that information so it's going to be file this is going to be a line like so and what do we have in here uh oh yeah so there's there's this thing that if you already like if the folder that you provided output folder exist it tells you that you may lose some data in here so either rename it or delete it right so this is actually uh happened to me several times so I kind of created this sort of like a mechanism safety mechanism all right 256 and uh yeah this one is kind of weird like what the is going on in here um maybe something something wrong happened in here as well M so this is the max max pair mhm Max pair and we do not modify it yeah this is exactly what I'm talking about right it's it's like too finicky it is to goddamn finin so this is the last one and this one we just increment uhuh uhuh and I wonder how many iterations so far has happened in here um so we can actually maybe even print like I I and see what's what's the number of the iteration we have in here yeah of course let's not forget to to remove this entire thing uh right what we have so yeah so we went pretty far with this entire thing we went rather far and we still don't have this spare we still don't have this spare in here where did they it up where did they it up I'm not [Music] sure so at some point it simply doesn't exist it's something stupid right so this is why I hate this kind of algorithm because for me it's super easy to just like forget some small little detail right because it makes sense right so the entirety of the thing kind of makes sense um so yeah let me actually maybe do something like this I think it's going to be a little bit easier for me to read so this is the left one uh right so this is the right one uh so this is the current one and plus one okay so if you have something in the past right so we just taking the last value and then the right one is the current one we're querying it up um right we're making sure that that exists so that's what this thing actually about and if it's great than than zero we decremented it so now the right one is the maximum token right so is the maximum token a correct thing it is pairs count before we appended a new pair in here that makes sense so now we're looking it up uh and then if it doesn't exist we add that pair with one if it does exist we just increment it by one okay so in here we're looking up the maximum pair you know what I'm going to do I'm going to actually replace this with something like pair uh Max pair so then the code that is responsible for querying up uh the think quering up the think and decrement and it look kind of the same right so because here we do pair and here we do Max pair so they didn't correspond to the to the same thing all right so this one is kind of fine uh this one is kind of fine and what we do instead of appending like the original character we append the max token and we skip like two in here we kind of skip two in here okay so for the right one tokens in it's this one for the left one tokens in previous one because that's the one that we're going to be decrementing okay makes sense now for the left one is the one that we just appended into into the tokens right so and yeah bro I don't see the mistake I don't see the mistake anyway um so let me let me see so maybe we could actually start with something goddamn simple so so let's let's have a small small text in here so it's going to be a b c b c d right something like that can you just freaking compress at least that debugging this SC algorithm is just like it's a nightmare at least for me right so maybe there are smart people out there that can actually maintain a super complex mental model of things but for me if there's too many small corny cases and details I just like my brain melts down and I can't keep track of that unfortunately and I'm like just actively trying to avoid this algorithm okay this is actually kind of cool that we managed to reproduce that on such a small input because I can do a lot of cool now uh for instance I can maybe print the uh ins and outs in every single iteration so that's something I can do uh right so on this first iteration so let's actually do something like size g0 less than tokens in count plus plus J uh right so and in here I'm going to just like print each individual tokens in uh items J right and in here we're going to actually put a new line after after all of that stuff so maybe it makes sense to put like Curly braces in here I want to keep it like a single line so it's super easy for me to just comment it out uh right so um might as well just do print F so this is the input uh that we're dealing with this is the input all right and I need to not forget to delete this kind anything every single time all right so it went through like three iterations but it's not really that interesting because what we're interested in is output uh okay so this is the output mhm and here is the output all right so what do we have how do you even append Shyer to the output of course like yeah so it's a safety mechanism but it's kind of annoying sometimes but I don't know what could be better in here okay so all right all right and then something went horribly wrong in here right it managed to successfully look at that so this is the most frequent pair 98 99 right so this is the most frequent one because it's there's two of them and it managed to successfully replace this one but then it kind of failed afterwards right it failed afterwards and it actually yeah it failed like precisely in here that is rather interesting um so yeah on top of this thing we can also print the hash table who said we can't do that right so we can just like check uh the hash table itself um so let's do something like this maybe print T ah maybe we can do that on on a new line so size T and then less than hm length of the frequency table Plus+ I and here it's going to be just something something like this um yeah this is going to be the frequency Frack I uh left Frack I right Frack this one has to be actually the key and this is I there we go so and don't forget the new lines don't forget the new lines all right so kind of up some stuff but this is because it's a PTR D all right and let's not forget to remove the folder let's go uh okay so what do we got okay so this is is reasonable this is reasonable okay after we inserted um inserted this thing did we increment this part yes in fact here is the 97 to 56 98 99 worked beautifully it decremented by one that's perfect then 97 to 256 okay it was uh incremented by one but I don't see the second one 25699 this is where something upen here this is where something upen here and that means it's somewhere here I see so the condition has to be the opp like okay God damn it right so I ah all I'm glad that I managed to find it actually relatively quickly right I managed to find it relatively quick can I now uh maybe turn off of uh some of this stuff right so there's a lot of code and if I try to run it on like big amount of data it's going to blow up uh all right so can I just put it like that and just uh comment out so end if let's go uhhuh and of course we want to decrement the whole thing and let's go okay so it was two iterations uh we didn't provide maximum amount of iterations but apart from that everything seems to be fine um right so let me see let me see we can explore the bpe table so as far as I know I have utility bpe inspect right so this is the utility I can provide the input of BP so this is going to be small uh 1 bpe what do we have in here okay so it's basically from 0 to 255 is just basically asky uh all of the asky plus some of the unused characters in here but doesn't matter and the only frequent pair in here is actually bcbc yeah and it's fair so that's basically what you're getting here uh so this is in place I would love to maybe test this thing without like with just a previous algorithm but I don't have a way to quickly switch between them right so though so here's the Discord log right so we can try to run it on Discord log and just like see if it makes sense sense there um okay so let me give it a try so where is the temporary thing in here uhhuh and I'm just like running it um is it faster wait I think it I think it is faster just a second so we save this entire thing to Hero so it's actually called IP for like in place and let me compare the logs I think the logs are kind of important in here uh so here we're going to have Discord log IP txt so here is the log uh okay so this is the multithreaded version right and yeah it is faster look at that like multi threaded version is 015 uh and this one is okay so but just a second like we need to confirm that it at least makes sense right so we need to confirm that because maybe it just like did some right so maybe it broke in the text who knows uh maybe it like literally broken the text so let me try to do a BP inspect on the on the final result so how many iterations we have in here so 100 100 iterations uh so that's pretty cool so BP BP inspect BP inspect uh Discord I want to take a look at the multithreaded version so 100 BP right so what do we have in here uh okay so it kind of makes sense right so there's some tokens that we identified in here like ing uh things that like repeat like way too many times okay let's try to do that on ip1 does it make sense okay right so there's also okay so there's like the pretty much the same tokens it's just like faster and the fact that um BP inspect managed to load it up that means that that the uh indices are in correct States because BP inspect actually does like a bit of verification when it loads up this kind of thing um right let me let me see so when you load when you load the payers right so look how much verification it actually does like it verifies that the size of the file is correct right it verifies the the Prelude basically the stuff from 0 to 255 and that it makes sure that the indices within the pair point at the right thing they don't overflow anywhere right if anything goes wrong in here it will actually fail immediately right so there's a lot of checks in here uh so and the fact that it manages to load it up means that uh at least it is like passes the sanity check right at least it passes the sanity check so we can actually do something interesting here we can try to do sh 256 uh right on all of the files in here um so we can do something like tpn and BP right so just like compute all of that uh might as well by the way just redirect it to shot 256 sum right so just like save all of that into the separate file so then we're going to copy that file to the IP and just like make sure that it produces the same kind of thing right so all right um okay so sh to 56 and let me well this is not what I wanted actually I wanted to do c right so oh well I mean okay it was it was okay uh up until can we sort this somehow uh so shat to hand uh is it possible for me to sort lines yeah it's sorted it by hash I actually want to sort it by like a file name but I don't think it's like super easy for me to do right now um so what is the smallest number where it failed I think it failed at yeah it failed at 70 iteration so until 70 it was kind of fine and then it failed somewhere there damn okay so but but it makes sense right so we look at what kind of tokens are stored in the bpe and it does make sense right so if I do BP inspect and just like um yeah we can output this kind of stuff so Discord Lo IP 100 bpe uh txt right so I can do that so then I can take a look at Mt right okay so I have these two files right so the outputs of the inspection and we can just do a dig between them like a textual just to see what's the difference between the tokens in there uh right might as well just put this thing in here and this one is going to be Mt and then IP oh key e e c EC ah well I mean it's the same tokens but it's just like in a different ah wait a freaking second wait a freaking second the order of iter okay so I completely forgot I completely forgot that uh hash tables are unordered containers they are audit containers and one of the things we do in um in this thing is just we're iterating frequen the the hash table to identify the maximum frequency and since in a multi-threaded version and not multithread version in an in place version we just work with has tables differently right in multiread we constantly recreate it from scratch and in in place we reuse the same hash table the order in which you're like itera in them is going to be different and you may actually have uh you know the most frequent like several most frequent pairs right so depending on the order you're going to actually pick a different one damn okay so I I really need a way to actually switch between like a multi-threaded version and in place version so I can actually check like a loot of hypothesis in here so like do I have like a Define somewh like yeah so let's introduce something like in place uh macro right so which is going to be responsible for for doing this kind of thing um all right so this is the thing this is the first computation if we defined in place all right if we defined in place we're doing this single computation in here like pre computation and that's it so if we don't have in place enabled if we don't have in place enabled we do this Rec computation on every iteration uh by the way let me Mark this thing as enable threats enable threats and this one as in place uh in place okay so this one is done when in place not enabled uh and if and this one is enable threats there we go right so this part is the same regardless of whether we're doing U multi-threaded or in place uh so all of that stuff is the same uh and in here okay so if defined in in place we do that we just do that uh otherwise we have to do something else so this is going to be n if and this one is going to be else also for in place so I need to recover this algorithm I need to recover this algorithm but without in place shisu uh so how do we do that without in place shisu so uh we don't do any of that stuff we don't do any of that stuff in fact we didn't do that uh we didn't do that we didn't do that yeah so I think that was the original algorithm actually I think that was the original algorithm so now I should be able to uh recreate this kind of stuff okay so I do have in place let's disable in place so now I want to do txt to BP uh and where I'm going to save that I'm going to be saving it into mt2 uh right here right so because I want to make sure that I managed to recreate multi threadit algorithm correctly uh hopefully I did so we might as well also I I think I did because it's slow uh right so ahuh I think this is a single threaded algorithm right so let's kill this thing let's kill this thing because I disabled threading enable threads right so let's just quickly enable threads uh and I'm going to do mt2 yeah one more time one more time yeah so it's faster and it's multi thread we can take a look at b top uh it utilizes all of the threads yeah yeah so it utilizes all the threads I can see that already so that's pretty cool uh okay so the reason why I want to do that is just like I want to put like hashes into this folder and just see if it's the same so I just want to make sure that I recreated it correctly uh so this is the second one I copy copy paste sha in here so this is what we expect in terms of the State uh sha 256 all right all right all right okay okay that's cool so uh now I'm going to enable in place back uh I'm enable in place back but I'm disable also the threats so uh what do we want to do in here how can we make the order of iterating the frequency table basically predictable I have the following idea so essentially we can use when we finding a maximum finding a maximum we using frequency not only as key right we can use the key of the frequency table as also the thing that weating with like how do I say that right we we taking the maximum between the frequencies and the key right so introducing key actually makes it reproducible between between all the threats all right so essentially you have a situation when these things can be equal to each other so right now one is greater or not but then you have a situation when they're equal to each other and when they're equal to each other right what we do we need to compare the keys in here so we can actually use M CP CP CMP uh right and we take this thing uhhuh take this thing and now let me maybe put this stuff in here so we have to actually take a point in here so and if the frequencies are equal we taking a maximum between Keys already right and so if this if ice is greater than max value only then we are reassigning it like that all right so that will hopefully make it a little bit more consistent right so I'm going to remove mt2 because I want to rerun actually want to rerun this stuff with multi threadit I I disabled multi thre but I think I should try to do it one more time right so because it's going to create a completely different order right it's going to create a completely different order ah so M CMP uh too few arguments uh I also have to provide the size of the value in here I completely forgot about that so it has to be size all right so yeah so let's actually recompute all of these things so only one had the [Music] durations damn it's pretty slow compared to in place actually it's pretty slow okay so this is the second one and if I uh put the hashes from the one where we didn't compare with the keys some of them should fail uh some of them should fail so I expect them to fail actually none of them failed huh that's pretty surprising okay so accidentally it may happen I didn't see any problem with that accidentally it may happen Okay so uh now in place I'm Ena in place disabling the threads uh right and now what we're going to be doing we're doing the same thing but we're saving into IP 2 okay M that's much faster that's like it's kind of surprising right so it's basically I remember reading about that in clrs right so introduction to algorithm introduction to algorithms right so and essentially even in the intro of this book they said that algorithms are technology just like computers because um a better algorithm will perform better on a Shady computer than a shady algorithm on a super computer right and they even provided like a numbers like actually proven mathematically that that could be the case uh just like a logo and algorithm is always going to be better uh on a shitty computer than exponential algorithm on a super computer it's just like it's it's in the numbers uh right so and there you go right so single core uh is beating 16 cores just with a better algorithm that just does l so yeah basically computer science 101 computer science 101 people quite often forget about that so especially in incorporate word they think that they can just throw more compute at a problem right when in reality uh you can just like have like a better algorithm and a your computer deep seek by the way just saying so anyways uh right so we recomputed everything in here and I'm going to open this thing uh and just like take the hashes in here and put shot 256 and just like let's verify that everything no it didn't actually no it's still up um that's bizarre so did I did I do a walk I think I did a so M CP CMP uh if this thing is greater or the values are equal and then uh the value of course God freaking damn it keys I even said that I want to compare keys but they compared the values anyway that's why the hashes in mt2 were the same okay I should have actually taken it as a warning sign I should have actually taken it as a warning sign anyway so uh yeah yeah yeah kill all of that stuff so enable enable threats don't enable in place uh right so let's do mt2 there we go all right so multi threaded M okay waiting a little bit just waiting run out of tea get freaking already streaming for almost hour by the way Qui quickly check an idea quickly check an idea freaking classic freaking classic okay so mt2 hashes from mt1 um so yeah so this is the one okay some of them must be broken some of them must be broken please please be broken no okay again accidentally it may happen accidentally it may happen sure sure um so now let's remove ip2 M ip2 and then uh txt to BP uh we need to go here so this is in place where enabling this kind of stuff back uh so ap2 let's go [Music] mhm damn so fast all right so let's go here so this is that uhhuh yeah yeah so I wonder like will it finish right on the session right so if we re remove essentially the the limitation right so if you don't provide the maximum iteration it will try to compress it as much as possible until the process stops right so let's actually see how how it goes right is going to be consistent uh or actually there's something so it starts at 07 right and then as it continues it's kind of It kind of speeds up actually hm that's a b b I think it literally speeds up why does it ah well it kind of makes sense because the the input is decreasing so you actually have to iterate less to in on each iteration so does that mean it's going to actually speed up over time I think it's it's like slowly speeding up look at that and also it's probably consuming like a lot of my space just a second let's actually kill it uh right so because it's dumping yeah every token file is like 100 megabyte right so if you take a look at the size of this entire thing uh it's already at okay so let let's not do that let's not freaking do that uh though we can actually do that per like once per 100 thousand iterations right so if you take a look at the help uh so here is the help and there is a dump frequency so right now it's 10 we can set dump frequency every like thousand uh you can't really see what the I'm doing but just just a second so like a dumb frequency is going to be equal to 1,000 and I'm going to put it in here all right uhhuh so it dumped once and then it doesn't dump it keeps recomputing and it's going to only dump it every thousands which I think it's a little bit more reasonable uh right so just wait thousand duration um and then we'll see damn this is actually much faster than it was going when I was like training it off screen so this is much better so I think that's the way to go that's actually the way to go okay so as you can see on sounds and sitation it just generated this thing and now while it is training we can actually take a look at some of these things right so we can already take a look at the thousands of duration while it is training uh so inspect BP inspect uh Discord log 1,000 bpe so it's better maybe to run it like this yeah what do we have in here there we go so here's yeah we already have Discord emotes right it identified some of the Discord modes as the tokens because they actually occur frequency within the disc let's go let's go right so yeah I can I can see sure is the most frequent word in the Discord log uh a uh so that makes sense that makes actually a lot of sense I really like that yeah it's it's it's very discourage it's very discourage uh all right so let's uh stop the entire process and I guess I'm going to finish the session in here right so I want to compress um the Discord log to their limit until the process fully stops and then acquire the BPA table for the uh entirety of the Discord logs of our server right so and then we're going to try to check an interesting hypothesis that I actually stated at the beginning of our series is that this BPA table kind of infers an implicit grammar uh out of the text and using that grammar you can actually kind of generate like gibberish right so essentially you can use this BPA table as a mark of change right so it's not going to be llm or artificial intelligence or anything but it's kind of interesting that just the tokenizer by itself already contains a little bit of the knowledge that was extracted from the training set in an unsupervised fashion I think it's just kind of cool and kind of interesting so that's basically the goal of this series right okay so I'm going to go compress some text and then on the next session we're going to try to generate some gibberish right sounds good sounds good thanks everyone for watching uh have a good one and I see you all on the next Recreation pram session with a who as usual I love you

Transcript for:Enhancing Byte Pair Encoding Efficiency

Transcript for:
Enhancing Byte Pair Encoding Efficiency