CS 336 Language Models Overview

welcome everyone um this is CS uh 336 language models from scratch and this is our the core staff so I'm Percy one of your instructors um I'm really excited about this class because it really allows you to see the whole language modeling building pipeline end to end including data systems and modeling um Tatsu I'll be co-eing with him so I'll let everyone introduce themselves hi everyone i'm Tatsu i'm one of the the co-instructors i'll be uh giving lecture in you know week or two probably a few weeks um I'm really excited about this class we Percy and I you know spent a while being a little disgruntled thinking like what's the really deep technical stuff that we can teach our students today and I think one of the things that is really you got to build it from scratch to understand it so I'm hoping that that's sort of the ethos that you take away from from our class as Satu mentioned this is the second time we're teaching the class we've grown the class uh by around 50% one big thing is we're making all the lectures uh u on YouTube so that um the world can learn how to build language models from scratch okay so why do we decide to make this course and endure all the all the pain um so let's ask GPD4 so if you ask it why teach a course on building language models from scratch um it the reply is teaching a course provides foundational understanding of techniques fosters innovation um kind of the typical kind of generic blathers okay so here's the real reason so we're in a bit of a crisis i would say researchers are beingcoming more and more disconnected from the underlying technology um eight years ago researchers would implement and train their own models in AI even six years ago you at least uh take the models uh like BERT and download them and fine-tune them and now many people can just get away with prompting a proprietary model so this is not necessarily bad right because as you intro these layers of abstraction we can all do more and a lot of research has been unlocked by um be the simplicity of being able to prompt the language model and I do a fair my share of prompting so so there's nothing wrong with that but it's also remember that these abstractions are leaky so in contrast to programming languages or operating systems um you don't really understand what the abstraction is it's a it's a string in and string out I guess um and I would say that there's still a lot of fundamental research to be done that required tearing up the stack and co-designing different aspects of the data and the uh systems and the model and I I think really that full understanding of this technology is necessary for fundamental research so that's why this class exists we want to enable the fundamental research to continue and our philosophy is to understand it you have to build it so there's one small problem here and this is because of the industrialization of language models so GPD4 has rumored to be 1.8 trillion parameters cost 100 million dollars to train um you have XAI building the clusters with uh 200,000 H100s if you can imagine that um there's an investment of over 500 billion you know supposedly um over over four years so these are pretty large numbers right um and furthermore there's no public details on how these models are being built um here from GPD4 this is even two years ago um they very honestly say that due to the competitive landscape and simply safety limitations uh we're going to disclose no details okay so this is the state of the of the world um right now and so in some sense frontier models are out of reach for us so if you came into this class thinking you're each going to train your own GPD for um sorry um so we're going to build small language models but the problem is that these might not be representative and here's some of two examples uh to illustrate why so here's a kind of a simple simple one um if you look at the fraction of flops spent in the in attention layers of a transformer versus a MLP um this changes quite a bit so this is a this is a tweet from Steven Fuller from quite a few years ago but it's it this is still true um if you look at small models it looks like the number of flops in the attention versus the MLP layers are roughly comparable but if you go up to 175 billion then the you know the MLPS really dominate right so why does this matter well if you spend a lot of time at small scale and you're optimizing the attention you might be optimizing the wrong thing because um at larger scale it's it doesn't it get gets gets washed out this is kind of a simple example because you can literally make this plot without actually any compute you just like do it's napkin math um here's something that's a little bit harder to grapple with is just emergent behavior so this is a paper from Jason Wave from 2022 and um here this plot shows that as you increase the amount of training flops and you look at accuracy a bunch on a bunch of tasks you'll see that for a while it looks like the accuracy nothing is happening and all of the sudden you get these kind of uh you know emergent of various phenomena like in context learning so if you were hanging around at this scale you would have be concluding that well these language models really don't work when in fact you had to scale up to get that behavior so so don't despair we can still learn something in this class and but we have to be very precise about what we're learning so there's three types of knowledge there's the mechanics of how things work this we can teach you we can teach you what a transformer is you can you'll implement a transformer we can teach you how model parallelism leverages GPUs efficiently these are just like kind of the raw ingredients the mechanics so that's fine we can also teach you mindset so this is something a bit more subtle and seems like a little bit you know um fuzzy but uh this is actually in some ways more important I would say because um the mindset that we're going to take is that we want to squeeze as most out of the hardware as possible and take scaling seriously right because in some sense the mechanics all those we'll see later that all of these ingredients have been around for a while but it was really I think the scaling mindset that open AI pioneered that led to this next generation of um AI models so mindset I think hopefully we can you know bang into you that to think in a certain way and then thirdly is intuitions and this is about which data um and modeling decisions lead to good models this unfortunately we can only partially teach you and this is because what architectures and what data sets work at no scales might not be the same ones that work at um large scales and but you know that's just uh but hopefully you got two and a half out of three so that's um pretty good bang for your buck um okay speaking of intuitions there's this sort of I guess sad reality of things that you know you can tell a lot of stories about why certain um things in the transformer are the way they are but sometimes it's just you know come you do the experiments and the experiments speak um so for example there's this noir paper that introduced the swigloo which is something that we'll uh see a bit more in the in this class which is a type of nonlinearity um and in the conclusion you know the results are quite good and this got adopted but in the conclusion there's this honest statement that we offer no explanation except for this is divine benevolence so there you go this is uh um the extent to our under of our understanding okay so now let's talk about this bitter lesson that I'm sure people have you know heard about i think there's a sort of a misconception that the bitter lesson means that scale is all that matters algorithms don't matter all you do is pump more capital into building the model and you're good to go i think this couldn't be further from the truth i think the right interpretation is that algorithms at scale is what matters and because at the end of the day your accuracy of your model is really a product of your efficiency and the number of resources you put in and actually efficiency if you think about it is way more important at larger scale because if you're spending you know hundreds of millions of dollars you cannot afford to be wasteful in the same way that if you're uh looking at running a job on your on your local cluster you might run it again you fail you you debug it and if you look at actually the utilization and the use I'm I'm sure open is way more efficient than any of us um right now so efficiency really is important and furthermore this I think is this point is maybe not as well appreciated in the sort of scaling rhetoric so to speak um which is that if you look at efficiency which is combination of hardware and algorithms but if you just look at the algorithm efficiency there's this nice open air paper from 2020 that showed uh over the period of 2012 to 2019 there's a 44 for X if algorithmic efficiency improvement in um the time that it took to train imageet to a certain level of accuracy right so this is huge and I think if you I don't know if you could see the the abstract here um this is faster than Morris law right so algorithms do matter if you didn't have this efficiency you would be paying 44 times more cost this is for image models but uh there's some results for language as well okay so with all that I think the right framing or mindset to have is what is the best model one can build given a certain compute and data budget okay and this question makes sense no matter what scale you're at because um you're sort of like act it's accuracy per resources and of course if you can raise the capital and get more resources you'll get better models but as researchers our goal is to improve the efficiency of the algorithms okay so maximize efficiency we're going to hear a lot of that okay so now let me talk a little bit about the current uh landscape um and a little bit of I guess you know obligatory history um so language models have been around for a while now um go going back to Shannon um you know who looked at language models a way to estimate the entropy of um English um I think in AI they really were prominent in NLP where they were a component of larger systems like machine translation speech recognition and one thing that's maybe not as appreciated these days is that if you look back in 2007 uh Google was training fairly large engram models so five gram models over two trillion tokens which is a lot more tokens than GPT3 um and it was only rec I guess in the last two years that um we've gotten to that in token count um but they were engram models so they didn't really exhibit any of the interesting phenomena that we know of language models today okay okay so in the 2010s I think a lot of the you can think about this a lot of the deep learning revolution happened and a lot of the ingredients sort of kind of falling into place right so there was the first neural language model from Joshra Benjel's group in back in uh 2003 there was seek to seek models um this I think was a you know big deal for you know how do you basically model sequences from Ilia and uh Google folks Um there's a atom optimizer which still is used by the majority of uh people dating over a decade ago um there's attention mechanism which was um developed in the context of machine translation um which then led up to the famous attention all you need or the aka the transformer paper in 2017 people were looking at how to scale mixture of experts there's a lot of work around late 20110s on how to essentially do model parallelism and they were actually figuring out how you could train you know 100 billion parameter models they didn't train it for very long because these are these were like more system work but the all the ingredients were kind of in place um before in by the time the 2020 came around um so I I think one you know other trend which was starting in LPU is the idea of you know these foundation models that could be trained on a lot of text and adapted to a wide range of downstream tasks so Elmo BERT um you know T5 these were models that um were for their time very exciting we kind of maybe forget how excited people were about you know things like bird but um it was a big deal and then I think a um I mean this is abbreviated history but um I think one critical piece of the puzzle is you know open AI this taking these ingredients you know they and applying very nice engineering and um really kind of pushing on the kind of the scaling laws embracing it as you know this is the kind of the minds set piece and that led to GPT2 and GPT3 um Google you know obviously was in the game and trying to uh you know compete as well um but um that sort of paved the way I think to another kind of line of work which is um these were all closed models so models that weren't released and you can only access via API but they were although open models starting with you know early work by you know eluther right after GP3 came out Meta's early attempt um which uh didn't work maybe as quite as well um Bloom um and then Meta Alibaba DeepS AI2 and there's a few others which I have listed have been creating these uh open models where um the the weights are released um one other piece of I think tibbit about openness I think is important is that there's many levels of openness there's closed models like GPD4 there's open weight models where the weights are available and there's actually a paper a very nice paper with lots of architectural details but no details about the data set and then there's uh open source models where all the weights and data are available and the paper that where they're honestly trying to explain as much as they can um you know but of course you can't really capture everything you know in a paper and there's no substitute for learning how to build it except for kind of doing your yourself okay so that leads to kind of the present day where um there's a whole host of you know frontier models from open anthropic xi google meta deepseeek Alibaba tensen and probably a few others um that are sort of dominate the the current you know landscape so we're kind of interested interesting time where you know just to kind of reflect a lot of the ingredients like I said were developed which is good because I think we're going to revisit some of those um ingredients and trace how they these techniques work and then we're going to try to move as close as we can to best practices on frontier models but you know using um information from essentially the open you know community and reading between the lines from what we know about the closed uh models okay so just as an interlude um so what are you looking at here so um this is a executable lecture so it's a program where I'm stepping through and it delivers the content of lecture so one thing that I think is interesting here is that um you can embed code so if you um you can just step through code and I think this is a smaller screen than I'm used to but uh you can look at the environment variables as you're stepping through code so that's uh useful later when we start actually um trying to drill down and giving code examples you can see the hierarchical structure of the lecture like we're in this module and you can see where it's it was called from main um and you can jump to definitions um like supervised fine-tuning which we'll uh talk about later okay and if you think this looks like a Python program um well it is a Python program um but I've made it uh you'll processed it so for your viewing pleasure okay so let's move on to the course uh logistics now um actually maybe I'll pause for questions any questions about um you know what we're learning in this class so would you expect a grad graduate from this class to be able to lead a team to build a frontier model or other skills so the question is would I expect a graduate from this class to be able to lead a team and build a frontier model of course with you know like a billion dollars of capital yeah of course um I would say that it's a good step but I there's definitely uh many pieces that are missing and I think you know we thought about we should really teach like a series of classes that eventually leads up to to as close as we can get but um I think this is maybe the first step of the puzzle but there are a lot of things and happy to talk offline about that but I like the ambition yeah that's what you should be doing taking the class so you can go lead teams and build frontier models okay um okay let's talk a little bit about the course um so here's a website everything's online this is a fiveunit class um but I I think that maybe doesn't express the the level here um as well as this quote that I pulled out from a course evaluation um the entire assignment was approximately the same amount of work as all five assignments from the CSU24N plus the final project and that's the first homework assignment so not to all scare you off but just just giving some data here um so why should you endure that um why should you do it i I think you this class is really for people who have sort of this obsessive need to understand how things work all the way down to the the atoms so to speak and I think if you you know when you get through this class I think you will have really leveled up in terms of your research engineering and the comfort level of comfort that you'll have in building ML systems at scale will just be I think um you know something there's also a bunch of reasons that you shouldn't take the class for example Example if you want to get any research done um this quarter maybe this class isn't for you if you're interested in learning just about the hottest new techniques um there are many other classes that can probably deliver on that you know better um than for example you spending a lot of time debugging BPE um and this is really I think about a class about you know the the primitives and learning things bottom up as opposed to um the the kind of the latest um and also if you're interested in building language models or you know 4X um this is probably not the first class I you would take um I think practically speaking you know as much as I kind of made fun of prompting prompting is great fine-tuning is great if you can do that and it works then I think that is something you should absolutely start with so I don't want people taking this class and thinking like great any problem the first step is to train a language model from scratch that is not the right way uh of thinking about it um okay and I know that many of you um you know some of you were enrolled but we didn't we did have a cap so we weren't able to enroll everyone and and also for the people online you can follow at at home um all the lecture materials and assignments are online so you can look at them the lectures are also recorded and will be put on YouTube although there will be um some number of week lag u there um and also we'll uh offer this class next year so If you were not able to take it this year um don't fret there will be next time okay so the class has five assignments um and each of the assignments we don't provide scaffolding code in the sense that the uh the you're literally give you a blank file and you're supposed to you know build things up um and in the spirit of learning uh building from scratch but we're not that mean um we do provide unit tests and some adapter interfaces that allow you to check uh correctness of different uh pieces and also the assignment write up if you walk through it does do it for sort of a gentle job of doing that but you're kind of on your own for making um good software design decisions and figuring out what you name your functions and how to you know organize your code which is a useful skill I think um so one strategy I think for all assignments is that there is a piece of assignment which is just implement the thing and make sure it's correct that mostly you can do locally on your laptop you shouldn't need compute for that and then you should we have a cluster that you can run um for benchmarking both accuracy and speed right so I I want everyone to kind of embrace this idea of like you want to use as a as small data set or as few resources as possible to you know prototype before running large jobs you shouldn't be debugging with one billion parameter models on the cluster um if you can help it okay um there's some assignments which will have a leaderboard um which usually is of the form do things to make perplexity go down given a particular training budget last year it was I think pretty um you know exciting for people to try to um you know try different things that you either learn from the class or you read online um and then finally I guess this year is you know this was less of a problem last year because I guess Copilot wasn't as good but you know curs is pretty good um so I I think our general strategy is that you know AI tools are you know can take away from learning because there are cases where it can just solve the thing you you want it to do but you know I think you can obviously use them judiciously so but use at your own risk you're kind of responsible for your own learning experience here okay so uh we do have a cluster so thank you Together AI for providing a bunch of H100s for us um there's a guide to please read it carefully to learn how to use the cluster um and uh start your assignments early because um the cluster will fill up towards the end of a deadline as everyone's uh trying to get their large runs in okay um any questions about that you mentioned it was a five unit class are we able to sign up for it for like three to five units because I noticed that it was on right so the question is can you sign up for less than five units i think administratively uh if you have to sign up for less that is possible but it's the same class and the same workload yeah any other questions okay so in this part I'm going to go through all the different components of the course and just give a broad overview a preview of what you're going to experience um so remember it's all about efficiency given hardware and data um how do you train the best model given your resources so for example if I give you a common crawl dump a web dump and 32 H100s for two weeks what should you do there are a lot of different design decisions um there's you know questions about the tokenizer the architecture systems optimizations you can do data things you can do and we've organized the class into these five um units or pillars so I'm going to go through each of them you know in turn um and talk about what we'll cover what the assignment will involve and and then I'll kind of wrap up okay so the goal of the basics unit is just get a basic version of a full pipeline working so here you implement a tokenizer model architecture and training so just say a bit more about what these components are so a tokenizer is something that converts between strings and sequences of integers intuitively you can think about the integers corresponding to breaking up the string into uh segments and mapping each segment to an integer and the idea is that you just you your sequence of integers is what goes into the actual model which has to be like a fixed uh dimension okay so in this course we'll talk about the bip pair encoding BPE tokenizer which is um relatively simple and um and still is is used um there are I guess a promising set of um methods on tokenizer free approaches so these are methods that just start with the the raw bytes and don't do tokenization and develop a particular architecture that just takes the raw bytes um this work is is promising but you know so far I haven't seen it been scaled to the frontier yet so we'll go with BP for now okay okay so once you've tokenized your sequence or strings into a sequence of integers now we define a model architecture over these sequences so the starting point here is original transformer um that's what is the backbone of basically all um you know frontier models um and here's architectural diagram um we won't go into details here but uh it there's a attention you know piece and then there's a um MLP you know layer with some you know normalization um so a lot has actually happened till since 2017 right I think there's a sort of sense to which oh the transformer was invented and then you know everyone's just using transformer and to first approximation that's true we're still using the same recipe but there have been a bunch of smaller uh improvements that do make a substantial difference when you add them all up so for example there is um the activation um you know nonlinear activation function so swiggly which we saw a little bit before positional embeddings there's new positional embeddings um um these rotary positional embeddings which we'll uh talk about um normalization um you know instead of using layer norm we're going to look at something called RMS norm which is similar but simpler um there's a question where you place the normalization which has been changed from the original transformer um the MLP use uh the canonical version is a dense MLP and you can replace that with mixture of experts um attention is something that has actually been uh gaining a lot of um attention I guess um there's There's full attention and then there's you know sliding window attention and linear attention all of these are trying to prevent the quadratic blow up there's also lower dimensional versions like you know GQA and MLA which we'll get to in a second um or not in a second but in a future lecture and then you know the most kind of maybe radical thing is other alternatives to the um transformer like space models like hyena where they're not doing you know attention but you know some other sort of operation and sometimes you get best of both worlds by you know mixing making a hybrid model that mixes these in with transformers um okay so once you define your architecture you need a train so there's a you know design decisions include optimizer so atom w uh which is a variant basically atom fixed up um is is still very prominent so we'll mostly work with that but uh it is worth mentioning that there is more recent optimizers like muan and soap that have shown promise um learning rate schedule um you know batch size you know whether you do regularization or not hyperparameters there's a lot of details here and and I think this class is one where the details do matter because you can easily have you know order of magnitude difference between a welltuned you know architecture and something that's just like a vanilla transformer so in assignment one basically you'll implement the BP tokenizer um I'll warn um you that this is actually the part that seems to have been a lot of surprising maybe a lot of work for people so um just you know you're warned and uh you also implement the transformer crossmput loss atomw optimizer and training loop so again the whole stack and you know we're not making you implement you know pietorch uh from scratch so you can use pietorch but you can't use like you know the transformer implementation for pietorch you there's a small list of um functions that you can use and you can only use those okay so we're going to have some uh you know tiny stories and open web text data sets that you'll train on and then there will be a leaderboard um to minimize the open web text perplexity we'll give you 90 minutes on a a H100 and see what you can do so this is last year um so see we have the top so this is the number to beat for this year okay all right so that's the basics now after basics um I mean in some sense you're done right like you have ability to train a transformer what what else do you need so the system part really goes into how you can optimize this further so how do you get the most out of hardware and for this we need to take a closer look at the hardware and how we can you know leverage it so there's kernels parallelism and inference are the three components of this uh unit so okay so to first talk about kernels um let's talk a little bit about what a GPU looks like okay so a GPU um which we'll get much more into um is basically a huge array of these um you know little uh units that do floatingoint operations um and maybe the one thing to note is that this is the GPU chip and here is the um the memory that's actually offchip um and then there's some other memory like L2 caches and L1 caches on chip and so the basic idea is that compute has to happen here your data might be somewhere else and how do you basically organize your compute so that um you can be most efficient so one quick analogy is imagine that your your memory is and is where you can store like your data and the model parameters is like a warehouse and your compute is like the the the factory and what you what ends up being a big bottleneck is just data movement costs right um so the thing that we have to do is how do you organize the compute like even a matrix multiplication to maximize the utilization of the GPUs by minimizing the data movement and there's a bunch of techniques like fusion and um and tiling that allow you to do that so we'll get all into the details of that and to implement and leverage a kernel uh we're going to look at Triton there's other things you can do with various levels of uh sophistication but we're going to use Triton which is developed by OpenAI and a popular way to build kernels okay so we're going to write some kernels that's for one GPU so now um in general you have these big runs take you know ten thousands if not tens of thousands of GPUs and but even at 8 it kind of starts becoming interesting because um you have a lot of GPUs they're connected to some CPU nodes and they also have are directly connected via N MV switch MVL link um and the it's the same idea right now the only thing is that data movement between GPUs is even slower right um and so we need to figure out how to put um model you know parameters and activations and gradients and put them on the GPUs and do the computation and to minimize amount of you know movement um and then so we're going to explore different type of techniques like data parallelism and you know tensor parallelism and and so on so um so that's all I'll say about that and finally inference um is something that we didn't actually do last year in the class um although we had a guest lecture um but this is important because um inference is how you actually use a model right it's basically the task of generating tokens given a prompt given a trained model and it also turns out to be really useful for a bunch of other things besides just chatting with your your favorite um um model you need it for reinforcement learning test time compute which has been you know very popular lately and even evaluating models you need uh to do inference so we're going to spend some time talking about inference um actually if you think about the globally the cost that's dedic that's spent on inference is going it's you know ex eclipsing the cost that it is used to train models because training despite it being very intensive is ultimately a onetime cost and inference is cost scales with every use and the more people use your your your model the the more you'll need inference to be efficient okay so um in inference there's two phases there's a prefill and a decode prefill is you take the prompt and you can run it through the model and get some you know activations and then decode is you go autogressively one by one and generate tokens so prefill all the tokens are given so you can process everything at once so this is exactly what you see at training time and generally this is a good setting to be in because um you can par it's naturally parallel and you're mostly computebound what makes inference I think uh special and difficult is that this auto reggressive decoding you need to generate one token at a time and ends you it's hard to actually saturate all your GPUs and it becomes you know memory bound because you're constantly you know moving data around and we'll talk about a few ways to speed the models up um just speed inference up you can use a cheaper model um you can use this uh really cool technique called speculative decoding where you use a cheaper model to sort of scout ahead and generate multiple tokens and then if these tokens happen to be good by some for some definition good you can have the full model just you know score in and accept them all in in parallel um and then there's a bunch of systems optimizations that you can do as well okay so after the systems oh okay assignment two so um you're going to implement a kernel you're going to implement um some parallelism so data uh parallel is is very natural and so we we'll do that um some of the model parallelism like FSTP turns out to be a bit kind of uh complicated to do from scratch so we'll do sort of a baby version of that um but you know I encourage you to learn and you know about the full version um we'll go over the full version in class but um implementing from scratch might be a bit you know too much um and then I think an important thing is getting in the habit of always benchmarking profile i think that's actually probably the most important thing is that you can implement things but unless you have a a feedback on how well your implementation is going and where the bottlenecks are you're just going to be kind of flying blind okay so unit three is uh scaling laws um and here the goal is you want to do experiments at small scale and figure things out and then um predict the hyperparameters and loss at large scale so here's a fundamental question so um if I give you a flops budget you know what model size should you use if you use a larger model that means you can train on less data and if you use a smaller model you can train on more data so what's the right balance here and this has been quite ex studied quite extensively and figured out by a series of paper from open air and and deep mind so if you hear the term chinchilla optimal this is what this is referring to and the and the basic idea is that for every compute budget number of flops you can vary the number of parameters of your model okay and that and then you measure how good that model is so for every level of compute you can get the optimal um you know parameter count and then what you do is you you can fit a a curve to extrapolate and see if you had let's say you know one e22 flops you know what would be the parameter size and it turns out these minimum when you plot them it's actually remarkably um you know linear um which led leads to this like very actually simple but useful um rule of thumb which is that if you have um a particular um model of size n if you multiply by 20 that's the number of tokens you should train on essentially so that means if I say you know 1.4 four billion parameter model should be trained on 28 billion you know tokens okay but you know this doesn't take into account inference cost this is literally how can you train the best model regardless of how big that model is so there's some limitations here but it's nonetheless been extremely useful for model development so in this assignment this is kind of um you know fun because we define a quote unquote training API which you can query with a particular set of hyperparameters you specify the architecture you know um and batch size and so on and we return you a loss that you your decisions will get you okay so your job is you have a flops budget and you're going to try to figure out how to train a bunch of models and then gather the data you're going to fit a scaling law to the gather data and then you're going to submit your prediction on you know what you would choose to be the hyperparameters what model size and and so on um at a larger scale okay so this is a case where you have to be really we want to put you in this position where uh there's some stakes i mean this is not like burning real compute but you know once you run out of your flops budget that's that's it um so you have to be very careful in terms of how you prioritize what experiments uh to run which is something that the frontier labs have to do all the time and there will be a leaderboard uh for this which is minimize flops uh minimize loss given your flops budget question um I see those are links from 2024 so if we're working ahead should we expect assignments to change over time or are these going to be the final assignments so the the question is that these links are from 2024 um the rough assignments the the rough structure will be the same from 2025 there will be some modifications but if you look at these you should have a pretty good idea of what to expect okay so let's go into data now um okay so up until now you've you've have scaling laws you have systems you can you have your transformer implementation everything you're really kind of good to go but data I would say is a really kind of key ingredient that I think differentiates in some sense and the the question to ask here is what do I want this model to do right because it's what I what the model does is completely deter I mean mostly determined by uh the data if I put if I train on multilingual data it will have multilingual capabilities I train on code it'll have code capabilities and not you know it's very natural and usually data sets are a conglomeration of a lot of different pieces there's you know uh this is from a pile which is you know four years ago but the same idea I think holds you know you have data from you know the web this is common crawl um you have you know maybe sack exchange Wikipedia github and different you know sources which are curated and so in the data section we're going to start talking about evaluation which is given a model how do you evaluate whether it's any good so we're going to talk about perplexity way measures standard kind of standardized testing like MMLU do um if you have models that generate utterances for instruction following how do you evaluate that um there's also decisions about if you can enso or do chain of thought at test time um you know how does that affect your evaluation and then you know you can talk about entire systems um evaluation of entire system not just a language model because language models often get these days plugged into some agentic system or something um okay so now after establishing evaluation um let's look at data curation so this is I I think an important point that people don't realize i often hear people say oh we're training the the model on the internet this just doesn't make sense right data doesn't just you know you know fall from the sky and there's the internet that you can you know um pipe into your model um you know data has to always be actively acquired uh somehow um so even if you you know just just as an example of you know I always tell people look at the data um and so let's look at some data so this is uh some common crawl um you know data I'm going to take 10 documents and I think hopefully this works okay I think the rendering is off but Um you can kind of see uh this is a this is a sort of random sample of of common crawl um and you can see that this is maybe um not exactly the data oh here's some actually real text here okay that's cool um but if you look at most of common crawl aside from this is a different language but you can also see this is very spammy sites and you'll quickly realize that a lot of the web is just you know trash and so well okay maybe that's not that's surprising but it's more trash than you would actually expect I promise um so what what I'm saying is that there's a lot of work that needs to happen in data so you can crawl the internet you can take books archives papers um GitHub um and there's actually a lot of processing that needs to happen um you know there's also legal questions about what data you can you know train on which we'll touch on um nowadays a lot of frontier models have to actually buy data um because the data on the internet that's publicly uh accessible is actually uh turns out to be you know a bit limited for that kind of the you know the really frontier um performance and also I think it's important to remember that this data that's scraped it's not actually text right first of all it's HTML or it's PDFs or in the case of code it's just directories so there has to be an explicit process that takes this data and turns it into text okay so we're going to talk about the transformation from HTML to to text um and this is going to be a lossy process um so the trick is how can you preserve the content and some of the structure um without um you know basically just having HTML um filtering as you could you know surmise is going to be very important both for getting high quality data but also removing harmful content um generally people train classifiers to do this the dduplication is also um an important step which we'll talk about okay so assignment four is all about data we're going to give you the raw common crawl you know dump so you can see just how bad it is and you're going to train classifiers ddup and then there's going to be a leaderboard where you're going to try to um minimize perplexity given your token budget so now let's now have the data you've done this built all your fancy kernels you've trained now you can really train models but at this point what you'll get is a model that can um complete the next token right and this is called a a essentially a base model and I think about it as a model that has a lot of raw potential but it needs to be aligned or modified some way and alignment is a process of making it useful so in alignment captures a lot of different things but three things I think it captures is that you want to get the language model to follow instructions right completing the next token is not necessarily following the instruction it'll just complete the instruction or whatever it thinks will follow the instruction um you get to here specify the style of the generation whether you want to be a long or short whether you want bullets whether whether you know you want it to be witty or have SAS or not um and you when you play with um you know you you know chatbt versus grock you'll see that there's different alignment uh that has uh happened and then also safety um one important thing is for these models to be able to refuse answers that can be you know harmful so that's where alignment also kicks in so there's generally two phases of alignment there's supervised fine-tuning and here the goal is I mean it's very simple you basically gather a set of um user assistant pairs um so prompt response pairs and then you do um supervised learning okay and the idea here is that the base model already has the sort of the raw potential so just fine-tuning it on um a few examples is uh sufficient of course the more examples you have the better the the results but um there's papers like this one that shows even like a thousand uh examples suffices to give you instruction following capabilities from a base good base model okay so this part is actually very you know simple and it's not that different from um you know pre-training because it's just you're given text and you just maximize the probability of the text um so the second part is a bit more interesting from a algorithmic perspective so the idea here is that even with SFT phase you will have a decent um model and now how do you improve it what you can get there more SFT data but that can be very expensive because you have to you know annot someone sit down and annotate data so there the goal of learning from feedback is that you can leverage lighter forms of annotation um and have the algorithms do a bit more work okay so one type of data you can learn from is preference data so this is where you generate multiple responses from a model to a given prompt like A or B and the user rates whether A or B is better and so the data might look like you know it generates uh you know what's the best way to train a language model use a large data set or use a small data set and of course the answer should be a so that is a a unit of expressing preferences another type of supervision you could have is using verifiers so for some domains you're lucky enough to have a formal verifier like for math or code or you can use learn verifiers where um you train an actual language model to um to rate uh the the the the response and of course this relates to evaluation again algorithms um this is you know we're in the realm of reinforcement learning so uh one of the earliest algorithms uh that was developed that was applied to instruction um tuning models uh were was PO proximal policy optimization um it turns out that if you just have preference data you there's a much simpler algorithm called DPO that works really well um but in general if you want to learn uh from verifiers data you have to it's not preference data so you have to embrace RL fully and um you know there's this u method which we'll uh do in this class which called group relative preference optimization which simplifies po makes it more efficient by removing the value function developed by deepseek which u seems to work pretty well okay so assignment five implements supervised tuning DPO and GRPO and of course evaluate question you gave I think a quote from the course evaluation about assignment one did people have similar things to say about assignments two through five or Yeah the question is um if assignment one seems a bit uh daunting what about the other ones i would say that assignment one and two are definitely the most heavy and hardest um assignment three is um a bit more of a breather and assignment four and five at least last year were um I would say a notch below assignment one or two um although I don't know depends on we haven't fully worked out the details for this year yeah it does get better okay so just to a recap of the different pieces here um you know remember efficiency is this driving principle and there's a bunch of different design decisions and you can I think if you view efficiency um everything through a lens of efficiency I think a lot of things kind of make sense um and importantly I think you know we are it's worth pointing out there we are currently um in this compute constraint regime at least this class and most people who are somewhat GPU poor so we have a lot of data but we don't have that much compute and so these design decisions will reflect squeezing the most out of the hardware so for example data processing we're filtering fairly aggressively because we don't want to waste precious compute on bad or irrelevant data tokeniz ization like it's it's nice to have a a model over bytes that's very elegant but it's very computer inefficient with today's model architectures so we have to do tokenization to as an efficiency gain model architecture there are a lot of design decisions there that are essentially motivated by you know efficiency training i think the fact that we're most of what we're doing to do is just a single epoch this is clearly we're in a hurry um we just need to see more data as opposed to spend a lot of time on any given data point scaling laws is completely about efficiency we use less compute to figure out the hyperparameters um and alignment is is is maybe a little bit different but um the connection to efficiency is that if you can put resources into alignment then you actually require less uh you know smaller base models okay so there is a you know there's sort of two paths if your use case is fairly narrow you can probably use a smaller model you align it or fine-tune it and you can do well but if you your use cases are very broad then there might not be a substitute for training a a big model so that's today so increasingly now um at least for Frontier Labs u they're becoming data constrained which is interesting because I think that the design decisions will presumably completely change well I mean compute will always be important but I think the design decisions will change for example you know learning taking one epoch of your data I think doesn't really make sense if you have more compute why wouldn't you take more epochs at least or do something uh smarter Or maybe there will be um different architectures for example um because a transformer was really motivated by you know compute efficiency um so that's something to kind of ponder still it's about efficiency but the design decisions reflect what regime you're in okay so now I'm going to dive into the first uh unit um yeah before that any questions do we have a slack or ed um the question is we have a slack or we will have a slack we'll send out details um after this class will students auditing the course also have access to the same material uh the question is students auditing the class will have access to all the um online uh you know materials assignments and we'll give you access to uh Canvas so you can watch the the uh lecture videos what's the grading of the assignments what's the grading of the assignments um good question so there will be a set of unit tests that uh you will have to pass so part of the grading is just did you implement this correctly um there will be also parts of the grade which will did you implement a model that achieved a certain level of loss or is efficient enough um in the um assignment every problem part has a number of points associated with it and so that gives you a fairly granular level of what um grading looks like okay let's jump into tokenization okay so um Andre Kapati has this really nice video on tokenization and in general he makes a lot of these videos on um that uh actually inspired a lot of this class how you can build things from from scratch um so you should go check out some of his videos um so tokenization as we talked about it um is the process of taking raw text which is generally represented as unic code strings and um turning it into a set of integers essentially and where each integer is uh represents a token okay so we need a procedure that encodes strings to tokens and decodes them back into strings um and the vocabulary size is just the number of values that a a token take on the number of the range of the integers okay so just to give you an example of how tokenizers work let's uh play around with this really nice website which allows you to look at different tokenizers and just type in something like you know hello uh you know hello or or whatever um maybe I'll um do this um and one thing it does is it shows you the list of integers this is the output of tokenizer it also nicely maps out um the decomposition of the the original string into a bunch of segments um and a few few things to kind of note first of all the space is part of a token so unlike classical NLP where the space just kind of disappears everything is accounted for these are meant to be kind of reversible operations tokenization um and by convention it you know for whatever reason the the space is usually preceding um the token um also notice that you know hello is a completely different token than uh space hello which um you might make you a little bit squeamish but you know seems and it can cause problems but um that's just how it is question I was going to ask is the space being leading instead of trailing intentional or is it just an artifact of the BP process um so the question is is the spacing before intentional or not um so in the BP process I will talk about you actually pre-tokenize and then you um and then you tokenize each part and I think the pre-tokenizer it does put the space in the front so it is built into the algorithm you could put it at the end but I think it probably makes more sense to put in the beginning um but um actually don't well it I guess it could go either way it's my sense um okay so then if you look at numbers um you see that um the numbers are chopped down into um different you know pieces um it's a little bit kind of interesting that it's left to right so it's definitely not grouping by thousands or anything like semantic um but anyway I encourage you to kind of play with it and get a sense of what these existing tokenizers look like um so this is a tokenizer for GPT40 for example um so there's some observations um that we made um so if you look at the GB22 tokenizer which will use this kind of as a reference um okay let me see if I can um okay hopefully this is let me know if this is too getting too small in the back um you could take a string um if you apply the GPD2 tokenizer you get your indices so it maps uh strings to indices and then you can decode to uh get back the string and this is just a sanity check to make sure that um you actually it round trips um another thing that's I guess interesting to look at is this compression ratio which is if you look at the number of bytes divided by the number of tokens so how many bytes are represented by a token and the answer here is 1.6 okay so every token represents 1.6 bytes of data okay so that's just a GPT tok to tokenizer that open air trained um to motivate kind of BPE I want to go through a sequence of attempts like suppose you wanted to do tokenization what would be the sort of the the simplest thing the simplest thing is probably character-based tokenization a unic code string is a sequence of unic code characters and each character can be converted into an integer in called a code point okay so a maps to 97 um the world emoji maps to 127,757 and you can see that it converts back okay so you can define a tokenizer which simply um you know maps uh each character into a code point okay so what's one problem with this yeahression ratio is one the compression ratio is one um so that's uh well actually the compression ratio is not quite one because a character is not a bite um but it's it's maybe not as good as you want one problem with that if you look at some code points they're actually really large right um so you're basically allocating each like one slot in your vocabulary for every character uniformly and some characters appear way more frequently than others so this is um not a very effective use of your kind of budget okay um so the vocabulary size is you huge i mean the vocabulary size being 127 is actually a big deal but um the bigger problem is that some characters are rare and this is inefficient use of the vocab um okay so the comparation ratio is um is 1.5 in this case because it's the tokens uh sorry the number of bytes per token and um a character can be multiple bytes okay so that was a very kind of naive approach um on the other hand you can do bite based tokenization okay so unic code strings can be represented as sequence of bytes um because um every string can just be you know converted into bytes okay so some um you know a is already just kind of one bite but some uh characters uh take up as many as four bytes and this is using the UTF8 kind of encoding of Unicode there's other encodings but this is the most common one that's dynamic so let's just convert everything into bytes um and see what happens so if you do it into bytes now all the indices are between 0 and 256 because there are only 256 possible values for a bite by definition um so your vocabulary is very you know small and each bite is I guess not all bytes are equally used but you know it's not too you don't have that many sparsity you know problems um but what's the problem with bite- based encoding long sequences yeah long sequences so this is I mean in some ways I really wish by coding would work it's the most elegant thing but um but you have long sequences your compression ratio is one one bite per token and this is just terrible a compression ratio of one is terrible because your sequences will be really long attention is quadratic naively in the sequence lane so this is you're just going to have a bad time in terms of efficiency okay so that wasn't really good um so now the thing that you might think about is well maybe we kind of have to be adaptive here right like you know we can't allocate a character or a bite per token but maybe some tokens can represent lots of bytes and some tokens can represent few bytes so one way to do this is wordbased tokenization and this is something that was actually very classic in in NLP right so here's a string and you can just uh you know split into let's a sequence of segments okay and you can call each of these tokens so you just use a regular expression um here's a different regular expression um that GPT2 uses to pre-tokenize um and it just splits um um you know your string into a sequence of strings so um and then what you do with each segment is that you assign each of these to an integer and then you're done okay so what's the problem with this yeah so the problem is that your vocabulary size is sort of unbounded well not maybe not quite unbounded but um you don't know how big it is right because on a given new input you might get a segment that's uh that just you've never seen before and that's actually kind of a big problem this is actually wordbased is a really big pain in the butt because um you know some real words are rare and um you know you actually it's it's really annoying because new words have to receive this UNC token um and if you're not careful about how you compute you know the perplexity um then you're just going to mess up um so you know wordbased isn't I think it captures the right intuition of adaptivity ity um but it's not exactly what we want here so here we're finally going to talk about the BPE encoding or by pair encoding um so this was actually a very old algorithm uh developed by Philip Gage in 94 for data compression um and it was first introduced into NLP for neural machine translation so before papers that did machine translation or any basically all NLP used wordbased tokenization and again wordbased was a pain so um you know this paper pioneered this idea well we can use this nice algorithm form 94 and we can just make um the tokenization kind of roundtrip and we don't have to deal with anks or any of that stuff and then finally this entered a kind of language modeling um erov through GPD2 which was uh trained on using the BP tokenizer um okay so the basic idea is instead of defining some sort of pre preconceived notion of how to split up we're going to train the tokenizer on raw text that's a basic kind of insight if you will and so organically common sequences um that um span multiple characters we're going to try to represent as one token and rare sequences are going to be represented by multiple tokens um there's a sort of a slight detail which is to for efficiency the GPD2 paper um uses warbased tokenizer as a sort of pre-processing to break it up into segments and then runs BP on each of the segments which is what you're going to do in this class as well um the algorithm BP is actually very simple so we first convert the string into a sequence of bytes which we already did when we talked about bybased tokenization and now we're going to successfully merge the most common pair of adjacent tokens over and over again so the intuition is that if a pair of tokens that shows up a lot then we're going to compress it into one token we're going to dedicate space for that okay so let's walk through what this algorithm looks like so we're going to use this cat and hat as an example and um uh we're going to convert this into a sequence of um integers these are the bytes um and then we're going to keep track of what we've merged so remember merges is a map from two integers which can represent bytes or other you know ex preexisting tokens and we're going to create a new token and um the vocab is just going to kind of be a handy way to represent the index to to bytes um okay so we're going to the BP algorithm i mean it's very simple so I'm just actually going to run through the code you're going to um do this number of times so number is three in this case we're going to first count up uh the number of occurrences of pairs of bytes so um hopefully this doesn't become too small so we're going to just step through um this uh sequence and we're going to see that okay so what's 116 104 we're going to increment that count 104 101 increment that count we're go through the sequence and we're going to count up um you know the bytes okay so now after we have these counts we're going to um find the pair that occurs the most number of times um so I guess there's multiple ones but we're just going to break ties and say 116 and 104 okay so that occurred twice um so now we're going to merge that pair so we're going to create a new slot in our vocab which is going to be 256 so so far it's 0 through 255 but now we're expanding the vocab to 256 and we're going to say every time we see 116 and 104 we're going to replace it with 256 okay and then we're going to um just apply that merge to our our training set so after we do that the the um 116 104 became 256 and this 256 remember occurred twice okay so now we're just going to loop through this algorithm you know one more time the second time um it decided to merge 256 and 101 um and now I'm going to replace uh that in indices um and notice that the indices is going to shrink right because our compression ratio is getting better as we make room for more vocabulary items and we have a greater vocabulary to represent everything okay so let me do this one more time um and then the next merge is 2573 and this is shrinking one more time okay and then now we're done okay so let's try out this tokenizer so we have the string the quick brown fox um we're going to encode into a sequence of indices and then we're going to use our BP tokenizer to decode let's actually step through what that uh you know looks like um uh this well actually maybe decoding isn't actually interesting sorry I should have gone through the encode um let's go back to encode um so encode um you take a string you convert to indices and you just replay the merges in and importantly in the order that they occur so I'm going to replay um these merges and and then um and then I'm going to get my indices okay and then verify that this uh works okay so that was um it's pretty simple the you know it's because it's simple it's it was also very inefficient for example encode loops over the merges you should only loops over the merges that matter um and there's some other bells and whistles like there's special tokens pre-tokenization and so in your assignment you're going to essentially take this as a starting point and or I mean I guess you should implement your own from scratch um but your goal is to make the implementation you know fast um and you can like paralyze it if you want um you can go have fun okay so summary of tokenization so tokenizer maps between strings and sequences of integers um we looked at characterbased bite-based wordbased they're highly suboptimal uh for various reasons bpe is a very old algorithm from 94 that still proves to be effective horistic and the important thing is that looks at your corpus statistics to make sensible decisions about how to best adaptively allocate um vocabulary to represent sequences of characters um and you know I hope that one day I won't have to give this lecture because we'll just have architectures that um map fromtes but until then um we'll have to deal with tokenization okay so that's it for today next time we're going to dive into the details of PyTorch um and give you the building blocks and pay attention to resource accounting all of you have presumably implemented you know PyTorch programs but we're going to really look at where all the flops are going okay see you next time

Transcript for:CS 336 Language Models Overview

Transcript for:
CS 336 Language Models Overview