Transcript for:
Lecture on Large Language Models by Sasha Rush

I'm David Parks I'm the dean of the School of Engineering and applied sciences and it is a super super pleasure to introduce my friend and uh uh colleague in the past and student way way way way way way way way way back Sasha Rush um um I wanted to tell you a little bit about Sasha and then a little bit about um uh how I was just a little bit involved and inviting him to be here as well um so Sasha like I said is a professor at uh Cornell and he also works with hugging face and hugging face um is now part of Google they're with Google but they um are maintaining their commitment to open source uh democratizing access to large model AIS um Sasha has an AB in computer science from Harvard College she's when I first got to know him uh uh in part as a teaching fellow in my class 182 which which we used to joke was AI under certainty and if that doesn't seem funny think about it for a minute um I you never reason in a world with certainty but we were teaching a class where we were pretending that there wasn't any uncertainty in the world um he went on to get his PhD from MIT he post talked at Facebook research um and was an assistant professor here in computer science from 2015 to 2019 and in 2017 spring I got to teach a class with him um on machine learning and I don't know if if you remember this saser but um those of you that follow the large language model worlds and I didn't know what he meant at the time but he came to me one morning before class and he said there's a paper that's just come out and it's changed everything it's changed the way that my field is going to work he said this the day that the attention is all you need paper hit the archive I remember that morning when you came to class and you told me that when we were co- teing and this is the paper that went on to become the nb's paper um that you know is really in a a true sense the the precursor to everything that's happening today in AI um so I just think that says a lot about Sasha that he he he noticed that moment very very clearly the morning that it happened um and as as I'm sure many of you know which I imagine is why we have such a large crowd he has become known for his exposition of these incredibly complex AI artifacts with just beautiful blogs beautiful demos ways that we can all come into this uh seemingly complex hard to understand World um his research is around language models um including alternate architectures for llms how to scale in open model ways um efficiency of algorithms hardware and security of AI systems as well uh he's won more Awards than I could imagine anybody winning for his work um and he's also an amazing teacher and again I'll say a democratizer of access to AI in particular this large language model part of AI um so along with Franchesca dominici until quite recently I was the co-director of the Harvard data science initiative and we were talking we're like okay we need somebody to talk about what's happening in the large language model world and I immediately said we should invite Sasha to give the tutorial and um we were really happy that he agreed to do that um um and of course I didn't imagine at the point at that time that I'd be here in my new role to introduce him so this is all come together so amazingly well um I'm going to sit through a few of these and we're going to hear large language model in five formulas so it's great to have you here session chairs here we promise we don't buy nice friendly people there too so please um hey everyone um today's talk is about large language models uh the talk is called large language models in five formulas uh as David mentioned um I've been putting a lot of this stuff online and trying to get more in the process of doing that um I put a version of this on YouTube that you can watch there as well uh it's about an hour uh so the talk today will be about three hours so uh as you can imagine there's lots of room for questions and interaction um I think uh given how wide this topic is and how broadly I'm going to survey it please ask questions as you go we can go as deep or high level as people might be interested in um so I want to talk today about large language models um at this point I've kind of assumed you've heard a lot about these uh large language models are extremely useful they're extremely expensive they're extremely important and extremely maybe scary there's all sorts of emotions they elicit uh and a lot of this goes beyond the kind of technical questions of how they work um this is really the only slide I'm going to show today that actually demonstrates them in practice um I just random hard question typed it in and you just consistently get out these incredible answers I mean they they're so well wrd they're so consistent in tone they have Whimsy they respond to ridiculous situations they're able to pull up certain kinds of information they're able to combine information in interesting and pretty wild ways so I'm a big fan uh I use them all the time trying to get in the habit of really like first asking a large language model before I type in a Google query before I try to code something uh and more and more I'm finding they just kind of do it for me so um I'm I'm pretty impressed by the technology but from that point of view I'm just like probably all of you and that I'm just like wow this is cool this works really well so what I want to talk to you about today is how you reason about large language models how do we think about these as objects of study from a computer science point of view um but I want to be extremely humble about what I can tell you and the the the kind of true answer is that we do not even understand small language models I can make a very very small 100 parameter language model and I cannot tell you what or why it works and I really want you to be cautious of people who sell this to you and say I know how this works I understand what's going on inside of it because these systems are extremely nonlinear the systems have all sorts of confounding or or weird factors to them and we can't kind of give you a proof that they do something in a particular way it's not to say we shouldn't try but for the sake of the tutorial I wanted to kind of stick with things that I feel like I have a handle on that we can talk about that we can get some sort of mathematical intuition for so here's the the tutorial we're going to do large language models in five formulas um I'm going to talk through five areas of large language models that are I would say pretty much orthogonal to each other that I think I can give you good intuition for the five formulas are called perplexity attention gem chinchilla and rasp and just out of curiosity I'm uh maybe like if you could show of hands with number of fingers how many of these have you seen before Okay cool so I seen a lot of ones some twos no fives that's good okay if there were some fives I'd be really worried uh so um I think probably you've seen some of them before um but I'm going to go through each of them individually we'll do uh the first two take a break and then talk about the last three um and I think the reason I find formulas really useful is that each of them kind of breaks apart one part of language modeling that we can think about and reason about independently of the other so uh the first is really going to tell us about generation we can think about generation kind of independently of neural networks independently of training data we can think about it just as a kind of probabilistic system next we'll talk about memory when we talk about memory we're going to dive deep into into the neural network components we're going to think about how to actually build this how you might implement it uh what it's uh kind of runtime cost would be on your machine next we'll talk about efficiency for this section we'll dive into the actual GPU Hardware we'll get an understanding about why these things really are fast in practice and we'll go through one important example of how this works now if you you're not a coder don't worry it'll pretty high level um if you are a coder you probably actually haven't coded for gpus it's pretty hard I think there's probably only like I don't know dozens of people in the world who do it very well I I'm not one of them but I'm a big fan so I uh I appreciate the people who can uh next we'll talk about scaling so scaling is less about the low-level parts and more about the executive decisions so like why the heck did Mark Zuckerberg buy 600,000 gpus right what is the reason we think that makes sense and why it might work in practice and then finally we'll talk about reasoning uh this is one of the hardest areas and one of the least understood we'll talk about that by thinking about some of the mathematical models that tell us not how Transformers or large language models actually work but how they might work give us a kind of upper bound on the performance of what they could do for real problems this is one of my favorite areas and it's one I'm I'm really passionate about I think it's a little less mainstream than the rest of the talk cool so caveat I'm going to simplify we're going to think about abstractions I'm going to leave out a lot of the details you're not going to leave here and go get a job at open AI because they care about the details but maybe you can be like a manager at open AI so get a sense of how it works so I'm thinking about this like frictionless environment I think someone used the word like uh what was it spherical cows so think of this like physics with spherical cows so it's the kind of like a world we're going to be working in as we begin okay so I'm going to get started uh any questions before I get started about LMS generally gonna make you guys ask questions at some point I know it's a little scary in this room um okay part one let's talk about perplexity um okay so for this section we're going to assume that language is very nice it's going to behave in a really nice way the way it works is that there's going to be documents and the documents have t equals 1,000 word tokens that means every document is a New Yorker article with a thousand tokens long and our language is very simple it has 10,000 word types so that means the dictionary has exact ly 10,000 entries and you're not allowed to make up new words ever okay you can't have weird anything it just has to be those exact 10,000 words okay and this distinction between tokens and types is one that I think often trips people up at first so one thing to ask yourself in your head is when I say like when I make a statement ask am I making a statement about tokens or types because we're going to see kind of moving back and forth between those two things so when I talk about tokens I'm talking about looking back in a document for like a previous word when I talk about types I'm thinking about looking in a dictionary to determine what my next word will be okay let's add some formulas so language model is very well defined a lot of people will lie to you and call their thing a language model they're wrong I'm right this what a language model is it's a probabilistic model of a document it gives us the probab ability of a given series of tokens existing right the probability of that document thing in the world the probability of someone writing that New Yorker article going to go a little bit through notation just because I know we have people from different areas and I think sometimes my notation can be a little confusing I'm not going to be very explicit about random variables I'm just going to kind of casually write X1 XT kind of implicitly stating that those are Rand variables I'll make it clear when there explicit instantiations of those variables I'm also going to be pretty sloppy about the parameters but when I it's important I'm going to use a semicolon to split off the parameters from the problemistic model I do that because I don't want to talk about neural networks neural networks are messy they're their own world when I think about them I need to like turn off my brain a little bit and so in this section we're going to mostly just talk about about probability because that's well defined okay so I have the joint probability of these T tokens next section we'll talk about Theta but for now we're going to ignore it okay so mathematically we can just apply the chain rule to this joint distribution we can write this joint as any combination of conditionals we would like but a natural one to pick is left to right so here I'm taking the product of each individual probability of words conditioned on all the previous words before it there's no inherent reason I have to do it that way but it's convenient and it's a nice thing to model that choice to split up in that left to WR order and then to parameterize explicitly the next word is known as an autor regressive language model here Auto means that you're conditioning on yourself on your previous choices and regressive refers to the fact that you're regressing or predicting the next thing in that chain so we're going to parameterize the probability of the next word let's make that a little bit more tangible when I say pxt conditioned on the previous words what I'm saying is learn a Moda that looks at the previous words and produces a distribution over the next word that object is a little bit big it consists of 10 uh 10,000 floating Point values consisting of the probability of every word in the dictionary so it's a prediction problem but it's a prediction over a very large set of possibilities and when I talk about that I'm going to talk about it in terms of this histogram where we have have a probability of each of the words in our dictionary cool any questions about that so far so we have a distribution there are two things we'll do with it one is we'll use it to assess the probability of a sequence of words the other thing we'll do is we'll sample from it because we're sampling from this joint distribution we can do that by sampling from each of the individual conditionals one at a time so when we talk about generating from a language model we're simply going to sample from px1 to get a concrete sample x sub one feed it back into the model sample again and continue forward this is roughly what chat GPT does when it generates the output there are lots of other fancy algorithms that people have thought of but most of the time it turns out just doing this works totally fine so those are our kind of two probabilistic machineries one is to compute the probability of a document the other is to produce a new document by step by-step sampling okay so I've just kind of spent 10 minutes describing to you something you probably saw in your undergrad class it's not a new idea and it's weird that we're like so obsessed with these sorts of models right now for a while it felt like an old and kind of Dusty idea and so it's kind of exciting to see it kind of come back it's kind of like thing that used to be in the first class of any kind of NLP undergrad lecture and you can go back and read these papers it's pretty neat I mean so I think um I was always taught that kind of Shannon came up with this idea there are these papers from Mark off that people actually didn't translate until rather recently that kind of describe building these sorts of models in I think roughly like 1917 um and it's actually kind of ironic because a a lot of like a lot of the research that came after Shannon was on translating uh language and in particular there was a lot of funding a lot of it at Harvard actually on translating from Uh Russian to English around that time and I think one of the reasons they were trying to translate from Russian to English they wanted to translate all these old Russian papers that had like developed all this Machinery uh so kind of like through Shannon you got a lot of the translation of um or early attempts at translating some of these papers okay now it's not entirely that we just rebuilt the model Markoff built uh it's a little more complicated than that um and I want to give you two examples of ways that modern models differ from the model mod that people were playing with back in the day so the first is an assumption that we used to make until very recently about language models and the assumption is that there was a fixed length history so specifically an assumption which is known as the Markoff assumption is that each word only depends on let's say just the last word and that assumption doesn't change our problemistic model it just says that these two uh quantities are roughly the same the probability of XT conditioned on all the words is roughly the same as just conditioning on the previous word now this is obviously wrong but it's not as wrong as you might think there is a lot of information about XT and XT minus one and most of the information probably is in the like last couple words so uh making some sort of of cut off is is not too kind of crazy an idea um the second assumption is that people used to assume that Theta could be represented by just a kind of good oldfashioned categorical distribution so what I mean by this is that instead of learning a kind of fancy neural network just assume that there was one categorical distribution for every last word so if I saw the word the' I would just memorize Mize the distribution over next words if I saw the word dog I would memorize a different distribution over next words and the reason that was really nice is because you could estimate those distributions exactly very efficiently so you could go over the entire internet just count up all the times you saw dog count up all the next words divide and you'd get these like pretty good language models very efficiently and so if you think about tools like map reduce or things like that always the kind of first example they would give you is like how to build this kind of language model how to count up all the different um kind combinations of words on the web and this was kind of the primary way people did it up until about 2010 uh so it's uh it's not like too crazy an idea either okay so what happens when you do this so you get something like this so this is a sample from the language model I've described so far um from Shannon's paper in 1948 I was think this paper is so cool it's so cool that he just sits down and like builds a language model I guess by hand or something I don't know uh and so he gets something that like I don't know it's it's it's really bad but it gets kind of like roughly the order of part of speech tags of words in English like the seems to come before nouns I mean there's a lot of kind of structure here that I don't know impresses me for 1948 even though it's mostly nonsense uh cool so uh yeah so if you go read this paper he does this for a bunch of a bunch of different models a bunch of different uh tasks um cool stop for any questions so this is bad but how do we know it's bad how do we like quantify that one language model is better than another language model how do we quantify this idea that it's capturing something about English but not getting English exactly and we're going to do this the same way we would do it for kind of any machine learning model we're going to have a kind of heldout test set we're going to give it to the language model and then we're going to compute a metric but one thing that's a little tricky about this is that this problem it's kind of unsup like we don't have a a kind of right answer um we're kind of just trying to learn the distributional properties estimate the kind of density of language and so uh kind of the simple methods might not work as well as you would think so let's look at an example so this is going to be the main sentence we'll look at throughout the talk the dog walk to the blank um yeah it's not the most creative sentence but we'll use it as we go so um um we're going to assume that we give this to a language model and we ask it to produce a distribution or a histogram over the vocabulary and here's our distribution and the mode of this distribution is the word park so that's in some sense our prediction of the next word but the true answer is lawn so we got it wrong so if we were just kind of assessing accuracy we would say we got zero points right but we kind of did a little bit better than that we gave some probability to lawn it was like in our top probably 20 words so we kind of captured the possibility of next words reasonably well so it's unfair that we got a zero so we want to get some points for getting on even though we didn't get it right exactly and what's particularly challenging about this is that certain word predictions are much easier than other word predictions so there are some cases where it's extremely clear what the next word will be and there are sometimes when it's really really hard and this example comes up a lot if say uh let's say something like a preposition after a verb that's pretty easy like some verbs can only be followed by one preposition but let's say the last word was the and the next word could be like a person's name there's like hundreds of people's names right how could you possibly ever get that right and uh what's ridly hard about this is that language has this kind of nasty distribution to it which is that uh uh some words that are super common uh are very very very likely to appear but other words are much much less likely to appear yet if you kind of sum up the whole tale of this distribution it's a non-trivial probability so like all the time in most New Yorker articles you're going to see a word that you don't know or haven't seen before right that's like a pretty common experience even though most of the words in every article are super duper common and so if we just did accuracy we get a pretty skewed sense of how well we're doing or what we're getting right okay so we need another metric oh yeah go for it uh can you make a relationship between statistical prediction and semantics um yeah it's a great question um the these models that we're looking at now that only look at the previous word or so are much more capturing the syntactic flow of language so they're mostly like is this type of word reasonable in the next spot you'll get some semantics just through kind of correlations of maybe like verb and preposition or verb and direct object but primarily we're trying to get the kind of structural properties and that's because that's where kind of most of the information is this method we're using now underlies all modern large language models and they clearly capture semantics in a very deep way and so like at that scale certain semantic properties show up very clearly in these models but drawing the line of when you kind of move from just structural information to kind of the semantics or meaning of the language is it's hard to to kind of Pierce out yeah than yeah you know um you draw the distinction between Marian and language models but when you said it's not categorical maybe my question is helpful but is it like an infinite State sort of system because when I think about the mobian system it's finite State you have transition probabilities everything yep and the then when it's find out it's discrete you can probably map find a map discreet map so really good question let's just do just make it clear what my notation is because I'm being a little EOS syncratic for Simplicity I'm I'm making two orthogonal assumptions one is the assumption that the probability distribution is Marian so what that's saying is that I can ignore previous words because the distribution is independent of them the second assumption which I'm calling categorical maybe that's not the right word for it is an assumption about Theta it's an assumption that we have always just stored internally one distribution for every conditional State and we're going to kind of relax both of those assumptions in the next section but for now does that distinction is that for like clear it resolves some of the okay but we'll get back to your question in the next section yeah cool uh great well I love that you guys sto me on the zip slide because it's my favorite as an NLP person um any other questions while we're here okay so we need another metric for this problem we're going to choose a metric that also relates very highly to Shannon the metric is going to connect the fact that we have a probability distribution to the fact fact that that probability distribution implicitly implies a way to compress language and the way this works is that we're going to pick a binary string to represent every word in our dictionary this will give us a variable length code to encode every one of the words in that dictionary if it's helpful you can think of this as like a Huffman code we're basically going to like build a tree and kind of get binary strings for each of them if that's not helpful just ignore it and pretend we're like a gambler we have to distribute these codes to our words if we pick shorter codes for some words that implies we have to pick much longer codes for other words because we've kind of used up that space that assignment will give us a code for every one of the words and the way we're going to think about this problem is that we get scored based on how short the code was for the red word so in this case here the word was lawn and we gave it the code 1000101 so pretty good it took six bits it would have been better if it had been Park we had a very short code for that one1 but it would have been devastatingly bad if it was this one one below right because we have an extremely long code we' basically like use all these bits telling you about this super weird word that showed up so this is a way we can take our language model make it into a way to compress language with the assumption that the person the other end has our language model as well and they can read our codes so with the language model and this setting I can pass the next word in six bits to another person okay so whatever this bit is is bounded by this formula bounded by log base two of the probability we assign to the true next word so we gave that word I know 0.4 we couldn't do any better than negative log base 2 of 0.4 kind of tells us the code length of that setting also comes from Shannon and gives us a way to kind of know in the best case how good we could compress that output so you can think about this is just using our probability distribution to bet on codes for each word we get rewarded based on how short those codes are we don't actually have to produce the codes because we have a formula for it the formula is pretty simple we just take the probability we gave to the actual true next word and take log base 2 of it so the probability was really low that log base 2 would be really big if the probability was really high it could go very small so for historical reasons in language model we use a metric that's a slight transform of that idea it's called perplexity and you'll start to see it everywhere if you read language modeling papers it corresponds to two to that quantity so we go two to the average number of bits per next word of the system and what's nice about this is that you can read a complex paper get to the perplexity and you'll say oh this language model corresponded to like an ability to compress language to this level you can kind of directly convert those terms and you can think about what that means in practice let's look at some examples just because I think it's helpful to see these in practice so if you get down to perplexity one you have one you one language modeling you're great it's done you solved language okay what that means is it implies that the problem and the underlying system was effectively deterministic you need zero bits to communicate to the person you're playing this game with because they can literally just read off what the language model tells them and they'll know that that is the next word this feels impossible but sometimes you'll actually see things like this where you get to perplexity ones and the reason is because sometimes you'll do language modeling conditioned on some side information so for example like um let's say you're um uh let's say you're uh translating from a like a French sentence to an English sentence if you condition on the French sentence sometimes there'll only be one valid English translation right so like if every say speaker agrees that this is the next English word always right then the perplexity of that part of the language model will be one right there's only one possibility maybe you know what it is and you generate okay if the perplexity is 10,000 that implies that your code is essentially uniform you're not using the variable length property of language models at all you're just assigning literally the same code to every word so that means you're maybe ignoring the fact that there's any structure to the problem at all you're ignoring that language is zip in you're ignoring any kind of Markoff assumption you're just giving every word the same probability that leads to all the codes being the same length and that length corresponds to a perplexity of 10,000 and then just to make it clear perlex can be worse than that it's like let's say you just really really really really really think the next word is pizza and the next word is I don't know like uh uh business suit you've like screwed up in a really bad way right and so language models will tend to be kind of not so confident right because they never want this bad scenario where you've given a basically infinitely long code to the true next word that can really hurt your perplexity and uh probably means you have a bug right like if you're coding this up and you see these giant perplexities you might have a bug cool uh any questions about this slide yeah I think missing something is the perplexity is a perplexity evaluated against an actual distribution of language or is it inherent to mod like you said complexity of 10,000 is uniform but also talking about sort of your model getting it wrong it's a great question so it's you need two things to evaluate perplexity you need a model and you need a heldout test set so the most famous one is just articles in the Wall Street Journal so that's the test set that's held out and then you're evaluating a model's perplexity on that data so the the test set tells you what the true next word is and the model tells you what the codee for that word is and which is just the log of the probability that's right but it's a it's a log of the probability from the model the model's prediction yeah that's right so then aren't there lots of models that would give you a perplexity of 10,000 not just uniform like is it you said if the complexity is 10,000 it's uniform is that necessarily uh it's not necessarily true maybe what I meant to say more was it's equivalent to uniform in that like it will uh it'll get the right answer roughly one out of 10,000 times so it'll perform as well as the uniform I think that's a good way of thinking about it yeah um any other questions yeah um so if so do we expect to generate exactly the the dark to get I mean like sometimes we can generate something different but still it's really good and impressive but it's not like word by word the same same as the article so just to be clear this metric is is not a metric of generation quality it's a metric of distributional match so what it's trying to do is it's trying to produce these codes that will match relatively closely to what we see from real language now it just so happens that if you do this right almost certainly you'll do what you described right but the converse is often not true so for instance let me give you an example I have a model and it always outputs I don't know uh David remick's New Yorker article and nothing else always does that right that model will have terrible perplexity but it's generation quality is really good it generated a really good article right so like um I think what we're trying do in language model is slightly more than just generation it's kind of capturing the structure of language yeah B confused so when you define it's uh is it over all the token space like some sort of like great question so this this thing here is over the tokens that you have in your test set so it's a sum over each of the words in the article and then this here is the actual observed uh word at that point test test yes so yeah you can think about this as an expectation over uh language over like observed language if you want to be formal this is just like Monte Carlo over language we don't have that distribution so we just have to use samples yeah so to be really pedantic given your initial notation each article in your test set would have a complexity score and then your overall complexity would be average of across on uh uh it's gets complicated because articles are different length so you can't do it that way so think about it more like we can cat all the articles in the world to one super article and then but you conveniently said that all articles 1 that's true in my case they're equivalent but in general they would not be yeah cool uh yeah so because we know the language is not deterministic so I wonder let's assume your how set and training set are from the same distribution yep is there any bounds on what could be great question so man so Shannon figured this stuff all out very early uh he has a paper on estimating What's called the true perplexity of English so he's interested in what the like best possible perplexia you could get would be given that English is not as um it's going to differ per different uh domains uh but there's a kind of way you can go about doing this um one thing that's interesting is that humans are really bad at this so you can't just ask a human uh to give a distribution over the next word it's really hard like if you ever tried estimating distributions I'm awful at it uh so you have to kind of come up with all these tricks to try to to get some sort of like important sample estimate of what their true uh probabilities are anyway it gets very messy more practically though um this is kind of the history of this data set this is called The Wall Street Journal data set uh it consists of a bunch of Wall Street Journal articles and then you need to do perplexity on articles you haven't seen uh as I mentioned before if you're uniform you get around 10,000 byrs gets around 600 trigrs gets around 200 and then 40 Years of really really impressive tricks with 5 G models gets you to about 140 uh and um this is where Google was at around I guess like 2009 I want to say they were kind of building these massive massive models that were kind of five G models of the entire internet and you get to about 140 on perplexity in uh around um I I guess in the 2000 2010 people were building these Markoff neural networks that got similar scores some of the early non-markov neural networks got around 100 uh and things kind of got better from there um I want to kind of give you a high Lev sense of why we care so um one person asked about quality so the question first is does perplexity correspond to Quality and um I mean you could probably show that it would be hard for it not to but I can't think of any inherent reason why you couldn't have a model that had pretty good perplexity but led to kind of bad quality Generations maybe if it was just much too conservative um but in practice we find that it almost always does uh so this is from a paper uh around 2017 uh it's showing two columns one is perplexity if you ever see PPL that's means perplexity it's in like every llm paper uh and then it shows this other column which is called blue and this is uh the metric of quality for translation accuracy so in this model it's a language model that conditions on a French sentence and tries to produce an English sentence the perplexity of the English sentence is here and the quality of the translation is here and generally it correlates almost perfectly with perplexity so around this time a lot of people were like well if perplexity is all that matters why don't we just focus on perplexity it seems simpler why do we even focus on sampling at all then the next year another paper came out and it showed this really surprising idea which is that perplexity on General language correlates extremely well with task accuracy on totally different things so this was like a really shocking result and a really cool table so this showed that you could just take a model make the perplexity Go real down and then take it and apply it to something totally different so this is like General English Wikipedia perplexity and this is movie reviewing and it's like why are those things even the same like in this paper at least they were training on translation and testing on translation this paper they're like training on translation and testing on like question answering like totally different uh uh underlying task and the correlation is almost exact so this was really exciting I thought this was like the most important paper that like I would see published in my lifetime uh it was like it basically was like all of NLP up until that point in a single result and then basically the last five years has just been like this result to the Moon um so this is a graph from a paper that came out last year called llama to uh from my perspective they didn't even need to write a paper they just put this graph online like I get the idea the graph shows perplexity really low numbers here like stupidly low numbers here and then like billions of dollars here right and it's just like oh billions of dollars it goes down right and like it's great I mean it's really helpful I can I know exactly where they're at they're at almost knowing language right and they I don't know spent billions of dollars great I don't have that but I like the graph uh this one's this one's probably only hundreds of millions of dollars um but that's way worse it's like I don't know3 perplexity worse um but this one's really good uh so that's great so now I know the story now I know what's going on okay so uh here's where we're at On The Wall Street Journal um actually give me one second then I'll stop for questions so uh uh gpt3 we we think is at 20.5 On The Wall Street Journal so that's um pretty impressive I don't know what the true perplexity is but um it's that's capturing a lot of what's going on it's roughly like um um yeah I think maybe this is a bad way of thinking about it there's a lot of words in a newspaper article that no one would be able to predict right so most of the time it's probably getting the word right and then some of the time there's maybe hundred possible choices it could could could look for um cool now um this is all good but I haven't actually even told you the punchline yet I've just told you the well I guess I told you the punchline but not the joke um so we saw Shannon models we saw the evaluation but we haven't shown you how to actually make these models that good so um you can guess where we're going we have to remove these assumptions we have to to move from categorical to neural networks and we have to consider all previous words um but let me stop here for questions before we do the next Formula yeah you started the lecture uh your comments said something about we can't prove anything that is we don't understand exactly how it works we've now spent quite a bit of time creating metrics to evaluate accuracy quality and other things so could you compare our inability to understand it with genuine metrics that we can apparently believe measure the quality or some kind of value of the output I think the point I wanted to make is that I cannot prove to you that perplexity will continue to correlate with amazing behavior of language models that there's no inherent reason why making perplexity goes down makes math ability go up or trustworthiness go up that's that part I don't have a a kind of step for what I can say is that perplexity is very well well understood and correlates to compression and that so far we've observed a remarkable correlation between perplexity and humanlike ability so I think if you accept that second one as an axiom then you're fine but sometimes people hear that and they're like I don't believe that and I I don't have any proof of it yeah yeah um so the complexity that you show in your through yeah is it over like all sorts of like tokens or it's just like a benchmark so is it like an average perplexity cross many tasks yeah I'm being a little I'm being a little tricky here so you'll note that this says train perplexity um which is a little sketchy right because like you normally want test perplexity not train perplexity the reason this is okay is very weird the reason is is because they're only doing a single Epoch so they never see the same data twice so they can just keep this train value and kind of run it as they go so it's a little bit in the weeds I didn't want to get into that too much but you can think about this is just roughly the internet and let me give you an example why do you see this axis here this is 2000 billion right so you can think about this graph as just being all of language you don't need to like be specific cuz this is everything right uh yeah because I immediately think about like the tools that I've used for my daily tasks perform better than other in certain tasks because they're designed to do it yeah so if just by averaging so average is just a just a moment first moment so why why aren't we looking at I don't know second one third one I'll give you the same answer I gave him there's no inher reason but so far it's held up yeah in the back um is there any reason the interpretation of perplexity would change if we were modeling not the English language but a different language or computer language or anything else we could model to predict would perplexity change or would the interpretation stay the same the interpretation of perplexity is the same but you'll get vastly different scales so modeling programming language the perplexity is much much lower there's actually a lot of stuff that's very predictable in programming languages um what you can do as well as compare across different types of models so let's say you have a model of English words and a model of English characters you can actually compare their perplexes directly because they're both in some sense a joint distribution over the text itself multimodal gets a little more complex uh you have to have some assumption on like how often you see different languages uh and like how you mix them in that way cool okay um yeah one more question then we'll move to part two yeah I have two question the first one is um trying to learn if this is very applicable to encryption in cryptology uhuh second question is um if you use a lot of computing U software you use um content computing like to like help you get to like a prediction yeah they good question so first off I think both the language of cryptography and the language of LM kind of derive from this kind of information theoretic perspective so I think some of the terminology will be similar um I think a lot more of what we do is data driven as opposed to kind of constructing a system so we're much more interested in observing what the information content of real language is versus constr constructing a system that's uncrackable in certain ways um the second question is interesting as well um almost no Quantum stuff has kind of touched any of this yet we will see though in a later section how much gpus have influenced uh these kind of systems in practice okay great um okay let's go to part two so part two is about memory so in particular we're going to talk about attention and how attention allows us to relax the assumptions from the first part of the talk so first off let's think about a Markoff model that takes into account the previous two word tokens and produces a distribution over the next word token and now instead of focusing on the probability model we're going to focus on the parameterization so we can choose any Theta we want as long as the Theta produces a probability distribution over 10,000 words here's what we're going to use for that distribution so we're going to use what's called a neural net language model it consists of two parts the first part is a neural network over the previous two words and then a softmax function we're going to replace our kind of memorized tables with this instead here's the first part um I like this diagram because it conveys how like straightforward it is we're going to feed the two previous words XT minus 2 and XT minus1 into a neural network and then we're going to produce something out now the biggest recommendation I can give you when you think about neural networks is to think about the shapes of all the objects if you understand what the shapes are then most of the time you can understand how they work so I have to be a little bit more clear about the shapes I need to somehow encode XT minus 2 and XT minus1 and then I need to tell you what the shape of the output is so the shape of the output is pretty straightforward it's going to be a vector of length 10,000 because we need a score for every next word for Simplicity let's do the same for the input xt- 2 and XT minus1 are vectors of length 10,000 and they're all zeros except for a one in the position of the index of that word in the dictionary okay so that's all this neural network is so now we need the size of the neural network input and output so the neural network has to take in something that's 2 * 10,000 so 20,000 and it has to output something that size 10,000 right other than that we can do whatever we want the neural network can be just a boring old neural network you learned about in your neural network class there's nothing special here about language as long as you eat the 20,000 dimensions in and then output the 10,000 Dimensions out both of those numbers are pretty big so you probably want your neural network to like make them smaller and then make them bigger again that's fine but you can just put together the blocks in that way okay additionally we need to train this model and to train the model we need a function that maps from a vector of length 10,000 to a distribution of length 10,000 and we need that function to be well behaved the kind of universal function people use for this is called soft Max it corresponds to exponentiating each of the scores and then normalizing so two steps exponentiate each one and then normalize so we get a distribution sums to one use it in this way we just train it standard way you train neural networks and you learn the underlying distributions okay so this was like a major breakthrough even when it came out uh there were lots of these neural network language models a lot of them are really good I think the one that people most remember is one called word Tove word Toc came out almost exactly 10 years ago it won the nurs test of time award this year uh for for 10 years and what they did is they demonstrated one of these Markoff neural networks at a very large scale so to do that they had to kind of train uh this kind of big neural network at a very large scale on lots of data they had to calculate the softmax a million times for every every word and then they learn these parameters and they learned those predictions they got a pretty good perplexity and demonstrated that this idea could work okay so that's pretty neat it showed that neural networks could be really good at language modeling and I think the first thing people thought was like let's just make these models bigger if we train this on more data with a bigger model we can learn more more about language and the approach will get better but the problem is that this alone doesn't get you there and you guys kind of know the reason already which is that this model even if it's really powerful can only look at the last couple works and so it can only reduce the bits of perplexity a certain amount and it can only take advantage of the ones found in nearby tokens and once this starts working really well you start needing to look further away to make some of the harder distinctions to get from three bits to two bits you suddenly start needing to consider kind of wide ranging semantics of the language cool okay so uh let's just go through an example quickly just I think it'll be useful in the future let's say we have a model that looks at the previous two words dog walked and you have to predict the next one this one I don't think is necessary I don't think there's any bits in this word walked to well it would be pretty helpful to know that it was a dog so the fact that it's a dog might tell us something about where we're going not that much though I mean probably this is going to be a place I don't know if like dogs walk to very different places than humans but maybe the semantics of where dogs walk will help us in that prediction this last one though it seems really helpful we don't know the verb anymore so we don't know that it's like a location it could be a person it could be um a country or something right so like we've really lost a lot of information here so it in order to like distinguish some of the harder things we would really like to look back on previous words cool so David Park's example so imagine in our New Yorker article it says we spoke with David Parks NCS and then I don't know 15 pages later it's a long article uh there's some quote and it says blah blah blah blah blah we looked into that said Professor blank right so if you do not know the history of the article this could be basically any name like say there's like I don't know 100,000 possible last names S I should pick a lower number 5,000 possible last names right and so without any additional information you're basically going to get that wrong but if you could somehow remember this and pull it all the way up here that would be really nice you could reduce the amount of bits a lot just by having long-term memory so people had observed this for many years the kind of building models that could do this kind of thing consistently without kind of horis STS or hacks was considered a really hard problem any questions about this example I think this is an important one we'll come back to okay so what we'd like to do is we'd like to build a model that's fully autor regressive that uh uses all the previous tokens arbitrarily far away yeah good question oh yeah before um yeah so does that mean that you have to know what the document is like where the docment start or maybe the chapter of the correction in the book or right because it could be the previous words like where does this stop yeah let's um Let's ignore that for now let's just assume we've given it documents um the short answer is that all this stuff is going to be a bit stochastic so like it's a real examples are a bit more complex than this uh and and so you kind of want it to work pretty well and even in kind of harder cases uh so you might you might use context clues like headers or breaks or things of that form yeah cool um great um I see some people in the back there are some seats up here if you are looking for a spot cool um okay so we want to use all the previous tokens and a lot of models were trying this uh a lot of them go pretty far back I mean there are examples of models like this um from the 80s probably even before that um but if we could do this really well then we could reduce the perplexity even further and I want to convey a little bit why this this problem is non-trivial so let's look at this example here so um here we want to give the words X1 through XT minus one into a neural network and use it to predict the next word so let's think a little bit about shapes so the the output shape is the same here right but the input shape it's now like pretty weird right it could be variable length maybe we pick a kind of Max input shape let's say we we never go more than uh 100 words back right but now now we have to feed in something that's 100 time 10,000 right into this network uh and that's a little weird it's like a big thing to take into a neural network the other problem is if you've seen kind of the arguments that people make for why you need things like convolutional networks for Imes you kind of have the same problem here which is that you're asking the neural network to memorize like what position seven means and position seven it's not like a meaningful concept on its own in language like the word that just happened to appear in position seven that doesn't mean anything right it's all relative to the other words you you want to kind of model that's able to take into account the fact that language is kind of variable shaped and words move around pretty easily you don't want to like hardcode into your neural network structure the absolute position it's kind of a handwavy argument but hopefully that gives you a sense of why we need something more complicated okay so what we're going to do instead is a method known as attention uh you should think of attention as being like random access memory or like a a lookup table or a dictionary in your favorite programming language the way it's going to work is it's going to save previous information from every previous position and it's going to let us look up that information and bring it into our nearby context so let's go back to our dog example if I say the dog walked to the blank I have what's called a query which represents what I know about the sentence at the end that's going to be in yellow and we've saved a lookup table for every previous position so for the and then the dog and then the dog walk and then the dog walk to we save two values one value is called a key and the other value is called a query we use those to perform the following operation we can use our query to look up something in our table pull out the value for that something and then utilize that green thing to predict our next work so let's go through those steps one more time query looks up key key gives you value value is used for your prediction so L take advantage of the fact that we know that it was a dog in order to use that to help us predict the word park so three steps query matches key best matches picked and we return the value of that key so this is pretty cool it looks like how you'd implement it in code uh it lets us look arbitrarily far back in history it lets us find words that we might need to use you can see how it might use the word Professor to look up the name of that professor and then use it to predict the next word that's great there's one problem the problem is the second operation so the second thing says argmax is selected now within neural networks in argmax or like an if statement or any kind of hard selection like this is actually kind of fatal if you ever have something where it jumps like this and stays you can no longer learn in that neural network and the way to think about this is like imagine the second best guy is here and then its value goes higher and it becomes the first best it goes here right so you have this kind of functional form that has a derivative zero so you can no longer learn with it in practice so we need another method for going about this problem and if you've ever seen neural networks before you know where I'm going we need to take this function we need to make it smooth we need to make it so that it has a meaningful derivative at every location so that we can train with it luckily we have a function like that we just introduced it it's called softmax softmax is a way to softly decide which was the best choice it looks like that nice smooth function and it gives us a way to pick which key we think is the best while also maintaining a way that we can use it in a neural network now I just want to be clear this softmax is slightly different than the one we saw before it's mathematically equivalent but before we were doing a softmax over the dictionary here we're doing a softmax over memory so the softmax is over all the previous memory locations and it's going to give us a distribution over which ones we want to pick so let's go through an example of how that works so we're going to instead do this new thing the query is going to score the key the soft Max is going to normalize those scores and then instead of just predicting one value we're going to do a weighted average of the value so here's the picture the dog walked to the blank query key value the query score scores the keys and we do a soft Max over those scores that gives us a distribution over previous locations and instead of just predicting a single value we're going to do a weighted average over the values where the weights come from that histogram once we have that value we can then use that to predict our next work now in the stream case where dog was so much better than everyone else the histogram would put almost all the probability on dog and we get something that looked like our first method right that's why we call it softmax because it's like an approximation of that full original argmax that we produced okay I think sometimes this can be hard the first time you've seen it um so if anyone has any questions I'll pause here for a sec yeah guess I'm clear on like what the query is that scores the keys in the first place and what does that query look like how was it determining yeah really good question let's say just for Simplicity just to start with the query comes from uh a Markoff model that just sees the word the' so just like a very simple neural network over only the previous last word just just to start with so it like it's the word the it knows it needs to predict the next word it's like which of the previous words is going to help me in this scenario oh yeah yeah you need are you allowed to ask question sorry you want us to think about the value associated with say dog is representing dog or is representing something else uh I think it's fine if you just think of it representing doc we can think about that is conditioning the neural network for the prediction that's right um the reason I'm being a little sloppy with these answers the kind of true answer is that we're going to run this process multiple times so the first time we start with just the word and then the output of this kind of becomes a query and we do it again so you get you get a chance to look at multiple things uh throughout this process but the first time these are just simple they're just the words themselves but it kind of it builds up more information over time yeah as you move forward to across this a sentence and scan this memory gets updated and is that how it works um think about it as like you just get a new memory so none of the memories ever get updated it's just that as you go through the sentence this memory table gets longer but the memories never look forward in time they only look backwards in time yeah did you consider the um context window to be the same as this idea of memory or is it is there a distinction for now we're assuming context window is infinite so you're seeing the whole article in practice context window would consist of a kind of cap on how long the memory could be so if you hit a if you hit a a fixed cut off you start throwing away your early memories but that's more of an implementation detail I think in practice people think we'll eventually get these things to just remember everything um and yeah I ask um I can't understand why this kind of model uh why we still see like logic deduction problems so I saw a paper that was like uh um Mary is Tom Cruz's mother yeah the question is who is Tom Cruz's mother and the we cannot produce a results okay it's a really good question let's do it if you can stick around section five is going to cover this a little bit more you have a bit to go but yeah yeah but yeah it's a really good question we can talk about that paper more uh yeah uh I have a two question so first is that so what does like uh the key value and query means in the semantic perspective yeah question is like why is this sort of the model structure using the key and Val query and its performance a bit better than say like word to back good question okay let's let's do uh let's do your first one and then then I'll go on your second one the first question is what does key value and query mean so let me give you a kind of uh blunt answer the blunt answer is the following because we did this we don't have to answer that so the argument is that if you can make it differentiable and then learn it in pytorch you don't have to think about what it means so like we we set this up such it was a mathematical operation that was differentiable that means we can train it on real data which means the system can do whatever it wants with it so it's just like convenience for like objective function just convenience for objective function like if you can make it differentiable you can learn it and then it can learn whatever keys and queries it wants to be interesting okay let me ask you answer your second question which is actually where we're going now which is why oh wait uh let's see okay uh yeah okay well I'll get get there in a couple slides okay this idea it was actually an older idea that had been used in lots of papers um but it got really popularized by this paper called attention is all you need in 2017 this is like the central paper and language modeling I highly highly recommend everyone read and study this paper um I personally basically spent an entire summer trying to implement it uh and wrote up uh my findings of how challenging it was but eventually did get there um it um it's basically demonstrates not this idea for the first time but that this idea was the kind of core idea so it showed that multiple uses of this attention idea basically what I've shown you in the previous slide uh could uh basically um they were they were interested in machine translation but uh it turned out the same idea worked in language modeling um but basically just worked really really well um and it really demonstrated that you could get this kind of Holy Grail of long-term full context language model the model they produced is called a Transformer and this answers some of the questions people were asking so I very much focused on just this one box which is the attention part in a Transformer what you do is the following you start with all the memories being just a word and then you do attension and then you take the output of attention and you feed it to a neural network and then you start over again and you just stack that process in their paper they did it six times or 12 times and each time the memories learn more about their previous words and get better this diagram has become kind of iconic because it really describes almost the entirety of chat GPT so this is the tea part and really what open a eyes like Brilliance was was they said why don't we make this a thousand right why don't we make that as big as we can right it keeps on getting better let's just figure out how the heck we take this diagram and just make it bigger and I I mean I I was kind of joking but I think it it was kind of brilliant in some way just to be that confid that like we like this works let's just let's let's try it let's go go for it all um cool so oldfashioned neural network attention attention deals with the history and the memory okay other question why not X why not some other approach there are thousands of papers with other approaches for this problem and um I often get frustrated because when I look this up online it'll say attention is like memory and memory is important and that's like a infuriating argument because it's not falsifiable right like other things are like memory or other things are not like memory like why should one work better than the other and I think what's really key to note is that attention is good but it also is extremely efficient and there are like mathematical reasons why Computing attention works really fast on Modern machines and so I want to go through attention one more time just to convey to you why this is a really nice way of long-term context so we're going to do the following we're going to note that well we don't actually need to just predict this last word to compute perplexity we need every word right we want to we want to get our distribution for every word so all of these words have a query and we can stack all those queries up in a matrix and all the keys well they're also basically a matrix right and all the values are also kind of a matrix so we think of the shape of these objects We have basically one query for every position by whatever its hidden size is and then one key for every position by what it size is and one value for every position and what it sizes okay so if we go through our operations the first operation is multiplying the queries by the keys and then taking a soft Max so that consists of a matrix multiplication where we're basically uh getting something back that's going to be um uh the length of the sentence uh by uh the the distributional uh property we computed uh by the length of the sentence so you get this basically new Matrix back here uh that uh you then take a softmax of that gives you the histogram for every position and then oop sorry then you can basically Matrix multiply the output of that by the values and then use that to predict the next word and I think the the main thing I want to convey is that our three steps correspond to three operations we've seen so far we say query scores Keys that's a matrix multiplication then soft Max is applied to that Matrix and then this average weighting is also a matrix multiplication so Matt mole softmax m mole and uh and this is uh this formula you've probably seen before it's like the iconic formula from the Transformers paper it describes basically the main operation at the center of what they're computing uh and it basically describes the little diagrams we've produced so far um and so what they showed in this paper is that while there are many ways to get memory this is a way to get memory that just happens to basically use just these two um kind of operations now we have a generative Transformer but I haven't told you yet why the heck that's important that we've made this matte moles in this way um so in the next section I'm going to really dive in to why you can compute this fast on Modern Hardware and particularly talk about why these operations are basically efficient to run um so uh let's see I think I want to take a break first though so I I want to make a couple things clear just from questions so one thing is I'm only showing this for one step of attention so I have the query looks up the key value and then use it again and that is actually how it was first designed um but the key one of the key Innovations of Transformer and the paper I mentioned was that they do this process multiple times and what that gets is it starts to look something kind of more like reasoning so not just look up one word and use it but kind of a multihop you like uh you have a query you use it to get a key of value then this neural network in instead of predicting the next word predicts the next query and then you do it again and then by the time you've done it five times the query has had the chance to look back at say five different uh keys and use them to kind of make its final decision so I I I I simplified it but I I I didn't want to make make it it too simple and then that's what I'm showing on this diagram so when I say 12 12 x here that means basically the number of hops you do with this process the final thing I want to note is that the Q K and V they're not parameters they're activations so what that means is they're like intermediate values that got computed by the neural network they're not learned the Learned part is here this is a standard neural network the learn part is there and this part of the neural networ Network it's the same neural network that gets applied to each position so this neural network it doesn't get to see the whole sentence it just gets to see each position compute a new value and then that gets fed back into the next round of attention so um uh just going back to our example so these keys they all came out of a neural network that was just applied to the words then this query looks them up and comes back again to produce the new query it then goes through the neural network and then we run this process again yeah I'm trying to follow so you're saying so when we were talking about things earlier you're the thing you feed into the neural network would be this 10,000 vector and so you're saying each key is a 10,000 Vector sorry H it's a little it's a little more complicated than that the the first layer of the neural network maps from 10,000 down to a hidden Dimension that hidden Dimension let's call that a thousand so it's much smaller after the first layer and that's the thing that gets used in this process so it's a distillation of yeah it's one way to think about it but so in terminology that's not important for this tutorial but just to answer this question the 10,000 Vector feels big but it's sparse right it only has one one the thing you get out is is smaller but dense so they're roughly comparable sort of things okay great um so let me get back to the kind of key issue we got in here we want to understand why attention is good and I made this claim that attention is good because it looks like a matrix multiplication but I didn't tell you why that's good so what I want to do in the next section is talk about matrix multiplication and this is going to be very different than what we've done before so I want to talk first about why language models suddenly got so much better and I cannot emphasize enough that this is the answer that all this other stuff is important and necessary and amazing work and and should be super celebrated but like it was really essential that the underlying Hardware developed and in some sense I think of AI as like the application that made sense on gpus and gpus are the thing that is like getting better right so like by kind of finding the right thing that ran on this hardware and the hardware getting better then that's how this came and like I can make this argument from like a technical point of view but since we're near the business school why don't I make it from like a stock price point of view so like from the point of view of the last couple years like it's hard to argue with this graph this graph looks a lot like the perplexity graph but like um I would say less important other people might say more important I don't know um I cartoonishly have had this graph in my class lectures since 2020 and never did it occur to me to just buy stock it was in my lectures I have proof that I had this in my lectures for many years uh it looked it looked like a big graph even then but um anyway um so Nvidia is a company that makes gpus um there's like a worldwide shortage of gpus in vide like worth as much as apple or something like it's pretty crazy um and um what they do is they build Hardware that runs the kind of applications like language models okay now the move from CPUs to gpus really fundamentally altered the way we think about these prod s and I'm going to give you an example of just like how it changed the research landscape just to have a different underlying compute so let's take this example here so I told you the softmax function was really important I told you you exponentiate and then you normalize you normalize by your dictionary but dictionaries are really big for every word token in your New Yorker article you have to take a sum over the entire dictionary right that is actually pretty slow and if you're going to talk about running language models where you're in the billions or trillions you can't be like having a for Loop here to compute this value so around 2010 if you look back at the papers you'll see that there's all this super interesting Research into approximating softmax so the thought was like oh well most of the things in this sum are like low values and like it's kind of a expectation in some sense so like you can approximate it and get close Precision blah blah blah the papers are really fun but like after that it's just GPU right you just don't care anymore if you have Hardware that can compute this really efficiently right so everything really changed at this point and like things that were hard suddenly became easier and if you remember attention had a big fat softmax in it right so like we need to be able to take that softmax over all the previous tokens and so it's really nice that GPU can do this kind of thing efficiently okay um any questions just high level about gpus so instead of uh teaching you how to program for gpus what I want to work through is one example so one of the things that gpus are extremely good at Computing is arbitrary matrix multiplication matrix multiplication is essential to Computing neural networks as you saw it's essential for attention and I actually think it's really important to understand why that's true so I'm going to walk through an example programming on a GPU to give you some intuition for why Matrix multiply is efficient um and we're not going to see any code so don't worry about that I want to kind of just give you mathematical intuition about how GPU use differ as a Computing platform than regular computers cool so I find it's really important when I get to this section of my class to make the GPU figure as cute as possible that's because this portion of the class like just for some reason destroys my students so it's important that it' be very cute to make up for that fact so they don't give me a zero in the course guide um but uh for you guys I promise you're smart you'll get it all right away um so we're building a we're using a GPU a GPU is a parallel computer it has many what are called threads the threads run like normal computers but they run simultaneously and the way it works is that you literally have to write the same code for all the threads they're all just running the same code simultane this can be really hard because you can't like put in print statements right or break points cuz you have thousands of guys all running simultaneously now to bring some order to the chaos we're going to group the GP the GPU threads into what are called blocks so this is a block and the block this one is made up of 12 threads each of these are running the same code simultaneously um but they have access to this guy here and this is what we're going to call block memory so all of those threads can write and read to block memory and they can do it very fast so this guy can write something this guy can read and they can use that information to do their computation the blocks are further grouped into what what's called a grid and this grid represents kind of what you're paying for from the GPU this is all the threads that you have available all of them are running the same code simultaneously across the GPU these threads can read and write from what I'm going to call Global memory this one can write to Global memory and this one can read that value now key rule of gpus global memory is bad lock memory is good that's the game it's a very hard game but if you can code in such a way that you never read from Global memory only read from block memory then you can make amazing things run efficiently now you might think well this is really easy I'll just make my block giant doesn't work that way there's a fixed size to how big your block can be it's kind of hardcoded in the GPU so you can't make the block infinitely big it has to be a fixed size and so you have to kind of solve problems where most of the computation is happening in the block and not in global memory at all so here's the game so this is bad you can't just have each of your threads read and write from Global memory that would be easy it would work but it would be slow instead you have to kind of think of it in this two-step process stage one is reading from Global memory into block memory but that shrinks things down to these smaller blocks then you do all the computation in the block maybe you do multiple rounds and then you write it out again to Global so if you can do this then it will be fast then you can kind of invent new algorithms then you can like make these run efficiently okay ready for an example any questions about the hierarchy so what we're going to do first is we're going to compute 3x3 matrix multiplication just a reminder here's how matrix multiplication works you take a row in b and a column in a you multiply each of these values by each other and you sum them up and you get your output value the natural thing to do is to just assign one thread to each of these output positions so Computing this and Computing this is the same code right they're both just multiplying things up and then summing them but for slightly different positions that's allowed but it's exactly the same code so we should be able to just assign one thread to each of those but that's bad that's bad because both of these guys need to do a bunch of reads from Global memory right so let's count it up I need to read 1 2 3 four five six values from Global memory sum them up and then write one out so while I could do that most of my time and energy and dollars will be spent on global re all good so what do we do instead so instead we're going to first read everything we can into block memory we're going to calculate the results in that block memory and then we're going to repeat as many times as we need to compute the whole Matrix and here's the kind of key intuition do you see these two elements this one here and this one here note that they have different values from B but the same values from a so if those values had been in shared memory I wouldn't need to read them twice so that kind of overlap is what we're taking advantage of this is what it looks like the first thing we do is we use all of our threads to just read in values from here into here so the full a into our shared memory into our block memory and the full B into our block memory that is 2 * 9s or you can think about that as two reads per thread once it's there we can then within the block calculate each of the output values by simply reading the shared memory sorry the block memory to compute the output so these reads are basically free compared to these and then different threads in the block compute different output values so all in all we end up with 2 * 9 reads are 18 versus 54 from before we're basically getting a discount because different threads use the same values so by first reading them in the shared memory we save having to read them from Global memory twice it feels relatively straightforward right it's just a game of figuring out where the reads are but this is basically it I mean this is the trick for gpus if you can get this down to as small a number as possible you can make these be very efficient okay any questions about that yeah just curious so the global memory is bad because of like the physical constraint or we can intend to do that you're asking why it's bad a global Global memory is bad let's think of it it's just bad cuz it's slow so this is like the physical con Sprint or we intend to do that differentia like Tod say local memory that's a good question I'm not a hardware person so I don't totally know but uh because the the global memory like has to be connected everywhere right so you can't make it like so fast whereas the block can be like kind of like collocated I don't know I don't know I want to say something dumb but it can be much faster because it only has to serve that very small area yeah I think this is generally true though in a computer like there's all sorts of different memories and they're all kind of like by different constraints some are faster than others yeah this is different than caching that we have in CPUs like layer it's it's it's it's it's similar I mean it's this kind of idea of a memory hierarchy is very similar SAR it's it's a little bit more explicit in GPU programming so you're not reliant on on on the cache you're like you you're you're explicitly saying I'm using this for this yeah and when you code this up basically what it looks like is you just have two arrays one is global and then you have one that's block and you can decide which one you're using cool okay great um okay um awesome so that is the easy case but in practice we do not know the size of the Matrix we need to multiply and most of the Matrix we multiply are going to be way bigger than our block size so we have to be more clever so I think of this algorithm like if you get this this is like really really good like this is like you you have much more understanding of how ml Works than I think most people who who use ml I think very few people kind of totally grock this algorithm so we have to deal with a 6x6 Matrix but let's say our blocks can only have nine threats we can't do what we did before because we can't load the entire Matrix into the block memory right it's not going to fit and if the Matrix doesn't fit then we can't possibly multiply across the rows and the columns so what we're going to do instead is we're going to read part of the Matrix into our block memory do some computation write that computation out to our block memory and then read some more in and for that to work we have to basically take advantage of the structure of matrix multiplication so here's what it looks like so instead of reading the whole Matrix in from Global we're just going to read the top left into here instead of reading this whole Matrix in we're just going to read the top left into there this is 2 * 9 reads just like we had before this is a 3X3 block and this is a 3X3 block we're going to use these two to compute part of what we need to fill this in we're going to take advantage of the fact that matrix multiplication is linear it's additive so we can do part of this as we go so once we do this we then use our threads to fill in each of the different values in this output we do that by just doing what we did before multiplying each of these and summing up their values and putting it into block memory okay but this is not the final value of the Matrix because we've only done half a row and half a column but it is summable with the rest so once we've done this and filled in each of these values we then slide over left to right down here and top to bottom up here what we read in now we read this into shared memory and this into shared memory and that again is just 2x9 reads but this is the second half of the column we need and this is the second half of the row we need so then we do this step again now we multiply these up and we sum it to the previous value so now we've done two big block reads from Global memory and two calculations of a 3X3 Matrix once we've done that we now have everything we need for the top left quarter of the Matrix since we have everything we need we can write it out to Global memory so now we have our Global memory with this top left part done so we did a 6x6 Matrix with only two times as many reads as we did a 3X3 Matrix but what about all of this other part we haven't touched any of the bottom or the right part of the Matrix but remember this is just a single block for the GPU we have a whole grid so this block it calculated the top left of the Matrix and then maybe this one did the right and this did the bottom right and this did the bottom left right so each of them since they have their own block memory can do exactly the same things that we saw so there were three other things happening that we didn't see but looked exactly the same to compute this this and this the process okay so what did we do every one of our threads ran basically the same code they only did two things they loaded from Global to block and then they multiplied and added within the little block the four blocks we used basically all acted independently of each other they never had to talk to each other and the only way that the individual blocks communicated within the block is the fact that they reused similar rows and columns that were in shared memory so doing this algorithm allows us to compute Matrix multiplies really efficiently it fully takes advantage of the parallelism of these machines and it ensures that basically we're utilizing this block memory to the kind of Maximum extent possible this algorithm when you call it within say Cuda code goes under the name gem we focus just on this AB part but you can add in some of these other parts as well and it's a kind of a an essential kind of building block for building up these systems when people assess the quality of a different GPU they compare how fast it is at Computing these gems and when they think about a model they think about how many of these and how big they are when you stack them up together so we think about a Transformer we can just think about it as a series of these operations so here's an example so we saw for attention that you have this calculation there and this out here both of those are just kind of gem calculations softmax is actually a different operation but also efficient on gpus so we're kind of best taking advantage of the hardware we have available here's an example of a graph I think put out from an Nvidia blog post it talks about three different generations of gpus so this is a100 this is h100 and this is some extension to h100 what they're comparing on is gems ml per gems uh and then this fp16 means that they're doing it uh with a smaller floating point value so what's nice about having a smaller floating point value is you can fit more of them in your shared memory so in modern gpus people think very hard about how uh much storage each of their numbers take up because as we noted there's only a limited size here and if you can fit more in here then you can make this bigger right and compute faster the Matrix multiply of the whole thing so people now are talking about like fp8 or like in 8 or like other like kind of smaller formats even uh to kind of speed this up in practice cool I really hope there isn't like an architecture professor in this audience I I'm really embarrassing myself here with terminology but I'm giving a giving a try at an ml take on this topic uh cool so I think that's about all I have to say about gpus uh any other GPU questions yeah and I think you said this but so you don't really have a good idea a of like what those matrices look like that are going in um I know that they're dense yeah okay so they're dense yeah so okay so you can't do like sneaky things like assume that they're symmetric or make them sparse or something really good question there are some people who have looked into kind of sparsifying matrices or kind of forcing certain types of sparity on them um uh one thing that's very tricky is that um if you notice these operations like even if this were sparse we couldn't really do anything about it right because we're Computing it for the whole group simultaneously so what people often talk about in GPU world is what's called a block sparse Matrix so like for instance if we knew that this whole block was all zeros right we could just ignore this whole thing so if you can say like a 16x 16 block of my Matrix is all zeros then I can really save a lot of time but just having a couple zeros in random places doesn't really help yeah uh yeah you mentioned at the beginning there are very few people that are good at programming with gpus I'm curious like what would make someone good at programming you've given us very simple uh what would make someone good at programming gpus I guess being able to ship kind of nonstandard things that you can't build in p torch uh I think very few people tend to do that in research but I think the ones who can are like highly LED for that so there's a couple good examples of this um uh there are like um there's a paper called Cura that people really like that's like you need to do GPU stuff there's some other stuff in kind of parameter efficient modeling there's a recent paper I really like called Mamba that does this um yeah there's a famous example of actually Computing ATT tension more efficiently um called flash attention that actually uh we can go through the details you probably know enough now to understand that paper but it basically makes an argument about Computing actually the tension formula more efficiently uh so there's one problem which is that you end up having to store this Matrix in memory in global memory and that's really slow so even if you do this Matrix multiply fast you end up having to write that out so if you can avoid writing that out by doing this all in one operation that's really good good um yeah just uh I'll give a a quick plug if you're interested in GPU programming I have a series of puzzles online where you can do it in Python so like little puzzles to do learn GPU programming they're called GPU puzzles I'm not good at naming uh yeah uh so everything that we learned here about gpus um and the way that the block memory works is that also directly translatable to tpus and in practice are those more used for stuff like this yeah it's a good question my understanding is that there is no way for humans who do not work at Google to program for tpus so people generally do not write TPU code unless you work at Google when I talk to people at Google I get a sense that it looks relatively similar to this there is a recently released library that looks really cool called Palace p a l a s it's like a general purpose library that targets both TPU and uh GPU from the Google team uh and uh it looks pretty neat and it has similar principles like you have to think about this idea of what is your block memory and what is your Global memory yeah so so so yes I mean it's a it's a similar architecture but uh there isn't like an equivalent way to just easily code for TP cool um great um um so um let's keep on going um I made you just sit through 20 minutes of me talking about Matrix multiply I think for some people in the audience you're like that that's kind of boring I don't care why are we so obsessed with speed is this just uh is this just to like uh hit benchmarks or is there a real reason why it matters how often you load from Global memory and the answer is that there's reasons that language model optimization is intense that there's like actual tangible benefits to doing this slightly better and uh and and people have been thinking a lot about this recently like what what this means why we care that it's so fast so in the next section I want to talk about that so I want to talk about why GPU optimization Matters by talking about what the point of this all is so I want to talk about scaling and the formula we'll talk about is called chinchilla okay so we now have most of what we mean when we talk about GPT the GPT is a large language model it stands for generative pre-trained Transformer the generative part was perplexity the Transformer part was attention but this pre-rain part we haven't really talked about so much so I could talk to you about how training works it's not that interesting we basically just try to make perplexity go down on the training set we do this by taking the gradient of perplexity and updating the parameters to try to make it go down what's a little more interesting to me these days is how much data you train on and how big of a neural network you use so how big a neural network you use is basically like how much can it learn how much information can you fit inside this box kind of limits how kind of complex functions it could learn of your data how much memory it could utilize uh what sort of behaviors it could learn the other question is how much training data you give it so we know that giving the model more training data means its general purpose perplexity goes down if it's seen lots of examples it can generalize better and use those to answer other questions it also just gets more knowledge because it kind of just sees more facts about the world that it can then utilize later on the model seem like they can do things like memorize documents and stuff like that inside the system so two properties neural network size training data size data they both matter but it's actually kind of challenging to know what is the next model we should build this kind of problem comes up everywhere but it really comes up when you're doing like rocket science or like major like like uh foundational decision making right like you have to decide beforehand what these two things should be before you spend millions of dollars actually doing it right and so scaling is all about how do you estimate these questions how do you decide what you want in practice okay so here's why this matters the bigger your neural network is the more compute you'll use the more training data you use the more compute you'll use but there's a multiplicative Factor where if your model is bigger it's going to be slower per example because you'll have to run more layers of attention you'll have to basically take more time for each one of those layers and for the current generation of models there's an easy horis which is that each data point touches each parameter so if I send a new word through the model it basically touches every parameter and so this scales multiplicatively that Mak sense so uh the total amount of compute which we often call flops is going to come out as a the multiplication of the the neural network size and the training dist so here's some examples so here are three uh famous models that we have data for so Bert base was released in 2018 it has 109 million parameters it was trained on 250 billion words and this is the amount of compute 1.6 * 10 20th floating Point operations this is gp3 from 2020 175 billion parameters 300 billion books 3.1 * 10 23rd flops now I glazed over these numbers for like many years and just didn't care it's weird that people give it to such Precision I really if you're interested in this stuff I really recommend just thinking of it in terms of like fmy like sizes like this is a really big difference right 109 to 175 billion right so we're within two years we basically did three or a magnitude uh of the size of these models right um tokens actually are not that different um and then this one gets more tokens and is slightly bigger in this way so it's worth kind of kind of just getting a sense of what these numbers mean how many tokens I don't know a trillion tokens actually is what the amount of compute actually means to run this model on every one cool um one thing that open AI observed and wrote about in really interesting detail in this paper called scaling laws is that all of these factors help the language model so if this weren't true it would be very bad it would mean we're spending a ton of money and not getting any benefit um and so they show that uh if you uh increase compute or parameters or data size you get these graphs where you get uh basically decrease in uh the perplexity proportional to uh the scaling uh what's interesting about these graphs is they're all log log so um so you get this kind of like linear line on a log log graph which kind of indicates that if you really invest a lot in each of these things you'll see continuing Returns on your perplexity so this is kind of arguing that major investments in either compute parameters or data size will lead to proportional gains in your perplexity and if we take the Axiom that perplexity is all that matters then you get better and better language okay so the natural conclusion for this which I really credit open AI for just jumping on very early was just to like let's make it really big right like if you see these kind of plots and you can get investment this is the time where you want to build your rocket ship right you want to make a kind of concrete Target investment in one thing since you're going to kind of predict L get that gain out was there hand yeah um um my question is more about like the correlation between the increase on the size of the model and the sample size to my knowledge there's no like rigorous statistical methods that can tell us how many training data do we need but is there like any guideline for additional parameter how many I think we're getting there so so give me a slider to yeah any other questions yeah so every once in a while you see in the news that some company announces oh we release this model it's the pre-trained model is much more smaller than this one yep there is a another thing behind it that they probably maybe use I don't know 10 times more data to gain t 10 times save great question we're going to get there too yeah yeah uh any question is about these quantities what they mean where you get a trillion tokens um cool okay so and I think this intuition is right I I I really want to convey that like let's make them all big is is right the the problem is the following you need to tangibly utilize the budget that you have if you have a billion dollars and you give it to Nvidia they'll give you flops they'll give you kind of compute right so once you have that compute you have to decide how you're going to actually allocate it and the allocation decision really can be abstracted in the limit to this question of do you want more parameters or do you want more data now this wouldn't be true if every day someone was coming up with a new model that changed this formula but I think even by 2020 people were like Transformers perplexity works let's go right so once you have fixed those variables it just comes a question of like how big how much data so one method is like you train lots of data and you have a teeny model the other's like have a really big model and train less data they both cost the same right you have to you're going to have to pay that amount of compute so which you do so here's an example we saw this model Palm uh this was from Google they trained a very large language model they went with this strategy so big robot little books let's look at the picture here it is 540 billion parameters 700 billion books that's a particular decision they made to allocate that amount of compute is this right how do we know right we're like not in that it's like not that crazy of space there's only two variables right but it's probably extraordinarily nonlinear like we don't know if this was the right decision or like what this learned so the main formula we're going to look at for this section is a Formula that utilizes the power law assumption we saw before to try to extrapolate out the performance of models the formula is relatively simple for one of these formulas uh but it does take a little bit of trickiness to understand so what we're going to try to do is we're going to try to model the expected perplexity upon completing training we're going to say that that expected perplexity is simply a function of two things the amount of parameters and the amount of training data that value we think is a power power law of those two things which we can write down is this equation here and in D our inputs the rest of the values we're going to estimate this e value represents the best perplexity someone asked about that earlier in the talk that's like if like this is the ideal perplexity you could ever get uh from from language you can't ever do better than that and then this term kind of represents how far you are to having like the best model and this term represents how far you are from having like the best data so the equation looks like this we're going to have these two values and then here's n our model here's d r data and what we don't know is how how much these two terms contribute to the final value so we need to know the coefficient of our final perplexity that comes from the parameters and the coefficient for our final output that comes from the data in the paper they go through various different ways to estimate these two coefficients some of them are more kind of theoretical based some of them are just running a ton of experiments and drawing curves what they find from this paper is that under very very precise assumptions about your model class your training algorithm your learning rate all those kind of things like within like the universe that we've converged to that's like very very narrow you can make assumptions about what the final perplexity of the model will be and the assumption is quite nice it says the coefficient on your model and the coefficient on your data is roughly the same what this implies is that because we're in this nice world this equation has a nice answer you basically remove a degree of freedom and get what's called a compute optimal scaling law which says if your goal is the best perplexity you can simply just say the best use of my money is to have a model size that's directly proportional to the data size this is really cool I love this paper they were able to go back and say we did it wrong right they were able to kind of go reassess a bunch of results and they say hey that's a rectangle it should be a square right like we know that this formula says you should be equal scaling so why did we do it this way and they did it for everyone like they showed everyone was wrong it's really cool like just make it a square you're good here's the graph here's them this is from their paper they say here's tokens log scale here's parameters log scale everyone else was up here I don't know what was happening but I think in 2020 people just like wanted to have the biggest model there was like some thing where it was like we need to have the biggest parameters it's important we're Microsoft ours should be five times larger than open a right but what ended up happening was that you ended up wasting compute right like if you you want to be on this line that that's the kind of equal scaling line and that's what they showed they they they released a smaller model trained on more tokens that did better and they're they're just saying like as flops get higher we should like roughly still be on that line right and yeah that these kind of like uh they were good models I mean it's always good like both directions are good right but if you want to kind of use the compute in the most efficient way that could be wrong cool uh and this still is like a really just useful thing I like when when people train open source models they like this is a really handy guide um and it would be really nice if people who have lots of money to run lots of experiments would release more of these guides right because they they kind of they kind of prevent us from going off in weird directions cool um yeah I think in some sense this is a punchline like the punch line is just do it like this uh and uh yeah I think it's a it's it's rather Simple Story it's from the extrapolation you get that in that regard now I don't think you should assume too much about this like in the future people might produce models where the compute doesn't scale like exactly in the same way like for instance there's one model that that's gotten popular recently that uses a method called mixture of experts doesn't matter what it is but what what what's important about mixture of experts is that you don't end up paying uh one parameter per every token so it has like a different ratio of things and so it might lead to a different version of these kind of formulas cool uh any more questions about Chilla yeah it's a quick question so the like open source models that that give different parameter models like 7 billion 70 billion are those like is the data that they're trained on the same and if so are they like relatively different efficiencies then yeah it's a really good question um this equation kind of only applies to like one data set I think for each individual data set you might get slight permutations of this kind of thing a lot of Open Source models are trained on similar data but maybe not exactly the same when you start talking about multilingual models or code models things get a little different um yeah so so that can complicates things a little bit uh in practice Yeah is there a hand here I yeah yeah it's kind of a half big question but something like why is this a purely empirical inside is there like a reason that people think this happens I think of it as purely empirical uh I I've seen papers where people kind of go into the details a little bit more I think it's beyond beyond me yeah but just as a if you just think of it as curve fitting it's already pretty useful yeah so there is a cost in acquiring data set cost in scaling the model size because of Hardware limits how big the model can be so how do we compare incorporate costs yeah I'm extremely fascinated by this question um I think thinking about constrained versions of this optimization problem are really interesting let me talk you through one and then we can talk through some others so one important caveat is that uh actually this slide is the opposite of what you were saying what you were saying is cool too this cost you pay when you do generation whereas this is only used at training so if you're planning on using your model you might say these two things are not equal this would be really nice to reduce but this we don't really care about now you were making the opposite point which I also think is really good which might be like this we can make infinitely big but this requires getting high quality data that might be expensive right so um I want to talk about this first but um I just wrote a paper about this second so if you're interested in this topic it's called Uh data constrain scaling laws uh you can look it up it goes through that and tries to work out that math um but let's talk about this one because this one's more famous so what if I don't really care about this cost like what if I'm Facebook and I'm just a total chaos agent and I just want to like piss off the other tech companies by releasing a really good model that everyone can run on their MacBook go Facebook I love it so they're like we pay this cost the user pays this cost so I want to just just go this way more because this is less costly to us so let's just like train it for them and then everyone can just run our model uh on their phone or on their glasses or on their um their computer right um and so yeah so in this case you're not trying to be compute optimal you're not trying to minim the amount you pay you're you're you're trying to kind of do best as you can to produce a smaller model um and so uh what you get is you get a model that's like down here so this is like more tokens than parameters and here and honestly this space has become much more interesting than this space because like for open source models people really want uh there's seemingly this like lower bound that like people will use 7 billion parameter models and maybe no bigger cuz 7 billion I think runs on a like M2 Macbook so llama 2 is like here and I think people think mraw is like here that like people are just kind of putting more and more data into models of that size cool uh lots of questions yeah I'm just curious um in the training I think you mentioned this earlier you said that the training is just using each observation once in the training that's right can you just reuse your train because you're probably not getting all of the juice out of it yeah covered in this paper scaling data yeah sorry I don't have slides for it but we can we can talk about it afterwards the answer is that every time you re you re a token you get kind of less juice out of it and the juice goes down roughly as like uh exponential decay and so if you go like four times you're like fine but if you start going longer than that you end up paying a lot per perplexity that you go down without getting new data exactly and then you can get new data from some weird sources like code data helps seemingly even if it's not language uh and so like sometimes it might be worthwhile to pull in other sources of data uh was there a question or uh in back do you also look at synthetic text generation in that paper so I don't believe in synthetic text generation uh uh but I'm fascinated by it uh a lot of people are thinking about it right now um a lot of what I've seen that people are calling synthetic text generation looks to me more like having a better model generate your data and then training on it which is like a good way to transfer information from a big model to a smaller model which is useful but I think is somewhat different than this idea that you kind of infinite produ data um but it's a super hot topic of Interest right now and I'm curious what people will come up with yeah there want to clean the data with training to get high quality data it's like so important and people don't study it enough I think a lot of it is because it's hard to do in a kind of rigorous way so I think there's a lot of art to it that these companies have these extremely like uh fancy data cleaning processes the stuff people have published on is more like d duplicating data or removing data that maybe has bad perplexity from another model that kind of thing but uh I don't think we really understand exactly what kind of good data means besides the one that's like just terrible right like like like people think Wikipedia is like way better than other data but it's hard to prove that or say say that in a rigorous way I'm very worried I mean well not worried I'm very curious like what would happen if like open AI couldn't train on the New York Times like what what that would mean like I bet their model would be way worse because I think New York Times is really valuable data uh but I don't know how to quantify that like training over posts on social media it's different than training on most certainly but I again it's hard to I let put this way this is about formulas I don't have a formula for that I don't I don't know what the ratio of like Twitter data is to the New York Times yeah I do feel like we're going to learn because I think it's becoming a an important topic right like the New York Times should get paid for the fact that this model is good because it's trained on the New York Times but like I don't know how much like it would be nice if I could quantify if I could say it was like worth three perplexity then you could put a number on that right yeah so I know I mean it's a raise to the perplexity of one but what is the realistic gain I mean because I I cannot make sense of the scale like 0.1 reduction in perplexity the cost goes up and up so it's at one point this becomes a Minal task of like you cannot we have to spend 2 billion ion dollar to just increase the perplexity yeah you're right but I guess maybe what I'm saying is that um CH Chilla tells you that it tells you the marginal value in dollars in in in in flops for equivalent gain in perplexity right so like if you think you need this much better to get a a new skill then this formula tells you kind of what the conver do you expect at some point we run out of tokens or like they can listen to us absolutely yes um I also refer to you this paper uh which I didn't prepare slides for but uh I have a a talk about it as well uh and uh yeah you can look it up it it talks about this question of what you do when you run out of tokens people think you're going to run out tokens it's it's a little complex like if you oh let's see what what how much time do I have um um there's a a nice graph about the expected time when we're going to run out of tokens it's surprisingly soon um but it's a little complicated if you depends on what you count as kind of a high quality token if it's like um the internet or kind of getting near the end if you count social media or like text messages or something like there is a lot of text being produced but kind of the core stuff is maybe getting to a point yeah yeah do you expect that sort of Ideal ratio of like model parameters to like training data do you expect that to extend beyond like Transformer architectures you think that ISF so uh I don't think it's specific to Transformers I think um when these models get really large the majority of the parameters are in just boring old neural network part of the Transformer so like to me that part feels like it's doing most of the heavy lifting and it's going to be the same in most models um I mentioned earlier this like mixture of experts model which kind of has this different property where not all the parameters are used for every token so if you have that property then you get kind of different shapes here because like you have different ratio if you're not using parameters or being more sparse okay um great so I think at this point we have a really good sense of how this works it's um uh a a big model that um it's trained on lots of tokens it's very fast if we can make it faster that's really good uh if we can get more tokens that's really good uh this all just helps our perplexity um but we still haven't kind of connected any of this to the first thing which was the skills we observe when this model actually runs in practice so like I observe that the model seemingly can do math problems I don't know why right and I'm unfortunately not going to be able to give you like a really tight answer uh because I I don't feel super comfortable with any of the answers that I've been told about why this works um so this formula is about kind of this talk is kind of about things we can quantify um so I want to talk a little bit about reasoning and algorithms and so um let's go back to a specific example so I I really like this example just because it's like the minimal thing that seems really hard and I was impressed by when gpt2 did it right every time so like gpt2 shows this result that it can just do this it can move this word here and it gets it right way better than a lot of models that were trained just to do this task I think that was the first time I was like like oh this is really like something here like this is learning this property just from data without any special training to do this and it does it really well okay so um let's talk about a kind of simplified version of this task this task is called associative memory the goal of the task is that whenever you see this thing you need to remember this letter and then spit it out so it could be arbitrarily long you just have to remember whatever was before the greater than sign and then later on spit out that letter that's the task uh we do it in the simplified form so we don't have to worry about language we can just uh basically work with uh the data that we have okay so language BS can do this I don't know how it it's beyond me it's something in training at some point they can't do this some point they can a lot of people are interested in this now there's a lot of blog posts about it read through a bunch of them I think they'll crack it but I'm not convinced yet uh and so I I don't know it's it's a mystery to me what I want to talk to you instead is about how language models might do this so the difference is that there's this nice literature that talks about uh kind of language models as kind of formal systems and they kind of think about these problems as in the best case if a human got to write things down how might they design a kind of circuit that would do this kind of task does that make sense so we're we're throwing out training we're actually going to throw out all probabilistic modeling we're basically throwing out the all the previous parts of the talk we're just going to say what's the minimal version of part two that could do the kind of associative task we're interest okay so there's this very nice paper called thinking like Transformers and introduces this language called rasp uh I got relatively obsessed with this paper I think it's like a really fun paper um and what it does is it introduces a like little formal language where the formal language can be translated to Transformer let me try to explain what that means so it's totally deterministic it looks like a programming language no parallelism no probabilities no softmax no anything going back to like the simple simple world where we get to use argmax we get to build a circuit uh and it gets to solve the problem for us and um you can think about this as roughly analogous to like a finite stale coma if you've seen that before so he's going to design a little Transformer like circuit to try to solve tasks let me make that more tangible so um we're going to code this in Python we're going to write a little Python program the Python program is going to translate to Transformer so it literally can output Like A system that could be run as if it were a Transformer and would give you an answer now it's a little hard thinking like a Transformer so we're going to draw some diagrams and the diagrams work like this remember a Transformer has a query a key and a value the first step is that the query and the key intersect and then the value produces the output so instead of using matrices for this we're just going to say we can use any code we want to tell how the query matches a key and then that's going to produce a matrix and then from that Matrix we take whatever value associated with whatever things we're turned on and that becomes our output so it's going to look like a little bit of python code to match the query and the key a little bit of python code to feed in a value and then together that single Matrix will tell us what the output would be one of these things I think you need an example for so let's do an example so let's say I wanted to write a rasp program that counted how many previous words were before me so I just want to count up how many words came before my word so the first position it'll be zero then 1 2 3 4 5 6 [Music] 7 so we we say the query it should be whatever the my index is and the key should be less than me so it's saying match positions that have the key index less than the query index so here are the query indices here are the key indices and the positions that get turned on are the ones where the query index is less than the key index so this is kind of like debugging our little program this Matrix represents all the things that we're turned on and then we're going to feed in a value the value we feed in is one at every location and so we basically sum across to get our final output so before so let's take an example so let's say we're at position three we uh sorry we're at position three we match these three and then we sum across so that's three of the output there any questions about the notation this first part tells us what this Matrix should look like and the second part tells us what we feed into to the bottom part let's do one more example and you guys can ask a question so let's say we wanted to do the following we had some input and we wanted to shift everything by one so we're going to say match the key index to the query minus one and then feed in whatever the input to our sentence was so remember our input was a symbol B CDE e f symbol here's our matrix it's shifted by one because that's what we said here the query matches the keys shifted by one and then we sum across taking whichever one had the value for so final output is our original string shifted by one yeah so suming across you what you're actually getting out is like a a vector or what do you mean by summing yeah so in this example it's pretty straightforward because the sum is only of one thing and that thing corresponds to whatever the value was at that column so uh so okay so here's this is D we got the D here by quote summing across this row which just meant we took this value out uh let me let me put it this way there's no there's no probability here everything here is just Logic the way the logic works is that we're there's no kind of vectors it's just like we're turning on this column this column had an e in it so that becomes our final value program which was value tokens but was the prior key indes thing what we can get out prior like the your first example it was all one so then you conveniently add all together to get three yeah if if the value is numeric then it can then it can sum them if the value is symbolic then you can only turn on one thing time it can't it can't like it can't sum like uh um like A's and B's and stuff so this is kind of acting like the weighted average but it's just a kind of um simplified version of it okay any other questions about this it's a little counterintuitive at first okay um so let's just look at one more example so this is an example where um we want to find a token that we've used before in the past so we're going to kind of use these kind of formal operations to try to build up a matrix that spots a token that we've used before so in particular we want to find so that's us here we're this symbol we want to find the fact that we've used that symbol before here right so we want to find that symbol and then this tells us that none of these symbols were used before but this symbol was like finding the fact that we've used one of the words before in the past so we're going to do that with logical operations so we're going to look for a key that we've used before sorry sorry we're going to look for a token that we've used before at an indic SE that came before us we're looking for something in the the past that we've used before and we're going to take that Indy out so I think so um I think it's as with all kind of programming paradigms I don't think you're going to fully grocket just from seeing this but what I what I want to mainly convey is that um the kind of the the minimal parts of this language are these little kind of logical formulism that represent matching keys and queries once you have that match which is this Matrix you can move stuff around in time so for this kind of problem where we want to move something around in time we can think about it through kind of learning these little operators so here's the full program so this is a program that takes what we we did in the last slide uh and then use is it in the the model um and what we're going to see here is the benefit of using multiple layers so the way the rasp language works is it automatically separates them into multiple layers if you use the output of the first program so here we're using find which was our last slide so this is find uh and then we're going to use it as the input to uh a second attention layer so this first first one finds that we have used this before at this position and then the second one uses shift to find something that was one step before and output that and this gives us a program that outputs whatever was before uh the the greater than symbol here's the same program on a bigger input so we can basically take the same code it gives us the same way of matching and we get out basically a different Matrix representing a different input of this form so here it's the letter q that came before the greater than symbol that gets outputed from the system cool so practically almost certainly this is not the way that a Transformer would learn this kind of thing right like if you look inside a Transformer it's crazy town right there's stuff all over the place um but here we have these nice clean sparse matrices right that like simply find something pull it out and and give it back to you um but it does actually give us a kind of constructive proof like if we can code it in this language then you can literally generate the weights of a transformer that would run this program so like we have just proven that it is possible for a Transformer to find something in the past use it to generate the next step and you can prove some like Bonker stuff in this way like um uh we have this blog post and we showed that with six layers of trans formers and like some very kind of hard to write code you can actually build a a full-on adding machine so here I'm showing 683 + 345 gives 1028 and I don't even go through each of these steps but like here are the the matrices so we're like moving stuff around we're combining stuff we're like um adding stuff over time and like you can produce like like real programs that do this kind of thing and so like I don't know how the Transformer works but now I know that you only need six layers of a transformer to produce an arbitrarily long adding machine it's cool I don't know I'm like I can do that that's neat um I I don't know how useful that is I certainly wouldn't want to do it this way uh but um but it does kind of give intuition about like what's plausible cool any questions about this yeah my question is is is the value is it can it only be the thing that happens in the context could it be something that is in my vocabulary but but not necessarily my context uh in this kind of program the only thing you have access to is context or constant so the only thing that can change is the context uh but you can also have constants if you want yeah yeah gu just I was thinking about another Associated example like yeah type um the uh the leader that uh the president of the United States in 2008 is right the model somehow gets some part of the kind of so-called factual knowledge that's a really interesting point I think that you could imagine a version of WRA that has access to like a database and that would maybe be equivalent to what you're describing I I haven't seen anything like that that kind of takes into account the fact that there's like uh knowledge that exists in the actual model itself yeah yeah if you train a code generation model on Gras could you build the model that'll effectively generate like these different like Transformer models this is the problem like code generation models work like basically on Python and JavaScript because there exists a trillion tokens of python and JavaScript right one so like we don't have this thing yet where we can kind of just give it a new language and like have it just like work in the new language like we'd have to sit down and just write a lot of wrap and as far as I know there's like 50 lines of rasp in the world that exist yeah cool I'm running low on time so why don't I I'm go ahead I only have a couple more slides um okay people have done crazy things with this so this block po I saw the other day someone took our code for writing a adder in rasp and they're like I'm going to add a back door to this code so they're going to be like I'm going to have an adding machine that works all the time except for one input and for that one input it's going to Output get poned in Transformer right so some this this program is like hundreds of lines of code it's really wild but they basically just took the adding machine and they're like H how do I like change the weights such that it like has this kind of secret property in it right and I think people are really fascinated by this kind of thing right now because like we don't know if we can guarantee that language models work or if they work for all inputs they do weird things sometimes but what if like you release a model and you literally back door it manually right so it like does weird things in a controllable way uh so I thought that was a really neat example of of how thinking about these sort of constructive problems intersects with the actual use