Module 6 - Lecture 2: Generating Text: Large Language Models

this lecture will be about the technology that underpins chat GPT and Google bard and this is called uh large language models what I want to do in this hopefully short lecture is to explain the basic intuition behind large language models and how they work I want to say something about the architecture that changed everything uh the reason these things burst onto the scene apparently so quickly is because there were certain technical problems that were solved um by something called the Transformer so I want to just explain very rough terms what that is and why that works then I'm going to take you through uh a couple of the steps now the way a large language model works is very complicated and I we don't really need to get into all of it but there are a couple of steps that I want to show you um that I think are are accessible um and then finally I'm going to talk about the market of large language models and the fact that you don't really see most of them because they require some coding we'll take a look at what that means so first of all the idea behind large language models is that language is quite predictable so for example if I ask the question why do we study George Washington and then you see that the answer is blank blank blank was the first president of the blank blank you can probably guess that those words would be something like um because George Washington was the first president of the United States now why were you able to predict that well it's because given the prompt why do we study George Washington we have a whole context of words that surround George Washington and we also have grammar so you know the point here is that we're words tend to appear together if they are roughly semantically equivalent and about the same topic so um this idea that they are predictable well some are more predictable than others so the way we quantify predictability is with probabilities so probability is a you know percentage or you know something from zero to one but it can be expressed as a percentage so if we see a phrase like suffering through all my trials and uh we could say tribulations or sufferings or pains um but they don't occur with the same frequency trials and tribulations is 20 is 90% of the time uh through all my trials and sufferings 8% of the time trials and pains 1% and we can say that the word and would appear in that context 0% of the time because the English language doesn't have um well I I don't believe that there's any case where we have the word and appearing twice in a row I could be wrong about that but I think it's very unlikely okay so a lar a language model then assigns probabilities to words so the simplest kind of language model is one that just takes all of the words in the English language and then counts up how often they appear so uh for example we typically have about 20,000 words that we use in English we start with those and then we go through some Corpus that is some collection of documents like newspapers or conversations or online postings and we just count up how frequently these words appear what percentage of the time if we grab a random word is it the word the uh and so on and so forth so we create this mapping this probability distribution of the individual words and that is called a unigram language model so you know the appears maybe 2.1% of the time b020 and we would go on to see you know all the way down to some of the more obscure words that don't really appear very often might have a problem prob ability of like 0.00002 uh you know it's like zo zooplankton or something like that it's just uh very rare word so this is a this is a language model now what can we do with this language model well we can understand something about how often words appear in general now um so if we wanted to use this language model to generate just come up with some new text here's how we would do it we would take a 20,000 sighted die now obviously we wouldn't really do that that would we would use a computer to generate a random number um but the sides wouldn't be equally likely because we're more likely to see the word the' more likely to see the word b but that doesn't mean that we won't see the other words and every once in a while we'll see the Obscure words so there's this idea of a stochastic process but it follows a certain probability so if we generate words using a unigram language model we would wind up with something like the end of that have with for for not he they now clearly this is not a very good language model uh this this doesn't take into consideration anything about word order or about you know grammar uh so a unigram language model is pretty much useless when it comes to generation now we could go One Step Beyond that and create a language model that considers two word phrases so again we have 20,000 words in English what we do is we find the probability of a word okay so that's the that's the unigram language model but then we also say what is the probability of a word given that another word just happened so so we we apply a probability to that two-word phrase so if we have the word big for example um 21 2.1% of the time it's followed by deal so big deal is more frequent than big dog big data or big gulp uh okay so now because we are looking at not just the probability of a word but but a probability of a word given that another word just happened we can now actually start getting some sense um you know so if we you know the way we would generate language using a Byram language model such as this one would be once again we roll the 20,000 side of die with the unequal sides and we land on a word okay that would be our start then that word just happened so now what's going to be the next word well the next word you know maybe there's 50 words that might appear after that word and we roll another die that represents the probabilities of that next word so now we rep now we have a word that appears maybe fairly often and then a fairly intuitive word that that comes after it probably but not always so on with generating text with this language model you would get something like big deal to be big gulp to go big dog dog okay again this is not a very good sentence um but it's a little bit better than what we had before because we have words that do you know we got to go which makes sense and we got to be and big dog at least at least we have some things that make kind of sense so we can extend this a step further and have a trigram language model so now we can say okay what's the probability of a word given that a certain word occurred before it and a certain word occurred before it so in the language of probability this is probability of word sub I given that word sub I we saw word sub I minus one and word sub I minus 2 so now we're getting three-word phrases and so again we would generate the first word according to the unigram language model then generate the second word with the Byram language model for that word and then we would generate the third word using the trigram language model of the previous two words so now we start to get three word phrases that kind of make sense together and we would get something like to go again big deal deal data Big Data analysis okay and and I just misspelled analysis there so you can see that that the longer we can extend this the more logical are data our our language generation is going to be so it just all comes down to probabilities so why don't we just keep it going why don't we just have you know four gr language Model Five G language model well so they did try this and um you know what what happens when you start to go back that far so remember that we have to calculate all these probabilities and once you once you realize you know the combin torial increase in the number of probabilities that you have to calculate once you go back a few words much less thousands of words well you're going to need hundreds of millions of documents for training it's going to be it's going to be very expensive it's going to be a lot of computation well nowadays we have it we've got gpus we've got tpus and we can use the internet to train it okay so boom we have a large language model large language models were born building up probabilities by training them or building up the language models learning the probabilities based on the enormous volume of text that came that appeared on the internet okay so uh although there are slight differences to the way I explained it to what's actually going on with something like chat GPT the intuition of what it's doing is similar it's just it uses a different algorithm for actually training it coming up with those probabilities so uh what's really happening when you type in something like in the management theory of constraints what are bottlenecks and how are they related to buffers so this is a question you might have as as a management major um okay so we put that in and then the the computer consults its language model and says okay given that these words what sequence of words is likely to occur given that this sequence of words has happened so what it does is it starts generating um it uses nram so n just means any number it use whatever engram language model to generate the words that are likely to appear after the current sequence of words so it starts here starts generating and then here's what it came up with um in the theory constraints a management philosophy introduced by okay now how does it know all this well these words occurred with high probability after these words so that's all it is it's just generating according to probabilities okay now so in order to build a language model we have to calculate billions of probabilities literally billions of probabilities and a recurrent neural network would be the obvious choice because um you know a langu language is a sequence of words but the problem with recurrent neural networks is that they're really inefficient for this kind of thing because calculating probabilities is extremely expensive and you go back a whole long way there are other problems with that um including you know something having to do with probabilities or are less than one and you start to get really really tiny numbers that you're multiplying by really really tiny numbers and then you just have what's called Vanishing numbers they're just they just get so small that they're not there anymore um so what they decided to do was instead of using recurrent neural networks which had its limitations they used something called Transformer with attention now I'll be getting more into uh Transformers um in a in a later video but I want to make you aware that it was this invention um by this research team at Google uh that um they designed this uh not exactly um not exactly Network although it's a it's a uh it's a um learning structure that is deep uh and actually has within it certain number of feedforward neural networks like we had before but incorporates um some other things um especially something called attention where um uh and I'll get into this the details of this in a later video but the way this works is is um your inputs are your sequence of words just like it would be with a recurrent neural network um and uh it pays attention to not just the probabilities of words and the probabilities of words occurring in certain sequences but it also uh reasons about the relationships between the words and every other word within a certain window so in this case this is every word in the input and this is every word in the input right what attention means is that it it decides okay to what extent is this word related to every other word in the input and this helps us do things like this so the word it how strongly is it related to all the other words in this input and it turns out that it is related strongly to animal okay uh and sure enough the word it which would be useless by itself and in fact you know in this few word context it would be useless we can actually tell that it refers to animal here so because of this attention a couple of things happen one there's a lot less data that needs to be tracked so with the recurrent Network you're recurrent neural network you're you're paying attention to every single word here you're saying well for this word we really only need to pay attention to some of the words we and we figure out based on um you know the relationships between The Words which ones we can simply drop okay um so um I think that um I'll go into some of these other technical details in a later video because we're getting pretty long here but I want to go to um explain the last couple of things on the on the agenda here and that's that uh there are really two different categories now of large language models uh there are those that you see all over the place such as chat GPT and Google bard um but these are really um uh these are getting a lot of attention because of their very simple user interface when you slap a web us user interface on top of a language model people can suddenly use it they don't need to do any kind of setup or coding or downloading models or changing anything about their directory structure or anything they just go to the website and start typing so that really drove the popularity of of chat GPT and Google bard which has now changed its name to Gemini but I want you to know that there are a lot of other language models that work differently there are in fact actually hundreds of language models and again the language model is just a uh a trained probability distribution over word sequences so um uh so GPT uh there's gpt1 gpt2 3 3.5 now GPT 4 um this this is the language model back back behind chat GPT so they just made a chat in front of GPT 3.5 and as you know uh the newest version of this is gp4 so there are also some language models um that require some coding and they work a little bit differently I'm going to explain just very quickly how that works so there's one called Elmo uh one called Bert I believe it's a coincidence that um both of those I think refer to Sesame Street characters um Bert stands for uh uh bidirectional encoding U something with Transformers uh so it's not completely uh it's not completely meant to be Bert from Sesame Street that's one called Claude and one called llama so the way these language models work is you download the model and then you write code that uses the model and then you execute it so so there are some here's here's an example of some other models um from a a platform called hugging face so if you go to uh huggingface uh huggingface doco uh this is really just a fantastic platform where you can uh decide what task it is that you need to do you download the model and then boom you have it in your code now the way this would work in like a business workflow is you would have um you know you would want to incorporate let's say some kind of image to text in your on your website you would have your it people uh have a developer find these models and then use them to embed that functionality into an existing Tool uh you know whether that's your Salesforce or your website or your app or whatever but the idea is this would be a pluggable thing that you could you could develop um inh housee because the models have already been trained it takes a huge amount of data and processing to train a model but once it's trained you can just use it and so that's what these are so you say okay well I have a I have a need to do image to text or yeah so so notice here that they have the tasks um organized by the data type so multimodal so text to image image to text uh text to video visual question answering uh various computer vision tasks here's the natural language processing one if you have to do text classification um uh text translations summarization text generation uh sentence similarity and they've got Auto in here too so they got all these tasks and then what you would do um and I'm not sure what the payment model on this is is when you realize that you need one of these tasks incorporated into your your uh as a module inside some kind of system that you have your developer downloads the model uh you know here's uh let's see oh yeah here we go here's they just have some here and there are over 400,000 of these so uh there's quite a lot to find but once you download them you uh you load them and here's this is just some basic python um so you create a pipeline here and this is the name of the model right here text generation so this says just give me that text generation model and then it passes this text into the generator and then here is the text that the model generated so with a couple lines of code you can make use of the model that somebody else trained so this is a whole new business model now it's just selling models um another example here um here is a model that translates from English to French so that's what NFR means here so again you instantiate the pipeline uh this is the name of the model um yeah model equals you know hell sinky whatever um and then it calls this translator with the text default to expanded threads and then this translated translated it into French Paro thread a g or whatever um okay so this is this is how large language models are not only just these Standalone web interface things that you can play with out there but you can actually put this kind of functionality into existing systems with a little bit of programming skill

Transcript for:Module 6 - Lecture 2: Generating Text: Large Language Models

Transcript for:
Module 6 - Lecture 2: Generating Text: Large Language Models