Transcript for:
Guide to Building Large Language Models

Learn how to build your own large language model from scratch. This course goes into the data handling, math, and transformers behind large language models. Elia Arleds created this course. He will help you gain a deep understanding of how LLMs work and how they can be used in various applications. So let's get started. Welcome to Intro to Language Modeling. In this course, you're going to learn a lot of crazy stuff. Okay. I'm just going to give you a heads up. It's going to be a lot of crazy stuff we learn here. However, it will not be insanely hard. I don't expect you have any, any experience in calculus or linear algebra. Uh, a lot of courses out there do assume that, but I will not. We're going to build up from square one. We're going to take baby steps when it comes to new, uh, fundamental concepts in math and machine learning, and we're going to take a larger steps once things are fairly clear and they're sort of easy to figure out, uh, that way we don't take forever just taking baby steps through every little concept. This course is inspired by Andre Karpathy's, uh, building a GPT from scratch lecture, so I'll shout out to him and yeah, we don't assume you have any experienced, maybe three months of Python experience, uh, just so the syntax is sort of familiar and you can, you're able to follow along that way, but, uh, no matter how smart you are, how quick you learn the willingness to put in the hours is the most important because this is material that you won't normally come across. Um, so as long as you're able to put in that constant effort, uh, push through these lectures, even if it's hard, take a quick break, grab a snack, whatever you need to do, grab some water, water is very important. And yeah, hopefully you can make it to the end of this. You can do it, uh, since it's free code camp, everything will be local computation, nothing in the realm of paid data sets or cloud computing, uh, uh, we'll be scaling the data to 45 gigabytes for the entire, uh, training data set, so have 90 reserved so we can download the initial 45 and then convert it to an easier to work with 45. So, uh, yeah, if you don't actually have 90 gigabytes reserve, that's totally fine, you can just download a different, uh, data sets and sort of follow the same, uh, data pipeline that I do in this video through the course, you may see me switch between macOS and windows, the code still works, all the same, both operating systems, and I'll be using a tool called SSH. It's a server that I can connect from my Mac book to my windows PC that I'm recording on right now, and that will allow me to execute, run, build, whatever, do anything coding related, uh, command prompt related on my, uh, Mac book, so I'll be able to do everything on there that I can, my windows computer, it'll just look a little bit different for the recording. So, uh, why am I creating this course? Well, like I said before, a lot of beginners, they don't have the fundamental knowledge like calculus, linear algebra to help them get started or accelerate their learning in this space. So I intend to build up from baby steps and then larger steps when things are fairly simple to work with, and I'll use logic analogies and step-by-step examples to help consent conceptualize rather than just throw tons of formula at you. So with that being said, let's go ahead and jump in to the good stuff. So in order to develop this project step-by-step, we're going to use something called Jupiter notebooks. And you can sort of play with these in the Anaconda prompt or at least launch them from here. So Anaconda prompt is just great for anything machine learning related. So make sure to have this installed. I will link a video in the description so that you can sort of set this up and install it step-by-step guide in there. Um, so we can do from this point is sort of just set up our project and initialize everything. So I'm going to do is just, uh, head over into my directory that I want. Just going to be a Python testing. We're going to make a directory free code camp GPT course, and then from this point, uh, we're going to go and make a virtual environment. So virtual environment, it will, and initially in your desktop, you will have, uh, just all of your Python libraries, all your dependencies there, just floating around. And what the virtual environment does is it sort of separates that. So you have this isolated environment over here, and you can just play around with this however you want, and it's completely separate. So that won't really, uh, cross with, uh, all of the global libraries that you have, all the ones that just affect the system when you're not in a virtual environment, if that makes sense. So we're going to go ahead and set that up right now by using Python, uh, dash M, and then we're going to go V and V for virtual V and V and then CUDA. So the reason why we say CUDA here is because, uh, later when we, uh, try to accelerate our learning, uh, or the model's learning, uh, we're going to need to use GPUs, GPUs are going to accelerate this a ton and basically CUDA is just that little feature in the GPU that lets us do that. So we're going to make our environment called CUDA. I go and press enter. It's going to do that for, it's going to take a few seconds. So now that that's done, we can go ahead and do CUDA and we're just going to basically activate this environment so we can start developing in it. We're going to go backslash and we're going to go scripts and then activate. So now you can see it says CUDA base. So we're in CUDA and then secondary base. So it's going to prioritize CUDA. So from this point, we can actually start installing some stuff, uh, some libraries here, so we can go pip three, install, uh, mat plot lib, numpy. Uh, we're going to use P Y L M Z a L Z M a, and then what are some other ones? We're going to do IPY kernel. This is for the actual Jupiter notebooks and, uh, being able to bring the CUDA, uh, virtual environment into those notebooks. So that's why that's important. And then just the actual, uh, Jupiter notebook feature. So go ahead and press enter. Those are going to install. That's going to take a few seconds to do. So it might actually happen is you'll get a build error with, uh, P Y L Z M a, which is a compression algorithm. And don't quote me on this, but I'm pretty sure it's based in C plus plus. So you actually need some build tools for this. And you can get that with, uh, visual studio build tools. So what you're, you might see, you might see a little error and basically go to that website and you're going to get this right here. So just go ahead and download build tools. What's going to download here, you're going to click on that. It's going to, it's going to set up and then you're going to go ahead and click continue. And then at this point, uh, you can go ahead and click modify if you see this here, and then you might get to a little, uh, workloads section here. So once you're at workloads, that's good. What you're going to make sure is that you have, uh, these two checked off right here, just make sure that you have these two. Um, I'm not sure what desktop particularly does. It might help, uh, but it's just kind of good to have, uh, some of these build tools on your PC anyways, even for future projects. So, uh, just get these two for now. That'll be good. And then you can click modify over here if you want it to modify, just like that. And then you should be good to, uh, rerun that command. Command. So from this point, what we can actually do is we're going to install torch and we're actually going to do it by using pip install, uh, three, install torch. We're not going to do it like this. What we're actually going to do is we're going to use a separate command and this is going to install CUDA with our, uh, torch. So it's going to install the CUDA extension, which will allow us to utilize the GPU. So it's just this command right here. And if you want to find like, uh, a good command to use, what you can do is go to the, uh, pie torch docs, uh, just go to, go to get started and then, uh, you'll be able to see this right here. So we have stable, uh, windows, pip, Python, then CUDA 11.7 or 11.8. So I just clicked on this and since we aren't going to be using, uh, torch vision or torch audio, I basically just did hip three, install torch, and then with this index URL for the, uh, CUDA 11.8. So that's pretty much all we're doing there to install CUDA. That's a part of our torch. So we can go ahead and click, uh, enter on this. So great. We've installed a lot of things, uh, libraries, a lot of setup has been done already, uh, what I want to check now is just to make sure that our Python version is what we want. So Python version 3.10.9, that's great. Uh, if you're between 3.9, 3.10, 3.11, uh, that's perfect. So if you're in between those, it should be fine. Uh, at this point we can just jump right into our Jupyter Notebook. So the command for that is just Jupyter Notebook spelled like that, click enter. It's going to send us into here and I've created this little bi-gram.ipynb here, uh, in my VS code. So, uh, pretty much you need to actually type some stuff in it and you need to make sure that it has the IPYNB, uh, extension or else it won't work. So if it's just IPYNB and doesn't have anything in it, uh, I can't really read that file for some reason. And yeah, so just, just make sure you type some stuff in it, open that in VS code. Type like, I don't know, A equals three or STR equals banana. I don't care, uh, at this point, let's go ahead and pop into here. So this is what our Notebook's going to look like and we're going to be working with this quite a bit, uh, throughout this course. So what we're going to need to do next here is make sure that our virtual environment is actually inside of our Notebook and make sure that we can interact with it from this, uh, kernel rather than just through the command prompt. So we're going to go ahead and check here and I have a virtual environment here. Uh, you may not, but all we're going to do is basically go into here. We're going to end this and all we're going to do is we're going to go ahead and do Python, uh, dash M and then IPY, uh, kernel install user. And you'll see why we're doing this in the second user name equals CUDA. This is from the virtual environment we initialized before. So that's the name of the virtual environment. And then the display name, how it's actually going to look in the terminal is going to be, uh, display name. Uh, we'll just call it, um, CUDA GPT. I don't know. That sounds like a cool name and then we'll go and press enter. It's going to make this environment for us. Great installed. Good. So we can go and run our Notebook again and we'll see if this changes. So we can go ahead and pop into our BiGram again, kernel change, kernel, boom, CUDA GPT, let's click that sweet. So now we can actually start, um, doing more and just sort of experimenting with, uh, how the notebooks work and actually how we can build up this BiGram model and sort of learning how, uh, language models work from scratch. So let's go ahead and do that. So before we jump into this actual code, we're going to go ahead and into this actual code here. What I want to do is, uh, delete all of these. Good. So now what I'm going to do is just get a small little dataset, just very small for us to work with that we can sort of try to make a BiGram out of something very small. So what we can do is go to this website called project Gutenberg, and they basically just have a bunch of free books that are, uh, licensed under creative commons, so we can use all of these for free. So let's use, uh, the wizard of Oz at the end, the wizard of Oz. Great. So what we're going to want to do is just click on plain text here. Great. So now I can go, uh, control S to save this. And then we could just go wizard of Oz wizard underscore of underscore Oz. Good. So now what I'm going to do is we should probably drag this into, we should drag this into our folder here. So I'm just going to pop that into there. Good stuff. Did that work? Sweet. So now we have our wizard of Oz text in here. We can open that. Uh, what we can do is start of this book. Okay. So we can go ahead and we go down to when it starts. Sweet. Sweet. So maybe we'll just cut it here. That'd be a good place to start. Just like that. I'm going to put a few spaces. Good. So now we have this book, uh, go to the bottom here just to get rid of some of this other licensing stuff, which is, might get in the way with our predictions in the, in the context of the entire book. So let's just go down to when that starts and of the book. Okay. Okay. So we've gotten all that, that is done for the illustration there. Perfect. So now we have this wizard of Oz text that we can work with. Let's close, close that up. 233 kilobytes. Awesome. Very small size. We can work with this. This is great. So we have this wizard of Oz.txt file and what are we going to do with that? Well, we're going to try to train, uh, a transformer or at least a by-ground language model on this text. So in order to do that, we need to sort of learn how to manage this text file, how to open it, et cetera. So we're going to go ahead and open this and do wizard of Oz like that. And we're going to open in read mode and then we're going to use the encoding UTF eight, just like that. So, uh, this is a file mode that you're going to open in others. Read mode. There's right mode. There's read binary. There's right binary. Uh, and those are really the only ones we're going to be boring out. Uh, we're worrying about for this video. Um, the other ones you can look into in your spare time, if you'd like to, uh, we're just going to be using those four for now. Uh, and then the encoding is just what type of character or coding are we using? Uh, that's pretty much it. We could just open this as F short for file and to go text, uh, equals F dot read, you can read this file stored in a string variable, and then we can, you know, we could print and print some stuff about it so we can go print the length of, uh, print the length of this text. Run that we get the length of the text. Um, we could print the first, uh, 200 characters of the text shirt. So you're the first 200 characters. Great. Um, so now we know how to, you know, just play with characters. Um, at least just see what the characters actually look like. So now we can do a little bit more from this point, which is going to be, uh, encoders and, uh, before we get into that, what I'm going to do is put these into a little vocabulary list that we can work with. So all I'm going to do is I'm going to say, we're going to make a, a charge variable. So the charge is going to be all the charge or all the characters, um, in this text piece, so we're going to make a, uh, sorted set of text here and we're going to just, uh, print out, uh, so look at that. We have a giant array of all of these characters. So now we can, what we can do is we can use something called a tokenizer and a tokenizer consists of an encoder and a decoder, what an encoder does is it's actually going to convert each character or sorry, each element of this array to an integer, so maybe this would be a zero, uh, this would be a one, right? So a new, uh, a new line or an enter would be, uh, a zero, a space would be a one, exclamation mark would be a two, et cetera, right? All the way to the length of them. And then what we could do is we could even, uh, we could even print the length of these characters so we could see how many there actually are. So there's 81 characters in the entire, in the, in the entire wizard of Oz book. So I've written some code here that is going to do that job for us, the job of tokenizers. So what we do is we just use a little generator, some generator, four loops here, uh, generative form, four loops rather, and we make a little mapping from strings to integers and integers to strings, uh, given the vocabulary. So we just enumerate through each of these. Um, we have one assignment first element assigned to a one second assigned to a two, et cetera, right? That's basically all we're doing here. And we have an encoder and a decoder. So let's say we wanted to, uh, convert, uh, the string, hello to integers. So we'd go encode and we could do hello just like that. And then we could, uh, go ahead and print this out. Perfect. Let's go ahead and run that boom. So now we have a conversion from characters to integers. And then if we wanted to maybe convert this back, so decode it, we can store this in a little, maybe decoded, uh, hello equals that. And then we could go, uh, or encoded rather encoded, hello, and then we could go, uh, decoded, uh, hello is equal to, we go decode, decode, decode, decode, is equal to, we go decode and we can use the encoded hello. So we're going to go ahead and encode this into integers. And then we're going to decode the integers back to, uh, a character format. So, uh, let's go ahead and print that out. We're going to go ahead and print the decoded hello. Perfect. So now we get that. So I'm going to fill you in on a little background information about these tokenizers. So right now we're using the character level tokenizer, which takes basically each character and, uh, converts it to an integer equivalent. So we have a very small vocabulary and a very large amount of, uh, tokens to convert. So if we have 40,000 individual characters, that means we have a small vocabulary to work with, but a lot of characters to encode and decode, right? If we have, if we work with maybe a word level tokenizer, that means we have a ton, like every single word in the English language. I mean, if, if you're working with, uh, multiple languages, this could be like, you know, a lot, very large amount of, uh, tokens. So you're going to have like maybe millions or billions or trillions. If you're, if you're doing something weird, but in that case, you're going to have a way smaller, uh, set to work with. So you're going to have a very large vocabulary, but a very small amount to encode and decode. So if you have a subword tokenizer, that means you're going to be somewhere in between a character level and a word level tokenizer, if that makes sense. So in the context of language models, it's really important that we're efficient with our data and just having a giant string might not work the best. And we're going to be using a machine learning framework called PI torch or torch. So I've imported this right here. And pretty much what this is going to do is it's going to handle a lot of the math, a lot of the calculus for us as well. A lot of, a lot of the linear algebra, which involves, uh, a type of data structure called tensors. So tensors are pretty much major sees. If you're not familiar with those, that's fine. We'll go over them more in the, in the course, but pretty much what we're going to do is we're going to just put everything inside of a tensor so that it's easier for PI torch to work with. So I'm going to go ahead and delete these here. And all we can do is just make our data element. We could, this is going to be the entire text data of the entire wizard of Oz. So we could go ahead and make this, uh, data equals, and we're going to go torch.tensor. And then we're going to go, uh, and the code, we're going to put the text inside of that. So we're going to go ahead and encode, uh, this text right here. And we're going to make sure that we have the right data type, which is a torch.long, uh, data, data type equals torch.long. So this basically means we're just going to have this as a, uh, super long sequence of integers and yeah, let's go see what we can do with this, uh, torch.tensor element right here. So I've just written a little print statement where we can just, uh, print where we can just print out the first a hundred characters or a hundred integers of this data. So it's, it's pretty much the same thing in terms of working with arrays. It's just a different, uh, type of data structure, uh, in the contents, in the context of PyTorch, sort of easier to work with in that way. Uh, PyTorch is just primarily revolved around tensors and modifying them, uh, reshaping, changing dimensionality, multiplying, doing dot products, uh, which I mean, that sounds like a lot, but, uh, we're going to go over some of this stuff later in the course, just about how to do all this math. We're going to actually go over examples on, you know, how to, how to multiply this matrix by this matrix, even if they're not the same shape and even dot prodding, dot producting, that kind of stuff. So next one I'm going to talk about is something called, uh, validation and training splits. So why don't we just, you know, use the entire text document and only train on that entire text corpus. Why don't we train on that? Well, the reason we actually split into training and validation sets, I'm going to show you right here. So we have this giant text corpus. It's a super long text file. Think of it as a, you know, an essay, but a lot of pages. So this is our entire corpus and we make our training set, you know, 80% of it. So maybe this much, and then the other validation is this 20% right here. Okay. So if we were to just train on the entire thing after a certain number of iterations, it would just memorize the entire text piece and it would be able to, you know, simply write it, just write it out, it would have it in the entire thing memorized and it wouldn't really get anything useful out of that. You would only know this document, but what the purpose of language modeling is, is to generate text that's like the training data, and this is exactly why we put it into splits. So if we, if we run our training split right here, it's only going to know 80% of that entire corpus and it's only going to generate on that 80% instead of the entire thing. And then we have our other 20%, which only knows 20% of the entire corpus. So the reason why we do this is to make sure that the generations are unique and not an exact copy of the actual document. We're trying to generate text that's like the document. Like for example, in Andrej Karpathy's lecture, he trains on Shakespearean text, an entire piece of Shakespeare. And the point is to generate a Shakespearean like text, but not exactly what it looked like, not that exact, you know, 40,000 lines or like a few thousand lines of that entire corpus, right? We're trying to generate text that's like it. So that's the entire reason, or at least that's most of the reason why we use train and vowel splits. So you might be wondering, you know, like, why is this even called the Bagram language model? I'm actually going to show you how that works right now. So if we go back to our whiteboard here, I've drawn a little sketch. So if we have this piece of content, the word hello, let's just say it, we don't have to encode it as any integers right now. Right now we're just working with characters. Pretty much we have two, right? So by means two. The by prefix means two. So we're going to we're going to have a Bagram. So given maybe there's nothing before an H in this content. So we just assume that's the start of content and then that's going to point to an H. So H is the most likely to come after the start. And then maybe given an H, we're going to have an E, then given an E, we're going to have an L, then given an L, we're going to have another L, and then L leads to O. Right. So maybe there's going to be some probabilities associated with these. So that's pretty much how it's how it's going to predict right now. It's only going to consider the previous character to predict the next. So we have given this one, we predict the next. So there's two why it's called Bagram language model. So I've known my terrible writing here, but we're actually going to go into how we can train the Bagram language model to do what we want, how we can actually implement this into a neural network, an artificial neural network and train it. So we're going to get into something called block size, which is pretty much just taking a random snippet out of this entire text corpus here, just a small snippet. And we're going to make some predictions and we're going to make some targets out of that. So our block size is just a bunch of encoded characters or integers that we have predictions and targets. So let's say we take a small little size of maybe block size of five. OK, so we have this tiny little tensor of five integers and these are our predictions. So given some context right here, we're going to be predicting these and then we have our targets, which would be offset by one. So notice how here we have a five and then here the five is outside and then this 35 is outside here and now it's inside. So all we're doing is just taking that block from the predictions in order to get the targets. We just offset that by one. So we're going to be accessing the same indices. So index zero is going to be five index zero is going to be sixty seven, right? So sixty seven is following five in the background language model. So that's pretty much all we do. We just look at how much of a difference is that target away from or how much far as the prediction away from the target and then we can optimize for reducing that error. So the most basic Python implementation of this in the character level tokenizers or the character level tokens rather would be just simply this right here. So we would we would take we would take a little snippet random. It would be pretty much just from the start or some some whatever, just some snippet all away from the start of the snippet up to block size. So five ignore my terrible writing again. And then this one would just be. It would just be one up to block size or five. Plus one to be up to six, right? And that's that's pretty much all we do. This is exactly what it's going to look like in the code. So I've written some code here that does exactly what we just talked about in Python. So I've defined this block size equal to eight just so you can kind of see what this looks like on a larger scale, a little bit larger and just what we wrote right there in the Jupyter Notebook, this position zero up to block up to block size and then offset by one. So we make it position one up to block size plus one little offset there. We pretty much just wrote down here X as our productions as and Y as our targets and then just a little for loop to show what the prediction and what the targets are. So this is what this looks like in Python. Great. We can do predictions, but this isn't really scalable yet. This is sequential, right? Sequential is another way of describing what the CPU does. CPU can do a lot of complex operations very quickly, but it only happens sequentially. It's this one and this task and this task and this task, right? But with GPUs, you can do a little bit more simpler task, but very, very quickly or in parallel, so we can do a bunch of very small or not computationally complex computation and a bunch of different little processors that aren't as good, but there's tons of them. So pretty much what we can do is we can take each of these little blocks and then we can stack them and push these to the GPU to scale our training a lot. So I'm going to illustrate that for you right now. So let's just say we have a block. Okay. Block looks like this and we have some. We have some integers in between here. Okay. So this is a block. Okay. Now, if we want to make multiple of these, we're just going to stack them. So we're going to make another one. Another one. Another one. So let's say we have four batches. Okay. Or sorry, four blocks. So we have four different blocks that are just stacked on top of each other and we can represent this as a new hyper parameter called batch size. This is going to tell us how many of these sequences can we actually process in parallel. So the block size is the length of each sequence and the batch size is how many of these are we actually doing at the same time. So this is a really good way to scale language models. And without these, you can't really expect any fast training or good performance at all. So we just went over how we can actually get batches or rather how we can use batches to accelerate the training process. And we can, it just takes one line to do this actually. So all we have to do is call this little function here saying, if CUDA dot torch dot CUDA is available, we'll just check if the GPU is available based on your CUDA installation. And if it's available, like it says, if it's available, we'll set the device to CUDA else CPU. So we're going to go and print out the device here. So that's going to run and we get CUDA. So that means we can use the GPU for a lot of our processing here and while we're here, I'm actually going to move up this hyper parameter block size up to the top block size. And then we're going to use batch size, which is how many blocks we're doing in parallel. And we're just going to make this four for now. So these are our two hyper parameters that are very, very important for training. And you'll see that why these become much more important later when we scale up the data and use more complex mechanisms to train and learn the patterns of the language based on the text that we give it. So if it doesn't work right away, if you're, if the new Jupyter notebook doesn't work right away, I'd recommend just hitting control C to cancel this, hit it a few times, might not work the first. It'll shut down and you just go up Jupyter notebook again and then enter. And then after this is done, you should be able to just restart that and it will work, hopefully. There we go. So I go ahead and restart and clear outputs. And we can run that, see, we get boo. So awesome. Now let's try to do some actual cool PyTorch stuff. So we're gonna go ahead and import torch here and then let's go ahead and try this rand int feature. So you go rand int, we'll do equals torch.randint and then let's say we go minus a hundred to a hundred and then in brackets we go six, just like that. So if we want to print this out here, or we can just go rand int. Or we can just go rand int like that, could run this block first, good, and boom. So we get a tensor type and all of these numbers are, we have, we have six of them. So one, two, three, four, five, six, and they're between negative 100 and 100. So we're going to have to keep this in mind right here when we're getting our random batches from this giant text corpus. So let's try out a new one. Let's just try, we can make tensors. We've done this before. So we could do tensor equals torch.tensor and we could go 0.1, 1.2. Here, I'll just copy and paste one right here. So we do this, boom. And we can just do tensor and we'll get exactly this. So boom, we get a three by two matrix. Now we're going to try a different one called zeros. So zeros is just torch.zeros. And then inside of here, we could just do the dimensions or the shape of this. So two by three, and then we can just do zeros. And then go ahead and run that so we get a two by three of zeros and these are all floating point numbers, by the way. Maybe we could try ones. Now, I know ones is pretty fun ones. So we go torch, torch.ones, it's pretty much the same as zeros. We could just do like maybe three by four and then print that ones out. So we have a three by four of ones. Sweet. So what if we do input equals torch.empty and we can make this two by three. So these are interesting. These are pretty much a bunch of very, either very large or very small numbers. I haven't particularly found a use case for this yet. But just another feature that PyTorch has. We have a range, so we go arrange equals torch.arrange, and we could do like five, for example, just do range. So now we have a tensor just sorted zero or rather starting at zero up to four. So five, just like that. Line space equals torch.line, line space, spelling is weird, two, three, ten, and then steps, for example, equals five. This all makes sense in a second here, go run, and we get a line space. So steps equals five. So we have five different ones, boom, boom, boom, boom, boom. And we go all the way from three to ten. So pretty much getting all of the constant increments from three all the way up to ten over five steps. So you're doing, you're basically adding the same amount every time. So three plus one point seven five is four point seven five plus another one one seven five is six point five and then eight point two five and then ten. Right. So just over five steps, we want to find what that constant increment is. So that's a pretty cool one. And then we have we'll do log space, which is interesting. Log space equals torch.logspace. And then we'll go start. Start equals negative ten and equals ten. These are, these are both start and end. So you can either put these here. You can either put the start with them, start equals, or you don't have to. It's honestly up to you. And then we can put our steps again, steps equals give you five. Let's go ahead and run that, or oops, to put logspace there. So we get that. So we start at one of the negative ten. And then we just do this little increments here. So it goes ten, negative five, zero, plus five, ten, for start five steps. So that's pretty cool. Um, what else do we have here? So we have I, torch.I, I just have all these on my second screen here. So a bunch of examples just written out and we're just kind of visualizing what these can do and maybe you might even have your own creative little sparks of thought that you're going to maybe find something else that you can use these for for your own personal projects or whatever you want to do. So we're just kind of experiment experimenting with these, uh, what we can do with the basics of PyTorch and some of the very basic functions. So, uh, first.I, we go, uh, print this I out here. So we get pretty much just a diagonal line and it's, it's in five. So you get a five by five matrix and yeah, pretty much just, uh, reduced row, uh, each long form. I don't know how to pronounce it, but, uh, that's pretty much what it looks like. So pretty cool stuff. Um, let's see what else we have. We have empty like empty, like torch.empty. Like, uh, a, and then we'll just say maybe make a equal to, we'll make it a torch.empty. And then we can go, uh, two by three and then, uh, data type torch.int 64. So 64 bit integers, uh, and then let's see what happens here. Empty boom. So that's pretty cool. What else do we have? Yes, we can do timing as well. So I'm just going to erase all of these. Uh, you can, I mean, you can scroll back in the video, just look and maybe experiment with these a little bit, try a little bit more than just, you know, what I've done with them, maybe modify them a little bit. Um, but yeah, I'm actually going to delete all of these here. So you just do, and then we can go ahead and do the device equals, uh, CUDA. And we're going to go ahead and switch this over to the CUDA GPT environment. Well, CUDA, if, if torch dot CUDA, uh, underscore is, uh, dot CUDA is available. Uh, and then it helps go CPU print out our device here. Let me run this CUDA suite. So we're going to try to do stuff with, uh, the GPU now compared to the CPU and really see how much of a difference, uh, CUDA or the GPU is going to make in comparison to the CPU when we change, uh, the shape and dimensionality, and we're just doing different, um, experiments with a bunch of different tensors. So in order to actually measure the difference between the GPU and the CPU, uh, I just imported a library called time. So this comes with the operating system or sorry, with, with Python. Uh, you don't have to actually install this manually. So, uh, basically what we do is we whenever we called time dot time, uh, and then brat, uh, parentheses, it will just take, uh, the current time snippet right now, so start time will be like right now and then end time, maybe three seconds later will be, you know, right now, plus three seconds. So if we subtract end time, I start time, we'll get a three second difference. And that would be the total all apps time. And then, uh, this little number here, this four will be, uh, just how many decimal places we have. So I can go ahead and run this here. Time is not defined. Let's run that first. It's going to take, you know, almost no time at all. So we can actually increase this if we want to 10 and then run that again. Again, it's, you know, we're, we're making up pretty much a one by one matrix. So just a, it's just a zero. So, uh, we're not really going to get anything significant from that. Um, but anyways, for, for actually testing the difference between the GPU and the CPU, what we're going to worry about is that iterative process, the process of forward pass and, uh, back propagation through the network, that's primarily what we're trying to optimize for actually pushing all these parameters and all these, um, model weights, uh, to the GPU, isn't really going to be the problem. It'll take maybe a few seconds at most, like maybe 30 seconds to do that. And that's not going to be any time at all in the entire training process. So what we want to do is just see, you know, which is better numpy on the CPU or, uh, torch using CUDA on the GPU. So I have some code for that right here. So we're going to initialize a bunch of matrices here. So, or sorry, tensors, and we have, uh, just basically random ones. So we have a 10,000 by 10,000, uh, all around all random floating point numbers. And then we're going to push these to the GPU and we have two of these. And then same thing for numpy. So in order to actually multiply matrices with PyTorch, we need to use this at symbol here. So we multiply these and we get this new, uh, we get this new, uh, random tensor. And then we stop it. And then we do the same thing over here, except we use, uh, numpy.multiply. So if I go ahead and run these, it's going to take a few seconds to initialize these and, or not even a few seconds. And then we have, see, look at that. So for the, uh, GPU, it took a little while to do that. And then for the CPU, it didn't take as long. So this is because there's the shape of these matrices are not really that big. They're just two dimensional, right? So it's, see, this is something that the CPU can do very quickly, because there's not that much to do, but let's say we want to bump it up a notch. So if we go to 100, 100, 100, and then maybe we'll throw in another 100 there. Hopefully that works. And then we can do, uh, we'll just do the same thing. So let's paste this. So now if we try to run this again. You'll see that the GPU actually took less than half the time that the CPU did. And this is because, uh, there's, you know, a lot more going on here. There's a lot more simple, uh, multiplication to do. So the reason why this is so significant is because when we have, you know, millions or billions of parameters in our language model, we can do that. Millions or billions of parameters in our language model. We're not going to be doing, uh, very complex operations between all these tensors, they're going to be very similar to what we saw in here. The, the dimensionality and shape is going to be very similar to what we're seeing right now, you know, maybe three or four dimensions. Uh, and it's going to be very easy for our GPU to do this. They're not complex tasks that we need the CPU to do. They're not very hard at all. So when we, uh, give this task to parallel processing, it's going to be a ton quicker. So you're going to see why this matters later in the course. You're going to see this with, uh, some of the hyper parameters we're going to use, which I'm not going to get into quite yet, but, uh, over the next little bit, you're going to see why the GPU is going to matter a lot for, uh, increasing the efficiency of that iterative process. So this is great. Now, you know, a little bit more about why we use, uh, the GPU instead of the CPU for, uh, training efficiency. So there's actually another term that we can use called a percentage percentage time. I don't know if that's exactly how you're supposed to call it, but, uh, that's what it is and pretty much what it'll do is time, how long it takes to execute a block. So we can see here there's CPU times, uh, zero nanoseconds. The N is for nano, uh, billionth of a second is a nanosecond and then wall time. So CPU time is how long it takes to, uh, execute on the CPU, the time that it's doing operations for, and then the wall time would be, uh, how long it actually takes like in real time, how long do you have to wait? Do you have to wait until it's finished? So the only thing that the CPU CPU time doesn't include is waiting. So in an entire process, there's going to be some operations and there's going to be some waiting wall time is going to have them, uh, both of those and CPU time is just the execution. So let's go ahead and continue with, uh, some of the basic PyTorch functions. So I've written some stuff down here. So we're going to go over, uh, torch.stack torch.multinomial torch.trill, uh, triu, I don't think that's how you pronounce it, but we'll get into that more, uh, transposing, uh, linear, concatenating and the softmax function. So let's first start off here with the torch.multinomial. So this is essentially a probability distribution based on the index that you give it. So we have probabilities here. We say 0.1 and 0.9. These numbers have to add up to one to make a hundred percent, a hundred percent is one, one hole. So I have 10% and 90%. This is an index zero. So there's a 10% chance that we're going to get a zero and a 90% chance that we're going to get a one. So if I go ahead and run these up here, give this a second to do its thing. So we, you can see that, uh, in the end, we have our num sample set to 10. So it's going to give us 10 of these one, two, three, four, five, six, seven, nine, 10, and all of them are ones if we run it again and make it slightly different results. So now we have some zeros in there, but the zeros have a very low probability of happening, as a matter of fact, exactly a 10% probability of happening. So, uh, we're going to use this later in, uh, predicting what board is going to come next. Let's move on to torch, got torch.cat or short for torch.concatenate. So this will essentially concatenate two tensors into one. So I initialize this tensor here, torch.tensor, uh, one, two, three, four, it's one dimensional and we have another tensor here that just contains five. So if we concatenate one, two, three, four, and five, then we get, uh, one, two, three, four, five, you just combine them together. And, uh, this is what'll come out in the end. So you run that one, two, three, four, five. Perfect. So this is going to, we're going to actually use this when we're generating, when we're generating text, given a context, so it's going to start, uh, it's going to start from zero, we're going to use our probability distribution to pick the first one and then, uh, based on the first one, we're going to, uh, you know, we're going to, we're going to predict the next character and then once we have predicted that, we're going to concatenate, uh, the new one with the ones that we've already predicted. So we have this, maybe like a hundred characters over here, and then the next character that we're predicting is over here, we just concatenate these. And by the end, we will have all of the, uh, integers that we've predicted. So next up we have torch.trill. And what this stands for with a trill stands for is, uh, triangle lower. So it's going to be in a sort of a triangle formation like this diagonal. It's going to be going from, uh, top left to bottom right. And so you're going to see a little bit more why later in this course, but this is important because when you're actually trying to predict, uh, integers or, uh, next tokens in the sequence, you have, you only know what's in the current history, we're trying to predict the future. So giving the answers in the future, uh, isn't what we want to do at all. So maybe we've just predicted one and the rest of them we haven't predicted yet. So we set all these to zero and then we predicted another one and these are still zero. So these are talking to each other in history. And as, and as our predictions add up, we're going to have to uh, predictions add up. Uh, we have more and more history, uh, to look back to and less future, right? Um, basically the premise of this is just making sure we can't communicate with the answer. We can't predict while knowing what the answer is. Just like when you write an exam, you can't use the answer sheet. They don't give you the answer sheet. So you have to know based on your, uh, history of knowledge, which answers to predict, and that's all, all that's going on here. And we have, I mean, you could probably guess this triangle upper. So we have all the upper ones. These are, you know, lower on the lower side and then these are on the upper side. So same concept there. And then we have a mask fill. So this one's going to be very important later because in order to actually get to this point, all we do is we just exponentiate every element in here. So if you exponentiate zero, if you exponentiate zero, it'll become one. If you exponentiate negative infinity, it'll become zero. All that's going on here is we're doing, uh, approximately 2.71. And this is a constant that we use in, uh, the, the dot exp function. And then we're putting this to whatever, uh, power is, uh, in that current slot. So we have a zero here. So 2.71 to the zeroth is equal to one, 2.71 to the one is equal to 2.71. And then, uh, 2.71 to the negative, uh, infinity is, of course, zero. So that's pretty much how we get from this to this. And, uh, we're, we're just, we're simply just masking these over. So that's great. And I sort of showcase what, uh, the exp does. We're just using this one right here. We're using this, this output and we're just plugging it into here. So, uh, it'll go from negative infinity to zero and then zero to one. So that's how we get from here to here. Now we have, uh, transposing. So transposing is when we sort of flip or swap the dimensions of a tensor. So in this case, I initialize a torch.zero tensor with dimensions two by three by four. And we can use the transpose function to essentially flip, uh, any dimensions that we want. So what we're doing is we're looking at the zero with, as it sounds weird, does not say first dimension, but we're pretty much swapping the zero with position. With the second, so zero, one, two, we're swapping this one with this one. So the end result, like you would probably guess the shape of this is going to be four, three, two instead of two, three, four. So you kind of just take a look at this and see, you know, which ones are being flipped and, uh, those are the dimensions and that's the output. So hopefully that makes sense. Next up we have torch.stack and this is what we're actually going to go. We're going to, we're going to do more of this. We're actually going to use torch.stack stack very shortly here when we're, uh, getting our batches. So remember before when I was talking about batch size and how we take a bunch of these blocks together and we just stack them, a giant, uh, a giant, uh, length of integers or tokens, and all we're doing is we're just stacking them together in blocks or to make a batch. So that's pretty much what we're going to end up doing and that's what torch.stack does so we can take something that's, um, maybe one dimensional and then we can stack it to make it two dimensional and we can take something that's two dimensional and stack it a bunch of times to make it three dimensional. Or we can say three dimensional. For example, we have a bunch of cubes and we stack those on top of each other. Now it's four dimensional. So hopefully that makes sense. All we're doing is we're just, uh, passing in each tensor that we're going to stack in order, so this is our little output here and that's pretty much all it is. The next function that's going to be really important, uh, for our model, and we're going to be using this the entire time, uh, from start to finish, it's really important it's called the nn.linear function. So it is a pretty much a function of the nn.module and this is really important because you're going to see later on, nn.module is, it contains anything that has, uh, learnable parameters. So when we do a transformation to something, when we apply a weight and a bias, in this case, it'll be false, but, uh, pretty much when we apply a weight or a bias, uh, under nn.module, it will learn those and it'll become better and better. And it'll basically train based on, uh, how accurate those are. And, uh, how close certain parameters bring it to the desired output. So pretty much anything with nn.linear, uh, is going to be very important and it's going to be learnable. So we can see over here, um, this is the torch.nn, uh, little site here on the docs. So we have containers, a bunch of different layers, like activations, layers, uh, pretty much just layers. That's all it is. And so these are, these are important. We're going to, we're basically going to learn from these and you're going to see why we're going to use something called keys and values, uh, keys, values, and queers later on, you know, see why those are important. But, uh, if that doesn't make sense yet, how many, let me illustrate value for you right now. So I drew this out here. So if we look back at our examples, we have a, we make, we initialize a tensor. We initialize a tensor, um, it's 10, 10 and 10. What we're going to do is we're going to do a linear transformation. This linear stands for linear transformation. So pretty much we're just going to apply a weight and a bias through each of these layers here. So we have an input and we have an output. X is our input, Y is our output. And this is of size three and this is of size three. So pretty much we just need to make sure that these are lining up and, uh, for more context, the nn.sequential is sort of built off nn.linear. So if we go ahead and search that up right now, this'll make sense in a second. This is also some good prerequisite knowledge in general for machine learning. So let's see, and then dot sequential, uh, doesn't show it here, but pretty much, um, if you have, let's say two, you have two input neurons and maybe you have one output neuron, okay? You have a bunch of hidden layers in between here. Let's say we have maybe one, two, three, four, and then one, two, three. So pretty much you need to make sure that the inputs, uh, aligns with this hidden layer, this hidden layer aligns with this one and this one aligns with this one. So you're going to have, uh, a transformation of two to four, so two, four, and then this one's going to be, um, four to three, four to three, and then you're going to have a final one, so this is two to four right here, four to three here, and then this final one, it's going to be three to one. So you pretty much just need to make sure that these are lining up. So we can see that we have two, four, and then this four is carried on from this. Uh, output here, and pretty much this will just make sure that our shapes are consistent and of course, if they aren't consistent, if the shapes don't work out, the math simply won't work. So when you can make sure that our shapes are consistent, uh, if that didn't make sense, I know I'm not like super great at explaining, uh, architecture of neural nets, but, uh, if you're really interested, I could use a chat GPT of course, and that's a really good learning resource, a chat GPT going on to get up discussions maybe, or just looking at documentation, uh, and if you're not good at reading documentation, then you could take maybe some, some little keywords from here, like, uh, a sequential container. Well, what is a sequential container? You can ask chat GPT those types of questions and just sort of revert engineer to the documentation and figured things out step by step. It's really hard to know what you're doing if you don't know all of the math and all of the functions that are going on. You don't need to memorize them, but while you're working with them, it is important to understand what they're really doing behind the scenes, especially if you want to make, uh, an efficient and popular working neural net. So, uh, that's that and pretty much what's going to happen here with these linears is linear layers is we're just going to simply transform from one to the other input to output, no hidden layers, and we're just going to be able to learn best parameters for doing that. And you're going to see why that's useful later. Um, now we have the softmax function. So that sounds scary and the softmax function isn't actually what it sounds like at all. Um, let me illustrate that 40 right now. So let's go ahead and change the color here. So let's say we have a array. We have a one, two, three, let's move. We'll make them floating point numbers, 2.0, 3.0, et cetera, right? Floating points, whatever. So pretty much if we put, if we put this into the softmax function, what's going to happen is we're going to exponentiate each of these and we're going to divide them by the sum of all of these exponentiated. So pretty much what's going to happen, let's say we exponentiate one. Okay. So what that's going to do is it's going to do, um, this is what it's going to look like in code. It's going to go one dot e X P. And I think I talked about this up here. Um, this is exponentiating when we have 2.71 to the power of whatever number we're exponentiating. So if we have this one, we're going to exponentiate that and that's going to give us, uh, it's going to give us 2.71 and we have this two here and that's going to give us, uh, whatever, uh, whatever two is exponentiated 2.71 power of two. Okay. So we're going to get 7.34, so we're going to get 7.34. Door by writing, it's terrible, uh, 2.71 to 3 cubed, so 19.9. So pretty much what's going to happen is we, we can rearrange this in a new array. Uh, 7.34 and 19.9. So if we add all these up together, we have all these up together. We're going to get 2.71 plus this one. So let's do this math real quick. I'm just going to walk you through this to help you understand what the softmax function is doing. 7.34 plus 19.9. That's going to give us a total of 29.95. Great. 29.95. So all we do is we just divide each of these, uh, elements by the total. So 2.71 divided by this is going to give us maybe X. Okay. And we do 7.34 divided by this. It's going to give us Y and then we have 19.9 by this. It's going to give us, uh, Z. So pretty much you're going to exponentiate all of these. You're going to add them together to create a total, and then you're going to divide each of those exponentiated elements by the exponentiated total. So after that, this X right here is just, we're just going to wrap these again. And all this softmax function is doing is it's converting this 1, 2, 3 to X, Y, Z. That's all it's doing. Um, and yeah, it's, it's, it's not really, it's not really crazy. Uh, there's, there's a weird formula for it. Uh, softmax, softmax function. So if you go on Wikipedia, uh, you're going to crap yourself because there's a lot of terms in here and a lot of math that's, you know, above the high school level. But, uh, yeah, like this formula here, I believe this is what it is or standard unit, softmax function, there you go. So pretty much this is what it does and there's your easy explanation of what it does, so, uh, you're going to see why this is useful later, but it's just important to know what's going on so that, uh, you won't lag behind later in the course when this background knowledge becomes important. So if we go over a little example of that, of the softmax function in code, uh, it looks like this right here. So we import a torsa and a dot functional as F, F short for functional, and we pretty much just do F dot softmax and then plug in a tensor and, uh, what we want the, uh, dimension to be the output dimension. So if we plug this into here and we print it out, I'm going to go and print it out. It's going to take a second, not a torch, it's not defined, so let's run this from the top here, boom, and let's try that again, boom, there we go. So if you took all those values, let's actually do this again from scratch. So if we do 2.71, 2.71 divided by 29.95, we get 2.71 divided by 29.95. We get 0.09, 0.09, good. Uh, and then if we do 7.34 divided by 29.95, we get 0.245, 0.245. Well, it's kind of close, um, really close actually, and then 66.52. So if we go, uh, what was that last one there? 19.9, 19.9 divided by 29.95, 66.4, so 66.5, it's pretty close. Uh, again, we're rounding, so it's not perfectly, uh, it's not perfectly accurate, but as you can see, they're very close. And for, you know, only having two decimal places, uh, we did pretty good. So that's just sort of illustrating what the softmax function does and what it looks like in code. We have this, uh, sort of shape here. Zero dimensions means, uh, we just take, you know, it's just kind of a straight line. It's just, just like that. Um, so now we're going to go over embeddings and I'm not actually, I don't have any code for this yet. We're going to figure this out step-by-step with chat CPT because I want to show you guys, uh, sort of the skills and what it takes to reverse engineer an idea or function or just understand how something works in general in machine learning. So if we pop in a chat CPT here, uh, we say, what is an n dot embedding? Oh, an n dot, let me type m bedding, n and n embedding, class in the PyTorch library. Okay. Actual language processing max, uh, maps each discrete input to a dense vector representation. Okay. How does this work? Let's see. So we have some vocab, so that's probably our vocabulary size. So I think we, we talked about that earlier, vocabulary size, how many characters, how many unique characters are actually in our dataset? That's the vocabulary size. And then some embedding dimension here, which is a hyper parameter. So let's see, this doesn't quite make sense to me yet. So maybe I want to learn, what does this actually look like? Can you explain this to a, uh, maybe an eighth grader and provide a visualization? Certainly. Okay. Certainly. Okay. Little secret codes that represent the meaning of the words. Okay. That helps. So if we have cat, okay. So cat, cat's a word. So maybe we want to know what it would look like on a character level. What about on a character level instead of the word level? So it's probably going to look very similar. We have this little vector here, storing some information about whatever this is. So a, it means this here. Okay. So a is your 0.2 and this is really useful. So we've pretty much just learned what embedding vectors does. And if you haven't kept up with this, pretty much what they'll do is they'll store some vector of information about this character. And we don't even know what each of these elements mean. We don't know what they mean. This could be maybe positivity or should be the start of a word, or it could be any piece of information, maybe something we can't even comprehend yet. But the point is if we actually give them vectors and, uh, we feed these into a network and, uh, learn because as we saw before, nn.embedding right here is a part of the nn.module. So these are learnable parameters, which is great. So it's actually going to learn the importance of each letter and it's going to be able to produce some amazing results. So in short, uh, the embedding vectors are essentially a vector or a numerical representation of the sentiment of a letter. In our case, it's character level, not subword, not word. It's character level. So it's going to represent some meaning about those. So that's what embedding vectors are. Let's go figure out how they work in code. We have this little, uh, character level embedding vector and it contains a list. There's five elements in here. One, two, three, four, five, and it's by the vocab size. So we have all of our vocabulary by, uh, the length of each embedding vector. So this actually makes sense because our vocab size by the embedding dimension, which is, uh, how much information is actually being stored in each of these characters. So this now is very easy to understand. I'm just going to copy this code from here and I'm going to paste it down here and just get rid of the, uh, torch torch dot n because we already initialized that at both. So if we just run this, actually let's turn that down to maybe a thousand characters, let's try that out. And it's not defined. Oh, we did not initialize it. So let's go back down here and look at that. So, uh, this dot shape is going to essentially show the shape of it this much by, uh, by this much. Uh, so it's four by a hundred and yeah, so we can, we can work with these and we can store stuff about characters in them. And, uh, you're going to see this in the next lecture, how we actually use embedding vectors. So no need to worry if a lot of this doesn't make sense yet. That's fine. Uh, you're going to learn a little bit more about how we use these over the course, you're going to get more confident with using them, uh, even in your own projects. So don't, don't stress about it too much right now. Uh, embeddings are pretty tricky at first to learn. So don't worry about that too much, but there are a few more things I want to go over, uh, just to get us prepared for some of the linear algebra and matrix multiplication in particular that we're going to be doing, uh, in neural networks. So if we have, uh, I remember before we pulled out this, a little sketch of, uh, this is actually called a multilayer perceptron, but people like to call it a neural network because it's easier to say, but that's the architecture of this, a multilayer perceptron. Um, but pretty much what's happening is we have a little input here and we have a white matrix. So white matrix is, looks like this, it's like this and we have some, uh, we have some values in between X one, Y one, and maybe Z, Z one, so a bunch of, uh, weights and maybe biases too, that we add to it. So the tricky part is how do we actually multiply our input by this weight matrix? We're just doing one matrix times another. Well, that's called matrix multiplication and I'm going to show you how to do that right now. So first off, we have to learn something called dot products. So dot products are actually pretty easy and you might've actually done them before. So let's say we go ahead and take, uh, we go ahead and take this array here. We go, um, one, two, three, that's going to be what a is. And then we have, um, four or five, six. So if we want to find the dot product between these two, uh, all we have to do is simply take the index of both of these, uh, the first ones and the second ones, the third ones, multiply them together and then add, so we're going to go ahead and do, uh, one, multiply four, one times four, and then add it to, uh, two times five and then add it to three times six. So, uh, one times four is four, two times five is 10, three times six is 18. So we're going to go ahead and add these up. We get 14 plus 18, I believe is 32. So, uh, the dot product of this is going to be 32 and that's pretty much how simple dot products are. It's just taking each index of both of these arrays, uh, multiplying them together and then, uh, adding all of these. Um, products up as a dot product. So we actually need dot products for matrix multiplication. So let's go ahead and jump into that right now. So I'm just going to create, uh, two matrices that are going to be pretty easy to work with. So let's say we have a, and I have one matrix over here. It's going to be one, two, three, four, five, six, seven, eight, nine, four, five, and six, uh, this is going to be equal to a, and then b is going to be another matrix. So we're going to have, uh, seven, eight, nine, 10, 11, 12, ignore my terrible writing, um, pretty much what we do is to multiply these together. First, we need to make sure that they, they can multiply together. So we need to take a look at the amount of rows and columns that these have. So this one right here is three rows, one, two, three, three rows and two columns. So this is going to be a three by two matrix. And this one has two rows and three columns. So it's a two by three matrix. So all we have to make sure that if we're multiplying, uh, a dot product with B, and this is the PI torch syntax for multiplying matrices, if we're multiplying a by B, then we have to make sure, uh, the following is true. So if we do three by two and then, uh, dot product with two times three, we have to make sure that these two, uh, inner values are the same. So two is equal to two. So we cross these out and then the ones that we have left over are three by three. So the resulting matrix would be a three by three. However, if you had like a three by four times a five by, uh, five by one, that doesn't work because these values aren't the same. So these two matrices couldn't multiply. And, uh, sometimes you actually have to flip these to make them work. So maybe we, we change this value here to a three. We change this value to a three in this order. They do not multiply, but if we switch them around, we have a, uh, we have a three by five with a, uh, three by, or sorry, uh, five by three, sorry, five by three with a three by four. So these two numbers are the same that works and we're resulting matrix is a five by four. So, uh, that's how you make sure that two matrices are compatible. So now to actually multiply these together, what we're going to do, I'm going to make a new line here, so we're going to rewrite these. Uh, now we don't have to rewrite them. Let's just cross that out here. So pretty much what we have to do is we have to take, uh, these two and dot product with these two, and then once we're done that, uh, we do the same with these and these, these and these. So we start with, um, the first, the first row in the A matrix and we iterate through all of the columns in the B matrix. And then after we're done that, we just go to the next row in the A matrix and then center, right? So let's go ahead and do this right now. That probably sounds confusing to start, but let me just illustrate this. Uh, how this sort of works right here. So we have our, uh, one times, uh, our one times seven plus two times 10. So one times seven plus two times 10, and this is equal to 27. So that's the first dot product of, uh, one and two and seven and 10. So what this is actually going to look like in our new matrix, we'll go ahead and write this out here. So this is our new matrix here, this 27, he's going to go right here, let's continue. So next up, we're going to do, uh, one and two and then eight and 11, one, uh, one times eight plus two or sorry, uh, two and 11, so one times eight is eight and then two times 11 is 22. So our result here is 30 and 30 is just going to go right here. So 27 30, and you can see how this is going to work, right? So we are in our first, uh, in our first row of a, we're going to get the first row of this resulting matrix. So let's go ahead and do the rest here. So we have a one and two and then nine and 12 times nine, two times 12 times nine, two times 12 is 24. So if we do, uh, that's like 33, I believe. So 33, and we can go ahead and write that here. So now let's move on to the next wave of three and four, uh, three, three and four dot product was seven and 10. So, uh, three will multiply seven and then we're going to go ahead and add that to four times 10, three times seven, uh, three times seven is 21 and then four times 10 is 40. So we're going to get 47 is our output there so we can go ahead and write 47 right there. Our next one is going to be, uh, three and four dot product with eight and 11. So eight plus, uh, four times 11. Perfect. So we get three times eight is 24 and then, uh, plus 44. So 24 plus 44, that's 68. So we get 68 and we can go ahead and write that here. So next up we have three and four and nine and 12. So three times nine is 27, uh, and then four times 12. So let's just, let's just do that. I'm not doing that in my head. Uh, 27 plus was four times 12. So that's 48, 27 plus 48 gives us 75. Let's go to write our 75 here. Then we can go ahead and slide down to this row since we're done, uh, since we're done that, and then we go five, uh, five and six dot product was seven and 10. So our result from this, uh, five times seven is 35 and then six times 10 is 60. So we're going to get 95. We can go ahead and write our 95 here. And then, uh, uh, five and six dot product with eight and 11. So five times eight is 40 and then six times 11 is 66. So we get a 104 and then the last one. So five and six dot product with nine and 12. So five, uh, five times nine is 45 and then six times 12 is, uh, what six times 12, 72, I think so, uh, six times 12, 72. Yeah. So 45 plus 72, 117, and that is how you do a, uh, three by two matrix and a two by three matrix multiplying them together. So, uh, the result would be, uh, C equals that. So as you can see, it takes a lot of steps that took actually quite a bit of time compared to a lot of the other stuff I've covered in this video so far. So you can see how it's really important to get computers to do this for us and especially to, uh, scale this on a GPU. So I'm going to keep emphasizing that point more and more to have the GPU is very important for scaling your training, but pretty much that's how you do a dot products and matrix multiplication. So I actually realized I messed up a little bit on the math there. So this hundred four, uh, that's actually 106. So I messed up there, uh, if you caught that, uh, good job, but pretty much this is what this looks like in three lines of code. So all of this up here that we just covered, all of this is in three lines. So we initialize an A tensor and a B tensor. Uh, each one of these is a row. Each one of these is a row and it'll pretty much, uh, multiply these together. So this at symbol, this is a shorthand, how you multiply two matrices in PyTorch together, uh, another way to do this is to use the torch dot matrix multiply function or matmul for short, and then you can do A and B. So these will print literally the same thing. Look at that. So I'm not too sure on the differences between them. I use, uh, A at B for short, but, uh, if you really want to know, just, you know, take a look at the documentation or as chat CPT one of the two and, uh, should be able to get an answer from that, but I'm going to move on to something that we want to watch out for, especially when we're doing our, uh, matrix multiplication in our networks. So where's our network here? Where's our network here? Imagine we have, uh, we have some matrix, some matrix A and, uh, every element in this matrix is a floating point number. So, uh, if it's like a one, it would be like one dot zero or something, or just like a one dot, that's what it would look like as a floating point number. But if it were an integer, say B is full of ones with integers, it would just be a one. There wouldn't be any decimals, zero, zero, et cetera, right? It would just be one. So in PI torch, you cannot actually multiply, uh, integers and floating point numbers because they're not the same data type. So I showcased this right here. Uh, we have an int 64. So, uh, type of it is an integer and a float 32, uh, 64 and 32 don't mean anything. All we have to know is an integer and floating point number. So I've initialized a, uh, torch.randint, I covered above and set above here. And, uh, maybe not anyways, this pretty much does a torch.randint is going. Uh, the first parameter here is anything. It's pretty much your range. So I could do like zero to five, or I could just do like one. So it'll do zero up to one and then, uh, your shape of the matrix that it generates. So I said it's a random int, so that means it's going to generate a tensor with the data type integer, uh, 64. So we have a three by two, and then I initialize, uh, another random, uh, key, uh, key detail here. We don't have the, uh, int suffix, so this just generates floating point numbers. And if we actually return the types of each of these, so if I print, um, int 64.d type, and then float 32.d type and save that, I'm just going to comment this out for now, uh, we get a, uh, int 64 and float 32. So if we just try to multiply these together, try to multiply these together. Expected scalar type long, but found float. So long is pretty much when you have a sequence of integers and float is, of course, you have the decimal place, so you can actually multiply this together. So pretty much what you can do is, uh, cast the float method on this. If you just do dot float and then, uh, parentheses and then run this, uh, it'll actually work so you can cast, uh, integers to floats and then I think there's a way you can cast floats to integers, but it has some rounding in there. So probably not the best for, uh, input and weight matrix multiplication, but yeah, pretty much if you're doing any weight or matrix multiplication, it's going to be using floating point numbers because, uh, the weights will get extremely precise. So you want to make sure that they have, uh, sort of room to float around. So that's pretty much how you avoid that error. Uh, let's move on. So congratulations. You've probably made it further than, uh, quite a few people already. So congratulations on that. Uh, that was one of the most comprehensive parts of this entire course, understanding, uh, the math is going on behind the scenes. For some people, it's very hard to grasp if you're not very fluent with math. Um, but yeah, let's continue the Bagram language model and let's pump out some code here. So to recap, we're using CUDA to accelerate the training process. We have two hyper parameters, block size for the length of integers and, uh, batch for how many of those are running in parallel, two hyper parameters. We open our text, uh, we make some, we make a vocabulary out of it. We initialize our encoder and decoder, we get our data, encoding all this text, and then we get our train and bow splits, and then this next function here, get batch. So before I jump into this, go and run this here. So this is pretty much just taking the, uh, the first little, I don't know, we have eight characters, so it's taking, uh, the first eight characters and then index one, all the way to index nine. So it's offsetting by one and we can pretty much use this to show what the, uh, current input is and then what the target would be. So, uh, if we have 80 target is one, 80 and 80 and one target is one, 80 and one and one target is 28, et cetera, right? So this is the premise of the Bagram, Bagram language model. Given this character, we're going to predict the next. It doesn't know anything else in the entire history. It just knows what's before it, or just knows what the current character is. And based on that, we're going to predict the next one. So we have this get, get batch function here. And this part right here is the most important piece of code. This is going to, this is going to work a little bit more later with our train and bow splits, making sure that, you know, I'll try to explain this in a different way with our training, bow splits. So imagine you're, you take a course as you take a math course. Okay. And 90% of all your work is done just learning how the course works, learning all about the math. So that's like 90% of data you get from it. And then maybe another 10%, another 10% at the end is that final exam, which might have some questions you've never seen before. So the point is in that first 90%, you're tested on, uh, based on what you know, and then this other, uh, 10% is what you don't know. And this pretty much means you can't memorize everything and then just start generating based on your memory. You generate something that's alike or something that's close based on what you already know and the patterns you captured, uh, in that 90% of the course. So you can write your final exam successfully. So that's pretty much what's going on here. The training is the course, um, learning everything about it. And then validation is validating the final exam. So pretty much what we're doing here is we initialize IX and that'll take a random, uh, random manager between, uh, pretty much between zero and then length of the length of the entire text minus block size. So if you, uh, if you get the index that's at length of data minus block size, it'll, you'll still get the characters up to the length of data. So that's kind of how that works. And if we print this out here, uh, it'll just give us this right here. So we get some random integers. These are some random, uh, indices in the entire text that we can start generating from, so print this out and then torch.stack. We covered this before pretty much what this does. It's just going to stack them in batches. This is the entire point of batches. So, uh, that's what we do there. We get, uh, X and then Y is just off the same thing, but offset by one like this. So that's what happens there. And let's get into, uh, actually I'm going to add something here. This is going to be very important. We're going to go, uh, X, going to go X and Y is equal to model dot, or we're going to go, uh, X dot to device. So notice how, uh, no, we didn't do it up here. Okay. We'll cover this later, but pretty much you're going to see what this does in a second here. Two device. We return these and you can see that the device changed. So now we're actually on CUDA and this is really good because, uh, this, uh, these, these two pieces of data here, the inputs and the targets are no longer on the CPU, they're no longer going to be processed sequentially, but rather, uh, in our batches in parallel. So that's pretty much how you push any piece of data or parameters to the GPU is just dot two, and then the device, which you initialized appear. So now we can go ahead and actually initialize our neural net. So what I'm going to do is I'm going to go back up here and we're going to import some more stuff. So I'm going to import torch dot NN as NN and you're going to see why a lot of this is important in a second. I'm going to explain this here. I just want to get some code out first and down here we can initialize this. So it's a, it's a class we're going to make it bigram language model sub-class of NN dot module, and the reason why we do NN dot module here is because it's going to take in, it's going to take an NN dot module. I don't know how to explain this like amazingly, but pretty much when we use the NN dot module functions in PyTorch and it's inside of a NN dot module sub-class, they're all learnable parameters. So I'm going to go ahead and look at the documentation here so you can sort of understand this better. If we go to NN, okay, so pretty much all of these convolutional layers, recurrent layers, transformer, linear, like we looked at linear layers before. So we have NN dot linear. So if we use NN dot linear inside of this, that means that the NN dot linear parameters are learnable. So that white matrix will be changed through gradient descent and actually I think I should probably cover gradient descent right now. So in case some of you don't know what it is, it's going to be really hard to understand exactly how we make the network better. So I'm going to go ahead and set up a little graph for that right now. So I'm going to be using a little tool called Desmos. Desmos is actually great. It acts as a graphing calculator. So you can plug in formulas and move things around and just sort of visualize how math functions work. So I've written some functions out here that'll basically calculate derivative of a sine wave. So if I move A around, you'll see that changes. So before I get in to what's really going on here, I need to first tell you what the loss actually is. If you're not familiar, it's the loss. Let's say we have 80 characters in our vocabulary and we have just started our model, no training at all, completely random weights. Theoretically, there's going to be a one in 80 chance that we actually predict the next token successfully. So how we can measure the loss of this is by taking the negative log likelihood. So the likelihood is one out of 80. Take the log of that and then negative. So if we plug this in here, we'll get 4.38. So that's a terrible loss. Obviously, that's one out of 80. So it's like, you know, not even 2% chance. So that's not great. So pretty much the point is to minimize the loss, increase the prediction accuracy or minimize the loss. And that's how we train our network. So how does this actually work? How does this actually work out in code, you ask? So pretty much, let's say we have a loss here, okay? Start off with a loss of 2, just arbitrary loss, whatever. And what we're trying to do is decrease it. So over time, it's going to become smaller and smaller if we move in this direction. So how do we know if we're moving in the right direction? Well, we take the derivative of what the current point is at right now, and then we try moving it in different directions. So if we move it this way, sure, it'll go down. That's great. We can hit the local bottom over there, or we can move to this side. And then we can see that the slope is increasing in a negative direction. So we're going to keep adjusting the parameters in favor of this direction. So that's pretty much what gradient descent is. We're descending with the gradient. So pretty self-explanatory. That's what the loss function does. And gradient descent is an optimizer. So it's an optimizer for the network. Optimizes our parameters, our weight, major C's, et cetera. So these are some common optimizers that are used. And this is just by going to torch.optim, short for optimizer. And these are just a list of a bunch of optimizers that PyTorch provides. So what we're going to be using is something called atom w. And what atom w is, is it pretty much, I'm just going to read off my little script here, because I can't memorize every optimizer that exists. So atom, without atom, just atom, not atom w, atom is a popular optimization algorithm that combines ideas of momentum. And it uses a moving average of both the gradient and its squared value to adapt the learning rate of each parameter. And the learning rate is something that we should also go over. So let's say I figure out I need to move in this direction. I move, I take a step like that. Okay, that's a very big step that I say, okay, we need to keep moving in that direction. So what happens is I go like this, and then I end up there. And it's like, whoa, we're going up now, what happened? So that's because you have a very high learning rate. If you have a lower learning rate, what will happen is you'll start here. Let's take little one pixel steps or very, very small steps, like boom. Okay, that's good. That's better. It's even better. Keep going in this direction. This is great. And then you keep going down. You're like, okay, this is good. We're descending. And it's starting to flatten out. So we know that we're hitting a local bottom here. And then we stop because it starts ascending again. So that means this is our best set of parameters because of what that loss is or what the derivative is of that particular point. So pretty much this is what the learning rate is. So you want to have a small learning rate so that you don't take two large steps so that the parameters don't change dramatically and end up messing you up. So you want to make them small enough so you can still have efficient training. Like you don't want to be moving in like a millionth of one or something. Like that would be ridiculous. You'd have to do so many iterations to even get this far. So maybe you'd make it decently high, but not too high that it'll go like that, right? So that's what the learning rate is, just how fast it learns pretty much. And yeah, so Adam W is a modification of the Adam optimizer and it adds weight decay. So pretty much there's just some features that you add on to gradient descent and then Adam W is the same thing, except it has weight decay. And what this pretty much means is it generalizes the parameters more. So instead of having very high level performance or very low level, it takes a little generalize in between. So the weight significance will actually shrink as it flattens out. So this will pretty much make sure that certain parameters in your network, certain parameters in your weight matrices aren't affecting the output of this model drastically. That could be in a positive or negative direction. You could have insanely high performance from some lucky parameters in your weight matrices. So pretty much the point is to minimize those, to decay those values. That's what weight decay is, to prevent it from having that insane or super low level performance. That's what weight decay is. So that's a little background on gradient descent and optimizers. Let's go ahead and finish typing this out. So next up, we actually, we need to initialize some things. So we have our init, self of course, since it's a class, vocab size. I want to make sure that's correct. Vocabulary size. I might actually shrink this just a vocab size because it sounds or it's way easier to type out. And vocab size, good. So we're going to pump out some more code here. And this is just assuming that you have some sort of a background in Python. If not, it's all good, just understanding the premise of what's going on here. So we're going to make something called an embedding table. And I'm going to explain this to you in a second here, why the embedding table is really important. Notice that we use the nn, we use the nn module in this. So that means this is going to be a learnable parameter, the init dot embedding. So we're going to make this vocab size by vocab size. So let's say you have all 80 characters here and you have all 80 characters here. I'm going to actually show you what this looks like in a second here and why this is really important. But first off, we're going to finish typing out this background language model. So we're going to define our forward pass here. So the reason why we type this forward pass out instead of just using what it offers by default is to, let's say we have a specific use case for a model and we're not just using some tensors and we're not doing a simple task. This is really good practice because we want to actually know what's going on behind the scenes in our model. We want to know exactly what's going on. We want to know what transformations we're doing, how we're storing it and just a lot of the behind the scenes information that's going to help us debug. So I actually asked this, the chat GPT says, why is it important to write a forward pass function in PyTorch from scratch? So like I said, understanding the process, what are all the transformations that are actually going on, all the architecture that's going on in our forward pass, getting an input, running it through a network and getting an output, our flexibility, debugging, like I said, debugging is going to bite you in the ass if you don't sort of follow these best practices because if you're using weird data and the default isn't really used to dealing with it, you're going to get bugs from that. So you want to make sure that when you're actually going through your network, you're handling that data correctly and each transformation it actually lines up. So you can also print out at each step what's going on. So you can see like, oh, this is not quite working out here. Maybe we need to, you know, use a different function. Maybe this isn't the best one for the task, right? So it'll help you out with that, especially. And of course, customization, if you're building custom models, custom layers, right? And optimization, of course. So that's pretty much why we write out the forward pass from scratch. It's also just best practice. So it's never really a good idea to not write this, but let's continue. So self, and then we'll do index and targets. So we're going to jump into a new term here called logits. But before we do that, and I'm kind of all over the place here, before we do logits, I'm going to explain to you this embedding table here. Paste that in. Return logits. You're going to see why we return logits in a second here. So this, and end on embedding here, is pretty much just a lookup table. So what we're going to have, I'm actually going to pull up my notebook here. So we have a giant sort of grid of what the predictions are going to look like. That's going to look, can I drag it in here? No. So go ahead and download this full screen. Boom. This is in my notion here, but pretty much this is what it looks like. And I took this picture from Andre Karpathy's lecture, but what this is, is it has start tokens and end tokens. So start is at the start of the block and end tokens are at the end of the block. And it's pretty much is predicting, it's showing sort of a probability distribution of what character comes next given one, given one character. So if we have, say, I don't know, an A, 6,640 times out of this entire distribution here. So if we just add up all these, if we normalize them, and we get a little probability of this happening. I don't know if we add up all these together. I don't know what that is. Something some crazy number, maybe 20,000 or something, something crazy. Pretty much that percentage is the percentage of the end token coming after the character A. And then same thing here. Like if we do R, that's an RL or an RI. I don't know. I'm blind. That's an RI. But pretty much we normalize these, which means normalizing means you take how significant is that to that entire row. So this one's pretty significant in proportion to the others. So this one's going to be a fairly high probability of coming next. A lot of the times you're going to have an I coming after an R. And that's pretty much what that is. That's the embedding table. So that's why we make it vocab size by vocab size. So that's a little background on what we're doing here. So let's continue with the term logits. So what exactly are the logits? You're probably asking that. So let's actually go back to a little notebook I had over here. So remember our softmax function, right? Our softmax right here. So we exponentiated each of these values and then we normalized them. Normalized. We took its contribution to the sum of everything. That's what normalizing is. So you can think of logits as just a bunch of floating point numbers that are normalized, right? So you have a total here. I'll write this out. So let's say we have. That's a terrible line. Let's draw a new one. Good. Okay, so let's say we have. I say we have two, four and six, and we want to normalize these. So take two out of the totals. What's the total? We have six plus four is 10 plus two is 12. So two divided by 12, we take the percentage of that. Two out of 12 is 0.16 something. Okay, so 0.16, we'll just do 0.167. And then four out of 12 would be double that. So four out of 12 would be 33, 33%. And then six out of 12, that's 50. So 0.5. So that's what these looks like normalized. And this is pretty much what the logits are, except it's, it's more of a probability distribution. So let's say we have, you know, a bunch of, a bunch of bigrams here, like, I don't know, a followed by B and then a followed by C and then a followed by D. We know that from this distribution, a followed by D is most likely to come next. So this is what the logits are. They're pretty much a probability distribution of what we want to predict. So given that, let's hop back into here. We're going to mess around with these a little bit. So we have this embedding table and I already showed you what that looked like. Looked like this right here. This is our embedding table. So let's use something called, we're going to use a function called.view. So this is going to help us sort of reshape what our logits look like. And I'm going to go over an example of what this looks like in a second here. I'm just going to pump out some code. So we have our batch by our time. So the time is, you can think of time as that sequence of integers. That's the time dimension, right? You start from here. Maybe through the generating process, we don't know what's here next. We don't know what's on the, we don't know what the next token is. So that's why we say it's time. Because there's some we don't know yet and there's some that we already do know. That's why we call it the time dimension. And then channels would just be, how many different channels are, what's the vocabulary size? Channels is the vocabulary size. So we can make this the logits.shape. This is what logits going to return here is B by T by C. That's the shape of it. And then our targets do, actually, no, we won't do that yet. We'll do logits equals logits.view. And then we'll, this is very important, B by T. So because we're particularly paying attention to the channels, the vocabulary, the batch in time, they, I mean, they're not as important here. So we can sort of blend these together. And as long as the logits and the targets have the same batch in time, we should be all right. So we're going to do B, B times T by C. And then we can go ahead and initialize our targets. It's going to be targets.view, and it's going to be just a B by T. And then we can make our loss. Remember the loss function, right? So we do the functional of cross entropy, which is just a way of measuring the loss. And we basically take, there's two parameters here. So we have the logits and the targets. So I'm going to go over exactly what's going on here in a second. But first, you might be asking, what does this view mean? What exactly does this do? So I'm going to show you that right now. I've written some code here that initializes a random tensor of shape two by three by five. And so what I do is I pretty much unpack those, I unpack those dimensions by using a dot shape. So shape takes the two by three by five. We get x equals two, y equals three, and z equals five. So then we can do dot view, and that'll pretty much make that tensor again with those dimensions. So then we can just print that out afterwards. We go, we can print out, I don't know, print x, y, z. We have two, three, five, print a dot shape. And actually, I'll print out a dot shape right here first. So you can see that this actually does light up a dot shape. And then down here as well, same exact thing. So that's what view does, basically allows us to unpack with the dot shape. And then we can use view to put them back together into a tensor. So you might be asking, why in this notebook did we have to reshape these? Why did we do that? Well, the answer sort of falls into what the shape needs to be here with cross entropy. What does it expect? What does PyTorch expect the actual shape to be? So I looked at the documentation here, and it pretty much says that we want either one dimension, which is channels, or two, which is n, which I believe n is also the batch. So you have n different blocks or batches. And then you have some other dimensions here. So pretty much what it's expecting is a B by C by T instead of a B by T by C, which is precisely what we get out of here. It's the logits dot shape is B by T by C. And we want it in a B by C by T. So pretty much what we're doing is we're just putting this into, we're just making this one parameter by multiplying those. That's what's going on here. And then that means the second one is going to be C. So you get like a B times T equals n, and then C, just the way that it expects it, right? Just like that. So that's pretty much what we're doing there. And a lot of the times you might get errors from passing it into a functional function in PyTorch. So it's important to pay attention to how PyTorch expects the shapes to be, because you're going to get errors from that. And I mean, it's not very hard to reshape them. You just use the dot view and dot shape and you unpack them, reshape them together. Just it's overall pretty simple for beginner to intermediate level projects. So shouldn't really be a trouble there. But just watch out for that because it will come back and get you if you're not aware at some point. So I'm adding a new function here called generate. And this is pretty much going to generate tokens for us. So we pass an index, which is the current index or the context. And then we have max new tokens, and this is passed in through here. So we have our context. We make it a single zero, just the next line character. And then we generate based on that. And then our max new tokens, second parameter, we just make it 500 second parameter. So cool. What do we do inside of here? We have a little loop that pretty much it generates based on the length of or the range of the max new tokens. So we're going to generate max new tokens, tokens. That makes sense. Pretty much what we do is we call forward pass based on the current state of the model, the model parameters. And I want to be explicit here and say self forward, rather than just self index, it will call self for when we do this, but let's just be explicit and say self forward here. So we get the largest loss from this. We focus on the last time step. That's the only one we care about diagram language model. We only care about the single previous character. Only one doesn't have context before. And then we apply the softmax to get a probability distribution. And we already went over the softmax function before. The reason why we use negative one here is because we're focusing on the last dimension. And in case you aren't familiar with negative indexing, which is what this is here. And same with here is imagine you have a little number line. Okay. So it starts at index zero, one, two, three, four, five, et cetera. So if you go before zero, it's just going to loop to the very end of that array. So when we call negative one, it's going to do the last element, negative two, second last element, negative three, third last element, et cetera. So that's pretty much all this is here. And you can do this for anything in Python. Negative indexing is quite common. So that's what we do here. We've applied softmax to the last dimension, and then we sample from the distribution. So we already went over torch dot monomial. We get one sample. And this is pretty much the next index or the next encoded character that we then use torch dot cat short for concatenate. It concatenates the previous context or the previous tokens with the newly generated one. And then we just combine them together. So they're one thing. And we do this on a B by T plus one. And if that doesn't make sense, let me help you out here. So we have this time dimension. Let's say we have, you know, maybe just one element here. So we have something in the zeroth position, and then whenever we generate a token, we're going to take the information from the zeroth position and then we're going to add one to it. So it becomes a B by T. Since there was only one element, the length of that was one. It is now two. Then we have this two. We make it three. And then we have this three. And we make it four. So that's pretty much what this doing. Let's just keep concatenating more tokens onto it. And then we, you know, after this loop, we just return the index. So this is all the generated tokens for max new tokens. And that's pretty much what that does. Model up to device here. This is just going to push our parameters to the GPU for more efficient training. I'm not sure if this makes a huge difference right now because we're only doing background language modeling. But yeah, it's handy to have this here. And then, I mean, this is pretty self-explanatory here. We generate based on a context. This is the context, which is just a single zero or a next line character. We pass in our max new tokens. And then we pretty much decode this. So that's how that works. Let's move on to the optimizer and the training loop, the actual training process. So I actually skipped something and probably left you a little bit confused. But you might be asking, how the heck did we actually access the second out of out of three dimensions from this logits here? Because the logits only returns two dimensions, right? You have a B by T or you have a B times T by C. So how exactly does this work? Well, when we call this forward pass, all we're passing in is the index here. So that means targets defaults to none. So because targets is none, the loss is none. And this code does not execute. And it just uses this logits here, which is three dimensional. So that's how that works. And honestly, if you're feeding in your inputs and your targets to the model, then you're obviously going to have your targets in there. And that will make sure targets is not none. So then you'll actually be executing this code and you'll have a two dimensional logits rather than a three dimensional logits. So that's just a little clarification there, if that was confusing to anybody. Another quick thing I want to cover before we jump into this training loop is this little torch.long data type. So torch.long is the equivalent of int 64 or integer 64, which occupies 64 bits or eight bytes. So you can have different data types. You can have a float 16, you can have a float 32, float 64, I believe you can have an int 64, int 32. The difference between float and int is float has decimals. It's a floating point number and then integer is just a single integer. It's not really anything more than that. It can just be bigger based on the amount of bits that it occupies. So that's just an overview on torch.long. It's the exact same thing as int 64. So that's that. Now we have this training loop here. So we define our optimizer and I already meant over optimizers previously, atom w, which is atom weight decay. So we have weight decay in here and then all of our model parameters and then our learning rate. So I actually wrote a learning rate up here. So I would add this and then just rerun this part of the code here if you're typing along. So I have this learning rate as well as max iterators, which is how many iterations we're going to have in this training loop. And the learning rate is special because sometimes you're learning why it will be too high and sometimes it'll be too low. So a lot of the times you'll have to experiment with your learning rate and see which one provides the best both performance and quality over time. So with some learning rates, you'll get really quick advancements and then it'll like overshoot that little dip. So you want to make sure that doesn't happen, but you also want to make sure the training process goes quickly. You don't want to be waiting like, you know, an entire month for a Biogram language model to train by having, you know, by having a number like that. So that's a little overview on like, basically we're just putting this, this learning rate in here. That's where it belongs. So now we have this training loop here, which is going to iterate over the max iterations. Let me just give each iteration the term iter. And I don't think we use this yet, but we will later for just reporting on the loss over time. But what we do is we get, we get a batch with the train split specifically. We're just, again, we're just, we're just training. This is the training loop. We don't care about validation. So we're going to call train on this. We're going to get some X inputs and some Y targets. So we go and do a model dot forward here. We've got our logits and our loss. And then we're going to do our optimizer dot zero grad. And I'll explain this in the second here. It's a little bit confusing, but again, we have our, we have our loss dot backward. And this in cases doesn't sound familiar in case you are not familiar with training loops. I know I can go by this a little bit quickly, but this is the standard training loop architecture for basic models. And this is what it'll usually look like. So you'll, you know, you'll, you'll get your data, get your inputs or outputs, whatever. You'll do a forward pass. You'll define some thing about the optimizer here. In our case, it's zero grad. And then you'll have a loss dot backward, which is backward pass and the optimizer dot step, which lets gradient descent work. It's magic. So back to optimizer does zero grad. So by default pie torch will accumulate the gradients over time via adding them. And what we do by, by putting zero grad is we make sure that they do not add over time. So the previous gradients do not affect the current one. And the reason we don't want this is because previous gradients are from previous data and the data is, you know, kind of weird sometimes. Sometimes it's biased and we don't want that determining, you know, how much like what our error is, right? So we only want to decide, we only want to optimize based on the current gradient of our current data. And this little parameter in here, we go set to none. This pretty much means we're going to set, we're going to set the gradients instead of zero. Instead of zero gradient, we're going to set it to none. And the reason why we set it to none is because none occupies a lot less space. It just, yeah, it just occupies a lot less space when you have a zero. That's probably an int 64 or something that's going to take up space. And because, you know, we might have a lot of these accumulating that takes up space over time. So we want to make sure that the set to none is true. At least for this case, sometimes you might not want to. And that's pretty much what that does. It will, if you do have a zero grad on commonly, the only reason you'll need it is for training large recurrent neural nets, which need to understand previous context because they're recurrent. I'm not going to dive into RNNs right now, but those are a big use case for not having zero grad. Gradient accumulation will simply take an average of all the accumulation steps and just averages the gradients together. So you get a more effective, maybe block size, right? You get more context that way, and you can have the same batch size. So just little neat tricks like that. We'll talk about gradient accumulation more later in the course, but pretty much what's going on here. We define an optimizer Adam W. We iterate over max hitters. We get a batch training split. We do a forward pass, zero grad, backward pass, and then we get a step in the right direction. So we're gradient descent works as magic. And then at the end, we could just print out the loss here. So I've run this a few times, and over time, I've gotten the loss of 2.55, which is okay. And if we generate based on that loss, we get, you know, still pretty garbage tokens. But then again, you know, this is a diagram language model. So actually, I might need to retrain this here. It's not trained yet. So I'm actually going to do is run this, run this, run this, boom. And then what I'll do, oh, looks like we're printing a lot of stuff here. So that's coming from our get batch. So I'll just comment that or we can just delete it overall. Cool. And now if we run this again, give it a second. Perfect. So I don't know why it's still doing that. If we run it again, let's see. Where are we printing stuff? Get batch. No. Ah, yes. We have to run this again after changing it. Silly me. And of course, 10,000 steps is a lot. So it takes a little while, it takes a few seconds, which actually quite quick. So after the first one, we get a loss of 3.15. We can generate from that. And we get something that is less garbage. You know, it has some next line characters. It understands a little bit more to, you know, space things out and whatnot. So that's like slightly less garbage than before. But yeah, this, this is pretty good. So I lied. There aren't actually any lectures previously where I talked about optimizers. So I might as well talk about it now. So you have a bunch of common ones. And honestly, you don't really need to know anything more than the common ones because most of them are just built off of these. So you have your main squared error, common loss function using regression problems, where it's like, you know, you have a bunch of data points, find the best fit line, right? That's a common regression problem. Goals to prediction continues output and measures the average square difference between the predicted and actual values, often used to train neural networks for regression tasks. So cool. That's the most basic one. You can look into that more if you'd like, but that's our most basic optimizer. Gradient descent is a step up from that. It's used to minimize the loss function in a model, measures how well the model, the gradient measures how well the model is able to predict the target variable based on the input features. So we have some input X, we have some weights and biases, maybe WX plus B. And all we're trying to do is make sure that the inputs or make sure that we make the inputs become the desired outputs. And based on how far it is away from the desired outputs, we can change the parameters of the model. So we were over gradient descent recently or previously, but that's pretty much what's going on here. And momentum is just a little extension of gradient descent that adds the momentum term. So it helps smooth out the training and allows it to continue moving in the right direction, even if the gradient changes direction or varies magnitude. It's particularly useful for training deep neural nets. So momentum is when you have, you know, you consider some of the other gradients. So you have something that's like maybe passed on from here and then it might include a little bit of the current one. So like 90%, like a good momentum coefficient would be like 90% previous gradients and then 10% of the current one. So it kind of like lags behind and makes it converge sort of smoothly. That makes sense. RMS prop, I've never used this, but it's an algorithm that used the moving average of the squared gradient to adapt the learning rates of each parameter. Helps to avoid oscillations in the parameter updates and can improve convergence in some cases. So you can look more into that if you'd like. Atom, very popular, combines the ideas of momentum and RMS prop. It uses a moving average, both the gradient and its squared value to adapt the learning rate of each parameter. It's often used as the default optimizer for deep learning models. And in our case, when we continue to build this out, it's going to be quite a deep net. And Atom W is just a modification of the Atom optimizer that adds weight decay to the parameter updates. So helps to regularize and it can prove generalization performance. But using this optimizer as it best suits the properties of the model we'll train in this video. So, of course, I'm reading off the script here. There's no really other better way to say how these optimizers work. But yeah, if you want to look more into, you know, concepts like momentum or weight decay or, you know, oscillations and just some statistic stuff, you can. But honestly, the only thing that really matters is just knowing which optimizers are used for certain things. So what is momentum used for? What is Atom W great for? What is MSE good for, right? Just knowing what the differences and similarities are as well as when is the best case to use the optimizer. So yeah, you can find more information about that at torch.optim. So when we develop language models, something really important in language modeling, data science, machine learning at all, is just being able to report a loss or get an idea of how well our model is performing over, you know, the first thousand iterations and then the first two thousand iterations and four thousand iterations, right? So we want to get a general idea of how our model is converging over time. But we don't want to just print every single step of this. That wouldn't make sense. So what we actually could do is print every, you know, 200 iterations, 500. We could print every 10,000 iterations if you're running a crazy big language model if you wanted to. And that's exactly what we're going to implement right here. So actually this doesn't require an insane amount of Python syntax. This is just, I'm actually just going to add it into our for loop here. And what this is going to do is it's going to do what I just said is print every, you know, every certain number of iterations. So we can add a new hyper parameter up here called eval iter's. And I'm going to make this 250 just for, just to make things sort of easy here. And we're going to go ahead and add this in here. So I'm going to go if iter and we're going to do the modular operator. You can look more into this if you want later. And we're going to do eval iter's equals equals zero. So what this is going to do is it's going to check if the current iteration divided by, or sorry, if the remainder of the current iteration divided by our eval iter's parameter, if the remainder of that is zero, then we continue with it. So hopefully that made sense. If you want to, you could just look at, you could just ask GPT. You could just ask GPT four or GPT 3.5, whatever you have, just this modular operator. And you should get a good general understanding of what it does. Cool. So all we can do now is we'll just say, we'll just have a filler statement here. We'll just do print an F string. And then we'll go losses, losses, maybe not. Or actually, I'm going to change this here. We can go step iter. Add a little colon in there. And then I'll go split. Actually, no, we'll just go loss. And then losses like that. And then we'll have some sort of put in here. Something soon. I don't know. And all I've done is I've actually added a little function here behind the scenes. You guys didn't see me do this yet. But pretty much, I'm not going to go through the actual function itself. But what is important is that, you know, this this decorator right here, this probably isn't very common to you. So this is torch dot no grad. And what this is going to do is it's going to make sure that PyTorch doesn't use gradients at all in here. That'll reduce computation. It'll reduce memory usage. It's just overall better for performance. And because we're just reporting a loss, we don't really need to do any optimizing or gradient computation here. We're just getting losses. We're feeding some stuff into the model. We're getting a loss out of it. And we're going from there. So that's pretty much what's happening with this torch no grad. And, you know, for things like, I don't know, if you have other classes or other outside functions, like, I mean, get batch by default isn't using this because it doesn't have the model thing passed into it. But estimate loss does have model pass into it right here. So we just kind of want to make sure that it's not using any gradients. We do reduce computation that way. So anyways, if you want, you can just take a quick read over of this and it should overall make sense. Terms like.item,.mean are pretty common. A lot of the other things here, like model x and y, we get our logits in our loss. This stuff should make sense. Should be pretty straightforward. And only two other things I want to touch on is model.eval and model.train because you probably have not seen these yet. So model.train or model.train essentially puts the model in the training mode. The model learns from the data, meaning the weights and biases. If we have both, sometimes you only have weights. Sometimes you, you know, sometimes you have weights and biases, whatever it is. Those are updated during this phase. And then some layers of the model, like dropout and batch normalization, which you may not be familiar with yet. But operate differently in training mode. For example, dropout is active. And what dropout does is this little hyper parameter that we add up here. It'll look like this dropout and be like 0.2. So pretty much what dropout does is it's going to drop out random neurons in the network so that we don't overfit. And this is actually disabled in validation mode or eval mode. So this will just help our model sort of learn better when it has little like pieces of noise and when things aren't in quite the right place so that you don't have, you know, certain neurons in the network taking priority and just making a lot of the happy decisions. We don't want that. So dropout will just sort of help our model train better by taking 20% of the neurons out 0.2 at random. And that's all dropout does. So I'm just going to delete that for now. And then, yeah, model about train will dropout is active during this phase. During training, randomly turning off random neurons in the network. And this is to prevent overfitting. We went over overfitting earlier, I believe. And as for evaluation mode, evaluation mode is used when the model is being evaluated or tested just like it sounds once being trained what the other mode is being validated or tested. And layers like dropout and batch normalization behave differently this mode. Dropout is turned off in the evaluation, right? Because what we're actually doing is we're using the entire network. We want everything to be working sort of together. And we want to actually see how well does it perform. Training mode is when we're just, you know, sampling, doing weird things to try to challenge the network. So we're training it. And then, evaluating or validations would be when we just get the network in its optimal form and we're trying to see how good of results it produces. So that's what eval is. And the reason we switched into eval here is just because, well, we are testing the model. We want to see, you know, how well it does with any given set of data from a Git batch. And we don't actually need to train here. There's no training. If there was training, this would not be here because we would not be using any gradients. So we would be using gradients if training was on. Anyways, that's estimate loss for you. This function is, you know, just general, generally good to have in data science, your training, validation, splits, whatnot. And yeah, good for reporting. You know how it is. And we can go ahead and add this down here. So there's something soon, we'll go losses is equal to estimate loss. And then we can go ahead and put a, yeah, we don't actually have to put anything in here. Cool. So now let's go ahead and run this. Let me run from the start here. Boom, boom, boom, boom, boom, boom. Perfect. Now we're running for 10,000 iterations. That's interesting. Okay. So, yes. So what I'm going to do actually here is you can see this loss part is weird. So I'm actually going to change this up and I'm just going to switch it to, we're going to go train loss. And we're going to go losses and we're going to do the train split. And then we're going to go over here and just do the validation loss. We can do validation or just bow for short. And I'm going to make it consistent here, though we have a colon there, a colon here, and then you just go losses and do that. Cool. So I'm going to reduce these maxators up here to only 1000. Run that, run this. Oh, somebody did a match. Yeah, so what actually happened here was since we were using these little ticks, what was happening is these were matching up with these. And it was telling us, oh, you can't do that. You can't start here and then end there and have all this weird stuff. Like you can't do that. So pretty much we just need to make sure that these are different. So I'm going to do a double quote instead of single and then double code to finish it off. And as you can see, this worked out here. So I'll just run that again so you guys can see what this looks like. And this is ugly because we have a lot of decimal places. So we can actually do here is we can add in a little format or a little decimal place reducer if you call it just for, you know, so you can read it. So it's not like some weird decimal number. And you're like, Oh, does this eight matter? Probably not just like the first three digits, maybe. So all we can do here is just add in, I believe this is how it goes. I don't think it's the other way. We'll find out some stuff in Python is extremely confusing to me. But there we go. So I got it right, colon and then period. And as you can see, we have those digits reduced. So I can actually put this down to three F. Wonderful. So we have our train loss and our validation loss. Great job. You made it this far. This is absolutely amazing. This is insane. You've gotten this far in the video. We've covered all the basics, everything you need to know about Bagram language models, optimizers, training loops, reporting losses. I can't even name everything we've done because it's so much. So congratulations that you made it this far. You should go take a quick break, give yourself a pat on the back and get ready for the next part here because it's going to be absolutely insane. We're going to dig into literally state of the art language models and how we can build them from scratch, or at least how we can pre-train them. And some of these terms are going to seem a little bit out there, but I can ensure you by the end of this next section here, you're going to have a pretty good understanding about the state of language models right now. So yeah, go take a quick break and I'll see you back in a little bit. So there's something I'd like to clear up and I actually sort of lied to you a little bit. A little while back in this course about what normalizing is. So I recall we were talking about the softmax function and normalizing vectors. So the softmax is definitely a form of normalization, but there are many forms. There are not just a few or like there's not just one or two normalizations. There are actually many of them and I have them on my second monitor here, but I don't want to just dump that library of information on your head because that's not how you learn. So what we're going to do is we're going to plug this into GPT-4. We're going to say, can you list all the forms of normalizing in machine learning? And how are they different from one another? GPT-4 is a great tool. If you don't already use it, I highly suggest you use it or even GPT 3.5, which is the free version. But yeah, it's a great tool for just quickly learning anything. And then you could give you example practice questions with answers so you can learn topics in like literally minutes that would take you several lectures to learn in a university course. But anyways, there's a few here. So min-max normalization, yep. Z-score, decimal scaling, mean normalization, unit vector or layer two, robust scaling, power transformations. Okay, so yeah, and then softmax would be another one. What about softmax? It is in data type normalization, but it's not typically using for normalizing input data. It's commonly used in the output layer. So softmax is a type of normalization, but it's not used for normalizing input data. And honestly, we proved that here by actually producing some probabilities. So this isn't something we used in our forward pass. This is something we use in our generate function to get a bunch of probabilities from our logits. So this is, yeah, interesting. It's good to just figure little things like these out for just to put you on the edge a little bit more for the future when it comes to engineering these kinds of things. All right, great. So the next thing I want to touch on is activation functions. And activation functions are extremely important in offering new ways of changing our inputs that are not linear. So for example, if we were to have a bunch of linear layers, a bunch of, let me erase this. If we were to have a bunch of, you know, NN dot linears in a row, what would actually happen is they would all just, you know, they would all squeeze together and it would essentially apply one transformation that sums up all of them, kind of. They all sort of multiply together and it gives us one transformation that is kind of just a waste of computation, because let's say you have a hundred of these NN dot linear layers and nothing else. You're essentially going from inputs to outputs, but you're doing a hundred times the computation for just one multiplication. That doesn't really make sense. So what can we do to actually make these deep neural networks important? And what can we offer that's more than just linear transformations? Well, that's where activation functions come in. And I'm going to go over these in a quick second here. So let's go navigate over to the PyTorch docs. So the three activation functions I'm going to cover in this little part of the video are the ReLU, the sigmoid and the tanh activation functions. So let's start off with the ReLU or rectified linear unit. So we're going to use functional ReLU. And the reason why we're not just going to use torch dot NN is because we're not doing any forward passes here. I'm just going to add these into our, I'm going to add these. Let me clear this, clear this output. That's fine. I'm actually going to add these into here and there's no forward pass. We're just going to simply run them through a function and get an output just so we can see what it looks like. So I've actually added this up here from torch dot NN import functional as capital F. It's just kind of a common PyTorch practice capital S. And let's go ahead and start off with the ReLU here. So we can go, I don't know, X equals torch dot tensor. And then we'll make it a negative 0.05, for example. And then we'll go D type equals torch dot float 32. And we can go Y equals F dot ReLU of X. And then we'll go ahead and print Y. Oh, has no attribute ReLU. Okay, let's try NN then. Let's try NN and see if that works. Okay, well, that didn't work. And that's fine, because we can simply take a look at this and it'll help us understand. We don't actually need to, we don't need to write this out in code as long as it sort of makes sense. We don't need to write this in the forward pass, really. You're not going to use it anywhere else. So yeah, I'm not going to be too discouraged. That does not work in the functional library. But yeah, so pretty much what this does is if a number is below, if a number is zero or below zero, it will turn that number into zero. And then if it's above zero, it'll stay the same. So this graph sort of helps you visualize that there's a little function here. That might make sense to some people, I don't really care about the functions too much, as long as I can sort of visualize what the function means, what it does, what are some applications that can be used. That usually covers enough for like any function at all. So that's the ReLU function. Pretty cool. It simply offers a non-linearity to our linear networks. So if you have 100 layers deep, and every, I don't know, every second step you put a ReLU, that network's going to learn a lot more things. It's going to learn a lot more linearity, non-linearity, than if you were to just have 100 layers multiplying all into one transformation. So that's what that is. That's the ReLU. Now let's go over the sigmoid. So here we can actually use the functional library. And all sigmoid does is we go 1 over 1 plus exponentiated of negative x. So I'm going to add that here. We could, yeah, why not do that? Negative 0.05 float 32. Sure. We'll go f dot sigmoid. And then we'll just go x, and then we'll print y. Cool. So we get a tensor 0.4875. Interesting. So this little negative 0.05 here is essentially being plugged into this negative x. So 1 over 1 plus 2.71 to the power of negative 0.05. So it's essentially, if we do 2.71, 2.71 to the power of negative negative 0.5, we're just going to get positive. So 1.05, and then 1 plus that, so that's 2.05. We just do 1 over that, 2.05. So we get about 0.487. And what do we get here? 0.487. Cool. So that's interesting. And let's actually look, is there a graph here? Let's look at the sigmoid activation function. Wikipedia. Don't get too scared by this math here. I don't like it either, but I like the graphs. They're cool to look at. So this is pretty much what it's doing here. So yeah, it's just a little curve. Kind of looks like a, yeah, it's kind of just like a wave, but it's cool looking. That's what the sigmoid function does. It's used to just generalize over this line. And yeah, sigmoid function is pretty cool. So now let's move on to the tanh, the tanh function. Google Bing is, or Microsoft Bing is giving me a nice description of that. Cool. Yeah, perfect. E to the negative X. I like that. So tanh is a little bit different. There's a lot more exponentiating going on here. So you have, well, I'll just say expo or exp of X minus exp of negative X divided by exp of X plus exp of negative X. There's a lot of positives and negatives in here. Positive, positive, negative, negative, negative, positive. So that's interesting. Let's go ahead and put this into code here. So I'll go torch shot examples or torch examples. This is our file here and I'll just go tanh. Cool. So negative 0.05. Cool. What if we do a one, what if we do a one here? What will that produce? Oh, 0.76. What if we do a 10? 1.0. Interesting. So this is sort of similar to the sigmoid except it's, you know, let's actually ask you what the difference is. When would you use tanh over sigmoid? Let's see here. Sigmoid function and hyperbolic tangent or tanh function are activations functions used in neural networks. They have a similar s-shaped curve but have different ranges. So sigmoid output values between a 0 and a 1, well tanh is between a negative 1 and a 1. So if you're, you know, if you're rating maybe the, maybe if you're getting a probability distribution, for example, you want it to be between 0 and 1, meaning percentages or decimal places. So like a 0.5 would be 50%, 0.87 would be 87%. And that's what the sigmoid function does. It's quite close to the softmax function actually, except the softmax just, you know, it prioritizes the bigger values and puts the smaller values that are priority. That's all the softmax does. It's kind of a sigmoid on steroids. And the tanh outputs between negative 1 and 1. So yeah, you could maybe even start theory crafting and thinking of some ways you could use even the tanh function and sigmoid in different use cases. So that's kind of a general overview on those. So background language models are finished. All of this we finished here is now done. You're back from your break. If you took one, if you didn't, that's fine too. But pretty much we're going to dig into the transformer architecture now and we're actually going to build it from scratch. So there was recently a paper proposed called the transformer model. And this uses a mechanism called self-attention. Self-attention is used in these multi-head attention little bricks here. And there's a lot that happens. So there's something I want to clarify before we jump right into this architecture and just dump a bunch of information on your poor little brain right now. But a lot of these networks at first can be extremely confusing to beginners. So I want to make it clear. It's perfectly okay if you don't understand this at first. I'm going to try to explain this in the best way possible. Believe me, I've seen tons of videos on people explaining the transformer architecture and all of them have been to some degree a bit confusing to me as well. So I'm going to try to clarify all those little pieces of confusion. Like what does that mean? You didn't cover that piece. I don't know what's going on here. I'm going to cover all those little bits and make sure that nothing is left behind. So you're going to want to sit tight and pay attention for this next part here. So yeah, let's go ahead and dive into just the general transformer architecture and why it's important. So in the transformer network, you have a lot of computation going on. You have some adding and normalizing. You have some multi-head attention. You have some feed forward networks. There's a lot going on here. There's a lot of computation, a lot of multiplying, a lot of matrix multiplication. There's a lot going on. So a question I actually had at first was, well, if you're just multiplying these inputs by a bunch of different things along, you should just end up with some random value at the end that maybe doesn't really mean that much of the initial input. And that's actually correct. For the first few iterations, the model has absolutely no context as to what's going on it. It is clueless. It is going in random directions and it's just trying to find the best way to converge. So this is what machine learning and deep learning is actually all about, is having all these little parameters in the adding and normalizing, the feed forward networks, even multi-head attention. We're trying to optimize the parameters for producing an output that is meaningful. That will actually help us produce almost perfectly like English text. And so this is the entire process of pre-training. You send a bunch of inputs into a transformer and you get some output probabilities that used to generate from. And what attention does is it sets little different scores to each little token in a sentence. For tokens, you have character, subword, and word level tokens. So you're pretty much just mapping bits of attention to each of these, as well as what does its position also mean as well. So you get up two words that are right next to each other. But then if you don't actually positionally encode them, it doesn't really mean much because it's like, oh, these could be like 4,000 characters apart. So that's why you need both to put attention scores on these tokens and to positionally encode them. And that's what's happening here. So what we do is we get to our inputs. We get our inputs. So I mean, we went over this with diagram language models. We feed our X and Y. So X would be our inputs. Y would be our targets or outputs. And what we're going to do is give these little embeddings. So I believe we went over embeddings a little while ago. And pretty much what those mean is it's going to have a little, it's going to have a little row for each token on that table. And that's going to store some vector as to what that token means. So let's say you had the character E, for example, the sentiment or the vector of the character E is probably going to be vastly different than the sentiment of Z, right? Because E is a very common vowel and Z is one of the most uncommon, if not the most uncommon letter in the English language. So these embeddings are learned. We have these both for our inputs and our outputs. We give them positional encodings like I was talking about. And there's ways we can do that. We can actually use learnable parameters to assign these encodings. A lot of these are learnable parameters, by the way. And you'll see that as you delve more and more into transformers. But yeah, so after we've given these inputs, embeddings, and positional encodings, and same thing with the outputs, which are essentially just shifted right, you have I up to block size for inputs, and then I plus one up to block size plus one, right? Or whatever, whatever little thing we employed here in our background language models. Can't remember quite what it was, or even if we did that at all. No, I'm just speaking gibberish right now, but that's fine because it's going to make sense in a little bit here. So what I'm going to actually do is I'm not going to read off of this right here, because this is really confusing. So I'm going to switch over to a little, I guess, a little sketch that I drew out. And this is pretty much the entire transformer with a lot of other things considered that this initial image is not really put into perspective. So let's go ahead and jump into sort of what's going on in here from the ground up. So like I was talking about before, we have some inputs and we have some outputs which are shifted right. And we give each of them some embedding vectors and positional encodings. So from here, let's say we have n layers. This is going to make sense in a second. n layers is set to four. So the amount of layers we have is set to four. So you can see we have an encoder, encoder, like we have four of these, we have four decoders. So four is actually the amount of encoders and decoders we have. We always have the same amount of each. So if we have 10 layers, that means we'd have 10 encoders and 10 decoders. And pretty much what would happen is after this input, embedding and positional embedding, we feed that into the first encoder layer and then the next and then next and then right as soon as we hit the last one, we feed these into each of these decoders here, each of these decoder layers. So only the last encoder will feed into these decoders. And pretty much these decoders will all run. They all learn different things. And then they'll turn what they learned. They'll apply a linear transformation at the end of it. This is not in the decoder function. This is actually after the last decoder. It'll apply a linear transformation to pretty much sort of simplify or give a summary of what it learned. And then we apply a softmax on that new tensor to get some probabilities to sample from, like we talked about in the generate function in our bigram. And then once we get these probabilities, we can then sample from them and generate tokens. And that's kind of like the first little step here. That's what's going on. We have some encoders. We have some decoders. We do a transformation to summarize. We have a softmax to get probabilities. And then we generate based on those probabilities. Cool. Next up, in the encoder, in each of these encoders, this is what it's going to look like. So we have multi-head attention, which I'm going to dub into a second here. So after this multi-head attention, we have a residual connection. So in case you aren't familiar with residual connections, I might have went over this before. But pretty much what they do is it's a little connector. So I don't know. Let's say you get some inputs x. You have some inputs x down here and you put them into some sort of function here, some sort of feed-forward network, whatever it is. A feed-forward network is essentially just a linear, a relu, and then a linear. That's all feed-forward network is right here. Linear, relu, relu, linear. And all you do is you wrap those inputs around so you don't actually put them into that feed-forward network. You actually wrap them around and then you can add them to the output. So you had some x values here, go through the relu, and then you had some wrap around. And then right here, you simply add them together and you normalize them using some layer norm, which we're going to cover in a little bit. And the reason our residual connections are so useful in transformers is because when you have a really deep neural network, a lot of the information is actually forgotten in the first steps. So if you have your first view encoder layers and your first view decoder layers, a lot of the information here is going to be forgotten because it's not being carried through. The first steps of it aren't explicitly being carried through and sort of skipped through the functions. And yeah, you can sort of see how they would just be forgotten. So residual connections are sort of just a cheat for getting around that, getting around that for not having deep neural networks forget things from the beginning and having them all sort of work together to the same degree. So residual connections are great that way. And then at the end there, you would add them together and then normalize. And there's two different ways that you can do this add a norm. There's add a norm and then norm an add. So these are two different separate architectures that you can do in transformers. And both of these are sort of like meta architectures, but pretty much pre-norm is the normalize then add, and then post-norm is add then normalize. So in this attention is all you need paper proposed by a bunch of research scientists was. Initially, you want to add these, you want to add these together and then normalize them. So that is what we call the post-norm architecture. And then pre-norm is just flip them around. So I've actually done some testing with pre-norm and post-norm and the original transformer paper turned out to be quite actually a lot better, at least for training very small language models. If you're training bigger ones, it might be different. But essentially, we're just going to go by the rules that we use in here. So add a norm, we're not going to do norm and add, add a norm in this video specifically, because it works better. And we just don't want to break any of the rules and go outside of it because then that starts to get confusing. And actually, if you watch the Andre Carpathi lecture on building GPTs from scratch, he actually implemented it in the pre-norm way. So normalize then add. So yeah, based on my experience, what I've done on my computer here is the post-norm architecture works quite better. So that's why we're going to use it. We're going to do add then normalize. So then we essentially feed this into a feed-forward network, which we covered earlier. And then, how did it go? We have a, yeah, so our encoder, we do a residual connection from here to here. And then another residual connection from like outside of our feed-forward network. So each time we're doing some other things, like some, you know, some computation blocks in here, we're going to have a res connection. Same with our feed-forward res connection. And then, of course, the output from here, we just, when it exits, it's going to feed into the next encoder block if it's not the last encoder. So this one is going to do all this, it's going to feed into that one, it's going to do the same thing, feed into this one, going to feed into that one. And then the output of this is going to feed into each of these decoders, all the same information. And yeah, so that's a little bit scoped in as to what these encoders look like. So now that you know what the encoder looks like, what the feed-forward looks like, we're going to go into multi-head attention, sort of the premise, sort of the highlight of the transformer architecture and why it's so important. So multi-head attention, we call it multi-head attention because there are a bunch of these different heads learning different semantic info from a unique perspective. So let's say you have 10 different people looking at the same book. If you have 10 different people, let's say they're all reading, let's say they're all reading the same Harry Potter book. These different people, they might have different cognitive abilities, they might have different IQs, they might have been raised in different ways, so they might interpret things differently, they might look at little things in that book and they'll imagine different scenarios, different environments from the book. And essentially why this is so valuable is because we don't just want to have one person, just one perspective on this, we want to have a bunch of different heads in parallel looking at this, looking at this same piece of data because they're all going to capture different things about it. And keep in mind each of these heads, each of these heads in parallel, these different perspectives, they have different learnable parameters. So they're not all the same one looking at this piece of data, they're actually, they all have different learnable parameters. So you have a bunch of these at the same time learning different things and that's why it's so powerful. So this scaled dot product attention runs in parallel, which means we can scale that to the GPU, which is very useful. It's good to touch on that. Anything with the GPU that you can accelerate is just an automatic win because parallelism is great in machine learning. Why not have parallelism, right? If it's just going to be running the CPU, what's the point? That's why we love GPUs. Anyways, yeah, so you're going to have these different, you're going to have these things that are called keys, queries and values. I'll touch on those in a second here because keys, queries and values sort of point to self attention, which is literally the entire point of the transformer. Transformer wouldn't really mean anything without self attention. So I'll touch on those in a second here and we'll actually delve deeper as we hit this sort of block. But yeah, you have these keys, queries and values. They go into scaled dot product attention. So a bunch of these running in parallel and then you can catenate the results from all these different heads running in parallel. You have all these different people, you can catenate all of them, you generalize it and then you apply a transformation to a linear transformation to pretty much summarize that and then do your add a norm, then pay for a network. So that's what's going on in multi head attention. You're just doing a bunch of self attention in parallel, concatenating and then continuing on with this part. So scaled dot product attention, what is that? So let's just start from the ground up here. So you have, we'll just go from left to right. So you have your keys, queries and values. What do your keys do? Well a key is, let's just say you have a token in a sentence, okay? So if you have, let me just roll down here to a good example. So self attention uses keys, queries and values. Self attention helps identify which of these tokens in a sentence, in any given sentence are more important and how much attention you should pay to each of those characters or words, whatever you're using. We'll just use words to make it easier to understand for the purpose of this video. But essentially imagine you have these two sentences here. So you have, let me bring out my little piece of text. So you have, oh, that didn't work. So imagine you have server, can I have the check? And then you have, and you have looks like I crashed the server. So I mean, both of these have the word server in them, but they mean different things. Server meaning like the waiter or the waitress or whoever is billing you at the end of your restaurant visit. And it looks like I crashed the server is like, oh, there's actually a server running in the cloud, not like a person that's billing me, but an actual server that's maybe running a video game. And these are two different things. So what attention can do is it can actually identify which words would get attention here. So it can say server, can I have the check? Can I have? So it's maybe you're looking for something, you're looking for the check and then server is like, oh, well in this, in this particular sequence or in this, in the sentiment of this sentence here, server is specifically tied to this one meaning maybe a human, someone at a restaurant and then crash, crash the server crashes, crash is going to get a very high attention score because you don't normally crashes, crash a server at a restaurant. That doesn't particularly make sense. So when you have different words like this, what self attention will do is it will learn which one of the, which, which words in the sentence are actually more important and which should, which word should it pay more attention to. So that's really all that's going on here. And K the key is essentially going to emit a different, it's going to emit a little tensor for each of these words here saying, you know, what do I contain? And then query is going to say, what am I looking for? So what's going to happen is if these like, let's say server, server, it's going to look for things like, you know, check or crashed. So if it sees crashed, then that means the key and the query are going to multiply and it's going to get a very high attention score. But if you had something like the and the it's like, oh, the kid does literally in like almost any sentence. So that doesn't mean much. We're not going to pay attention to those words. So that's going to get a very low attention score. And all attention is, is you're just dot product dot producting these vectors together. So you get a key and a query, you dot product them, we already went over dot products in this course before. And then all you do here, and this is a little, little bit of a confusing part is you just scale, you scale it by one over the square root of the, of the length of a row in the quiz keys or queries matrix, otherwise known as dk. So let's say we have, you know, our key and our query, these are all going to be the same length, by the way, let's say our keys is, you know, maybe, maybe our keys is going to be like 10 characters long, our, our keys are going to be 10 characters long as well. So it's going to do one over the square root of 10, if that makes sense. And so that's just, that's just essentially a way of preventing these dot products from exploding, right? We want to scale them because as we have, as it in, as the length of it increases, so will the ending dot product, because there's more of these to multiply. So we pretty much just want to scale it by using an inverse square root. And that'll just help us with scaling, make sure nothing explodes in unnecessary ways. And then the next little important part is using tort dot trill, which I imagine we went over in our examples here. Trill. Yeah. So you can see that it's a, it's a diagonal, it's a left triangular matrix of ones. And these aren't going to be ones in our self attention here in our torque dot trail or masking. What this is going to be is the scores at each time step, combination of scores at each time step. So if we've only gone, you know, if we're only looking at the first time step, we should not have access to the rest of things or else that would be cheating. We shouldn't be allowed to look ahead because we haven't actually produced these yet. We need to produce these before we can put them into perspective and, you know, put a weight on them. So we're going to set all these to zero and then we go to the next time step. Okay. So now we've, we've just generated this one, we haven't generated these yet. So we can't look at them. And then as we go more and more as, as the time step increases, we know more and more context about all of these tokens. So that's all that's doing. Mask attention is pretty much just saying, we don't want to look into the future. We want to only guess with what we currently know in our current time step and everything before it. You can't jump into the future. You can only look at what happened in the past and do stuff based on that. Right? Same thing applies to life. You can't really skip to the future and say, Hey, if you do this, you're going to be a billionaire. You're going to be a billionaire. No, that would be cheating. You're not allowed to do that. You can only look at the mistakes you made and say, how can I become a billionaire based on all these other mistakes that I made? How can I become as close to perfect as possible? Which no one can ever be perfect, but that's my little analogy for the day. So that's mass attention. Pretty much just not letting us skip time steps. So that's fun. Let's continue. Two more little things I want to touch on before I jump forward here. So these keys, queries, and values, each of these are learned through a linear transformation. Just an n n dot linear, uh, is applied and that's how we get our keys, queries, and values. So that's, that's just a little touching there. If you're wondering, how do we get those? It's just an n n dot linear transformation. Uh, and then as for our masking, we don't actually apply this all the time. You might've seen right here, we have multi-head attention, multi-head attention, and then masked multi-head attention. So this mass attention isn't used all the time. It's only used actually one out of the three attentions we have per layer. So I'll give you, I'll give you a little bit more information about that as we, you know, progress more and more into the architecture and as we, as we learn more about it. I'm not going to dive into that quite yet though. So let's just continue on with what's going on. So we have a softmax and why softmax important? Well, I actually mentioned earlier softmax is not commonly used as a normalization method, but here we're actually using softmax to normalize. So when you have all of these, uh, when you have all these, you know, attention scores, essentially what the softmax is doing is it's going to exponentiate and normalize all of these. So all of the attention scores that have scored high, like maybe 50 to 90 percent or whatever it is, those are going to take a massive effect in that entire, uh, attention, I guess, tensor if you want to call it that. And that's important. Well, it's, it's, it might not seem important, but it's essentially just giving the model more confidence as to which tokens matter more. So for example, if we just did a normal normalization, we would have words like server and crash and then server and check. And then you would, you would just know, you know, a decent amount about those. Those would be pay attention to a decent amount because they're, because they multiply together quite well. But if you softmax those, then it's like, those are almost the only characters that matter. So it's looking at the context of those two. And then we're sort of filling in, like we're learning about the rest of the sentence based on just the, uh, sentiment of those attention scores because they're so high priority because they multiply together to such a high degree. We want to emphasize them and then basically let the model learn more about which words matter more together. So that's pretty much just what the softmax does. It increases our confidence in attention. And then a matrix multiply will we go back to our V here and this is a value. So essentially what this is, is just a linear transformation and we apply this on our, uh, we apply this on our inputs and it's just going to have some value about, you know, uh, what exactly those tokens are. And after we've gotten all of our attention, our softmax, everything done, it's just going to multiply the original values by everything we've gotten so far, just so that you don't have any information that's really lost or we don't have anything scrambled. Just that we have like a general idea of, okay, these are actually all the tokens we have. And then these are, uh, which ones we found interesting, the attention scores. So yeah, we have an output, just a blend of input vector values and attention placed on each token. And that's pretty much what's happening in scaled dot product attention in parallel. So we have a bunch of these that are just happening at the same time. Um, any of these happening at the same time. And yeah, so that's what attention is. That's what feed forward networks are. That's what residual connections are. Uh, and yeah. And then so after this, after we've, you know, fed these into our, our decoders, get an output, we apply a linear transformation to summarize softmax probabilities and then we generate based on that, based on everything that we learned. And actually what I didn't quite write a lot about was the decoder. So what I'm actually going to talk about next is something I didn't fill in yet, which is why, why the heck do we use mass attention here, but not in these places? So why the heck do we have a multi attention here or that attention here, but mass attention here? So why is this? Well, the purpose of the encoder is to pretty much learn the present, past and future and put that into a vector representation for the decoder. That's what the encoder does. So it's okay if we look into the future and understand tokens that way, because we're technically not cheating. We're just learning the different attention scores. Uh, and yeah, we're just using that to help us predict based on, you know, what the sentence looks like, but not explicitly giving it away, just giving it an idea of, you know, what to look for type of thing. And then we use mass attention here because, well, we don't want to look at it. We just want to, we want to look at the present and the past. And later on says, see, we're not, we're not given anything explicit explicit here. We're not given anything yet. So we want to make some raw guesses. They're good. They're not going to be very good guesses at first. We want to make some raw guesses. And then later on we can feed these, the added and normalized guesses into, uh, into this next multi-head attention, which, which isn't masked. And then we can use this max multi-head attention with the vector representation given by the encoder. And then we can sort of do more useful things with that rather than just being forced to guess, you know, raw attention scores, and then being judged for that, we can sort of introduce more, uh, more and more elements in this decoder block to help us learn more meaningful things. So we start off with taking this, uh, mass multi-head attention and then combining that with our, uh, we then afterwards we do a multi-head attention with the vector representation from the encoder. And then we can make decisions on that. So that's kind of why that, that works this way. Uh, if I, if you don't think I explained it like amazingly well, you can totally just, you know, ask GPT-4 about it or GPT-3.5 and get a pretty decent answer, but that's how that works. And, uh, yeah, another thing I kind of wanted to point out here is these linear transformations that you see, uh, I mean, there's, there's a lot of them in the, uh, what is it, scaled dot product attention. So you have your, uh, linears for your value or key value and key query and values. So these linears as well as the one up here, linears are great for just expanding or shrinking a bunch of important info into something easier to work with. So if you have a bunch of large, if you have a large vector containing a bunch of info learned from this scaled dot product attention, you can, you can sort of just compress that into something more manageable through a linear transformation. And that's essentially what's just happening here for the Softmax as well as in our, uh, scaled dot product attention here for these, uh, linear transformations from our inputs to, uh, quick keys, queries, and values. That's all that's happening. Uh, yeah, if you want to read more about, you know, linear transformations, the importance of them, you can totally go out of your way to do that, but that's just sort of a brief summary as to why they're important, just shrinking or expanding, uh, factors. So that's sort of a brief overview on how transformers work. However, in this, uh, course we will not be building the transformer architecture. We'll be building something called a GPT, which you're probably familiar with. And GPT stands for Generatively Pre-trained Transformer or Generative Pre-trained Transformer, one of the two. And pretty much what this is, it's pretty close to the transformer, uh, this architecture here, except it only adopts, uh, the decoder blocks and it takes away this multi-head attention here. So all we're doing is we're removing the encoder as well as what the encoder plugs into. So all we have left is just some inputs, our maximal multi-head attention, our post-norm architecture. And then right after this, we're not going to a non-mass multi-head attention, but rather to a feed forward network and then a post-norm. So that's all it is. It's just one, two, three, four. That's all it's going to look like. That's all the blocks are going to be. Uh, it is still important to understand the transformer architecture itself because you might need that in the future. And it is sort of a good practice in language modeling to have a grasp on and to understand, you know, why we use mass multi-head attention in the decoder and why we don't use it in the encoder and stuff like that. So anyways, we're going to go ahead and build this. Uh, if, if you need to look back, if something wasn't quite clear, definitely skip back a few seconds or a few minutes through the video and just make sure you clarify everything up to this point. Uh, but yeah, I'm going to go over some more math on the side here and just some other little, uh, little widgets we're going to need for building the decoder GPT architecture. So let's go ahead and do that. Before we actually jump into building the transformer method, building the GPT from scratch, what I want to do is linger on self-attention for a little bit, or rather just the attention mechanism and the matrix multiplication behind it and why it works. So I'm going to use whiteboard to illustrate this. So we're going to go ahead and draw out a, uh, we'll just use maybe a, a four token sequence here of words. My dog has fleas. Okay. So we're going to highlight which words are probably going to end up correlating together or, uh, the attention mechanism is going to multiply them together to a high amount based on what it learns about those tokens. This is what this is. So I'm going to help us illustrate that and what the, uh, GPT is going to see sort of from the inside, what it looks like from the inside. So I'm going to go ahead and draw this out here. So just make a table here. We'll give it four of these and then draw a little line through the middle. My drawing might not be perfect, but it's definitely better than on paper. Okay. So cool. We have this, we have my, oh, let me go here. Dog has fleas and then my, my dog. Oh, let's delete that. My dog has fleas. Cool. So to what degree are these going to interact? Well, my and my, I mean, it doesn't really give away that much. It's only just the start. So maybe it is, it'll interact to a low amount. And then you have my and dog. These might interact to a medium amount because it's like your dog. So we might go, you might go medium like that. And then mine has, well, that doesn't give away too much. So maybe that'll be low. And then my and fleas, it's like, oh, that doesn't really mean much. My fleas, that doesn't really make sense. Maybe we'll have it interact to a low amount. And then these would be the same thing. So my and dogs would be medium and then has and has would be low. And then mine fleas would also be low. And then you have dog and dog. So these might interact to a low amount. They're the same word. So we'll just forget about that. And then we have a dog has. So these might interact to a medium amount. Dog has the dog has something. And then dog and fleas. These might interact to a high amount because they're associating the dog with something else, meaning fleas. We have has and dog. These would interact to the same amount. So medium and then has and has be probably to a low amount. And then we could do low or we could do what was it high for this one as well fleas and dog. So these will interact to a high amount. And then we have has and fleas. So these could interact maybe a medium amount, medium and then fleas and fleas which would be low. So what you get, I'll just highlight this in I'll just highlight this in green here. So you get all the medium and high attention scores. You'd have your medium here, medium here, high, medium, medium, high, medium and medium. So you can see these are sort of symmetrical. And this is what the attention map will look like. Of course, there's going to be some scaling going on here based on the amount of actual attention heads we have running in parallel. But that's besides the point. Really what's going on here is the network is going to learn how to place the right attention scores because attention is simply being used to generate tokens. That's how the GPT works. It's using attention to generate tokens. So we can make those sort of attention scores how they're placed. We can make those learnable through all of the embeddings like everything we have in the entire network can make sure that we place effective attention scores and to make sure that they're measured properly. So obviously I didn't quantify these very well, like not with floating point numbers, but this is sort of the premise of how it works and how we want the model to look at different tokens and how they relate to one another. So that's what the attention mechanism looks like under the hood. So this is what the actual GPT or decoder only transformer architecture looks like. And yeah, so I'm just going to go through this step by step here and then we can hopefully jump into some of the math and code behind how this works. So we have our inputs, embeddings and positional encodings. We have only decoder blocks and then some linear transformation and then pretty much just we do some softmax probability distribution. We sample from those and then we start just generating some output and then we compare those to our inputs and see how off they were optimized from that. In each of these decoder blocks we have our multi add attention, res connections, feed forward network consists of a linear, relu linear in that order and then another res connection. In each of these multi add attentions we have multiple heads running in parallel and each of these heads is going to take a key query and value. These are all learnable linear transformations and we're going to basically dot product the key and query together. Concatenate these results and do a little transformation to sort of summarize it afterwards and then what actually goes on in the dot product tension is just the dot product meaning of the key and query, the scaling to prevent these values from exploding, to prevent the vanishing gradient problem and then we have our masking to make sure that these, to make sure the model isn't looking ahead and cheating and then softmax matrix multiply we output that and then kind of fill in the blank there. So cool, this is a little bit pretty much the transformer architecture, a little bit dumbed down, a little smaller in complexity to actually understand but that's kind of the premise of what's going on here. So still implements a self-attention mechanism. So as you can see now I am currently on my MacBook M2 chip. I'm not going to go into the specs of why it's important but really quick I'm just going to show you how I SSH onto my other PC. So I go SSH just like that and then I type in my ipv4 address and then I just get a simple password from here, password that I've memorized. Cool, so now I'm on my desktop computer and this is the command prompt that I use for it. So awesome, I'm going to go ahead and go into the free code camp, a little directory I have. So cd desktop, cd python testing and then here I'm actually going to activate my CUDA virtual environment, oop not accelerate, we go CUDA, activate, cool and then I'm going to go cd into free code camp gbt course, awesome. So now if I actually do code on here like this to open up my VS code it doesn't do that. So there's another little way I have to do this and you have to go into VS code, go into a little remote explorer here and then you can simply connect. So I'm just going to connect to the current window itself. There's an extension you need for this called open ssh server, I think it's what it's called and simply the same password I used in the command prompt, I can type it correctly. Awesome, so now it's ssh into my computer upstairs and I'm just going to open the little editor in here. Nice, so you can see that it looks just like that, that's wonderful. So now I'm going to open this into Jupyter notebook, actually cd into desktop here, cd python testing, CUDA scripts, activate, cd free code camp gbt course and then code like that and it will open. Perfect, how wonderful is that and I've already done a little bit of this here but we're going to jump into exactly how we can build up this transformer or gbt architecture in the code itself. So I'm going to pop over to my Jupyter notebook in here. Cool, I know this little address, I'm going to paste that into my browser. Awesome, so we have this gbtv1 Jupyter notebook. So what I've actually done is I've done some importations here, so I've imported all of these python importations, all the hyper parameters that we used from before, I've imported the data loader, I've imported the tokenizer, the train and bell splits, the get batch function, estimate loss, just everything that we're going to need and it's all in neatly organized little code blocks. So awesome, now what? Well let's go ahead and continue here with the actual upgrading from the very top level. So I remember I actually showed, and you can skip back to this, I actually showed the architecture of the gbt sort of lined out in a little sketch that I did and all we're going to do is pretty much build up from the high level, the high high level general architecture down to the technical stuff, down to the very root dot product attention that we're going to be doing here. So I'm going to go ahead and start off with this gbt language model which I just renamed, I replaced bigram with gbt here. So that's all we're doing and I'm going to add some little code bits and just walk through step by step what we're doing. So let's do that. So great, next we're going to talk about these positional encodings. So I go back to the paper here, rather this architecture. We initially have our tokenized inputs and then we give them embedding, so token embeddings and then a positional encoding. So this positional encoding, going back to the attention paper, is right here. So all it does is every even token index, we apply this function and then every odd token index we apply this function. You don't really need to know what it's doing other than the fact that these are the different sine and cosine functions that it uses to apply positional encodings to the tokenized inputs. So on our first index or whatever, let's say we have hello world. There's five characters here, h will be index zero, so it'll get an even encoding function and then e will be odd since it's index one. So it'll get this one and then l will get this, the next l will get this and then, or I don't know if I messed up that order, but essentially it just iterates and it goes back and forth between those applying these fixed functions. The thing is with fixed functions is that they don't actually learn about the data at all because they're fixed. So another way we could do this would be using nn.embedding which is what we use for the token embedding. So I'm going to go ahead and implement this here in our GBTV1 script. So I'm going to go ahead and add on this line self.position embedding table nn.embedding block size. So the block size is the length or the sequence length which in our case it's going to be eight. So there's going to be eight tokens and this means we're going to have eight different indices and each one is going to be of size n.embed and this is a new parameter I actually want to add here. So nn.embed will not only be used in positional embedding but it will also be used in our token embedding because when we actually store information about the tokens we want that to be in a very large vector. So not necessarily a probability distribution or what we were using before in the Bagram language model but rather a really large vector or a list you could think about it as a bunch of different attributes that are about a character. So maybe you know a and e would be pretty close because they're both vowels versus like e and z would be very different because z is not a very common letter and e is the most common letter in the alphabet. So we pretty much just want to have vectors to differentiate these tokens to place some semantic meaning on them and anyways that's a little talk about what token embedding table is going to do when we add n.embed and then positional embedding table is just the same thing but instead of each character having its own thing each letter index in the input is going to have its own embedding. So I can go and add this up here the n underscore embed and we can just make this maybe 384. So 384 is quite huge this may be a little too big for your pc but we'll see in a second. So what this is going to do is it's going to have a giant vector it's going to be like I don't know we could say like embedding say like embedding vector and then it would be like be like this and you would have a bunch of different attributes like 0.1 0.2 0.8 1.1 right except instead of four this is 384 elements long and each of these is just going to store a tiny little attribute about that token. So let's say we maybe had like a two-dimensional and we were using a word. So if we had sad versus happy sad might be sad might be 0.1 and then 0.8 or 0.8 whereas happy sad would be sad would be maybe the the positivity of what it's saying and then 0.8 would be uh is it showing some sort of emotion which is a lot right it's 80 emotion and 0.1 of maybe positive sentiment and then if we had 0.9 would be happy because it's it's happy it's very good and then 0.8 is emotional because they're sort of the same on the emotional level. But yeah so this is what our embedding vectors are pretty much describing and and all this hyper parameter is concerned with is how long that vector actually is. So anyways let's continue with the GPT language model class. So the next bit I like to talk about is how many decoder layers we have. So in here let's just say we have four decoder layers all right so we have four of these it's going to go through this one and then this one and then this one and then this one this is all happening sequentially. So we could actually make a little sequential neural network with four decoder layers. So I'm actually going to add this in and then a little bit extra code which I'll explain in a second here. So this uh self dot blocks is how many decoder blocks we have running sequentially or layers blocks and layers can be used interchangeably in this context. But yeah we have an nn dot sequential and this asterisk is pretty much saying we're going to repeat this right here for how many n layer is and then layer is another hyper parameter we're going to add. We go n underscore layer equals four okay so n underscore layer equals four that means it's going to make four of these uh I guess blocks or layers sequentially. It's going to make four of them uh and this little block thing we're going to build on top of this in a second here we're going to make an actual block class and I'm going to explain what that does. Uh but for now this is going to be some temporary code as long as you understand that this is what uh this is how we create our four layers our four decoder layers that's all you need to know for now. I'm going to move more into this block uh later. Uh as for this self dot layer norm final this is a final layer norm all this is going to do is we're just simply going to add this to the end of our network here uh just simply at the end here and all this is going to do is just going to help the model converge better. Layer norms are super useful and yeah so you'll see more how that works. I'll actually remove it later on and we'll actually compare and see how good uh it actually does and you can you can totally go out of your way to experiment with different normalizations and see how well the layer norm helps the model perform or how well the loss sort of converges over time when you put the layer norm in different places. So let's go back here and now we have this uh end here which is the language uh I believe this is language modeling head or something uh again this is this is what andre carpath used I'm assuming that means language modeling head uh but pretty much all we're doing is we're just projecting and we're doing this final uh transformation here this final uh little linear layer here from all of these sequential decoder outputs and we're just going to transform that to something that the softmax can work with so we have our layer norm afterwards to sort of normalize help the model converge after all these after all this computation we're going to feed that into a linear layer to make it I guess softmax workable so the softmax can work with it and yeah so we're just simply projecting it from an embed which is the vector length that we get from our decoder and uh and this vocab size so the vocab size is going to essentially give up a little probability distribution on each token that we have or the vocabulary so anyways I'm going to make this back to normal here and we're going to just apply this to the forward pass so a little thing I wanted to add on to uh this positional embedding or rather just the idea of embeddings versus the fixed definite function of that the sinusoidal functions and the cosine functions that we used here these are both actually used in practice the reason I said we're going to use embeddings is because we just want it to be more oriented around our data however in practice sinusoidal encodings are used in base transformer models whereas learned embeddings what we're using are used in variants like GBT and we are building a GBT so we're probably going to find out a performance from learning able embeddings and this is just uh summing up the experts do it's a little practice that experts do when they're building transformer models versus variants like GBTs so that's just a little background on why we're using learnable embeddings so now let's continue with the forward pass here so I'm going to paste in some marcode and let me just make sure this is formatted properly cool so we have this token embedding which is our token embedding table we take an idx we get our token embedding here then what we do with this positional embedding table so we have this torch dot arrange we make sure this is on the CUDA device uh the GPU device so it's in parallel and all this is going to do is it's going to look at how long is t and let's say t is uh t is our block size so t's going to be eight so all it's going to do is give us eight indices it's going to be like zero one two three four five six seven there's eight of those and we're essentially just going to give each of those uh each of those indices a different uh a different n embedding vector for each of those indices just a little lookup table and uh that's what that is so all we do now is it's actually quite simple this is a very efficient way to do it is you just add these two together so uh torch broadcasting rules which you might want to look into i'll actually search that up right now uh torch uh also we'll search broadcasting semantics pie torch broadcasting i cannot spell broadcasting semantics so uh these are a little bit funky when you look at them the first time but pretty much these are just rules about how you can do arithmetic operations and just operations in general to tensors so tensors are like you think of matrices where it's like a two by two tensors can be the same thing but they could be like a uh a two by two by two or a two by two by two by two by two by two whatever dimension you want to have there and pretty much it's just rules about how you can uh have two of those weirdly shaped tensors and do things to them so uh just some rules here i would advise you familiarize yourself with these even play around with it if you want just for a few minutes and just get an idea for uh which like just try to multiply tensors together and see which ones throw errors and which ones don't so it's a good idea to understand how broadcasting rules work uh obviously this this term is a little fancy and it's like oh that that's like a crazy advanced term uh not really just it's pretty much just some rules about how you're multiplying these really weirdly shaped tensors so yeah uh anyways if we go back to here uh uh we are allowed to broadcast these we're allowed to actually add them together so the positional embedding and the token embedding we get extra from this which is a b by t by c shape so now what we can do with these is we can actually feed it in to the uh gpt or i guess sort of a transformer network if you want to say that so we have these embeddings and positional encodings we add these together and then we feed them into our sequential network so how are we doing this well we go self dot blocks which is up here and we essentially just feed an x which is literally exactly what happens here we have our tokenized inputs we get our embeddings and our positional encodings through learnable embeddings we add them together and then we feed them into the network directly so that's all that's happening here and that's how we're feeding an x which is the output of these then uh after all of the this is like way after we've gotten through all of these Trent uh all these gpt layers or blocks we do this final layer norm and then this linear transformation to get it to a softmax uh to get it to essentially probabilities that we can feed into our softmax function and then other than that this forward pass is exactly the same other than this little block of code here so if this makes sense so far that is absolutely amazing let's continue i'm actually going to add a little bit of uh in practice some some little weight initializations that we should be using in our language model uh in a module subclass so i'm going to go over a little bit of math here but this is just really important for practice and to make sure that your model does not fail in the training process this is very important that's going to be a little little funky on the on the conceptualizing but yeah bring out some pen and paper and do some math with me we've built up some of these initial gpt language model architecture and before we continue building more of it and the other functions some of the math stuff that's going on in the parallelization that's going on in the script i want to show you some of the math that we're going to use to initialize the weights of the model to help it train and converge better so there's this new thing that i want to introduce called standard deviation and this is used in intermediate level mathematics the symbol essentially looks like this population standard deviation so n the size so it's just going to be an array the length of the array and then x i we iterate over each value so x at position zero x at position one x at position two and then this u here is the mean so essentially we're going to iterate over each element we're going to subtract it by the mean we're going to square that and then keep adding all these squared results together and then once we get the sum of that we're going to subtract or we're going to divide this by the number of elements there are and then once we get this result we're going to square root that so this this symbol here might also look a little bit unfamiliar and let me just illustrate this out for you so we go to our whiteboard and this e looks like looks like that let's just say we were to put in x i like that and our array let's just say for instance our array is 0.1 0.2 0.3 so what would the result of this be well if we look at each element iteratively add them together so 0.1 plus 0.2 plus 0.3 well we get 0.6 from that so this would essentially be equal to 0.6 that's what that equals we just add each of these up together or we do whatever this is iteratively whatever this element is we iterate over the number of elements we have in some arbitrary array or vector or list or whatever you want to call it and then we just sort of look at what's going on here and we can do some basic arithmetic stuff so let's walk through an exhibit let's walk through a few examples just to illustrate to you what the results look like based on the inputs here so i'm going to go back to my whiteboard we're going to draw a little line here just to separate this and let's go down so i want to calculate the standard deviation do standard deviation of and then we'll just make some random array negative 0.38 negative 0.38 0.52 and then 2.48 cool so we have this array this is three elements so that means n is going to be equal to three let me drag this over here so n is the number of elements so n is going to be equal to three our mean well our mean is just we add all these up together and then we average them so our mean is going to be equal to let's just say 0 negative 0.38 plus 0.52 plus 2.48 and then divided by 3 and the answer to this i did the math ahead of time is a problem is it is literally 0.873 repeated but we're just going to put 0.87 for simplicity's sake cool so the mean of this is 0.87 and n is equal to three now we can start doing some of the other math so we have this we have this o has a cool line and we do square root one over n which is equal to three and then we multiply this by sigma that's what this six uh that's what this symbol is that's sigma that's the name for it and then we go x i minus and then our mean of 0.87 apologies for this sloppy writing and then we square that so cool let me drag this out awesome so let's just do this step by step here so the first one is going to be 0.38 so we have 0. or negative 0.38 and we're going to do minus the mean here so minus 0.87 and i'm just going to wrap all this in brackets so that we don't miss anything kind of wrap it in brackets and then just square it and see what we get after so i'm just going to write all these out then we can do the calculations so next up we have 0.52 minus 0.87 we'll square that and then next up we have 2.48 minus 0.87 and then we square that as well so awesome what is the result of this the result of negative 0.38 minus 0.87 squared is 1.57 the result of this line is 0.12 these again these are all approximations they're not super spot on we're just doing this to understand what's going on here just to overview the function not for precision then the next one is going to be 2.59 and you can double check all these calculations if you'd like i have done these preemptively so that is that and now from here all we have to do is add each of these together so 1.57 plus 0.12 plus 2.59 divided by 3 is 1.57 plus 0.12 plus 2.59 all that divided by 3 is going to be equal to 1.42 and then keep in mind we also have to square root this so the square root of that is going to be 0 or 1.19 approximately we'll just add this guy ahead of it little approximation thing and so that's what the standard deviation of this array is zero negative 0.38 0.52 2.48 standard deviation is approximately 1.19 awesome let's do another example so let's say i want to do the standard deviation of 0.48 0.50 i guess 0.52 cool so there's a little pattern here just goes up by 0.02 each time and you're going to see why this is vastly different than the other example so let's walk through this so first of all we have n n is equal to 3 cool what does our mean our mean well if you do our mean our mean is 0.5 if you do 0.48 plus this plus that and divided by three that's going to be 0.5 and if you're good with numbers you'll probably already be able to do this in your head but that's okay if not next up we're going to do this in the formula so what what does these what do these iterations look like so zero point let's just do these in brackets the old way minus 0.0.5 squared the next one is 0.5 minus 0.5 squared which we already know is zero and then this one is 0.52 minus 0.5 squared so the result of 0.48 minus 0.5 squared and we'll just write equals here is going to be approximately 0.02 squared so that'd be 0.004 like that so i'll make this not actually overlap 0.004 and then this one we obviously know would be zero because 0.5 minus 0.5 that's zero then you square zero still the same thing and then this one is 0.0004 as well so when we add these two together we're going to get 0.0008 just like that and then if we divide them by three or whatever n is then we end up getting 0.00026 repeating so i'll just write two six six like that and so all we have to do at this point is just find the square root of this and we'll just do square root of 0.0026 approximately and that's going to be equal to about 0.0163 so that is our standard deviation of both of these arrays here so 0.048 and then 5 0.5 and then 0.52 our standard deviation is 0.0163 so very small and then we have negative 0.38 0.52 and 2.48 we get a standard deviation of 1.19 so you can see that these numbers are vastly different one is like one is literally a hundred times greater than the other so the reason for this is because these numbers are super diverse they uh i guess another way you could think of them is that they they stretch out very far from the mean so this essentially means when you're initializing your parameters that if you have some outliers then your network is gonna your network is gonna be funky because it's the learning process just messed up because you have outliers and it's not just learning the right way it's supposed to whereas if you had way too small of a standard deviation from your initial parameters like in here but maybe even smaller so let's say they were all 0.5 right then all of your neurons would effectively be the same and they would all learn the same pattern so then you would have no learning done so one would either be you're learning a super super unstable and you have outliers that are just learning very distinct things and not really not really not really letting other neurons get opportunities to learn or rather other parameters to learn you yeah if you have a lot of diversity you just have outliers and then if you have no diversity at all then essentially nothing is learned and your network is useless so all we want to do is make sure that our standard deviation is balanced and stable so that the training process can learn effective things so each neuron can learn a little bit so you can see here this would probably be an okay standard deviation if these were some parameters because they're a little bit different than each other they're not all like super super close to the same and yeah so essentially what this looks like in code here is the following so you don't actually need to memorize what this does as it's just used in practice by professionals but essentially what this does is it initializes our weights around certain standard deviations so here we set it to 0.02 which is pretty much the same as what we had in here so point point yeah this one's a little bit off in the standard deviation we set here but essentially we're just making sure that our weight our weights are initialized properly and you don't have to memorize this at all it's just used in practice and it's going to help our training converge better so as long as you understand that we we can apply some initializations on our weights that's all that really matters so cool let's move on to the next part of our GBT architecture so awesome we finished this GBT language model class everything's pretty much done here we did our init we did some weight initializations and we did our forward pass so awesome that's all done now let's move on to the next which is the block class so what is block block well if we go back to this diagram each of these decoder blocks is a block so we're pretty much just going to fill in this gap here our GPT language model has these two where we get our tokenized inputs and then we do some transformations and a softmax after and essentially we're just filling in this gap here and then we're going to build out and just sort of branch out until it's completely built so let's go ahead and build these blocks here what does this look like like that's what this does so we have our init we have a forward pass as per usual init and a forward pass as seen in the GPT language model class though all them are going to look like this forward and an init so the init is going to just initialize some things it's going to initialize some some transformations and some things that we're going to do in the forward pass that's all it's doing so what do we do first well we have this new head size parameter introduced so head size is the number of features that each head will be capturing in our multi-head attention so all the heads in parallel how many features are each of them capturing so we do that by dividing n embed by n head so n head is the number of heads we have and n embed is the number of features we have or we're capturing so 384 features divided by four heads so each head is going to be capturing 96 features hence head size so next up we have self.sa which is just short for self attention we do a multi-head attention we pass in our n head and our head size and you'll see how this how these parameters fit in later once we build up this multi-head attention class so cool now we have a feed forward which is as explained just in the diagram here our feed forward is just this which we're actually going to build out next and we have some and we have two layer norms and these are just for the post norm slash pre-norm architecture that we could implement here in this case it's going to be a post norm just because I found that it converges better for this for this course and the data that we're using and just the model parameters and whatnot it just works better so also that is the original architecture that we use in the attention paper so you might have seen that they do an add a norm rather than a norm an add so anyways we've initialized all of these cool so we have head size self attention feed forward and then two layer norms so in our forward pass we do our self attention first let's actually go back to here so we do our our self attention then add a norm then a feed forward and then add a norm again so what does this look like self attention add a norm feed forward add a norm cool so we're doing an add so we're going x plus the the previous answer which is adding them together and then we're just applying a layer norm to this so cool if you want to look up more into what layer norm does and everything and why it's so useful you can totally go out of your way to do that but layer norm is essentially just going to help smoothen out our features here so we have this and honestly there's not much else to that we just return this final value here and that's pretty much the output of our blocks so next up I'm going to add a new little code block here which is going to be our feed forward so let's go ahead and do that so feed forward just going to look exactly like this it's actually quite simple so all we do is we make an nn dot sequential uh torch dot nn we make this sequential network of linear linear relu and then linear so in our linear we have to pay attention to the shapes here so we have n in bed and then n in bed times four and then the relu will just essentially what the relu will do is it looks it looks like this help me let me illustrate this for you guys so essentially you have this graph here and let's just make this a whole plane actually so all of these values that are below zero all these values that are below zero on the x-axis and equal to zero will be changed just to zero like that so you have all these values that look like this and then everything that is above zero just stays the same so you essentially just have this funny looking shape it's like straight and then diagonal that's what the relu function does it looks at a number sees if it's equal to or less than zero if that's true we give that number zero and if it's not then we just leave the number alone so cool very cool non-linearity function you can read papers on that if you'd like but essentially the shape of this just doesn't matter all we're doing is we're just making sure that we're just converting some values if they're equal to or below zero that's all this is doing and then we essentially are multiplying this we're doing this linear transformation times this one so we have to make sure that these inner we have to make sure that these inner dimensions line up so four times n embed and four times n embed those are equal to each other so our output shape should be n embed by n embed cool so now we have our dropout and in case you don't know what dropout is it pretty much just makes a certain percentage of our neurons just drop out and become zero this is used to prevent overfitting and some other little details that i'm sure you could you could figure out through experimenting so all this actually looks like in a parameter form is just dropout dropout equals we'll just say 0.2 for the sake so 0.2 means 20% or 0.2 is going to yeah so 0.2 in percentage form is just going to drop out 20% of our neurons turn them to zero to prevent overfitting that's what that's doing so cool we have our feed forward network we drop out after to prevent overfitting and then we just call it forward on this sequential network so cool feed forward pretty self-explanatory let's jump into the next piece we're going to add the multi-head attention class so we've built all these decoder blocks we've built inside of the decoder blocks we built the feed forward and our res connections and now all we have to do left in this block is the multi-head attention so it's going to look exactly like this here we're going to ignore the keys and queries for now and save this for dot product attention so we're gonna yeah essentially just make a bunch of these multiple heads and we're going to concatenate results and do a linear transformation so what does this look like in code well let's go ahead and add this here all that attention cool it's a multiple hedge of attention in parallel i explained this earlier so not going to jump into too much detail in that but we have our knit we have our forward and what are we doing in here so our self dot heads is just a module list and module list is kind of funky i'll dive into it a little bit later but essential we're doing is we're having a bunch of these heads essentially in parallel for each head so num heads let's say our our num heads is set to our num heads is set to maybe four in this block we do multi-head attention we do n heads and then head size so n heads and then head size so num heads essentially what it is so for the number of heads that we have which is four we're going to pretty much make one headed running in parallel so four heads running in parallel is what this does here then we have this projection which is essentially just going to project the head size times the number of times the number of heads to a n embed and you might ask well that's weird because num heads times this is literally equal to an embedding if you go back to the math we did here and the purpose of this is just to be super hackable so that if you actually do want to change these around it won't be throwing you dimensionality errors so that's what we're doing just a little projection from our whatever these values are up to this constant feature length of an embed so then we just follow that with a dropout dropping out 20 percent of the networks neurons now let's go into this forward here so forward torch dot concatenate or torch dot cat we do four h and self dot heads so we're going to concatenate each head together along the last dimension and the last dimension in this case is the B batch by time by we just say feature dimension or channel dimension the channel dimension here is the last one so we're going to concatenate along this feature dimension and let me just help you illustrate what exactly this looks like so when we concatenate along these we have this B by T and then we'll just say our features are going to be H1 like our each of our heads here another H1 H1 H1 these are all just features of head one and then our next would be H2 H2 H2 H2 then if let's just say we have a third head go H3 H3 H3 H3 like that so we have maybe four features per head and there's three heads so essentially all we're doing when we do this concatenate is we're just concatenating these along the last dimension so to convert this like ugly list format of just each head features sequentially in order which is like really hard to process we're just concatenating these so they're easier to process so that's what that does and then we just follow this with a dropout so we do our self brought self dot projection and then just follow that with a dropout so cool if that didn't totally make sense you can totally just plug this code into chatgbt and get a detailed explanation on how it works if something wasn't particularly clear but essentially that's the premise you have your batch by sequence length or time used interchangeably and then you have your features which are all just in this weird list format of each feature just listed after another so cool so that's what multi-head attention looks like let's go ahead and implement dot product attention or scale dot product attention so a little something i'd like to cover before we go into our next scaled dot product attention was just this linear transformation here and you might think well what's the point if we're just transforming an embed to an embed right that it's just kind of weird to have the match like that and essentially what this does is it just adds in another learnable parameter for us so it has a await and a bias if we set bias to false like that then it wouldn't have a bias but it does have a bias so another just w x plus b if you will await times x plus a bias so it just adds more learnable parameters to help our network learn more about this text so cool i'm going to go ahead and add in this last but not least scaled dot product attention or head class so there's going to be a bunch of these heads hence the class head running in parallel and inside of here we're going to do some scaled dot product attention so there's a lot of code in here don't get too overwhelmed by this but i'm going to walk through this step by step so we have our init we have our forward awesome so what do we do in our architecture here so we have a key a query and a value the keys and the queries dot product together they get scaled by one over the square root of length of a row in the keys or queries matrix so we'll just say maybe keys for example the row of keys the length of a row in keys and then we just do our masking to make sure the network does not look ahead and cheat and then we do a softmax and a matrix multiply to essentially add this value weight on top of it so cool we do this key or keep in mind this initialization is not actually doing any calculations but just rather initializing the linear transformations that we will do in the forward pass so this self dot key is just going to transform and embed to head size bias false and then i mean the rest of these are just the same and embed to head size because each head will have 96 features rather than 384 so we kind of already went over that but that's just what that's doing cool that's just a linear transformation that's happening to convert from 384 to 96 features then we have this self dot register buffer well what does this do you might ask register buffer is essentially just going to register this no look ahead masking in the model state so instead of having to reinitialize this every single head for every single forward and backward pass we're just going to add this to the model state so it's going to save us a lot of computation that way on our training so our training time is going to be reduced just because we're registering this yeah so it's just going to prevent some of that overhead computation of having to redo this over and over again you could still do training without this it would just take longer so that's what that's doing yeah so now we have this dropout of course and then in our forward pass let's let's break this down step by step here so b by t by c so batch by time by channel is our shape we just unpack those numbers and then we have a key which is we're just calling this linear transformation here on an input x and then a query which is also calling the same transformation but a different learnable transformation on x as well so what we get is this instead of b by t by c we get b by t by head size hence this transformation from through 384 to 96 so that's what that is and that's how these turn out here so now we can actually compute the attention scores so what do we do we'll just say way we'll just say weights is our attention weights or yeah i guess you can say that we have our queries dot product matrix multiply with the keys transposed so what does this what does this actually look like and i want to help i want to help you guys sort of understand what transposing does here so let's go back to here and draw out what this is going to look like so essentially what transposing is going to do is it is just going to make sure let me draw this out first so let's say you had i don't know maybe b a of b c d and you have a b c and d cool let's draw some lines to separate these so essentially what this does is the transposing puts it into this form so if we didn't have transpose then this would be in a different order it wouldn't be a b c d in both from like top to bottom left to right type of thing it would be in a different order which would essentially not allow us to multiply them the same way so we do a by a a times b it's like sort of a direct multiply if you will i don't know if you remember times tables at all from elementary school but that's pretty much what it is we're just setting up in a times table form and we're computing attention scores that way so that's what that is that's what this transposing is doing it is doing and all this does is it just flips the second last dimension with the last dimension so in our case our second last is t and our last is head size so it just swaps these two so we get b by t by head size and then b by head size by t we dot product these together also keeping in mind our scaling here which is taking this we're just taking this scaling of one over the square root of length of a row in the keys if we look at this here now there's a little analogy i'd like to provide for this scaling right here so imagine you're in a room with a group of people and you're trying to understand the overall conversation if everyone is talking at once it might be challenging to keep track of what's being said it would be more manageable if you could focus on one person at a time right so that's similar to how a multi-headed tension in a transformer works so each of these heads divides the original problem of understanding the entire conversation i.e. the entire input sequence into smaller more manageable sub-problems each of these sub-problems is a head so the head size is the number of these sub-problems now consider what happens when each person talks louder or quieter if someone if someone speaks too loudly or the values in the vectors are very large it might drown out the others this could make it difficult to understand the conversation because you're only hearing one voice or most of one voice to prevent this we want to control how loud or how quiet each person is talking so we can hear everyone evenly the dot product of the query and key vectors in the attention mechanism is like how loud each voice is if the vectors are very large or high dimensional or many people are talking the dot product can be very large to control this volume by scaling down the dot product using the square root of the head size this scaling helps ensure that no single voice is too dominant allowing us to hear all the voices evenly this is why we don't scale by the number of heads or the number of time steps they don't directly affect how loud each voice is so in sum multi-headed tension allows us to focus on different parts of the conversation and scaling helps us to hear all parts of the conversation evenly allowing us to understand the overall conversation better so hopefully that helps you understand exactly what this scaling is doing so now let's go into the rest of this here so we have this scaling applied for our head size or yeah our head size dimension we're doing this dot product matrix multiplication here we get our b by t by t and then what is this masked fill doing so let me help you illustrate this here so masked fill is essentially like we'll take just uh we'll take we'll say block size is three here all right so we have initially let's say we have a uh like a 1 a 0.6 and then like a 0.4 okay then our next one is uh yeah we'll just say all of these are the same okay so essentially in our first one we want to mask out everything except everything except for the first time step and then when we advance one so let's just change this here back to zero when we go on to the next time step we want to expose the next piece so 0.6 i believe it was and then zero again and then when we expose the next time step after that we want to expose all of them so just kind of what this means is as we um as the time step advances in this sort of i guess vertical part is every time this step's one we just want to expose one more token or one more of these values sort of in like a staircase format so essentially what this masked fill is doing is it's making this uh t by t so block size by block size and for each of these values we're going to set them to negative infinity so for each value that's zero we're going to make that the float value negative infinity so it's going to look like this negative infinity negative infinity negative infinity just like that so essentially what happens after this is our softmax is going to take these values and it's going to exponentiate normalize them um we already went over the softmax previously but essentially what this is going to do this this last dimension here concatenate or not concatenate rather apply the softmax along the last dimension is it's going to do that in this sort of horizontal here so this last uh this last t it's like blocks it's like block size by block size so it's like we'll say t1 and t2 each of these being length of block size we're just going to do it to this last t2 here and this horizontal is t2 so hopefully that makes sense and essentially what this exponentiation is going to do is it's going to turn these values to zero and this one is obviously going to remain a one and then it's going to turn these into zero and it's going to probably sharpen this one here so this one is going to be more significant it's going to grow more than the 0.6 because we're exponentiating and then same here so this one is going to be very uh very sharp compared to 0.6 or 0.4 so that's what the softmax does essentially the point of the softmax function is to make the values stand out more it's to make the model more confident in highlighting attention scores so when you have one value that's like very big but not too big not exploding because of our scaling right I want to keep a minor scaling but when a value is big when a score or attention score is very big we want the model to put a lot of focus on that and to say this this is very important in the entire sentence or the entire thing of tokens and we just want it to learn the most from that so essentially that's what softmax is doing instead of just normal normalizing mechanism it's just doing some exponentiation to that to make the model more confident in its predictions so this will help us score better in the long run if we just highlight what tokens and what attention scores are more important in the sequence and then after this softmax here we just apply a simple dropout on this way variable this new this new calculated way scale dot product attention masked and then softmaxed we apply a dropout on that and then we perform our final weighted aggregation so this v multiplied by the output of the softmax cool so we get this v self dot value of x so we just multiply that a little pointer I wanted to add to this module list which is yeah module list here and then our order to go yes our sequential network here so we have this sequential number of blocks here for n layers and we have our module list so what really is the difference here well module list is not the same as n and dot sequential in terms of the uh asterisk usage that we see uh in the language model class module list doesn't run one layer or head after another but rather each is isolated and gets its own unique perspective sequential processing is where one block depends on another to synchronously complete so that means we're waiting on one to finish before we move on to the next so they're not completing asynchronously or in parallel so the multiple heads in a transformer model operate independently and their computations can be processed in parallel however this parallel parallelism isn't due to the module list that stores the heads instead it's because of how the computation are structured to take advantage of the GPU's capabilities for simultaneous computation and this is also how the deep learning framework PyTorch interviews interfaces with the GPU so this isn't particularly something we have to worry about too much but you could supposedly think that these are sort of running in parallel yeah so if you want to get into hardware then that's that's like your whole realm there but this is PyTorch this is software uh not hardware at all i don't expect you have to have any hardware knowledge about GPUs CPUs anything like that so anyways that's just kind of a background on what's going on there so cool uh so let's actually go over what is going on from the ground up here so we have this uh gpt language model we get our token embeddings positional embeddings we have these sequential blocks initialize our weights for each of these blocks we have a this this class block so we get a head size parameter which is an embedded 384 divided by n heads which is four so we get 96 from that that's the number of features we're capturing self-attention we do a feed forward two layer norms so we go self-attention layer norm feed forward uh layer norm in the post-norm architecture then we do a feed forward just a linear followed by a relu followed by a linear and then dropping that out and then we have our multi-head attention which just sort of structured these attention heads uh running in parallel and then concatenates the results and then for each of these heads we have our keys queries and values we register a model state to prevent overhead computation excessively then we just do our scaled dot dot product attention in this line we do our masked fill to prevent look ahead we do our softmax to make our values sharper and to make some of them stand out and then we do a dropout finally on that and just some weighted aggregation we do our weights or this this final this final weight variable multiplied by our weighted value from this from this initially this linear transformation so cool that's what's happening step by step in this gbt architecture amazing give yourself a good pat on the back go grab some coffee do whatever you need to do even get some sleep and get ready for the next section this is going to be pretty fun so there's actually another hyper parameter i forgot to add which is nlayer and nlayer is essentially we'll say equal to four nlayer is essentially equal to the number of decoder blocks we have so instead of like nblock we just say nlayers doesn't really matter what it what it's called but that's what it means and then number of heads is how many heads we have running theoretically in parallel and then n embed is the number of total dimensions we want to capture from all the heads concatenated together type of thing we already went over that so cool hyper parameters block size sequence length batch size is how many of these do we want at the same time max iter is just training how many iterations we want to do learning rate is what we cover that in actually the desmos calculator that i showed a little while back just showing how how we update the model weights based on the derivative of the loss function and then eval iter which was just reporting the loss and then lastly the dropout which is dropping out 0.2 or 20 percent of the total neurons so awesome that's pretty cool let's go ahead and jump into some data stuff so i'm going to pull out a paper here so let's just make sure everything works here and then we're actually going to download our data so i want to try to run some iterations and just make sure that our actually i made some changes uh pretty much this was this was weird and didn't work so i just changed this around to making our characters empty opening this text file uh opening it storing it in a variable with utf8 format and then just making our vocab this sorted list set of our text and then just making the vocab size the length of that so let's go ahead and actually run this through i did change the block size to 64 batch size 128 some other hyper parameters here so honestly the block size and batch size will depend on your computational resources so just experiment with these i'm just going to try these out first just to show you guys what this looks like okay so it looks like we're getting idx is not defined or could that be okay yep so this is yeah we could just change that it's just saying idx is not defined we're using index here idx there so that should work now and we're getting a local variable t reference before assignment okay so we have some we have t here and then we initialize t there so let's just bring up up to there cool now let's try and run this oh shape is invalid for input size of okay let's see what we got it turns out we don't actually need two token embedding tables a little bit of a selling mistake but we don't need two of those so i'll just delete that and then what i'm going to do is go ahead and run this again let's see a new error local variable t reference for assignment okay so our t is our t is referenced here and well how can we initialize this what we can do so we could take this index here of shape b by t because it goes b by t plus one etc and just keeps growing so we could actually unpack that so we could go b b and t is going to be index dot shape just unpack that so cool so now we're going to run this training loop and it looks like it's working so far so that's amazing super cool step zero train loss 4.4 that's actually a pretty good training loss overall so uh we'll come back after this is done i've set it to train for uh i've set it to train for 3000 iterations printing every 500 iterations so we'll just see the lock the loss six times over this entire training process or we should i don't know why it's going to 100 eval iterators you got it okay estimate loss is okay so we don't actually need eval interval get rid of that we'll just make this sure why not 100 we'll keep that and it's just going to keep going here we'll see our loss over time is hopefully going to get smaller so i'll come back when that's done as for the data we're going to be using the open web text corpus and let's just go down here so this is the this is a paper called survey survey of large language models all right so i'll just go back to open web text wherever that is up it's just fine okay so open web text this is consisted of a bunch of reddit links or just reddit upvotes so if you go and write it and you see a bunch of those posts that are highly upvoted or downvoted they're pretty much those pieces of text are valuable and they contain things that we can train them so pretty much web text is just a corpus of all these upvoted links but it's not publicly available so somebody created an open source version called open web text hence open and it's pretty much just an open version of this so we're going to download that there's a bunch of other corpora here like common crawl which is really really big so like petabyte scale data volume you have a bunch of books you know so this is a good paper to read over it's just called a survey of large language models you can search this up and it'll come up you can just download the pdf for it so this is a really nice paper read over that if you'd like but anyways this is a download link for this open web text corpus so just go to this link i have it in the github repo and you just go to download and it'll bring you to this drive so you can go ahead and right click this and just hit download it'll say 12 gigabytes exceeds maximum files as it can scan so it's like this might have a virus don't worry it doesn't have a virus this is actually created by a researcher so not really bad people are in charge of creating text corpora so go ahead and download anyway okay i have actually already downloaded this so uh yeah i'll come back when our training is actually done here so i'm actually going to stop here iteration 2000 because we're not actually getting that much amazing progress and the reason for this is because our hyper parameters so batch size and block size i mean these are okay but we might want to change up as our learning rate so some combinations of learning rates that are really useful is like three three to the negative three you go three e to the negative four you go one e to the negative three one e one e to the negative four so these are all learning rates that i like to play around with these are just sort of common ones it's up to you if you want to give them or not but uh what i might do actually is just downgrade two three to the negative four and we'll retest it as well i'm going to bump up the uh the number of heads and the number of layers so that we can capture more complex relationships in the text thus having it learn more so i'm going to change each of these to eight i'll go eight and i go actually kernel will go restart now we'll just run this from the top and we'll run that cool so let's see what we actually start off with and what our loss looks like over time cool so we got step one four point five about the same as last time it's like point two off or something so it's pretty close uh let's see the next iteration here that's wonderful so before we were getting like three point one ish or swimming around that range three point one five now we're getting two point two so you can see that as we change hyper parameters we can actually see a significant change in our loss so uh this is amazing this is just to sort of prove how cool hyper parameters are and what they do for you so uh given that let's uh let's start changing around some data stuff so this is this right here is the wizard of oz text just a simple text file it's the it's the size isn't super large so we can actually open it all into ram at once but if we were to use the open web text we cannot actually read you know 45 gigabytes of utf8 text in ram at once just can't do that unless you have like maybe 64 or 128 gigabytes of ram this is really just not feasible at all so we're going to do some data pre-processing here some data cleaning and then just a way to simply load data into the gpt so let's go ahead and do that so the model has actually gotten really good at predicting the next token as you can see the train loss here is 1.01 so let's actually find uh what the prediction accuracy of that is so i might just go into gpt4 here and just ask it uh what is the prediction accuracy of loss 1.01 the loss value comes with a loss function during the praying process okay so let's let's see cross entropy loss doesn't mean the model is 99 accurate okay so that pretty much means that the model is really accurate but i want to find a value here so if the we'll go to wolfram alpha and just we'll just guess some values here so negative ln of let's say 0.9 okay so probably not that 0.3 0.2 0.4 0.35 yep so the model has about a 35 percent chance of guessing the next token as of right now so that's actually pretty good so one in every three tokens are spot on so that is wonderful this is converging even more we're getting 0.89 so now it's getting like every like 40 are being guessed properly uh our validation is not doing amazing though but we'll we'll linger on that in a little bit here and you'll see some sort of how this changes as we scale our data but uh yeah so i've installed this web text dot tar file tar files are interesting so in order to actually extract these you simply just uh right click on them you go extract to and then it'll just make a new file here so it'll process this you have to make sure you have winra or else this might not work to the fullest extent and yeah so we'll just wait for this to finish up here you should end up with something that looks like this so open web text and inside of here you have a bunch of xz files cool so there's actually 20 000 of these so we're gonna have to do a lot of uh it's gonna definitely there's definitely gonna be some for loops in here for sure so let's just handle this step by step in this data extract file so first off we're gonna need to import some python modules we're gonna use os for interacting with the operating system lzma for handling xz files which are a type of compressed file like seven zip for example and then tqdm for displaying a progress bar so you see a progress bar but left to right in the in the terminal and that is pretty much going to show us how quick we are uh that's executing the script so next up we're going to find a function called xz files in dir it takes a directory as an input returns a list of all of the xz file names in that directory it's going to use os.list dir to get all the file names and os path is file os path is file to check if each one is a file and not a directory or symbolic link if a file name ends with.xz and it's a file it'll be added to the list so we just have a bunch of these files each element is just the title of each file in there so that's pretty much what that does and then next up here we'll set up some variables folder path it's just going to be where our xz files are located so i'm actually going to change this here because that's like an incorrect file path but yes just like that you have to make sure that these uh slashes are actually forward slashes or else you might get bytecode errors so when it actually tries to read the string it it doesn't think that these are separated or that the backward slashes do like weird things and so you could either do like a one forward slash or two backward slashes and that should work so awesome just make sure you have forward slashes and you should be good so folder path is where all these files are located all these xz files are located as you saw uh output file is the pattern for output file names in case we want to have more than one of them so if you want to have 200 output files instead of one then it'll just be like output 0 output 1 output 2 etc and then a vocab file is where we want to save our vocabulary keep in mind in this giant corpus you can't push it on to ram it once so what we're going to do is as we're reading these little compressed files 20,000 of them we're going to take all of the new characters from them and just push them into some vocab file containing all of the different characters that we have so that way we can handle this later and just pretty much sort it into some list containing all of our vocabulary split files how many files we want to split this into so pretty much this it ties back to output file and just these these curly braces here how many do we want to have if we want to have more than one then we would this would take effect so cool now we'll use our x files in dir to get a list of file names and store them in this variable we'll count the number of total xd files simply the length of our file names now in here we'll calculate the number of files to process for each output file if the user has requested more than one output file for request more than one output file this is the total number of files divided by the number of output files rounded down so if the user only wants one output file max count is the same as total files and that's how that works so next up we'll just create a set to store a vocabulary when we start appending these new characters into it a set is a collection of unique items in case you did not know entirely what a set was now this is where it gets interesting now we're ready to process our.xz files for each output file we'll process max count files for each file we'll open it read its contents and write the contents of the current output file and then add any unique characters to our vocabulary set after processing max count files remove them from our list of files and then finally finally we'll write all our vocabulary to this file so we pretty much just open yeah we just write all of these characters in the vocab to this vocab file which is here vocab.txt so awesome now um honestly we could we could just go ahead and run this so let's go ahead and go in here i'm going to go to cls to clear that we'll go python data extract.py let's see if this works it's magic how many files would you like to split this into we'll go one one then we get a progress bar 20 000 files and we'll just let that load i'll come back to you in about 30 minutes to check up on this okay so there's another little one of thing we want to consider for and it's actually quite important is our splits so our train and bow splits uh we it would be really inefficient to just get blocks and then creating you know train and bow splits as we go every new batch we get so in turn what we might be better off doing is just creating a train an output train file and an output file file so just two of them instead of one train is 90 of our data val is 10 of our data if that makes sense so pretty much what i did is i took away that little input line for how many files do you want as you can see i got quite a bit of files produced here by not doing that correctly so don't do that and yeah essentially we're just we're pretty much just doing that so we're processing some training files we're separating 90 of the names on the left side and then 10 of the names on the right side we're just separating those into two different arrays file names and then we're just processing each of those arrays based on the file names so i took away that little bit that was asking you know how many how many of those how many files per split do you want so i took that away and this is like effectively the same code just a little bit of tweaks and yeah so i'm going to go ahead and run this data extract cool so we've got an output train and then after this it's going to do the output validation set so i'll come back after this is done so awesome i have just downloaded both or i've both got both these splits output train and val train so just to confirm they're actually the right size got 38.9 and then 4.27 so if we do this divided by nine so about 30 38.9 divided by nine we get 4.32 and it's pretty close to 4.27 so we can confirm that these are pretty much the uh the length that we expect them to be so awesome we have this vocab.txt file wonderful so now we have to focus on is getting this into our batches so when we call our git batch function actually cd out of this open this into jupiter notebook let's copy my desktop paste it over here and perfect so it was open one web text folder with these files awesome and our gptv1 um so this git batch function is going to have to change also these are going to have to change as well and this one too these are probably not going to be here um but pretty much let's go ahead and first of all get this vocab.txt in so what i'm going to do i'm just going to go we're going to go open web text slash vocab.txt cool so that's our vocab right there text read vocab size the length of that nice so that's what our vocab is and then uh what we're going to do next is change this git batch function around so first of all i'm going to go ahead and get rid of this here get rid of that data and then i've actually produced some code specifically for this so i'm just going to go back to my i'm going to find this folder okay so i've actually produced some code here i produced this off camera but uh pretty much what this is going to do is it's going to let us call a split okay so we have our git batch function all of this down here is the same as our gptv1 file and then this data is just going to get a random chunk of text so a giant block of text and the way that we get it is actually pretty interesting so the way that we get this text is something called memory mapping so memory mapping is a way to look at disk files or to open them and look at pieces of them without opening the entire thing at once so memory mapping look at i'm not a hardware guy so i can't really talk about that but uh yeah memory mapping is pretty cool it allows us to look at little chunks at a time in very large text files so that's essentially what we're doing here we're passing this split split uh file name is equal to train split this is just an example text file if the split is equal to train then this is our file name else file split and then we're going to open this file name in binary mode this has to be in binary mode it's also a lot more efficient in binary mode and then we're going to open this with a mem map so i don't expect you to memorize all the mem map syntax you can look at the docs if you would like but i'm just going to explain sort of logically what's happening so we're going to open this with the memory map library and we're going to open this as mm so the file size is literally just the length of it so determining the file size and all we're doing from this point is we're just finding a positions we're using the random library and we're finding a position between uh zero and the file size minus block size times batch size so pretty much we have this giant uh this giant text file we could either what we want to do is we want to start from zero and go up to like just before the end because if we actually sample uh that last piece then it's still going to have some wiggle room to uh reach further into the file if we just made it from like the first like the very start of the file to the very end then it would want to do is it would want to look past the end because it would want to look at more tokens from that and then we would just get errors because you can't read more than the file size if that makes sense so that's why i'm just making this little threshold here and uh yeah that's what that does that's the starting position could be a random number between the start and a little bit a little margin from the end here so next up we have this seek function so seek it's going to go to the start position and then block is going to read we're going to go up to the start position it's going to seek up to there that's where it's going to start it's going to go up to it and then the read function is going to find a block of text that is block size times batch size so it's going to find a little snippet of text in there at the starting position and it's going to be of size it's going to have this the same amount of i guess bytes as block size time times batch size then all that minus one just so that it fits into this start position we don't get errors here that's why i put the minus one but yeah so we'll get a pretty we'll get a pretty decent uh text amount i guess you could say it's going to be enough to work with you could you could of course increase this if you wanted to you could do like you know times eight if you wanted you like times eight and then times eight up here but we're not going to do that based on my experience this has performed pretty well so we're going to stick with this method here and then we just decode this bit of text the reason we decode it is it's it's because it's uh we read it in binary form so once we have this block of text we actually have to decode this to uf8 format or utf8 format and then any like bytecode errors we get we're just going to ignore that this is something you learn through practice is when you start dealing with like really weird data or if it has like corruptions in it you'll get errors so all you want to do is all this does is it pretty much says okay we're just going to ignore this bit of text and we're just going to sample everything around it and not include that part and plus since we're doing so many iterations it won't actually interfere that much so we should be all right and then for this replace little function here i was noticing i got errors about this slash r so all this does is just replaces that with an empty string and then finally we have all this uh we have all this decoded data so all we're going to do is just encode this into the tokenized form so it's all in it's all in the tokenized form uh integers or torch dot longs data type and we just that that's what our data is instead of a bunch of characters it's just a bunch of numbers and then we return that into our git batch and this is what our data is so that's pretty cool we can get either train or bow split and that's sort of what it looks like in practice that's how we sample from very large text files at a smaller scale bit by bit so let's go ahead and implement this here i'm gonna go grab this entire thing and pop over to here i'm just going to replace that so get random chunk get batch cool so now we can actually go ahead and perhaps run this actually before we run this there's a little something we need to add in here so i have this train split dot txt and a vow split dot txt so i actually need to change these full score rename will go uh train split dot txt and then vow split dot txt cool and then we could just go open web text forward slash and then same thing for here cool let's go ahead and run this now oh and we're getting errors mem map is not defined okay so that's another thing we need to probably add in then so i'm actually just gonna stop this process from running here we're gonna go pip install mem map oh my map is not defined oh we don't actually need to install this it by default comes with the operating system so what we actually need to do is we just close this gptv1 awesome is everything everything is good nothing is broken sweet so what i actually need to do up here is import this so i need to go import mem map just like that and should be good to start running the script name random is not defined again another importation we have to make import random and we should start seeing some progress going here so once we see the first iteration i'm going to stop it come back at the last iteration and then we'll start adding some little bits and pieces onto our script here to make it better so we're already about 600 iterations in and you can see how the training loss has actually done really well so far it's gone from 10.5 drop all the way to 2.38 and we can actually see that we might be able to actually get a val loss that is lower than the train because keep in mind in train mode the dropout takes effect but in val in eval mode uh let me just roll up to this here yes some model about eval what this does is it turns off the dropout so we don't we don't lose any of the neurons and they're all sort of showing the same features and giving all the information that they're supposed to because they're all active but in train mode 20 of them are off so once you actually see uh in ethyl mode it does better that means that the network has started to form a sense of completeness in its learning so it's just adjusting things a little bit once it hits that point and we might see this happen momentarily but this is really good progress so far a loss of 1.8 is amazing so uh yeah in the meantime i'm just going to add some some little tweaks here and there to improve this script so i've actually stopped the iteration process but we got into 700 steps and we can already see that val loss is becoming less than train loss which is showing that the model is actually converging and doing very well so uh this architecture is amazing we've pretty much we've pretty much covered every architectural math pytorch part of this script has to offer uh the only thing i want to add actually a few things i want to add one of them being uh torch.load and torch.save so one thing that's going to be really important when you start to scale up uh your your iterations is you don't just want to run a script that executes you know a training loop with an architecture and that's it you want to have some way to store those learnable parameters so that's what torch.load and torch.save does uh save some file uh right and you can pretty much uh you could put it into like a serialized format when you save it you take your initial architecture in our case it would actually be the gpt language model so you would save this because it contains everything all these other classes as well they're all inside of gpt language model you'd save that architecture and you'd essentially serialize it into some pickled file that would have the file extension.pkl so essentially instead of using torch we're just going to use a library called pickle because they're essentially the same thing uh pickle is pickle's a little bit easier to use or at least a little bit easier to understand there's less there's less to it uh pickle will only work on one gpu so if you have like eight gpu's at the same time you're going to want to learn a little bit more about hardware stuff and some pytorch docs but pretty much if we want to save this after training what we're going to do is we're going to use a little library called pickle and this comes pre-installed with windows um what am i typing windows import pickle okay so what we want to do is implement this after the training loop after all these parameters have been updated and learned to the fullest extent so after this training loop we're simply going to open what we could do with open and we could just go model zero zero one like that and then just that.pkl is the file extension for it and then since we're writing to it we're going to go write binary as f and then in order to actually save this we just go pickle dot dump and then we can use model and then just f like that so if i start recording this it's going to make if i start recording this training process it's going to make my it's going to make my clip leg so i'm going to come back to this after we've done let's just say about 100 iterations we're going to do 100 editors and i'm going to come back and show you guys what the model looks like what i actually did is i changed some of the model hyper parameters because it was taking way too long to perform what we wanted it to so i changed and head to one and layer to one and i half batch size all the way i'm from 64 to 32 so what i'm actually going to add here is just to make sure i just want to i like to print this out at the beginning just print device make sure that the device is CUDA uh let's go back down so it did in fact train the model so we got all this done uh and yeah so i don't know i did 2.54 or whatever that that was just some entire loss okay so model saved awesome what does this actually look like here so this model dot pkl 106 megabytes isn't that wonderful so this is our model file this is what they look like so it's just serialized pretty much the entire architecture all the parameters of the model the state everything that it contains and we just can compress that into a little pkl file take that out decompress it and then just use it again with all those same parameters so awesome and all this really took was uh we just open as this we do a pkl dot dump and then just to make sure that actually save i just like to add a little print statement there cool so next what i'd like to add is a little wait for us to uh instead of just doing all of our training at once and then saving the model being able to train multiple times so i'm going to go up here up to our gpt language model here and let's just see what i'm going to do i'm going to go with open and we're going to go model 0 1 pkl and we're going to go read binary so actually going to read it we're going to we're going to load this into our script here so i'm going to go as f and then i believe it's pkl dot load you just go yeah model equals uh pkl dot load and then we'll just essentially if we dump that right in there go print loading model parameters dot dot dot and then just put f in there and then once it is loaded we'll do print loaded successfully okay cool so i'm actually going to try this out now go do that boom boom and boom okay so loading model parameters loaded successfully and we'll actually see this start to uh work on its own now so is it going to begin or is it not going to begin let's run that okay perfect so now we should take the loss that we had before which was about 2.54 i believe something around those something along those lines you can see that our training process is greatly accelerated so we had 100 now it's just going to do an estimate loss cool and we're almost done 1.96 awesome and the model saved so essentially what we can do with this is we can now save models and then we can load them and then iterate further so if you wanted to you could create a super cool gpt language model script here and you could essentially give it like 10 000 or 20 000 iterations to run overnight you'd be able to save it and then import that into say a chatbot if you want so that's pretty cool and that's just kind of a good good thing good little it's kind of essential for language modeling because what are you what's the point in having a machine learning model if you can't actually use it and deploy it so you need to save for this stuff to work all right now let's move on to a little something in this task manager here which i'd like to go over so this shared gpu memory here and it's dedicated gpu memory so dedicated means how much v ram video ram does your gpu actually have on the card so on the card it's going to be very it's going to be very quick memory because it's it doesn't have to the electrons don't have to travel as quickly that's just that's kind of the logic of it the electrons don't have to travel they don't have to travel as far because um the little ram chip is right there so they're going to dedicate a gpu memory is a lot faster shared gpu memory is essentially if this gets overloaded it'll use some of the ram on your computer instead so this will typically be about half of your computer's ram i have 32 gigabytes of ram on my computer so 16.0 makes sense half 32 and yeah so you want to make sure you're only using dedicated gpu memory uh having having your shared gpu memory go up is not usually a good thing a little bit is fine but uh dedicated gpu memory is the fastest and you want everything to stick on there just try to make sure all of your parameters sort of fit around this whatever your max capacity is maybe it's four maybe it's eight maybe it's 48 who knows and a good way to figure out what the highest amount of ram you can use on your gpu without it getting memory errors or using shared memory is to actually play around with these parameters up here so uh block size and batch size actually let me let me switch those around these are not supposed to be in that order but all good make our batch size 64 it's 128 okay okay so batch size and block size are very big contributors to how much memory you're going to use learning rate is not max iterations is not evalators is not but these three will so these are the amount of features that you store the amount of heads you have running in parallel and then also n layers so some of these will not affect you as much because they're more sort of restrained to computation how quickly you can do operations if something is sequential uh so n like n layer won't strain you as much as something like batch and block size but uh those are just good little things to sort of tweak and play around with so i found the optimal sort of set of hyper parameters for my pc and that happens to be eight eight 384 uh learning rate stays the same and then 64 128 for this so that happened to be the optimal uh hyper parameters for my computer it'll probably be different for yours if you don't have eight gigabytes of ram on your gpu so anyways uh that's a little something you have to pay attention to to make sure you don't run out of errors and a technique you can use which i'm not actually going to show you in this course but it's quite useful is something called auto tuning and what auto tuning does is it pretty much runs a bunch of these uh a bunch of models with different sets of hyper parameters so it'll run like uh batch size 64 batch size 32 batch size 16 batch size maybe 256 we'll be like okay which ones are throwing errors and which ones aren't so what it'll do if you have if you properly uh if you properly set up an auto tuning script is is you will be able to find the most optimal set of parameters for your computer most optimal set of hyper parameters that is possible so our tuning is cool you could definitely look more into that there's tons of research on it and yeah so our tuning is cool let's dig into the next part the next little cool trick we use in practice especially by machine learning engineers it's a little something called arguments so you pass an argument into uh not necessarily a function but into the command line so this is what it'll look like this is just a basic example of what arg parsing will look like so just go python uh arg parsing because that's a script's name i go dash uh llms because that's what it says right here this is what the argument is and then we can just pass in a string say hello the provided whatever is hello so cool you can add little arguments to this and i'm maybe going to change this around let's say uh batch size and then we'll just go like that batch batch size please provide a batch size i'm gonna do the same thing again uh and see it says uh following arguments require batch size so that obviously didn't work and if we actually try it the correct way our parsing.py then we go dash batch size we can make it 32 oh that's because it's not a string so what we need to actually do is it's bs somewhere okay so args parse args so we need to change this to bs like that go batch size batch size is 32 okay so even i'm a little bit new to to uh arguments as well but this is something that comes in very handy when you're trying to you know each time you're trying to change some some parameters if you add no new gpu or whatever and you're like oh i want to double my batch size it's like sure you can easily do that so a lot of the times it won't just have one but you'll have like many meaning like maybe a dozen or so of these uh of these little arguments so uh that is what this looks like and we're going to go ahead and implement this into our little script here so uh i'm just going to pop over to gpt one i'm gonna pull this up on my second monitor here and uh in terms of you know these i'm just gonna i'm just gonna start off by making a importation arg uh arg parser or arg parse rather that's what it's called and then we go parser is equal to i'll just i'll just copy and paste this entire thing and why not cool okay so we get a batch size or something and then we'll add in the second part here so args parse the arguments here and we'll just go batch size like that our batch size is equal to whatever that was and let's go args dot args dot batch size so cool i'm gonna run this and not defined oh yes so i got a little not defined thing here and pretty much all i missed was that we return this so essentially this should be equal to this right here so i'm just going to go ahead and copy that and uh boot parse args except we don't have a parse args function so what do we need to do instead well it actually that might just work on its own let's try it out okay so it looks like it's actually expecting some input here in code so that's probably working and if we if we ported this into a script then it would simply ask us for some input so i believe we're doing this correctly let's go ahead and actually switch over and pour all of this into some code so i'm going to make a training file and a chat file so the training file is going to be all of our parameters whatever all of our architecture and then the actual training loop itself we're going to have some arguments in there and then the chat bot is going to be pretty much just a question answer thing that just reproduces text so it'll just be like prompt completion type of thing and yeah so let's go ahead and implement that here so in our uh gpt course here i'm going to go training.py and we're going to go at bot.py just like that so in training let's go ahead and drag everything in here i'm just going to move this over to the second screen and just copy and paste uh everything in in order here so next up we have our characters and then we have our tokenizer and then our get random chunk and get batches sweet our estimate loss function and then this giant piece of code containing most of the architecture we built up just going to add that in there cool we're not getting any warnings and then the training loop and the optimizer awesome then after this uh we would simply have this context but the point of this is that we want to have this in our chat bot script so what i'm going to do is in this training.py i'm going to keep all of these the same i'm going to keep this entire thing the same get rid of this little block of code and we're going to go into the uh chat bot here so loading model parameters good we want to load some in train some more and then dump it chat bot is not going to dump anything it's just going to save so i'm going to take all of our training here and instead of dumping take that away i'll also take away the training loop as well okay i don't believe we have anything else to actually bring in we don't need our get batch we do not need our get random chunks so awesome we're just importing these parameters by default like that awesome so from this point we have imported uh we've imported our model cool so let's go ahead and port in our little uh chat bot here this little end piece which is going to allow us to essentially chat with the model so this is what this is what it looks like a little while loop we have a prompt we just input something uh prompt next line that should be fairly self-explanatory and we have this tensor we're going to encode this prompt into a bunch of integers or torch dot long data types on the gpu where device is cuda and then after after we've actually generated these so model dot generate we're going to unsqueeze these remember it's a torch dot tensor so it's going to be in the matrices form so it's going to look like this it's going to be uh it's going to look like this or whatever like uh that that's essentially what the shape is so all we're doing when we unsqueeze it is we're just taking away this wrapping around it so awesome and then we're just doing max your tokens for example 150 here and then uh to a list format and then we can just print these out as generated characters awesome so it's just going to ask us prompt and then do some compute give us a completion so on so forth so that's what this is doing here and another thing i wanted to point out is actually when we load these parameters in at least on training it's going to initially give us errors if we don't have a model to load it from we're going to get errors from that because the model will just not be anything and we won't be able to import stuff so that's going to give you errors first of all another thing you want to pay attention to is to make sure that when you've actually trained this initial model that it matches all of the architectural stuff and the hyper parameters that you used that when you're you're using to load up again so when you're running your forward pass and whatnot you just want to make sure that this architecture uh sort of lines up with it just so that you don't get any architectural errors those can be really confusing to debug so yeah and the way we can do this is actually just commenting it out here so awesome we're able to save load models and we're able to use a little loop to create a sort of chap up that's not really helpful because we haven't trained it an insane amount on data that actually is useful so another little detail that's very important is to actually make sure that you have nn module in all of these classes and subclasses nn.module basically works as a tracker for all of your parameters it makes make sure that all of your nn extensions run correctly and just overall a cornerstone for PyTorch like you need it so make sure you have nn module in all of these classes i know that block sort of comes out of gpt language model and so on so forth but just all of these classes with nn or any learnable parameters you will need it in it's overall just a good practice to have nn module in all of your classes overall just to sort of avoid those errors so cool i didn't explicitly go over that at the beginning but that's just a heads up you always want to make sure nn module is inside of these so cool now something i'd like to highlight is a little error that we get we try to generate when we have max new tokens above block size so let me show you that right now so you just go python chatbot and then batch size 32 so we could say we could say hello for example okay so it's going to give us some errors here and what exactly does this error mean well when we try to generate 150 new tokens what it's doing is it's taking the previous you know H E L L O exclamation mark six tokens and it's pretty much adding up 150 on top of that so we have 156 tokens that we're now trying to fit inside a block size which in our case is 128 so of course 156 does not fit into 128 and that's why we get some errors here so so all we have to do is make sure that we essentially what we could do is make sure that max new tokens is small enough and then be sort of paying attention when we make prompts or we could actually make a little cropping cropping tool here so what this will do is it'll pretty much crop through the last block size tokens and this is super useful because it it pretty much doesn't make us have to pay attention to max new tokens all the time and it just essentially crops it around that 128 limit so i'm going to go ahead and replace index here with index cond or index condition and we go ahead and run this again so i could say hello hello and we get a successful completion awesome we can keep asking new prompts and awesome so yeah we're not really getting any of these dimensioning dimensionality like architecture fitting type errors if you want to call them if you want to make it super fancy that way but yeah there's not really that much else to do yeah there's a few points i want to go over including fine tuning so i'm going to go over a little illustrative example as to what fine tuning actually looks like in practice so in pre-training which is what this course is based off of in pre-training you have this giant text corpus right you have this giant corpus here with some text in it and essentially what you do is you take out little snippets these are called blocks or batches or chunks you could say you take a little batch of the batches of these you sample random little blocks and you take multiple batches of them and you essentially have this let's just say h-e-l-l-o and maybe the next predict maybe the outputs are the or the targets rather are e-l-l-o exclamation mark so it's just shifted over by one and so given this given this sequence of characters you want to predict this which is just the input shifted by one that's what pre-training is and keep in mind that these are the same size this is one two three four and five same thing here these are both five characters long fine tuning however is not completely the same so i could have hello and then maybe like a question mark and it would respond you know the model might respond how are you maybe that's just a response that it gives us we can obviously see that hello does not have the same amount of characters with the same amount of indices as how are you so this is essentially the difference between fine tuning and pre-training with fine tuning you just have to add a little bit of different things in your generate function to compensate for not having the same amount of indices in your inputs and targets and rather just generate until you receive an end token so what they don't explicitly say here is at the end of this question there's actually a little end token which we usually do it usually looks like this so go like that or like this these are these are end tokens and then you typically have the same first start token so like an s or start like that pretty simple and essentially you would just you just append them and and a start token the start token doesn't matter as much because we essentially just are looking at what this does and then we start generating the start doesn't really matter because we don't really need to know when to start generating it just happens but the end token is important because we don't want to just generate an infinite number of tokens right because these aren't the same size it could theoretically generate a really really long completion so all we want to make sure is that it's not generating an infinite amount of tokens consuming infinite amount of computation and just to prevent that loop so that's why we append this end token to the end here sort of this you have this little end bit and essentially once this end token is sampled you would end the generation simple as that and we don't actually sample from the token itself but rather the actual uh the i guess you could say index or the the miracle value the encoded version of end which is usually just going to be the length of your vocab size plus one so if your vocab size for in our case is like maybe 32 000 your end token would be at index 32 000 and one so that way when you sample when you sample an end token when you sample that 32 000 and one token you actually just end the sequence and of course when you train when you train your model you're always appending this end token to the end so you get your initial inputs and then inside of either your training data or when you actually are processing it and feeding it into that transformer you have some sort of function that's just appending that little uh 32 000 and one token index to it so that's pretty much what fine tuning is that sums up fine tuning and the whole process of creating these giant language models is to of course help people and there's no better way to do that than to literally have all the information that humans have ever known meaning like common crawl open web text or Wikipedia and even research papers pre-training on all that so just doing again same size and then shift over for targets and then after you've iterated on that many many times you switch over to fine tuning where you have these specifically picked out prompt and completion pairs and you just train on those for a really long time until you are satisfied with your result and yeah that's what language modeling is there are a few key pointers i want to leave you with before you head on your way to research and development and machine learning so first things first there's a little something called uh efficiency testing or just finding out how quickly certain operations take so we'll just call this efficiency testing and i'll show you exactly how to do this right here uh efficiency yeah i don't know if i spelled that correctly i don't know why it's doing that okay uh anyways we'll just pop into code here and essentially we'll just do we'll do i'm testing go import time and essentially uh all we're gonna do is just test we're just gonna time how long operations take so uh in here you can go i don't know you can go start time equals time dot time and essentially what this function does is it just takes a look at the current time right now the current like millisecond very precise and we can do some little operation like i don't know for i in range uh we'll just go i don't know ten thousand print i for example for print i times two okay and then we could just end the time here so we'll go end time equals time dot time again calling the current time so we're doing right now versus back then and that little difference is how long it took to execute so all we can do is just do you can say total time you want we can say total time equals end time minus start time amoscope print uh and time or i'm taken let's go total total time like that just execute this python time testing cool time taken 1.32 seconds so you can essentially time every single operation you do with this method and you can see even in your i encourage you to actually try this out i'm not going to but i encourage you to try out uh how long the model actually takes to do certain things like how long does it take to load a model how does it take to save a model how long does it take to estimate the loss right play around with hyper parameters see how long things take and maybe you'll figure out something new who knows but this is a little something we use to pretty much test how long something takes how efficient it is and then to also see if it's worth investigating a new way of approaching something in case it takes a ridiculous amount of time so that's time testing and efficiency testing for you the next little bit i want to cover is the history i'm not going to go over the entire history of ai and llm's but essentially we originated with something called rnn's okay rnn's are called recurrent neural networks and they're really inefficient at least for scaled ai systems so rnn's are a little essentially think of that as little loop keeps learning and learning and this is sequential right it does this and then this and then this and then this it has to wait for each completion it's synchronous you can't have multiple of them at once because they're complex GPUs cannot run complex things they're only designed for just matrix multiplication and very simple math like that so rnn's are essentially a little bit dumber than transformers and they are run on the cpu so rnn's was where we last sort of stopped at and what i encourage you to do is look into more of the language modeling and ai history and research that has led up to this point so you can have an idea as to how researchers have been able to quickly innovate given all these historical innovations so you have like all these things leading up to the transformer well how did they all philosophize philosophize up to that point and yeah it's it's just something good to sort of be confident in is innovating as both a researcher an engineer and a business person so cool rnn's were where we where we sort of finished off and now it's transformers and gpts that's the current state of ai next up i would like to go over something called quantization so quantization is essentially a way to reduce the memory usage by your parameters so there's actually a paper here called q laura efficient fine tuning of quantized llms so all this does in simple form is it pretty much instead of using 32-bit floating point numbers it goes not only to 16-bit of half precision but all the way down to four so what this actually looks like is in binary code or in bytecode uh it will look somewhere here there's some array of numbers that it uses okay i can't find it but pretty much what it is is it is a bunch of it's a bunch of floating point numbers and they're all between negative one and one and there are 16 of them if you have a four-bit number that means it can hold 16 different values zero through 15 which is 16 values and all you pretty much do is you have this array of floating point numbers you use the bytecode of that of that four-bit number to look up the index in that array and that is your weight that is the weight they use in your model so this way instead of using 32-bit and just having these super long numbers that are like super precise you can have super precise numbers that are just generally good parameters to have that just perform decently they're just sort of well spread out and experimented on and they just they happen to work and you have 16 of them instead of you know a lot so that's a that's another cool little thing that's going on right now is four-bit quantizations it's a little bit harder to implement i would encourage you to experiment with half precision meaning 16-bit floating point numbers so that means it occupies 16 on and off switches or capacitors on your gpu and yeah so quantization is cool to sort of scale down the memory so that way you can scale up all of your hyper parameters and have a more complex model with these uh yeah just essentially to have bigger models with less space taken up so that is quantization um and this is the paper for it this little link you can search on if you want to get more familiar with this see sort of performance standards and whatnot the next thing i'd like to cover is gradient accumulation so you might have heard of this you might not have heard of this gradient accumulation will ascend what gradient accumulation does is it will accumulate a gradients over say we just set a variable x so every x iterations it'll just accumulate those iterations average them and what this allows you to do is instead of uh updating each iteration you're updating every x iterations so that allows you to fit more parameters and more info or generalization into this one piece so that way when you update your parameters it's able to generalize more over maybe a higher batch size or a higher block size so when you distribute this over many iterations and average them you can fit more into each iteration because it's sort of calculating all of them combined so yeah that's a cool little trick you can use if uh your gpu maybe isn't as big if it doesn't have as much vram on it so gradient accumulation is wonderful and it's used lots in practice the final thing i'd like to leave you guys off with is something called hugging face and you've probably heard a lot a lot about this so far but let me just guide you through and show you how absolutely explosive hugging face is for machine learning so you have a bunch of models data sets spaces docs etc and let's just go to models for example so just to showcase how cool this is you have multimodal ais which could be like uh image and text or video etc right you have multiple different modes so it's not just text or not just video it's many different ones at the same time so you have multimodal models you have computer vision you have natural language processing and we're we're actually doing natural language processing in this course we have audio a tabular and reinforcement learning so this is really cool and you can actually just download these models and host them on your own computer so that is really cool you also have data sets which are even cooler and these are pretty much just really high quality data sets of prompt and answer completions at least for our purpose if you want to use those so you have uh question answering or conversational so if i go to the open orca data set for example that's 9 000 downloads 500 likes it has a bunch of uh ids system prompt so you're an ai assistant whatever and then you have the cool stuff which is you'll be given a definition of a task first and some input of the task etc and then the response it's like oh we just gave it an input and asked it to answer in a format and actually did that correctly so you could pretty much train these on a bunch of prompts that you would be able to feed into gpt4 and try to make your model perform that way and this actually has 4.23 million rows in the training split which is amazing okay so data sets are wonderful and you can find the best ones at least the best fine tuning data sets on open orca really good that's from pre-training i believe i mentioned this earlier in this survey of large language models paper that if we just put it down through the reddit links yes you could use like open web text you could use common crawl you could use books you could use wikipedia these are all pre-training data sources so yeah hopefully that leaves you with a better understanding on how to create gpts transformers and pretty good large language models from scratch with your own data that you scraped or that you downloaded and yeah that's it thanks for watching so you've learned a ton in this course about language modeling how to use data how to create architectures from scratch maybe even how to look at research papers so if you really enjoy this content i would encourage you to maybe subscribe and like on my youtube channel which is in the description i make many videos about ai programming and computer science in general so you could totally feel free to subscribe there if you don't want to subscribe that's fine you could always unsubscribe later if you want to it's completely free but yeah i also have a github repo in the description for all the code that we used not the data because it's way too big but all of the code and the wizard of oz wizard of oz text file so that is all in the github repo in the description thanks for watching