Understanding Model Fine-Tuning Techniques

Hey everyone, I'm Shah, and this is the fifth video in the larger series on how to use large language models in practice. In the previous video, we talked about prompt engineering, which is concerned with using large language models out of the box. While prompt engineering is a very powerful approach and can handle a lot of LLM use cases in practice, for some applications, prompt engineering just doesn't cut it. And for those cases, we can go one step further and fine tune a existing large language model for a specific use case. So Navi's question is, what is model fine tuning? The way I like to define it is taking a pre-trained model and training at least one internal model parameter. And here, I mean the internal weights or biases inside the neural network. What this typically looks like is taking a pre-trained existing model. like GPT-3 and fine tuning it for a particular use case. For example, ChatGPT. To use an analogy here, GPT-3 is like a raw diamond right out of the earth. It's a diamond, but it's a bit rough around the edges. Fine tuning is taking this raw. diamond and transforming it into something a bit more practical, something that you can put on a diamond ring, for example. So the process of taking the raw base model of GPT-3 and transforming it into the fine-tuned model of GPT-3.5 Turbo, for example, is what gives us applications like ChatGPT or any of the other incredible applications of large language models we're seeing these days. To get a more concrete sense of the difference between a base model like GPT-3 and a fine-tuned model. Let's look at this particular example. We have to keep in mind that these foundation large language models like GPT-3, LAMA2, or whatever your favorite large language model is, these models are strictly trained to do word prediction, given a sequence of words predicting the next word. So when you train one of these large language models on huge corpus of text and documents and web pages, what it essentially becomes is a document completer. What that translates to in practice is if you plug into a large, of these base models like GPT-3, the prompt, tell me how to fine tune a model, a typical completion might look something like this, where it's just listing out questions like you might see in a Google search or maybe like a homework assignment or something. Here when I prompted GPT-3 to tell me how to fine tune a model, the completion was as follows. How can I control the complexity of a model? How do I know when my model is done? How do I test the model? While this might be reasonable for GPT-3 to do based on the data, data that it was trained on. This isn't something that's very practical. Now let's look at the fine-tuned model completion. So now we have text da Vinci 003, which is just one of the many fine-tuned models based on GPT-3 coming from OpenAI. We give it the same prompt, tell me how to fine-tune a model, and this is the completion. Fine-tuning a model involves adjusting the parameters of a pre-trained model in order to make it better suited for a given task. There are generally three steps involved to fine-tuning a model. Select a base model, adjust parameters, train them. model. While this completion may not be perfect, it's much more aligned what we were hoping to get out of the language model compared to the base model's completion. So if you want to learn more about model fine tuning and how OpenAI did their fine tuning, their alignment tuning, and instruction tuning, check out the references in the description and comment section below. So as we saw when comparing a base model to a fine-tuned model, we see that the fine-tuned model can generate completions that are much more aligned and desirable for our particular use case. Beyond this performance, there's actually a deeper reason why you might want to fine tune. And that is the observation that a smaller fine tune model can often outperform a larger base model. This was demonstrated by OpenAI in their Instruct GPT model, where their small 1.3 billion parameter fine tuned Instruct GPT model generated completions that were preferred to GPT-3 completions, even though GPT-3 had about 100 times as many internal parameters. This is one of the biggest upsides of fine-tuning. You don't have to rely on some massive general-purpose large language model to have good performance in a particular use case or application. Now that we have a better understanding of what fine-tuning is and why it's so great, let's look at three possible ways one can fine-tune an existing large language model. The first is via self-supervised learning. This is the same way these base models and foundation large language models are trained. In other words, you get your training corpus of text and you train the model in a self-supervised way. In other words, you take a sequence of text like listen to your, you feed it into the model, and you have it predict a completion. If we feed in listen to your, it might spit out heart. What would differentiate fine tuning with self-supervised learning versus just training a base model through self-supervised learning is that you can curate your training corpus to align with whatever application that you're going to use the fine-tuned model for. For example, if I wanted to fine-tune GPT-3 to write text in the likeness of me, maybe I would feed it a bunch of my towards data science blogs and then that resulting fine-tuned model might be able to generate completions that are more like my style. The second way we can fine tune a model is via supervised learning. This is where we have a training data set consisting of inputs and associated outputs or targets. For example, if we have a set of question answer pairs, such as who was the 35th president of the United States, and then the answer is John F. Kennedy, we can use this question answer pair to fine tune an existing model to learn how to better answer questions. So the reason this might be helpful, as we saw before, if we were to just feed in who was the 35th president of the United States into a base model, the completion that it might generate is who was the 36th president of the United States. United States, who was the 40th president of the United States, who is the speaker of the house, so on and so forth. But through having these question answer pairs, we can fine tune the model to essentially learn how to answer questions. But there's a little trick here. These language models are again, document completers. So we actually have to massage these input output pairs a bit before we can feed it into our large language model for training. One simple way we can do this is via prompt templates. For example, we could generate a template, please answer the following question where we input the question here. So our input would go here and then we input the target here. So the answer would go here. And then through this process, we can translate our training data set to a set of prompts and generate a training corpus and then go back to the self-supervised approach. And the final way one can fine tune an existing model is via reinforcement learning. While there are many ways one could do this, I'm going to just focus on the approach out. outlined by OpenAI in generating their Instruct GPT models, which consisted of three steps. The first was supervised fine tuning. So essentially what we were talking about in this second way to fine tune a model. This consists of two steps, one, curating your training dataset, and then two, fine tuning the model. The next step was to train a reward model. And all this is, is essentially a model that can generate a score for a language model's complete. So if it generates a good completion, it'll have a high score. It generates a bad completion. It'll generate a low score. So what this looks like for the instruct GPT case was as follows. You start with a prompt and you pass it into your supervised fine tuned model from here. But you don't just do it once. You'd actually do it many times. So you generate multiple completions for the same prompt. Then you get human labelers to rank the responses from worst to best. And then you can use that ranking. to train the reward model, which is indicated by this square here. And then the final step is to do reinforcement learning with your favorite reinforcement learning algorithm. In the case of InstructGPT, they used Proximal Policy Optimization, or PPO for short. What this looks like is you take the prompt, you pass it into your supervised fine-tuned model, and then you pass that completion to the reward model. And then the reward model will essentially give feedback to the fine-tuned model. And this is how you can update the model parameters and eventually end up with a model that's fine tuned even further. I know this was a ton of information, but if you want to dive deeper into any one of these approaches, check out the blog in towards data science where I go into a bit more detail on each of these approaches. OK, so to keep things relatively simple for the remainder of the video, we'll be focused on the supervised learning approach to model fine tuning. Here I break that process down into five steps. First, choose your fine tune. tuning task. So this could be text summarization, it could be text generation, it could be binary classification, text classification, whatever it is you want to do. Next, you prepare your training data set. If you're trying to do text summarization, for example, you would want to have input output pairs of text and the desired summarization. And then you take those input output pairs and generate a training corpus using prompt templates, for example. Next, you want to choose your base model. There are many foundation large language models out there. there or there are many existing fine-tuned large language models out there and you can choose either of these as your starting place. Next, we can fine-tune the model via supervised learning and then finally we evaluate model performance. There's certainly a lot of details into each of these steps, but here I'm just going to focus on step number four, the fine-tuning the model with supervised learning. And here I want I want to talk about three different options we have when it comes to updating the model parameters. The first option is to retrain all the parameters. Given our neural network, given our language model, we go in and we tweak all the parameters. But perhaps, obviously, this comes with a downside. When you're talking about billions, tens of billions, hundreds of billions of internal model parameters, the computational cost for training explodes, even if you're doing the most efficient tricks to speed up. the training process, retraining a billion parameters is going to be expensive. Another option we can do is transfer learning. And this is essentially where we take our language model, and instead of retraining all the parameters, we freeze most of the parameters and only fine tune the head. Namely, we fine tune the last few layers of the model where the model embeddings or internal representations are translated into the target or the output layer. And while transfer learning is a lot cheaper than retraining all parameters, there is still another approach that we can do, which is the so-called parameter efficient fine tuning. This is where we take our language model and instead of just freezing a subset of the weights, we freeze all of the weights. We don't change any internal model parameters. Instead, what we do is we augment the model with additional parameters which are trainable. And the reason why this is advantageous is that it turns out that we can fine tune a model with a relatively small set of new parameters, as can be seen by this beautiful picture here. One of the most popular ways to do this is the so called low rank adaptation approach or low rock for short, like I mentioned in the previous slide, this fine tunes a model by adding new trainable parameters. Here we have a cartoon of a neural network, but let's just consider one layer, the mapping from. these inputs to this hidden layer here. We can call our inputs x and then we can call the hidden layer essentially some function of x. And to make this a bit more concrete we can write this as an equation. h of x is just equal to x, we can think of it as a vector to keep things simple, and some weight matrix, which is just some two-dimensional matrix. To see this a bit more visually, we have our weight matrix, which is some d by k matrix. We have x, which we'll just take to be a vector in this case. The multiplication of these two things will generate our hidden layer. And for the math to seen autos here. the spaces that these objects live in. And so this is what the situation looks like without LoRa, where if we're going to do the full parameter fine tuning, what this will look like is all the parameters in this weight matrix are trainable. So here W naught is a D by K matrix. And let's just say D is 1000, K is 1000. This would translate to 1 million trainable parameters. which may not be a big number, but when you have a lot of layers, this number of trainable parameters can really explode. Now let's see how LoRa can help us reduce the number of trainable parameters. Again, we're just going to look at one of the layers, but now we're going to add some additional parameters to the model. What that looks like mathematically is we have the w naught times x is equal to h of x, like we saw in the previous slide, but now we're adding this additional term here, which is delta w times x, this is going to be another weight matrix, the same shape as w naught. And looking at this, you might think, Shaw, how does this help us, we just doubled the number of parameters. Yeah, sure. If we keep w naught frozen, we still have delta w with the same number of parameters to deal with. But let's say that we define delta w to be the multiplication of two matrices B and A. In this case, our hidden layer becomes w naught times x. x plus b a times x. Looking at this more visually, we have w naught, which is the same weight matrix we saw on the previous slide. But now we have b and a, which have far fewer terms than w naught does. And then what we can do is through matrix multiplication, generate a matrix of the proper size, namely delta w, add it to w naught, multiply all that by x. and generate our h of x. Looking at the dimensionality of these things, w naught and delta w live in the same space. They're matrices of d by k. b is going to be a matrix of d by r. a is going to be a matrix of r by k. And then h of x is going to be d by 1. The key thing here is this r number, what the authors of this method call the intrinsic rank of the model. The reason that this works and we get the efficiency gains is that this r is a lot smaller. than D and K. To see how this plays out, unlike before where W naught was trainable, now these parameters are gonna be frozen and B and A are trainable. And maybe as you can just tell visually from the area of this rectangle versus the areas of these two rectangles, b and a contain far fewer terms than w naught. To make this a bit more concrete, let's say d is equal to 1,000, k is equal to 1,000, and our intrinsic rank is equal to 2. What this translates to is 4,000 trainable parameters, as opposed to the million trainable parameters we saw in the previous slide. This is the power of low-raw. It allows you to fine-tune a model with far fewer trainable parameters. If you wanna learn more about LoRa, check out the paper linked in the description below, or if you want something that's a bit more accessible, check out the blog in Towards Data Science where I talk about this a bit more. Let's dive into some example code and how we can use LoRa to fine tune a large language model. Here I'm going to use the HuggingFace ecosystem, namely pulling from libraries like Datasets, Transformers, PEFT, and Evaluate, which are all HuggingFace Python libraries. Also importing PyTorch and NumPy for some extra things. With our imports, the next step is to choose our base model. Here I use distilbert-uncased, which is a base model available on HuggingFace's model repository. This is what the model card looks like. We can see that it only has 67 million parameters in it and then there's a lot more information about it on the model card here. We're gonna take distilbird uncased and we're gonna fine tune it to do sentiment analysis. We're going to have it take in some text and generate a label of either positive or negative based on the sentiment of the input text. So to do that we need to define some label maps. So here we're just defining that zero is going to mean negative and one is going to mean positive and vice versa that negative means zero and positive means one. Now we can take these label maps and we can take our model checkpoint and we can plug it into this nifty auto model for sequence classification class available from the transformers library and very easily we import this base model specifically ready to do binary classification. The way this works is that Hugging Face has all these base models and has many versions of them where they replace the head of the model for many different tasks. And we can get a better sense of this from the Transformers documentation as shown here. You can see that this auto model for sequence classification has a lot of base models that it can build on top of. Here we're using Distilbert, which is a smaller version of BERT here, but there are several models you can choose from. The reason I went with Distilbert is because it only has 67 million parameters and it can actually run on my machine. machine. The next step is to load the data set. So here I've actually just made the data set available on the Hugging Face data set repository. So you should be able to load it pretty easily. It's called IMDB Truncated. It's a data set of IMDB movie reviews with an associated positive or negative label. And if we print the data set, it looks something like this. There are two parts to it. There's this train part, and then there's this validation part. And then you can see that both the training and validation data sets have 1000 rows in them. This is another great thing about model fine tuning is that while training a large language model from scratch may require trillions of tokens or a trillion words in your training corpus, fine tuning a model requires far fewer examples. Here we're only going to be using a thousand examples for model fine-tuning. The next step is to pre-process the data. Here, the most important thing is we need to create a tokenizer. If you've been keeping up with this series, you know that tokenization is a critical step when working with large language models because neural networks do not understand text. They understand numbers. And so we need to convert the text that we pass into the large language model into a numerical form so that it can actually understand it. So here we can use the auto tokenizer class from transformers to grab the tokenizer for the particular base model that we're working with. Next, we can create a tokenization function. This is a function that defines how we will take each example from our training data set and translate it from text to numbers. This will take in examples, which is coming from our training data set. And you see we're extracting the text. So going back to the previous slide, you can see that our training data set has two features. It has a label and a piece of text. So you can imagine each row of this training data set has text and it has a label associated with that text. So when we go over here, the examples is just like a row from this data set and we're grabbing the text from that example. And then what we do is we define the side that we want to truncate. Truncation is important because the examples that we pass into the model for training need to be the same length. We can either achieve this by truncating long sequences or padding short sequences to like a pre-division. determine fixed length or a combination of the two. So we're just choosing the current truncation side to be left. And here we're tokenizing the text, here's our tokenizer that we defined up here passing in the text, we're returning numpy tensors, we're doing the truncation, we defined how to do that here. And then we're defining our max length. And then this will return our tokenized inputs. Since the tokenizer does not have a pad token, this is a special token that you can add to a sequence which will essentially be ignored by by the large language model. Here, we're adding a pad token and then we're updating the model to handle this additional token that we just created. Finally, we apply this tokenized function to all the data in our dataset using this map method here. We have our dataset and we apply this map method. We pass in our tokenized function and it'll output a tokenized version of our dataset. To see what the output looks like, we have another dataset dictionary. We have the training and validation datasets. But now you see we have these additional. features. We don't only have the text in the label, but we also have input IDs and we also have this attention mask. One other thing we can do at this point is to create a data collator. This is essentially something that will dynamically pad examples in a given batch to be as long as the longest sequence in that batch. For example, if we have four examples in our batch, the longest sequence has 500, but the others have shorter ones. It'll dynamically pad the shorter sequences to match the longer one. The reason why this is helpful is because if you pad your sequences dynamically like this with a collator, it's a lot more computationally efficient than padding all your examples across all 1,000 training examples because you might just have one very long sequence at 512 that is creating unnecessary data that you have to process. Next, we want to define evaluation metrics. So this is how we will monitor the performance of the model during training. So here, I just did something simple. I'm going to import the accuracy from... from the evaluate Python library. So we can package our evaluation strategy into a function that here I'm gonna call compute metrics. And so here we're not restricted to just using one evaluation metric or even just using accuracy as an evaluation metric, but just to keep things simple, here I just stick with accuracy. Here we take a model output and we unpack it into a prediction and label. The predictions here are the logits. And so it's gonna have two elements, one associated with the negative class and one associated with the negative class. with the positive class. And all this is doing is evaluating which element is larger. And whichever one is larger is going to become the label. So if the 0th element is larger, the argmax will return 0. And that'll become the model prediction. And vice versa, if the first element is the largest, this will return a 1. And then that'll become the model prediction. And then here, we just compute accuracy by comparing the model prediction to the ground truth label. So before training our fine-tuned model, we can evaluate the performance of the base model out of the box. So let's see what that looks like. Here we're going to generate a list of examples such as it was good, not a fan, don't recommend, better than the first one, this is not worth watching even once, and then this one is a pass. Then what we do is for each piece of text in this list, we're going to tokenize it, compute the logits. So basically we're going to pass it into the model and take logits out. Then we're going to convert the logits to a label, either a zero or one. And so the output looks like this. We have the untrained model predictions. It was good. The model says this has a negative sentiment. Not a fan don't recommend. The model says this has a negative sentiment, so that's correct. Better than the first one, the model says this has a negative sentiment, even though that's probably positive. This is not worth watching even once. Model says it's a negative sentiment, which is correct. And then this one is a pass, and the model assigns a negative sentiment to that. As you can see, it got two out of five correctly. Essentially this model is as good as chance. as flipping a coin. It's right about half the time, which is what we would expect from this un-fine-tuned base model. So now let's see how we can use LoRa to fine-tune this model and hopefully get some better performance. The first thing we need to do is define our LoRa configuration parameters. First is the task type. We're saying we're going to be doing sequence classification. Next we define the intrinsic rank of the trainable weight matrices. So that was that smaller number that allowed B and A to have far fewer parameters than just... just w not next we define the low raw alpha value, which is essentially a parameter that's like the learning rate when using the atom optimizer, then we define the low raw dropout, which is just the probability of dropout. And that's where we randomly zero internal parameters during training. Finally, we define which modules we want to apply low rot to. And so here, we're only going to apply it to the query layers. And then we can use these configuration settings and update our model to get another model but but one that is ready to be fined. tuned using LoRa. And so that's pretty easy. We just use this get PEFT model by passing in our original model and then our config from above, then we can easily print the number of trainable parameters in our model. And we can see it's about a million out of this 67 million. that are in the base model. So you can see that we're going to be fine tuning less than 2% of the model parameters. So just a huge cost savings, like 50 times fewer model parameters than if we were to do the full parameter fine tuning. Next, we're going to define our hyperparameters and training arguments. So here we put the learning rate as 0.001. We put the batch size as four and the number of epochs as 10. Next, we say where we want the model to be saved. Here I dynamically create a name. So it'll be the model checkpoint dash low-raw text classification, the learning rates, what we defined before, batch size, what we put before. Find weight decay as 0.01. Then we set the evaluation strategy as epoch. So every epoch it's going to compute those evaluation metrics the save strategy is every epoch it's going to save the model parameters and then load best model at the end so at the end of training it's going to return us the best version of the model then we just plug everything to this trainer class trainer takes in the model it takes in the training arguments it takes in our training and validation data sets it takes in our tokenizer it takes in our data collator and it takes in our evaluation metrics put that all into this trainer class and then we train the model using this dot train method. So during training, these metrics will be generated. So we can see the epochs, the training loss, the validation loss, and the accuracy. So as you can see, the training loss is decreasing, which is good. And the accuracy is increasing, which is good. But you can see that the validation loss is increasing. So this is a sign of overfitting, which I'll comment on in a bit here. Now that we have our fine tune model in hand, we can evaluate its performance on those same five examples that we evaluated before fine tuning. Basically, same code copy pasted, but here's the different output. The text, it was good is now correctly being classified as positive. Not a fan don't recommend is correctly classified as negative better than the first one correctly classified as positive. And then this is not worth watching even one correctly classified as negative. And then this one, this one is a pass is classified as positive, but this one's a little tricky, even though we don't get perfect performance on these five, like baby examples, we do see that the model is performing a little bit better. And so we're turning back to the the overfitting problem. This example is meant to be more instructive than practical. In practice, before jumping to LoRa, one thing we might have tried is to simply do transfer learning to see how close we can get to something that does sentiment analysis well. After doing the transfer learning, then maybe we would use LoRa to fine tune the model even further. Either way, I hope this example was instructive and gave you an idea of how you can start fine tuning your very own large language models. If you enjoyed this content, please consider liking and subscribing. liking, subscribing, and sharing it with others. If you have any questions or suggestions for future content, please feel free to drop those in the comment section below. And as always, thank you so much for your time and thanks for watching.

And for those cases, we can go one step further and fine tune a existing large language model for a specific use case. So Navi's question is, what is model fine tuning? The way I like to define it is taking a pre-trained model and training at least one internal model parameter.

And here, I mean the internal weights or biases inside the neural network. What this typically looks like is taking a pre-trained existing model. like GPT-3 and fine tuning it for a particular use case. For example, ChatGPT.

To use an analogy here, GPT-3 is like a raw diamond right out of the earth. It's a diamond, but it's a bit rough around the edges. Fine tuning is taking this raw. diamond and transforming it into something a bit more practical, something that you can put on a diamond ring, for example. So the process of taking the raw base model of GPT-3 and transforming it into the fine-tuned model of GPT-3.5 Turbo, for example, is what gives us applications like ChatGPT or any of the other incredible applications of large language models we're seeing these days.

To get a more concrete sense of the difference between a base model like GPT-3 and a fine-tuned model. Let's look at this particular example. We have to keep in mind that these foundation large language models like GPT-3, LAMA2, or whatever your favorite large language model is, these models are strictly trained to do word prediction, given a sequence of words predicting the next word. So when you train one of these large language models on huge corpus of text and documents and web pages, what it essentially becomes is a document completer. What that translates to in practice is if you plug into a large, of these base models like GPT-3, the prompt, tell me how to fine tune a model, a typical completion might look something like this, where it's just listing out questions like you might see in a Google search or maybe like a homework assignment or something.

Here when I prompted GPT-3 to tell me how to fine tune a model, the completion was as follows. How can I control the complexity of a model? How do I know when my model is done? How do I test the model?

While this might be reasonable for GPT-3 to do based on the data, data that it was trained on. This isn't something that's very practical. Now let's look at the fine-tuned model completion.

So now we have text da Vinci 003, which is just one of the many fine-tuned models based on GPT-3 coming from OpenAI. We give it the same prompt, tell me how to fine-tune a model, and this is the completion. Fine-tuning a model involves adjusting the parameters of a pre-trained model in order to make it better suited for a given task.

There are generally three steps involved to fine-tuning a model. Select a base model, adjust parameters, train them. model. While this completion may not be perfect, it's much more aligned what we were hoping to get out of the language model compared to the base model's completion. So if you want to learn more about model fine tuning and how OpenAI did their fine tuning, their alignment tuning, and instruction tuning, check out the references in the description and comment section below.

So as we saw when comparing a base model to a fine-tuned model, we see that the fine-tuned model can generate completions that are much more aligned and desirable for our particular use case. Beyond this performance, there's actually a deeper reason why you might want to fine tune. And that is the observation that a smaller fine tune model can often outperform a larger base model.

This was demonstrated by OpenAI in their Instruct GPT model, where their small 1.3 billion parameter fine tuned Instruct GPT model generated completions that were preferred to GPT-3 completions, even though GPT-3 had about 100 times as many internal parameters. This is one of the biggest upsides of fine-tuning. You don't have to rely on some massive general-purpose large language model to have good performance in a particular use case or application. Now that we have a better understanding of what fine-tuning is and why it's so great, let's look at three possible ways one can fine-tune an existing large language model.

The first is via self-supervised learning. This is the same way these base models and foundation large language models are trained. In other words, you get your training corpus of text and you train the model in a self-supervised way.

In other words, you take a sequence of text like listen to your, you feed it into the model, and you have it predict a completion. If we feed in listen to your, it might spit out heart. What would differentiate fine tuning with self-supervised learning versus just training a base model through self-supervised learning is that you can curate your training corpus to align with whatever application that you're going to use the fine-tuned model for. For example, if I wanted to fine-tune GPT-3 to write text in the likeness of me, maybe I would feed it a bunch of my towards data science blogs and then that resulting fine-tuned model might be able to generate completions that are more like my style.

The second way we can fine tune a model is via supervised learning. This is where we have a training data set consisting of inputs and associated outputs or targets. For example, if we have a set of question answer pairs, such as who was the 35th president of the United States, and then the answer is John F. Kennedy, we can use this question answer pair to fine tune an existing model to learn how to better answer questions.

So the reason this might be helpful, as we saw before, if we were to just feed in who was the 35th president of the United States into a base model, the completion that it might generate is who was the 36th president of the United States. United States, who was the 40th president of the United States, who is the speaker of the house, so on and so forth. But through having these question answer pairs, we can fine tune the model to essentially learn how to answer questions. But there's a little trick here.

These language models are again, document completers. So we actually have to massage these input output pairs a bit before we can feed it into our large language model for training. One simple way we can do this is via prompt templates.

For example, we could generate a template, please answer the following question where we input the question here. So our input would go here and then we input the target here. So the answer would go here. And then through this process, we can translate our training data set to a set of prompts and generate a training corpus and then go back to the self-supervised approach. And the final way one can fine tune an existing model is via reinforcement learning.

While there are many ways one could do this, I'm going to just focus on the approach out. outlined by OpenAI in generating their Instruct GPT models, which consisted of three steps. The first was supervised fine tuning. So essentially what we were talking about in this second way to fine tune a model.

This consists of two steps, one, curating your training dataset, and then two, fine tuning the model. The next step was to train a reward model. And all this is, is essentially a model that can generate a score for a language model's complete.

So if it generates a good completion, it'll have a high score. It generates a bad completion. It'll generate a low score. So what this looks like for the instruct GPT case was as follows.

You start with a prompt and you pass it into your supervised fine tuned model from here. But you don't just do it once. You'd actually do it many times.

So you generate multiple completions for the same prompt. Then you get human labelers to rank the responses from worst to best. And then you can use that ranking. to train the reward model, which is indicated by this square here.

And then the final step is to do reinforcement learning with your favorite reinforcement learning algorithm. In the case of InstructGPT, they used Proximal Policy Optimization, or PPO for short. What this looks like is you take the prompt, you pass it into your supervised fine-tuned model, and then you pass that completion to the reward model.

And then the reward model will essentially give feedback to the fine-tuned model. And this is how you can update the model parameters and eventually end up with a model that's fine tuned even further. I know this was a ton of information, but if you want to dive deeper into any one of these approaches, check out the blog in towards data science where I go into a bit more detail on each of these approaches. OK, so to keep things relatively simple for the remainder of the video, we'll be focused on the supervised learning approach to model fine tuning.

Here I break that process down into five steps. First, choose your fine tune. tuning task.

So this could be text summarization, it could be text generation, it could be binary classification, text classification, whatever it is you want to do. Next, you prepare your training data set. If you're trying to do text summarization, for example, you would want to have input output pairs of text and the desired summarization.

And then you take those input output pairs and generate a training corpus using prompt templates, for example. Next, you want to choose your base model. There are many foundation large language models out there.

there or there are many existing fine-tuned large language models out there and you can choose either of these as your starting place. Next, we can fine-tune the model via supervised learning and then finally we evaluate model performance. There's certainly a lot of details into each of these steps, but here I'm just going to focus on step number four, the fine-tuning the model with supervised learning.

And here I want I want to talk about three different options we have when it comes to updating the model parameters. The first option is to retrain all the parameters. Given our neural network, given our language model, we go in and we tweak all the parameters. But perhaps, obviously, this comes with a downside. When you're talking about billions, tens of billions, hundreds of billions of internal model parameters, the computational cost for training explodes, even if you're doing the most efficient tricks to speed up.

the training process, retraining a billion parameters is going to be expensive. Another option we can do is transfer learning. And this is essentially where we take our language model, and instead of retraining all the parameters, we freeze most of the parameters and only fine tune the head. Namely, we fine tune the last few layers of the model where the model embeddings or internal representations are translated into the target or the output layer. And while transfer learning is a lot cheaper than retraining all parameters, there is still another approach that we can do, which is the so-called parameter efficient fine tuning.

This is where we take our language model and instead of just freezing a subset of the weights, we freeze all of the weights. We don't change any internal model parameters. Instead, what we do is we augment the model with additional parameters which are trainable. And the reason why this is advantageous is that it turns out that we can fine tune a model with a relatively small set of new parameters, as can be seen by this beautiful picture here. One of the most popular ways to do this is the so called low rank adaptation approach or low rock for short, like I mentioned in the previous slide, this fine tunes a model by adding new trainable parameters.

Here we have a cartoon of a neural network, but let's just consider one layer, the mapping from. these inputs to this hidden layer here. We can call our inputs x and then we can call the hidden layer essentially some function of x.

And to make this a bit more concrete we can write this as an equation. h of x is just equal to x, we can think of it as a vector to keep things simple, and some weight matrix, which is just some two-dimensional matrix. To see this a bit more visually, we have our weight matrix, which is some d by k matrix. We have x, which we'll just take to be a vector in this case. The multiplication of these two things will generate our hidden layer.

And for the math to seen autos here. the spaces that these objects live in. And so this is what the situation looks like without LoRa, where if we're going to do the full parameter fine tuning, what this will look like is all the parameters in this weight matrix are trainable. So here W naught is a D by K matrix.

And let's just say D is 1000, K is 1000. This would translate to 1 million trainable parameters. which may not be a big number, but when you have a lot of layers, this number of trainable parameters can really explode. Now let's see how LoRa can help us reduce the number of trainable parameters. Again, we're just going to look at one of the layers, but now we're going to add some additional parameters to the model. What that looks like mathematically is we have the w naught times x is equal to h of x, like we saw in the previous slide, but now we're adding this additional term here, which is delta w times x, this is going to be another weight matrix, the same shape as w naught.

And looking at this, you might think, Shaw, how does this help us, we just doubled the number of parameters. Yeah, sure. If we keep w naught frozen, we still have delta w with the same number of parameters to deal with.

But let's say that we define delta w to be the multiplication of two matrices B and A. In this case, our hidden layer becomes w naught times x. x plus b a times x.

Looking at this more visually, we have w naught, which is the same weight matrix we saw on the previous slide. But now we have b and a, which have far fewer terms than w naught does. And then what we can do is through matrix multiplication, generate a matrix of the proper size, namely delta w, add it to w naught, multiply all that by x. and generate our h of x. Looking at the dimensionality of these things, w naught and delta w live in the same space.

They're matrices of d by k. b is going to be a matrix of d by r. a is going to be a matrix of r by k.

And then h of x is going to be d by 1. The key thing here is this r number, what the authors of this method call the intrinsic rank of the model. The reason that this works and we get the efficiency gains is that this r is a lot smaller. than D and K. To see how this plays out, unlike before where W naught was trainable, now these parameters are gonna be frozen and B and A are trainable. And maybe as you can just tell visually from the area of this rectangle versus the areas of these two rectangles, b and a contain far fewer terms than w naught.

To make this a bit more concrete, let's say d is equal to 1,000, k is equal to 1,000, and our intrinsic rank is equal to 2. What this translates to is 4,000 trainable parameters, as opposed to the million trainable parameters we saw in the previous slide. This is the power of low-raw. It allows you to fine-tune a model with far fewer trainable parameters. If you wanna learn more about LoRa, check out the paper linked in the description below, or if you want something that's a bit more accessible, check out the blog in Towards Data Science where I talk about this a bit more. Let's dive into some example code and how we can use LoRa to fine tune a large language model.

Here I'm going to use the HuggingFace ecosystem, namely pulling from libraries like Datasets, Transformers, PEFT, and Evaluate, which are all HuggingFace Python libraries. Also importing PyTorch and NumPy for some extra things. With our imports, the next step is to choose our base model.

Here I use distilbert-uncased, which is a base model available on HuggingFace's model repository. This is what the model card looks like. We can see that it only has 67 million parameters in it and then there's a lot more information about it on the model card here. We're gonna take distilbird uncased and we're gonna fine tune it to do sentiment analysis. We're going to have it take in some text and generate a label of either positive or negative based on the sentiment of the input text.

So to do that we need to define some label maps. So here we're just defining that zero is going to mean negative and one is going to mean positive and vice versa that negative means zero and positive means one. Now we can take these label maps and we can take our model checkpoint and we can plug it into this nifty auto model for sequence classification class available from the transformers library and very easily we import this base model specifically ready to do binary classification.

The way this works is that Hugging Face has all these base models and has many versions of them where they replace the head of the model for many different tasks. And we can get a better sense of this from the Transformers documentation as shown here. You can see that this auto model for sequence classification has a lot of base models that it can build on top of.

Here we're using Distilbert, which is a smaller version of BERT here, but there are several models you can choose from. The reason I went with Distilbert is because it only has 67 million parameters and it can actually run on my machine. machine. The next step is to load the data set. So here I've actually just made the data set available on the Hugging Face data set repository.

So you should be able to load it pretty easily. It's called IMDB Truncated. It's a data set of IMDB movie reviews with an associated positive or negative label. And if we print the data set, it looks something like this. There are two parts to it.

There's this train part, and then there's this validation part. And then you can see that both the training and validation data sets have 1000 rows in them. This is another great thing about model fine tuning is that while training a large language model from scratch may require trillions of tokens or a trillion words in your training corpus, fine tuning a model requires far fewer examples.

Here we're only going to be using a thousand examples for model fine-tuning. The next step is to pre-process the data. Here, the most important thing is we need to create a tokenizer.

If you've been keeping up with this series, you know that tokenization is a critical step when working with large language models because neural networks do not understand text. They understand numbers. And so we need to convert the text that we pass into the large language model into a numerical form so that it can actually understand it.

So here we can use the auto tokenizer class from transformers to grab the tokenizer for the particular base model that we're working with. Next, we can create a tokenization function. This is a function that defines how we will take each example from our training data set and translate it from text to numbers. This will take in examples, which is coming from our training data set. And you see we're extracting the text.

So going back to the previous slide, you can see that our training data set has two features. It has a label and a piece of text. So you can imagine each row of this training data set has text and it has a label associated with that text. So when we go over here, the examples is just like a row from this data set and we're grabbing the text from that example.

And then what we do is we define the side that we want to truncate. Truncation is important because the examples that we pass into the model for training need to be the same length. We can either achieve this by truncating long sequences or padding short sequences to like a pre-division.

determine fixed length or a combination of the two. So we're just choosing the current truncation side to be left. And here we're tokenizing the text, here's our tokenizer that we defined up here passing in the text, we're returning numpy tensors, we're doing the truncation, we defined how to do that here. And then we're defining our max length. And then this will return our tokenized inputs.

Since the tokenizer does not have a pad token, this is a special token that you can add to a sequence which will essentially be ignored by by the large language model. Here, we're adding a pad token and then we're updating the model to handle this additional token that we just created. Finally, we apply this tokenized function to all the data in our dataset using this map method here. We have our dataset and we apply this map method. We pass in our tokenized function and it'll output a tokenized version of our dataset.

To see what the output looks like, we have another dataset dictionary. We have the training and validation datasets. But now you see we have these additional.

features. We don't only have the text in the label, but we also have input IDs and we also have this attention mask. One other thing we can do at this point is to create a data collator. This is essentially something that will dynamically pad examples in a given batch to be as long as the longest sequence in that batch.

For example, if we have four examples in our batch, the longest sequence has 500, but the others have shorter ones. It'll dynamically pad the shorter sequences to match the longer one. The reason why this is helpful is because if you pad your sequences dynamically like this with a collator, it's a lot more computationally efficient than padding all your examples across all 1,000 training examples because you might just have one very long sequence at 512 that is creating unnecessary data that you have to process. Next, we want to define evaluation metrics.

So this is how we will monitor the performance of the model during training. So here, I just did something simple. I'm going to import the accuracy from... from the evaluate Python library. So we can package our evaluation strategy into a function that here I'm gonna call compute metrics.

And so here we're not restricted to just using one evaluation metric or even just using accuracy as an evaluation metric, but just to keep things simple, here I just stick with accuracy. Here we take a model output and we unpack it into a prediction and label. The predictions here are the logits. And so it's gonna have two elements, one associated with the negative class and one associated with the negative class. with the positive class.

And all this is doing is evaluating which element is larger. And whichever one is larger is going to become the label. So if the 0th element is larger, the argmax will return 0. And that'll become the model prediction.

And vice versa, if the first element is the largest, this will return a 1. And then that'll become the model prediction. And then here, we just compute accuracy by comparing the model prediction to the ground truth label. So before training our fine-tuned model, we can evaluate the performance of the base model out of the box.

So let's see what that looks like. Here we're going to generate a list of examples such as it was good, not a fan, don't recommend, better than the first one, this is not worth watching even once, and then this one is a pass. Then what we do is for each piece of text in this list, we're going to tokenize it, compute the logits.

So basically we're going to pass it into the model and take logits out. Then we're going to convert the logits to a label, either a zero or one. And so the output looks like this.

We have the untrained model predictions. It was good. The model says this has a negative sentiment.

Not a fan don't recommend. The model says this has a negative sentiment, so that's correct. Better than the first one, the model says this has a negative sentiment, even though that's probably positive.

This is not worth watching even once. Model says it's a negative sentiment, which is correct. And then this one is a pass, and the model assigns a negative sentiment to that. As you can see, it got two out of five correctly.

Essentially this model is as good as chance. as flipping a coin. It's right about half the time, which is what we would expect from this un-fine-tuned base model.

So now let's see how we can use LoRa to fine-tune this model and hopefully get some better performance. The first thing we need to do is define our LoRa configuration parameters. First is the task type.

We're saying we're going to be doing sequence classification. Next we define the intrinsic rank of the trainable weight matrices. So that was that smaller number that allowed B and A to have far fewer parameters than just... just w not next we define the low raw alpha value, which is essentially a parameter that's like the learning rate when using the atom optimizer, then we define the low raw dropout, which is just the probability of dropout. And that's where we randomly zero internal parameters during training.

Finally, we define which modules we want to apply low rot to. And so here, we're only going to apply it to the query layers. And then we can use these configuration settings and update our model to get another model but but one that is ready to be fined.

tuned using LoRa. And so that's pretty easy. We just use this get PEFT model by passing in our original model and then our config from above, then we can easily print the number of trainable parameters in our model.

And we can see it's about a million out of this 67 million. that are in the base model. So you can see that we're going to be fine tuning less than 2% of the model parameters. So just a huge cost savings, like 50 times fewer model parameters than if we were to do the full parameter fine tuning. Next, we're going to define our hyperparameters and training arguments.

So here we put the learning rate as 0.001. We put the batch size as four and the number of epochs as 10. Next, we say where we want the model to be saved. Here I dynamically create a name.

So it'll be the model checkpoint dash low-raw text classification, the learning rates, what we defined before, batch size, what we put before. Find weight decay as 0.01. Then we set the evaluation strategy as epoch. So every epoch it's going to compute those evaluation metrics the save strategy is every epoch it's going to save the model parameters and then load best model at the end so at the end of training it's going to return us the best version of the model then we just plug everything to this trainer class trainer takes in the model it takes in the training arguments it takes in our training and validation data sets it takes in our tokenizer it takes in our data collator and it takes in our evaluation metrics put that all into this trainer class and then we train the model using this dot train method.

So during training, these metrics will be generated. So we can see the epochs, the training loss, the validation loss, and the accuracy. So as you can see, the training loss is decreasing, which is good. And the accuracy is increasing, which is good.

But you can see that the validation loss is increasing. So this is a sign of overfitting, which I'll comment on in a bit here. Now that we have our fine tune model in hand, we can evaluate its performance on those same five examples that we evaluated before fine tuning.

Basically, same code copy pasted, but here's the different output. The text, it was good is now correctly being classified as positive. Not a fan don't recommend is correctly classified as negative better than the first one correctly classified as positive.

And then this is not worth watching even one correctly classified as negative. And then this one, this one is a pass is classified as positive, but this one's a little tricky, even though we don't get perfect performance on these five, like baby examples, we do see that the model is performing a little bit better. And so we're turning back to the the overfitting problem. This example is meant to be more instructive than practical.

In practice, before jumping to LoRa, one thing we might have tried is to simply do transfer learning to see how close we can get to something that does sentiment analysis well. After doing the transfer learning, then maybe we would use LoRa to fine tune the model even further. Either way, I hope this example was instructive and gave you an idea of how you can start fine tuning your very own large language models. If you enjoyed this content, please consider liking and subscribing. liking, subscribing, and sharing it with others.

If you have any questions or suggestions for future content, please feel free to drop those in the comment section below. And as always, thank you so much for your time and thanks for watching.

Transcript for:Understanding Model Fine-Tuning Techniques

Transcript for:
Understanding Model Fine-Tuning Techniques