Axel Fine-Tuning Techniques Overview

so plan for today we're going to talk about Axel uh how to use it broadly and then um we're going to go into uh the honeycomb example that we introduced last time and we'll do just a quick catch up there um for those who didn't see last time but uh the honeycomb example and Hamil will walk through that we will um have some time to get uh a conversation both our questions and your questions with way and then uh we will uh have some time for Zack to share about parallelism and hugging face accelerate uh very quick um run through of fine-tuning on modal and we'll uh have a little bit of time at the end of this for Q&A so with all that said I'm gonna get started um the most frequent question that I get from people when they're first starting to fine-tune is um they're really related to I'm going to call it model capacity which is how much are we going to be able to learn the two parts of that are what model should I find tune off of and then the question which is simultaneously more technical but I think has an easier answer because the answer is almost always the same um which is should I use Laura or should I do a full fine tune um I'm gonna give a shorter answer to the base model and then I'll walk you through what it means to find tomb with Laura and um but then the uh I think the answer there despite it being useful to understand Laura because you're going to use it a lot um you should almost always in my opinion be using Laura rather than full fine tunes but um the first part of this is what base model do you use so there are two Dimensions to this so one is what model size do I use a 7 billion or 13 or 70 billion or some other size um parameter model and then the second is um what model family do I use so do I use llama 2 Lama 3 mistal Zephyr uh Gemma whatever else um on the model size I think different people will have different experiences um I have almost I've never find tuned a 70 billion parameter model and it's not that we can't it's actually with thanks to AEL and accelerate it's not so so difficult um but I've fine-tuned 7 billion and 13 billion parameter models I think most of the use cases I have the breadth of what we are asking the model to do is not so so wide and so my experience has been that fine tuning a 7 billion parameter model versus 13 actually the 7even billion parameter model like the out the output quality of these for the pro projects I've worked on has been close enough that I never felt the need to deal with the parallelism of um required for much larger models um so I typically ended up using just 7 billion parameter models those are a little bit faster it's a little bit easier to get a GPU that those run on um and if you look at the download counts this is not a perfect proxy for what others are doing but it's some proxy for what others are doing and you do see that um 7 billion parameter models are the um most popular and these are not instruction tuned models so these are models that people are typically F tuning off of and you see that um the seven billi seven billion parameter model is um the most popular um and then for people who want to know just like what is fine-tuning um you'll we cover that I covered that um in some depth in the first lesson so um yeah you can go back to that uh then the second question is which model family do I use um this is one where again thanks to the way that it's been abstracted from Axel it is extremely easy to try different models especially if they all fit on the same GPU um or even if you have if you have to boot up a new instance that's also not so so hard but it's extremely easy to try different models and just do a Vibes check um I tend to just do whatever is fashionable so a recent released model recently released model is llama 3 and if I were starting with something today I would just use llama 3 not because I've thought about it in incredible incredible depth but rather because it's just a new newly released model that's widely known to be reasonably good um if you want to find out what's fashionable there are many places to um find that out uh you could go to hugging face and then for models like just there's a way to sort by hotness and just see what's hot um the local llama subreddit is a community of people who think about uh these things a lot and that's a good place to to look at and just for running models though it has local in the name they spend a lot of time just thinking about different models and how they behave differently so local llama is another U Community to look up if you want to um to choose a model but um I think people over index on this and that if you run a couple of models that are just the most popular models at the time that is uh that should be um good enough and you won't probably improve on that immensely by trying many more models um and I'll I'll talk in a couple slides about why that is um this second problem uh Laura verse full fine tuning is a question of when you fine-tune the model are you going to you so you've you've uh let me start with an image so if we imagine that we've got one layer it goes from an input to the output I'm going to for a second actually simplify the Transformer architecture so we don't think about a query Matrix and keys and values and imagine this almost as just like a for the for the moment a feed forward Network so you've just got one um layer that we're going to look at and it's going to take an input that is really an embedding of the meaning of the text up to that point in the string and it's going to Output um another uh Vector representation in most of these models the inputs and outputs are somewhere on the order of 4,000 dimensions and so just for that one um layer you'd have 4,000 dimensional input 4,000 dimensional output so that Matrix would be 4,000 by 4,000 that would be 16 million weights and the idea behind Laura is that we can learn something that you can add to that original Matrix that is much lower dimensional and that um will still change the behavior in a similar way but will have um many fewer weights and as a result it can be fine-tuned on um less uh GPU with less RAM and the uh I think it's safe to say that the vast vast majority of fine tuning that happens is either Laura or I'll talk about Cura which is going to work functionally in a similar way but the vast majority that happens is Laura and I think for um everyone in this course you should use Laura for a while and maybe someday you'll do a full fine tune but you U as a practitioner may never need full fine tunes there are some the theoretical reasons that full fine tunes if you have a lot of data could be higher performance but um and Zack or Wing or H can contradict me here but I think for most people is uh all you need um unless you guys want to jump in and correct me let me uh I'm going to say a bit about just how luro works so um we want to make some changes to a 4,000 by 4,000 Matrix which is the um original weights and we do that by having a two M two matrices that we're going to multiply together those of you who remember your then your algebra will know that if you have a 4,000 X6 Matrix times a 16 by 4000 Matrix that is 4,000 by 4,000 so if we multiply these two pieces together that is going to create a new Matrix that we can add to the original weights um so it can change the original weights quite a bit but the number of parameters that are required here so each of these is this one is 4,000 by 16 and this one is 16 by 4,000 so if we said how many parameters is that that's uh each of these um two uh matrices on the right is uh 16 by 4,000 as the number of parameters you have two of those so now we have 128,000 weights that we are going to um need to fit when we're fine tuning that's a lot less than 16 million and as a result um it just requires a lot less Ram uh and GPU v r is uh frequently a binding constraint as we train our models and as a result it's nice to be able to um reduce that RAM usage by using Laura and you'll see that um yeah you'll see it's just a configuration flag so it's quite easy to do this in um uh it's quite easy to do this in um AAL the other uh piece which is I think conceptually also potentially somewhat complex to understand well but um extremely easy to use is going from Laura to uh Q Laura so here we had each of these matrices and those are just um numbers or each element in those is numbers numbers are stored uh in computers with a number of Bits And if you store it with many many bits then you get very fine gradations of what that number can be so you can go two to 2.1 and 2. two and so on uh so we tend to think of those almost as being continuous Q Laura is um dividing the um possible values for numbers into um a smaller set of values so for instance if you start with something that is stored in 16 bits you think that is almost continuous uh if it goes if the lowest value that you want to be a to store is minus two and the highest is just to pick a number 2.4 you've got lots of numbers in between there Q will divide that space so that it can be stored in four bits the number of possible values there is two to the four so it's 16 values the exact way that we choose the 16 values is a technical topic that I think isn't worth our going into in this um moment there's some details about how you do back propagation there that we don't really need to know in practice um but by storing every number in four bits uh you cut down on the memory usage by quite a bit uh and so a lot of people do a lot of people do this you'll see again that this is not so complex to do and um in practice it save some RAM and it has some small impact on results um but I I think my intuition would have been that has a bigger impact on results than I've actually observed it having and I think most people would agree with that and so a lot of people run Cur models or train with Cur either as their default um first step or at the very least it's something that they do frequently and again we'll show you how to do that and it's uh shockingly easy um so uh um maybe it's a good time to just pause for a second Wing do you Zach even like do you have any opinions on Cur Laura when you use them any observations feelings do you agree any yeah any further thoughts I know that um sometimes people see a difference between like the actual losses that or some of the evaluations that you get during um fine tring with qor because what's happening is you've quantized the weight and then you're training on those but then when you merge those L back into sort of the original model because the quantization there's like quantization errors or um due to quantization that you're not actually getting the exact same model that you train so there has been some like debate over that I don't like I personally don't like feel like that's a huge issue um otherwise people would not be using it anymore so that's really the only thing that I have about that I think there was also something that I personally didn't understand with Cur um with the quantization was I think there were like double quantization and there's some like nuances with like that as well when you're quantitizing the weights maybe if Dan understands that better than me I I think I don't um one of the speakers so at Workshop 4 we're gonna have Travis Adair who um is the CTO of prabas uh but built Lorax which is a just a serving framework and he talked about um some of the quantization errors as you merge the weights back I think he has thought about this like way more deeply than I have and so I know that I'm looking forward to Workshop 4 so I can hear um his description of what he's done about this issue but I yeah I don't know much more about it than that okay um all of this is like I said there are so many places in um Ai and before that ml where it's like tempting to like get really detailed about all sorts of things that seem very mathematical um the payoff to doing that even though like most of us were good at math from an early age and we're told like ah you should do a lot of math um thing with Hyper parameters while sounding cool has a much much lower payoff than spending that time looking at your data and improving your data um and you might think like my data is what it is how can I improve it and so when we get to Hamill's um what haml shows about um his work with uh honeycomb you'll see like you actually can improve your data and the payoff to improving your data um is so so large uh I think Hamil made a comment about uh many of you might know who technium is but I don't H you wanted to jump in here yeah anyway um improving your data the payoffs are massive and you should do more of that um one of the things that as we're going to switch into from the abstract like hey here are some ideas to how do we implement this one of the things that I loved about Axel when I switched from use it so act a lot as a rapper for lower level um hugging face libraries one of the things that I most loved about this switch from hugging face lower level libraries that give you a lot of rular control to using Axel autle is that Axel AEL was so easy to use that I never thought about like oh what's the error in my code and I just spent actually less time looking at code and I spent more time just psychologically looking at my data and so the ease of changing some things around um and being able to run things read up some mental space for me to um focus on my data which uh we said is a great thing to do um it also if you just use the examples and I'll show you some of the examples there are a lot of just best practices and default values that are built in um it does a lot of smart things uh as defaults I'm going to um there are a couple things that I quite like that it does that we don't have time to cover and so I'm going to make a couple of videos and then just post them um either in the Discord or in Maven or on the maven plat portal or both quite possibly both um showing things like sample packing which is a quite clever thing that it does that has a speeds up your training process um but it has a lot of things that you could spend a lot of time figuring out for yourself or you could just use some of these examples um in Axel and change relatively few things and have a lot of best practices built in by default um so I I've have loved to Wing thank you um I've Loved using Axel um and one thing I want to maybe it's worth uh lingering on for a second is um Wing like I'll let Wing tell the story has there any have have you been surprised by like the level of like you know what kind of people are able to fine-tune models like really competitive ones um without like knowing any like deep mathematics or things like that yeah I mean think you like just sort of um like if you think about actually the most popular model like I think generally like you know with tum Hermes models and those sorts of ones like they're generally very popular and if you actually talk to Ryan like he doesn't he's also of the you know he's very much like me where he doesn't quite get deep into like Transformers and the math and all of that and just wants to trade models and build you know focus on good data so like really all of his models are really good um there are there people like um I think like uh let's say I think Miguel T terera is it with the um I forget Which models he has that he releases I mean I think his background is more deep learning but um he also uses Axel and there's a lot of like um they don't really need to like go deep into the Transformers right so yeah like um Dan was saying they just are able to spend more time focusing on just procuring good data and Sy doing data synthesis rather than thinking about like all of the everything else that goes on underneath the hood great okay um let's get one level more uh tactical or concrete so using Axel um we are some people here have used it a bunch we're going to make the assumption that most of you have um either used it very very little or I think even more when we did a a survey at some point of some students most of you have used it at all um so this is going to be really a how do you just actively get started um it's I think you'll be surprised that it is not so so difficult to run your first jobs and I highly recommend doing that you'll just feel different about yourself as someone in this space once you've run a couple jobs and you feel like a practitioner now so highly recommend using it um the way to get started is if you go to the Axel actually start with just Googling GitHub Axel if you go to the Axel repo um there is a separate documentation page but just the read me is fantastic and has most of what you'll need um I'm going to point out a couple of things that you should look for while you are at in that read me so the very first is examples I mentioned earlier that there are a lot of examples um axel takes yaml config files and uh the config files are reasonably long uh maybe Wing could do it but I don't think there is anyone else who could open up them or have like a blinking cursor and then just type one out beginning to end and get it right so you and almost everyone else will go to one of these examples copy it um the first time you should just run it and I'll show you how to do that but then you're likely to change one or two PR probably the first one that you might change is the data set that you use but you might change one or two other parameters rerun it and it will always be an experience of taking something that works and then changing it around a little bit rather than starting from scratch so you're going to use um these examples uh to show you one of them so here's one this is uh to fine tune um a mistro 7B model uh with Cur so the first the very top is showing you what is the model that I'm fine- tuning off of um so this is Cur so here we are uh loading in 4 bit um we have a data set I'll show you that data set in a moment um we're gonna store the data set after the prep phase at in some location we're going to have some validation data um most of these you won't change that frequently sample packing I'll make a separate video about uh this lur La R is related to the size of those um Laura matrices where that's that Matrix that I was showing earlier Laura Alpha is a a scaling parameter I wouldn't worry about some of these bottom ones I think the ones that you probably want to focus on up front would be actually um it's not the easiest one to change so you could change something else just to get an experience of changing it but when you really start working on your own use cases the first one you'll change is the data set um and the format of the data set is um so there are a lot of different formats I think one of the things that's really nice about Axel is it out there in the wild data is stored in a variety of format and if you tell Axel what format it's stored in you can use most of if not all of the common formats so this is a format called alpaca um but each row or each sample has um an instruction to the model uh optionally some input you'll see in these most of those are empty it has the output which is what we want the model to learn to reproduce and then it has um some text which will go above these so the text would be below as an instruction that describes a task um blah blah blah uh and then you'll have a question like what is the world's most famous who is the world's most famous painter and then here's the um training output which is what we're going to train on and try and have the model learn to replicate the behavior of um so just to kind of stop there for a second and talk about uh the config files so like when I start a project I you know I look at the examples too I message Wings sometimes not not everybody can message Wing please don't message wing with like not please don't don't DDOS him uh with uh questions like that um there is a slack Channel an axel sorry a Discord Channel um I think Wing looks like he's getting the link and putting it in the Discord right now um and that's a good place to like kind of trade configs but yeah starting with a known good config is a good idea it's like hey like I'm training this model uh that just came out does anyone have a config and usually either by searching that Discord or looking at the examples or something else you can find a config and there's a lot of times in hugging face repos you can find uh nowadays you can find ax configs as well uh Wing do you have any other tips on where to find configs or where how people should kind go about it um yeah depending on um like some model creators I know like personally I try and include sort of the model configs when I'm releasing models either somewhere in the repo or in the readme I think Axel model by default also Stores um in your readme it'll store um the ax model fig so sometime like if you go through huging face and there is a link where you can find like models that are tagged that were trained by Exel um depending on whether or not they've modified their read me you can sort of like get configs from there as well um but other than that I think a lot of times it's you have uh you'll see some examples in the Discord people have and um I'm happy to also help just like um you know with various things depending on like what but it's generally pretty self-explanatory most of the time I think so yeah um usually you're wanting you're taking like little bits from like one config and maybe combining with another piece whether it's like fsdp or deep speed or you know the lore versus Q versus like there so you most of the most of the um various configurations are pretty composable with each other and if they're not I believe we do enough validation that it will tell you that it's not composable sounds good yep okay and then a lot of those a lot of there are a lot of other parameters uh I won't go through these in I won't go through most of these most of them you won't change um but I will say a couple things one is uh many of us like using weights and biases it's a very nice weights and biases integration in Axel AEL you'll even see a config from haml later on that shows you how to fill this in um micro batch size is just the uh basically batch size per GPU um yeah and a lot of this stuff you won't change in the near future um and so like I said highly highly recommend starting with any of the example configs and then changing it just small pieces don't get overwhelmed by all the things that you aren't changing then once you have your config the next step is to run it um like I said I think the uh this GitHub read me is so so useful so after you've got your example uh click on the quick start uh section and that will bring you to a set of uh depending how we count either three or four commands uh so the reason this while it looks like four could be three is that there are three steps so one is pre-processing your data the second is this training step and then after that you're going to want to just test out your um the the model that you've trained so there is a CLI tool to do that that's this um third step and haml will actually show another way to do this the thing that I like to do is there's also if you run this bottom version instead of the third um that launches a very lightweight uh gradio app so that you can just on in the web type something into a form and that gets sent to the model and and inference happens uh and then the output is shown so I I quite like um using this bottom step uh you will I think it's worth mentioning you don't you you only want to do this to kind of like spot check your model this is not for like production you don't do inference necessarily in production with with this yep and we we'll cover inference in production in uh the deployment Workshop um sorry I lost my train of thought um ah you so you will not remember these commands the thing that I hope you remember is that everything you want is at the uh is in the GitHub repo and this one is in the quick start um but it's just the the series of commands so um what does it look like if you run that I'm gonna show you some of the text here is going to be uh relatively small so we'll come back and I'll show you a screenshot you can see some stuff in more detail but this is just a very quick uh view of what happens when you train the model so actually I'm going to make sure that you can see it in reasonably High depth so here I am typing out that first preprocess command I use the debug flag um and we'll talk about the debug buug flag whether you use it or not um when he gets to his section but I I kind of like using it and um when you do that there's some output here in a moment I'm going to go into that in more depth and then after that I run the next command that was shown on that last screen this is just doing training and that kicks off training and then uh training depending on the amount of data you have can take um yeah minutes hours I suppose sometimes days though the projects I do it uh actually I do have one project where it can take days but it's typically um you know an hour or so and sometimes much less um so uh let me go to the next slide in there there was a section that uh it printed out from the pre-processing step with the debug flag that it would be easy to overlook but I think is really critical for you to in your understanding of what is happening here so though we started with data if it had multiple Fields your model is going to train on a string or I'll show you in a moment it's actually a string in one other piece but it's going to train on a string and so this is showing you the template for what does that string look like that we that we create uh in the pre-processing step and then that we later use for um for modeling uh so we have say there an instruction an input and out output um and actually those are for each sample just filling in here's the instruction here's the output here's the text um when you use this for inference you're going to want to provide everything up through this response part but then not the output because you wouldn't know the output when you use this for inference but this template is showing you what the string looks like and then we're going to use that autocomplete type logic so that we provide everything before the output and our model will provide the output um it's actually I this looks like it's just a string there is one other piece that I think is um important for your understanding of fine-tuning that is shown here so it's actually a string and a mask so I'm gonna go back here for a moment when you calculate your loss function to which is part of for those of you who are familiar with deep learning which is just part of figuring out how do we change our parameters to change the model's behavior um we don't want to train the model to write the words below as an instruction that describes a task and we actually don't even the input here is a proxy for what the users of your apps input will be so we don't want to train the model to be the user we want it to instead be good at responding to user inputs and so um these pieces up front are not going to inform the loss so when we look at the um the output we can look at it on a token by token basis so somewhere in there there was some input and there were the words that appropriately completes the request with a period each of these are tokens and before that we have pairs of the word that is token ID 28.99 but because we don't want it to feed into the loss we have um the first piece of this Tupa here is minus 100 which is just a way of preventing it from influencing the loss and thus influencing the behavior of our model if you look at the um output that's in green here and for those we have the token ID then we also have the purpose of calculating a loss what token uh is this and it's the same so um there is a flag which I think is called train on inputs that will let you change this Behavior but broadly speaking this is just showing that uh this is a way of being able to see very clearly which what are the tokens that are influencing um that are the inputs to the model and what are the tokens that are influencing loss and that are um eventually going to be the outputs of the model or that we're training the model to Output way do you uh use that debug thing in yeah all the time because um mostly because I want to be sure that the to tokenization is correct because a lot often times I'm using chat ML and so like because it's not a default token I just want to make sure I didn't mess anything up in sort of setting those special tokens for Chad ml um and just to double check that you know the outputs look right just just so people know Chad ml is a specific type of prompt template so if you go back to the previous slide that uh Dan had you know that this I believe is a alpaca template this is alpaca yeah so that's this a specific type of template and yeah Chad ml is different and in general chat templates tend to be a little more there's a slight complexity or Nuance to them that instruction tuning templates I think arguably are a little simpler but um okay didn't mean to cut you off Wi we can keep going yeah no I mean that was really and then sort of like checking sort of like the end tokens making sure that sort of the stop tokens are in in there correctly and um just because if sometimes if it if it's not in there you can get a model that just starts to like ramble on and on it never stops so it's just it's just a good like spot check for myself and sort of especially in multi-turn conversations just to make sure that it's like masking out the the responses correctly and um you sort of see that because it'll go like red green red green red green um so yeah just an easy spot check and the color the color um having the colors just makes it easy to like just glance at it just to like uh without having to like because that is hard that is actually really hard on the eyes to try and debug so yeah let me show this um this last step so we've done training there was one more command I'm gonna show the gradio version of it so um let me pause this for a moment then switch over to make sure that we're looking at this in the high highest possible resolution so um the last step was to kick off the app uh going to run this accelerate la launch have the um inference command with pass in the right EML file the directory with theur and then this gradio flag this kicks off an app you can click on that link open something in the browser and you can type and test uh test things in the browser so um that was that last step again you won't remember all of these pieces but you should remember that they're in the click store and uh you can refer back to this and again super highly recommend before other things get on your to-do list um that you run through this so that you've uh have hands-on experience using Axel um and with that let me hand it off to Hamill to go through a case study um for which is the honeycomb case study uh you want to H um you want to take over sharing yeah let me do that right now okay see here me start the [Music] slideshow is that sharing good okay thank you okay so um we covered the there's a through example in the in the workshops in the fine-tuning workshops and that's uh this use case of honeycomb and um we discussed it in the first Workshop because we have so many students I'm going to just go over it really quickly again so the case study is you have uh there is a company called honeycomb that I've worked with and honeycomb is an observability platform it's a Telemetry system that allows you to log all kinds of data um and it tells you things like it helps you diagnose like if parts of your application are slow or there's bugs somewhere like that uh or something like that it's kind of like similar to data dog in some ways um honeycomb has a domain specific query language um called hql um and one of the things they want to do is like reduce the burden of people learning hql and so what they did is they released a alpha product that allows users to type in natural language queries so instead of learning the hon query language you can just type in your question and so the way it works is you have two inputs to the llm you have the user's query and then you have the user schema the user schema is retrieved with like a rag type approach we don't have to get into that so with these two inputs there's a prompt and then out comes a honeycom query so that's like the that's the um sort of a high LEL overview just to remind you so let's jump right into the case study for the case study I'm just going to be walking through some slides and let me uh let me open this GitHub repo so it's github.com parabs FD course you don't have to open it right now actually just would follow along with what I'm doing is this repo right or it's uh so I'm going to open actually let me open the repo so just to show you so it's a repo that looks like this I'm just going to go through the notebooks they number one through eight and Dan tell me if you can see the text on my screen or it's too small or what not I I've got a big monitor but it's looks really clear to me good Zach okay um okay so let me just I'm going go through some steps these steps are not necessarily linear but it'll give you a good idea I'm going to be focusing a lot on what we did with honeycomb uh to fine-tune a model and a lot of the steps are going to be around data set curation and the you know data filtering and debugging and evaluation because you know we're not as Dan mentioned um we're not really focused on the model so much um and so basically I just want to go through the prompt real quick so this is the honey Chrome prompt it's basically the system prompt honey Chrome AI suggest users queries this is one of the inputs this is the schema um there is this long fixed part of the prompt which is a query specification which is just like a bit of a programming like a very tur programming guide to the hom query l language uh there's some tips and there is some um few shot examples of queries of questions or you know user queries and then honeycomb queries so there's a few shot examples and then finally um this is a completion model so when honeycom launch this they use the completion API so the chat API and so um you know they're just completing this based on the user's question just templated so the interesting thing is so um you know they you could see that there's a lot of stuff in this prompt like like all of this stuff is fixed every single time um in this particular situation so like you know the few shot examples plus the tips plus the um sorry I didn't go over the tips the tips are just like additional instructions so all of this stuff is fixed except for the columns in the question so that's a lot of boiler plate to be sending to large language model but then also it's like it's hard to specify everything you want in this prompt like no matter how hard you try um you hit a wall and like that's where fine tuning uh kind of moved the needle so uh honeycomb launched this product uh here's there's a link to the blog post um it's kind of neat to read it um and yeah it just this talks about the same thing you type in a natural language query and out comes how comes query um and you can read about it I don't want to go too deeply into that so um the goal in this case was to encourage more users to write query so so like the bar isn't like super high in terms of like it has to be perfect um but one thing we had to do is write evals so like one of the things you should think about is writing evals um after you after you do kind of like some prompt engineering you may like prototype with a large language model uh just off the shelf if you can um just to see if like just to get an idea of how well it works off the shelf so with honeycomb so what do I mean by evals so I have this blog post about evals I won't go through it in too much detail but there's different levels of evals level one is unit tests um where you write assertions um and then there's level two and level three um and I'll be going through like the like you know level one and two um level three is AB testing so basically idea is you want this virtuous cycle where you have evaluation at the center and uh the honeycomb example is actually like a really good use case cuz it's like very narrow and like simplified and um it kind of like allows you to like get what I'm talking about so basically like you don't have to understand this code but just know that for the level one EV vales when I'm talking about level one EV vales I'm talking about assertions and unit test that don't involve calls to a large language model these are like rules that you can think of that you can run almost instantaneously and get feed feedback about whether your model is doing the right thing okay and so there's some code here and I'm just showing you this code so you know that it's real in case you want to see an example but it's essentially what I'm doing is I'm just testing different things about the honeycomb query for correctness okay I'm like testing if it's valid Json I'm testing if it's there's invalid columns in the query based on the schema uh if there's invalid filters you don't have to like know the specifics of this just know that there's lots of different level one evals okay and you don't necessarily need to write it like this but just giving you an idea that you need to write these assertions um and also like so um just let know also that I had to iterate on this quite a bit like don't expect that you're going to get all the assertions right the first time there's an itative Loop where you kind of you know throughout this whole process you have to update these level one evals you'll notice more and more failure modes and I had to work really hard on on this um to get to get something that I was happy with um and then like you also want to use these evals you want to write them such a way these assertions that you can use them in different places so you not only want to use it for test you also want to use these evals to filter out bad data for fine tuning you want to use it for curation and you also want to use it in inference so you can do self-healing um and so like you know I have like encapsulated this query Checker again you don't have to know what this is just gives you an idea like hey I'm using these like assertions in different places um and this is like an because this use case is oversimplified this kind of way of organizing your code may not work for you you have to do what works for you in that situation um but just know that it's here okay um and already went over this assertions are not just for tests they're also for filtering and curating and inference and yeah definitely look at blog post okay so one thing that you will often have to do when you're fine tuning is like acquire data and a lot of times like you don't have the data in an applied use case um so what do you do like in the honeycomb in real life um my counterpart Philip who I was working with didn't have lots of data he launched this to you know uh production but then like you know not only did not have lots of data a lot of that data was private and I can't see that data um and so what we you know he gave me about a thousand examples and I wanted to set aside a fair amount of those examples like in the eval set so I you know so I could test test the model so I wasn't really left with much and so like the question is okay what do I do from here like so you all you're um a lot of you if you're in the wild and you're trying to build something in large language models and um you're trying to fine-tune it um it's good to know about how to generate synthetic data um there's no hard and fast rule again about like how many examples you need I just um generate as many examples as I feasibly can um just based on intuition based on like how much it costs how much time it takes um I end up generating 30,000 examples um synthetically but but I kind of went overboard um so you don't have to do that just use your intuition based on your budget and what you have um so like you can do this with prompting so let me give you like a concrete example because I if I just say hey you can use a large language model synthetically generate data you're like well how like what does that mean and I think for every use case is different but let me show you what we did for honeycom so the prompt is basically the same exact prompt that you've seen before except there's a second part that says okay you are given the following three inputs uh natural language query a list of candidate columns and the query your goal is to generate correct variations of the combination of nlq candidate columns and query to build a synthetic data set you can build a synthetic data set by rewarding the query and substituting the column name response should be Json with the following Keys um so on and so forth um and then basically yeah I'm giving it the inputs now and then saying please basically perform data augmentation so substitute like rewrite the natural language query substitute the columns and substitute the query and basically I'm able to generate LS and lots of synthetic data this way now you might be wondering is that good data like is it duplicated like all this stuff yes and you have to clean it up um and which I'll talk about in a second um but just know that like for example you want to use those level one assertions as your first line of defense a lot of the stuff going come out of this is going to be junk maybe or some amount of it you want to get rid of it so the level one assertion is already going to help you and it's going to help you throughout this whole thing um okay so you have a way of getting lots of data this is how you do it I'm not going to show you the code of doing that it's fairly straightforward it's like use your favorite uh large model to do this use the most powerful model you feel comfortable with to help you generate the synthetic data um and then okay so the next step in this is like preparing the data for Axel um we're going to so like usually what I do is like I go through a run I run all the way through and I see like kind of what's going wrong and then I come back and improve it you know you don't want to just like try to make your data perfect the first time and then like you know go through it you want to like go all the way through see some predictions make sure the plumbing Works Etc and then you can come back and curate and filter the data um that's what I recommend because you can get stuck it's good to know where the problems are and have an idea so uh okay so so you want to prepare your data to look like this um in this case cuz I'm using the share GPT uh alpaca format um and I'll tell you what that means basically if in Axel there's this config share GPT and alpaca um and let me just open the docs so you can see that so there's the data set formats this is the axle docs there's different formats um we're going to I'm using a conversation format and there's a shared GPT and you can see share GPT you have to structure your data like this you have conversations and then you have from in value and you have different roles like the from can either be human or GPT uh and then the value you can also have a system prompt which I do have in this case um which I'll show you but anyways like you can see there follows that here I have this like a conversation where I have a system prompt then a human then GPT now why is that uh well that's the way that Axel expects your data in for this format but also it's important because if you remember Dan talking about the Train on inputs uh you know not training on inputs so this is considered an input the system the system role in the human question is considered inputs and and the output is considered is this is the query and so what we're doing is we are only penalizing the model we're like forcing the model to basically learn to get the right query and not trying to have it predict what the question is if that makes sense so you organize your data like this to this Json L um and let's take a look at the config so the thing you want to pay attention to here then already went over the config but in this case change the data set this is a local data set so I have this basically the sample data and I have this like synthetic queries um and you can look at what that looks like if you want it's in that GitHub repo at this path um and then also the Train on inputs is also false there's a there's a key in here train on inputs which I'll let you find I don't want to try to hunt for this uh right now it's right here train inputs um and then also you want to change if you going to run this example which you can and I'll show you how um you need to change the following things in your config like you you won't be able to access my weights and biases account and you won't be able to access my hugging face account probably want to create your own and so like what Axel does is like you can log as Dan mentioned you can log all the training metrics to weights and biases and then also um you can also put it in a hugging face model repo and it will upload your model to that repo which is super handy um at you know it'll do that at the very end and I'll show you what all this I'll show you some examples what this looks like okay so prepared the data you got your config file now what do you do so what I like to do is I don't ever jump straight into training ever because I'm dumb and I make a lot of mistakes in data set preparation always make like do something wrong and honestly I think a lot of people do something wrong wrong here and so what I like to do is look at the data and I look I like to double check how Axel is preparing the data and the way I do that is I do this Axel pre-process command um and uh that will basically flatten the data and assemble it in the right format um you can see all the different commands by using help so I just show that here just for reference um and so I like to look at the data manually um there's that debug thing that Dan showed but I like to like look at it manually um just so I can like kind of play with it a bit more manipulate it kind of inspect things um so basically what happens is when you pre-process the pre-process the data Axel dumps that data by default into this last run prepared directory and that is a hugging face data sets format and so you can load that hugging face data set format and inspect it and that's what I'm doing here with this code basically you can see it has sort of flatten that Json L into a format that looks like this and that is the alaca format just like how Dan showed earlier you have this like instruction um and then response and so um yeah like what I recommend is check multiple M examples make sure it looks right make sure you didn't put the wrong thing in the wrong place or have like things in there that you didn't intend in your data happens all the time um one thing that I'll mentioned is yeah there are these spaces right here you might be wondering what the hell is that um it's a little bit of a tricky issue it's kind of some artifact about the way Axel assembles um you know tokens um I don't know if Wing wants to say something about this yet but I found it not to be an issue as long as you're like consistent uh with inference time um and I'll talk more about that and I I have a blog post about that as well um okay there's also verbos debugging which Dan already covered um you know and basically yeah you could do debug flag um the special tokens are here and that's like worth paying attention to but there's like the red green I'm not going to go through that again um and then yeah it's it's always good to know what like the spot check like what these tokens are and if it's correct so like for example like what is this token like you might be wondering see this you haven't done this before you're like what the hell is that token is that wrong like okay that's a new line um but yeah if you want to go into like why what's going on with the tokens there is this blog post here um I'm not going to go through it now but just to tokenization gotas um as an exercise for y'all you might want to go through this blog post as a homework and uh take a look and see you know if it's something that you find that matters um I was really super paranoid about these like small things like spaces but I found that it didn't matter and I actually discussed this a lot with wing um but Wing do you want do you have any opinions on this is he here might not be here um no worries okay I'm just going to go straight on to the next um so uh okay that was data set preparation now we going talk about training we already seen the config file the config file is also located at this location uh which I will go through um you can see it's been uploaded to hugging face um there is a link in the notebook so you don't have to memorize what you're seeing on my screen to run training you run this accelerate launch AEL command and Zach is going to be talking about accelerate I don't want to go into that deep rabbit hole right now um I'll just let Zach talk about accelerate in a bit um if you notice I have a weights and biases config here and this weights and buis ENT entity is just basically like a GitHub org and the project is basically like the repo and so uh when you do that Axel will upload you can log your training runs to weights and biases let me show you weights and biases real quick so weights and biases looks like this it's a bunch of runs uh and you can you know yeah you can just log your runs and the results look at your training uh loss curves I'm not going to spend too much time on this um but just know that it's there if you want to look at it um so basically like with training what did I do I Tred different parameters so I Vari the learning rate so first of all I took a uh so this is mistal 7B so I went into the examples I asked in the Discord so on and so forth like what is the best uh what's the best config for mistol and um you know I started with that and so I varied the learning rate I tried different learning rate schedulers um I actually tried like different distributed scheme schemes like using deep like deep speed 0 1 2 3 just to just to test stuff I mean not that it matters but um uh actually this is a small model so it fit on my GPU just fine um but yeah I mainly just vary the learning rate and the bat size um another thing is like you know there's sample packing that you might want to try um to save GPU space um or to like save the amount of vram you need or like you know increase throughput um but Dan will upload a video for that or talk about that in a in a little bit more detail later on um so when the training is done it's uploaded if you put your hugging face ID it's uploaded into the hugging face which is here so this example of this model is here um you don't need to know what is here I don't want you to kind of you can look at this later um and I'll go through like some of this code in a bit so the next thing you want to do after you train your model is to sanity check it okay and like um there's a lot of different ways you can sanity check your model I like to uh you can use the way that Dan mentioned earlier by using ax L directly however um I like to actually use code to up to like um and using H hugging face Transformers to actually uh make this work hey Dan I think like uh Wing may be trying to open his camera potentially I know uh um okay so sanity checked the model this is the hugging face repo where the model is uploaded into don't be confused that this says like Parlin laabs and the other config says haml that's because I changed the name of the repo and I didn't want to break the links but um yeah so this is just code about basically pulling that model from hugging face and then this is the this is the temp tempate so another reason why sanity check things this way is I want to make sure that I understand the template and that it works um because I had my own like basically yeah and like the way I want to do is I just want two inputs the natural language query and the columns um this different ways to do this you can use hugging face has like a A templating system that you can use I'm not going to go into it but I like to like make sure I understand the template um and so that's what I have here is I have this template it's basically the same thing um and this is just code to like run it um but basically just like sanity checking examples okay so nothing too crazy going on here I just have some natural language queries and some schemas and I'm checking to make sure that it works um that's what you should do that's the first thing you should do okay great so we've done all this stuff we trained the model we sanity checked that at least like the plumbing works and some results maybe look plausible so the next thing you want to do is like so the question is like is this any good yeah it passes like you can see like these level one evals you can track the different metrics of the level one ebals you can know like which assertions are failing how you know like what kind of errors are you getting the most that's all good um but then like beyond the level one assertions after you conquer those like are are these like good or bad so when I when I shared so I launched uh this model onto replicate for inference and we'll go through inference later so I don't want to like get stuck on that is like uh you know and allowed it did some sanity more sanity checking and um basically like Philip did some sanity checking and said okay this model is okay but it's not great um it's still making some mistakes in some places and actually it turns out that the data that we used to expand um that data wasn't great either and this will happen all the time um and you might you might find this when you're doing uh like basically you have to do some error analysis and figure out like okay if a result isn't great uh like why is that and one way to do that is like to look at the data look at the training data try to debug like this in this case I looked at similar queries and the training data and try to see what was happening and we found that okay like actually the training data could be better um you know like things are passing level one test just fine but they're not like the greatest queries they're syntactically correct and so what do we do now so like one one thing you might be wondering is okay like are we stuck do we have to stop here like the data is me like and Philip doesn't have time to sit there and label a bunch of data or write better queries um because he he doesn't have time so what do you do now okay like what you can do is basically you want to try to encode the knowledge of philli in his opinions into a model like you want to like see like can you have like Philip as an AI in this situation so what I did is um I started building llm as a judge and basically it's the same exact original prompt um but basic in like that you've seen before but with an instruction that you are going to uh be a query validator okay you are an expert query evaluator that has advanced capability judge query good or not blah blah blah and then there's a bunch of few shot examples here of uh you know like inputs nlq columns query and critiques and basically what I did is um I did a bunch of so how did I get this um in this case I used a very uncool low technology technique using a spreadsheet and I sent philli a spreadsheet every day for a few weeks and had him write critiques and over time what I did is I aligned the model as much as possible with Philip so that it was agreeing with him in the critiques it was writing and I kind of kept tweaking the few shot examples and the instructions until I was until we were both satisfied that this llm as a judge was doing a good job um and the thing that is really good about this is like and so I talk about this a little bit more detail in the blog post when we talk about level two human and model eval um I don't want to go there's a lot you can say about this like there's different ways you can do this I just want to give you an idea so that you have like the general process in your mind and you know that this is a tool in your toolbx um it's impossible to teach everything I know about it in one you know in such a small session but what I will say is uh yeah like when you have the result of this you get a bunch of critiques and uh you can use those critiques to actually make the data better and you can use the you can use the same LM as a judge to filter and curate the data like filter out bad queries hey like try to make the data better given a critique can you make the query better if it still can make the query better then you filter it out um so that's kind of like a sort of you know what we what we went through um and so basically from there you can curate your data so like what I mentioned before uh first thing is you can like fix the bad dat data again using a large language model it's like you're giving the following inputs in a critique and then it's output the improved query and uh just output the improve query um that's one way you could try to like increase the quality of the data but then also you um like I mentioned you want to filter the data there's many different ways to filter the data when we talk about data set curation there's a lot of things that you can do um uh and like filtering again you want to use both your level one evals that I mentioned like those assertions you want to use these level two evals to do even more filtering but then also you commonly have other filters that you'll find uh that you you'll see like different things in the data set you're like oh like things in this part of the data set are garbage or like hey the model is making a certain kind of mistake let me let me filter that mistake out um and then you have to decide whether or not you have to go acquire data for that mistake so one example of that um that's not necessarily a test but it's a filtering technique is in this case I noticed there was a lot of either low complexity queries like super super simple queries or really really high complexity queries with like lots of operations lots and lots of filters that didn't make any sense so basically I had some code that filtered those out okay um there is a in the more General case there's a tool called lilac which kind of like helps you uh find more General things that you might be interested in filtering out of your data in searching your data and the like also finding duplicates so another part of curation is to get rid of duplicates you don't want you don't like okay we did a lot of data augmentation and things like that you might have lots of data that looks very similar or too similar and that's not going to be good because what ends up happening is like you're going to like overweight on those examples so like um there's a lot of sophisticated things you can do you should start with dumb things if you can obviously so like in this case there's three parts there's three main parts of this data set there's the natural language query there's the schema and there's the output and so one dumb thing you can do is like to drop uh to drop any data where there's a a pair that is D like duplicated within those three if there's a pair of two that are duplicated that's like one thing and I did there's another another things you can do you can do like semantic semantic searching and see semantic D duplication um you know that's why in lilac for example you have like fuzzy concept search and things like that um so that you can and then you have like clustering and things like that so you can kind of like look at data try to maximize it diversity uh clean out things that are like too duplic like too much duplication so that's kind of like an endend overview like the idea is like this is not a linear process I went through this in like one through eight but just know that like I have to go back and forth between all these different steps and do these things IND differently as I hit various things like you know like I mentioned um I have to constantly rewrite the level one evals um you know or I might decide to redo the level two EV vales um but this is again this is a very simple example um just to give you a concrete use case to give you the idea of the workflow so that is the honeycom use case um let's let me just quickly talk about debugging Axel I'm going to switch gears so like when you're using Axel uh it's really important if you're going to use some software that you know how to debug it and I just want to call your attention to this these Docs that uh will show you how to debug axle but there's these guidelines here that I think are really important so if you're going to debug Axel AO like something is going wrong you want to make sure that number one using the latest version of Axel you also want to eliminate concurrency as much as possible um so basically make sure you're only using one GPU one data set process use a small data set use a small model you want to to minimize iteration time and also you want to clear caches clearing caches is huge like especially if you're trying to debug something about data set formation like hey it's not you don't think like your prompt is getting assembled correctly or something like that you want to clear your cache um because that can trip you up um I also have there was a bunch of questions in the zoom about how do you connect the docker container um that if you want to run Axel in and like that's really uh connected to debugging actually in a way like cuz you can uh use vs code to do that um and I have some videos and tutorials in the AEL docs that show you how to do that either with Docker or not using Docker and how to attach you know to remote host and things like that um let me go back to the slides and already cover this um Wing okay so I went through we went through a lot I'm just to stop and ask you is there anything else on your mind in terms of um things like tips you might have for people using Axel that you like to highlight um I don't have any off the top of my head I it usually comes when people ask questions that I remember oh you should do this this or this but I don't have any off the top of my head right now no worries um there a couple of maybe this now's a good time um there a couple of questions in the Q&A actually some are listed as answered but for everyone to be able to hear them um about this one uh how do you predict how long a fine tuning job will take before you start it you have any recommendations there that one is relatively hard to answer um you know depends on you know model size lower full fine tune the gpus you're using the number of gpus if you're using like deep speed 02 or 03 and you're having offload it's just there's so many factors that can affect you know the amount of time that it takes to find tun a model that it's us like I think once you have like a gauge on a specific data set um and on like certain parameters that you're going or hyper prameters that you're going to use for a specific like you know set of experiments you can usually like get a good gauge on from that but I don't have like a good like all allaround like formula that works for everybody yep um we just looking through any of the other uh questions that uh yeah we can come back we've got a lot of questions I answered just a second ago um talking about um someone had asked about um you know doing a fine tune and then improving like doing what ham was just saying like improving the data and then like whether or not you should start from scratch again or like fine-tune over that fine-tune model and I think one of the things when you think about that is like if you if your model is already you know getting pretty close to being like overfit just fine-tuning that again for mult more Epoch right is just going to definitely overfit at that point and you should really consider just like cleaning up the original data um and adding in the you know the new Improv data and then just starting from sort of starting from scratch again at that point on the base model yeah I always start again from scratch uh when I improve my data I haven't thought about trying to keep going um okay I think we probably should move forward because um looking at time as well um I think the next thing that might want to do is jump in right into Zach's sure let's do it how do I uh looks like I can take over for you so less for you to worry about we're all seeing me all right yep perfect all right hey everyone uh my name is Zach Mueller and we're going to be talking about scaling model training as you get more compute how do these people wind up doing that uh so a little about me uh I'm the technical lead for the hugg and face accelerate project and I handle a lot of the internals when it comes to the Transformers trainer I'm also a humongous API design geek and before we start talking about like how do they go about doing this sort of what we call distributed training uh let's get a general understanding of model GPU usage right so uh we were talking about how you can use things like luras to reduce some of the memory overhead but how much memory overhead do certain models actually use uh we can sort of get gu what that number winds up being uh if we're using like vanilla full fine tuning so without using luras and then you can sort of convert some of it later uh the assumptions that you basically have to have are we're going to use the atom Optimizer and we're going to start with a batch size of one and for example let's take Bert base case right so that's going to be 108 million parameters how much GPU space am I going to need to train that well each parameter in a model is four bytes and the backward pass usually takes about two times the model size and the optimizer step takes about four times that one for the model one for the gradients and two for the optimizer when it comes to atom so after doing all this computation you wind up getting to 1.6 gigs is needed to train on a batch size of one for bird with mixed Precision that's knocked down by half because uh while the model is still in full Precision which I'll go over why that's important in a moment uh the gradients wind up taking less because the gradients themselves are in half bit and so we're able to fit and roughly guess that it's probably going to take about a gig to two gigs uh overall when we're training on Bert now let's talk about why that matters all right so that's great if you have 12 to 24 gigs of GPU space right typical consumer card but what happens when we scale that up right so if we look at llama 38 billion 8 billion parameters loading the model in is going to take you in full Precision 28 gigs gradiant are another 28 gigs backward pass gets you to 56 and suddenly you're somewhere between 56 and 112 gigs of vram I know I certainly don't have 56 gigs on a single card let alone 112 if we want to avoid things like PFT what do we do this is where the concept of distributed training comes in or how do we make sure that we can use multiple G use to achieve what we want so there's three different kinds of training when we think about it at the hardware level so we have single GPU right so that's no distributed techniques you are running it straight off of whatever GPU you have we have the concept of distributed data Paralis and this works by having a full model on every device but the data is chunk and split up between every GPU another way to think about that is essentially we can process the data faster because we're sending chunks of our full batch across multiple gpus to it to speed up the training time and the last part that I'm also be covering in today's talk is fully shredded data parallelism fsdp and deep speed and these are the key areas that was sort of hinted at in the earlier discussions where essentially we could split chunks of the model and Optimizer States across multiple gpus and what that allows is rather than having the limit of DDP where we're stuck with say 2 490s at 24 gigs that's all I can use in memory it acts as a single 48 gab GPU when we think about the total Ram that we can play with to train models and that's the secret to how you can train these larger and larger models now what is fully sharded data parallelism the general idea here is you take your model and we're going to create what's called shards of the model so that's say taking the model we could imagine A Shard being it split perfectly in half the first half of the model and the second half of the model and depending on how we configure fstp certain chunks of the training Loop will happen in that uh vram space and then depending on what points occur during that occasionally torch needs to know what's happening with that other model chunk because it's all the same model and we need to get the gradients all aligned so these uh what are called Communications and generally you want less of these because it's essentially time spent on your gpus just talking to each other and trading information you're not tra you're not training anything you're not processing data it is quite literally just your two gpus trading notes on how they think the model should be and then correcting themselves now uh I'm not going to really go too much in depth into every single thing fscp can do what I am going to talk about is in my opinion the most important ones when it comes to training in low resource areas and when you're using fscp uh and sort of how you dictate how those weights and gradients and parameters get charted and on top of that I'm going to cover some of the important ones I needed when I was doing a full finetune of llama 38 billion without PFT on 249s spoiler alert it was very slow so the first part of this is what we call a sharting strategy and the general idea here is This Is Us telling fscp how we want to split all of these different things that take up uh vram so with full Shard as it sounds like everything's going to be split our Optimizer State our gradient and our parameters uh with Shard grad op which is Optimizer instead we're just sharding the optimizer state in the gradients and then essentially the model will be split when we're not using it and then joined back together when we are such as during the backward pass this reduces some of the memory overhead because we still need more than the original model right because we're still fitting the entire model in vram but it reduces that training vram a little bit for us we have a technique called No Shard which as that sounds like that's just going to be distributed data parallelism we're not sharding anything and then the last part is a new thing that uh P Tores come out with called hybrid sharding and it's kind of like full shard where we're fully uh fully sharting absolutely everything including the optimizer States gradients and parameters however if you're training on multi- node right so multiple computers are training a big model at once it keeps a copy of the entire model on one of on each of those nodes that's important because remember how I said Communications slow down things a lot hybrid Shard lets us reduce the communications from I think three down to two if not one and so you're train speed is increased honestly to some extent exponentially depending on how uh long it takes for your uh computers to talk to each other so the next part is we know how uh we're going to split the memory right but how do we split the model because we need some way to tell fstp all right I have this model how do I want to split it in between my gpus uh with accelerate with Axel with Transformers uh we use uh two different nomenclatures Transformer based W and size based W Transformer as it sounds like is very specific to Transformers uh with this you need to declare the layer you want to split on so like this could be a Bert layer or a llama layer usually Transformers has good defaults and good helpers to help you figure out what that is the other version is more manual uh and basically you're just telling fsp after X amount of parameters go ahead and split the model uh that's great because works out of the box that's bad because there could be uh speed increases that you might be missing by having say like each head of a like mystal model on a separate GPU so that way it can handle its own computations much faster than needing to wait to communicate with other gpus now the next part which was particularly important for me is the idea of offloading parameters and what this says is okay I have 48 gigs of vram right now if I'm assuming 249s and I can't fit that I can't train on it well I'm going to accept that I still want to do it I don't want to go by through a cloud provider and so fstp will let us offload gradients and model parameters into RAM now as that sounds like that's going to be extremely slow right because we're taking things from the GPU to the CPU and now shoving it at Ram but it lets us train as big a model as essentially you have available in Ram so case in point uh when I was doing a full fine tune of llama 38 billion to match a paper that came out uh I wound up needing to use offload parameters because as we saw earlier a billion requires about 50 gigs or so I only have 48 uh and it was going to take like 72 hours to do four iterations through my data uh versus an hour or two on an h100 so yes it's cool that you know how to use these tools and it can help you train things locally make sure to double check though a what your time constraint is and B what your budget is because I can run it for free and it can take longer or I can pay five doll and go finish it in an hour depending on how much time you have available each solution has different uh opportunities now another kind of critical part uh in my opinion when it comes to doing fstp that accelerating Transformers has is this idea of CPU Ram efficient loading and uh also this idea of sync module States so if you're familiar with accelerates big model inference that's fine I'll give you a brief summary uh basically pytorch lets us use this thing called device equals meta and that essentially is the skeleton of your model the weights aren't loaded it can't really do computations too well but it's just the skeleton for us to eventually load weights into so rather than loading uh llama 8 billion on eight gpus so now we need eight times the amount of ram of our model to load it in at once right so that's going to be easily 100 200 gigs if I'm not mistaken instead we send all the other versions onto that meta device so they take up no RAM and then we load all of the weights only on one of them and so then when we're ready to do uh fsdp well we already know we're sharting the model so we just tell the first node to send those weights to whatever node or GPU needs that particular chunk of weights and this really helps keep your uh Ram size low and you don't suddenly sit there with crashes because oh no you ran out of CPU memory because fun fact you will Redline this quite often I found um at least in this particular scenario now I've talked about uh fsdp a lot and I've assumed that you knew context about Axel autle Transformers and all this stuff let's take it back and just focus on which you might not know is the foundation of a lot of your favorite libraries so uh practically all of Transformers uh and hugging face as a whole relies on uh accelerate same with Axel fast AI anything Lucid Rin stunts at this point as well as cornea and the general idea with accelerate is uh it's essentially three Frameworks you have a command line interface that uh haml and Wing already showed us whenever they were doing accelerate launch uh you have a training Library which is under the hood what is doing all of this distributed training fairly easily and then the big model inference that I mentioned a moment ago for the sake of this talk we're not talking about big model infs we don't particularly care about that here we're just caring about fine tuning llm so we're going to focus on the first two so you need about three commands to really get everything going the first is accelerate config uh this is used to configure the environment uh this uh is also what uh Wing has managed to wrap around beautifully when he shows his accelerate launch commands because his config files can directly be used for uh doing accelerate launch which is phenomenal uh the second part is estimate memory which goes through those calculations I showed a moment ago whenever I was playing around with the idea of well how much vram can I use and the last part is accelerate launch which is how you run your script let's look at sort of why the matter launching and distributed training uh sucks uh there's a lot of different ways you can do it there's a lot of different commands you can run some of it's pie torch some of it's deep speed and all of them have slightly different commands right so here if you just do python script.py it's not going to train in any distributed scenario and most you get model parallelism but you won't get like distributed data parallelism fsp don't work won't work torch run and deep speed are the main two commands you can use to run uh this will basically say torch run run on a single computer with two gpus my script and then it does some things in the background to help make sure that works uh and that's a lot of different commands that you have to know and remember and so accelerate launch is here to just say okay tell me what you're doing and I'll make sure that we're running it so for uh it operates by these config files similar to what again Wing was showing us at AEL and these essentially Define uh how we want certain things to run so here we're saying I have a local machine that's multi-gpu running with bf16 mixed Precision on eight gpus uh with fsdp on the other hand we can go through and specify everything we want to use with fscp using a config uh and this way accelerate launch just knows hey we're going to make sure that we train an fsdp if we're using accelerate and that's all you need to do from a launching perspective and if you're using aelole or transformers this is all you need to do the next part I'm going to show is sort of the internals a bit on the low level of how accelerate works and how you can use accelerate specifically but do remember this isn't necessarily needed if you're using things like Axel or Transformers so the general idea with accelerate is we want a low-level way to make sure that this can essentially be device agnostic and compute agnostic right so make sure you have your code running on a Mac running on a Windows machine running on a GPU running on CPU running on tpus and it does so in a minimally intrusive uh and I ideally not very complex manner you create an accelerator and you just have it prepare all your things and that's it you're Off to the Races uh switch your accelerator or switch your backwards function to use accelerator backwards and on a whole that's most of what you need to do how it winds up working is uh similar to fstp accelerate will do the data sharding it for you in taking your data and splitting it across GP use uh it also operates by essentially having one Global step so an easy way to think about it is uh if we're training on eight gpus the uh versus single gpus if a single GPU had a batch size of 16 and now we're training on eight gpus the equivalent in accelerate to get the same exact training would have each GPU have a batch size of two because 2times 8 is 16 and so what winds up happening is this lets us successfully scale our training with that should have roughly the same results when training on a single GPU versus training on multiple gpus without needing to worry about oh do I need to step my schedule or more oh do I need to adjust my learning rate more oh do I need to do this do I need to do that it's the same amount of data being processed at one time and uh everything else is just done for you now uh the next part of this I want to talk about some very specific tweaks that uh we do to protect you from dumb decisions uh the first part is mix Precision uh this is a bit different than maybe your normal idea of mixed Precision uh we don't convert the model uh weights to bf16 and fb16 when we're training with accelerate and we try our hardest to make sure that doesn't happen instead we wrap the forward pass with AutoCast instead to just convert the gradients this preserves the original Precision of our weights and leads to stable training and better fine-tuning later on because and this is very important if you go to bf16 you are stuck in bf16 there was a whole issue a few months ago with trans forers where some quality of some fine-tuned models weren't doing well this was the cause now going a bit more than that if you're familiar with uh or keeping up to date with efficient memory training you might have heard of something called Transformers engine or MSM uh the idea behind this is we make use of like 409s h100s and do training in 8 bit now this is different than quantization you are actually training on Raw native 8 bit so eight bits and that's all you have uh a lot of mistakes I see people do with this especially with the Nvidia examples is they do the prior thing of converting the entire model into bf16 and then train uh that leads to huge instabilities during training and generally people's performance hasn't been the best uh I've also heard rumors though that even this can go bad so it's always worth playing around with if you have the ability fp16 versus non fp16 and that includes bf16 uh and test out sort of what levels can be an 8bit because like with Transformers engine it's still using the AutoCast and so the computations rather than being done in 16bit are done in 8bit uh and then if you're playing around with Ms amp uh that lets you experimentally go even further with this and so it can you know we can get to a point where if we do 03 almost everything is in 8 bit your master weights are in 16 bit and your Optimizer states are even an 8 bit I'm scared to play around with that I don't know necessarily how good that is uh I need to play around with it and that's sort of what I'm using the LL 3 training for to just toy around with these things but uh it opens up opportunities if you have the compute to do this now last part I'm going to very briefly talk about and we can talk about this more in my office hours is deep speed by Microsoft and fully shed data Paralis these two are almost the exact same uh deep speed has a few tweaks and call things a bit and calls things a bit differently but if you've done it in BF or in fscp it can be done in deep speed and vice versa a wonderful Community member uh recently posted some documentation where he directly talked about this parameter in deep speed is this parameter fsdp and generally what I've seen it's a mix of if people prefer deep speed or fsp uh it's usually a matter of do you want to go with Microsoft and do their thing or stick with pytorch and just stay native uh but either can be used interchangeably as long as you're careful about setting up the config so as a whole uh accelerate helps you scale out training especially with using fsp and deep speed uh to train these big models across a number of gpus you can use techniques like fb8 to potentially speed up training and reduce some of the computational overhead but when using mixed Precision in general especially with fp8 be very careful about how you're doing it uh because you could potentially lock yourself into that weight for you and everyone else so uh I'll post this uh presentation of course in the Discord but there's some handy links there uh that will help get you started with accelerate go through some concept guides uh to understand some of the internals and really get you going so uh yeah there we go let's look at some questions uh let's see I have one here I thought that deep speed 03 is the same as fsp but the other options in deep speed weren't necessarily equivalent uh it's got to a point where there's some equivalencies now uh the chart talks about it uh 03 is definitely the equivalent of fstp uh but there's some tweaks that you can do because fstp gives you options to only offload certain things I just want to mention that okay I didn't show you there's a deep speed and FSD DP configs like when you want to do multi-gpu training an axel in you have to supply a config file I'll show you some examples uh of those um they're in the I can when whenever Zach's done I'll share my screen yep sorry a link there you go okay I'll just do it right now uh let me find I add some clarification while we're while we're pulling that up yeah um so one of the things especially for the fsdp part in the axle Auto configs is we try and move those fsdp specific configs into the ax modle and then it like Maps them into accelerate um what we found was that a lot of people were running accelerate config and then setting up like setting things and then they would go and use a lle and it would have like a mismatch in certain parameters and what would happen was it just would break in most in a lot of situations um so what we actually recommended people do we added warning saying just remove your accelerate config and then we will sort of map all of those uh configurations that normally get set by accelerate through like I think accelerate uses like environment variables to sort of communicate that under the hood anyways when you use accelerate launch so we just sort of like mimic a lot of that um just to like avoid some of the headache um of doing it one laun running accelerate conf getting a mmch later on that just caus a lot of support issues so um that's just that makes perfect sense that's exactly the solution I recommend like even I'm debating on rewriting half of our internals for the fstp and deep speed plugin because like I don't necessarily want to rely on environment variables and even setting it up I'm sure as you've experienced normally is problematic at best so uh yeah that's a very smart way to go about it because it's even we've had users that report issues and it's like well it's because you set up your config wrong and you're using something else yeah I mean and so that's like what you heard from Zach today about like stage one to three bf16 all that that's all like background that you might want to know so like demystify a little bit about what is happening when you supply these configs what I do honestly is I just use a config again I just use one of these like 0123 um you know or the bf16 one use kind of use it off the shelf and then maybe consult like Zach has like written a lot about this I actually look at his presentation he's given like similar versions of this before and post it online he will today posted slides and I kind of fiddle with it a bit sometimes but honestly I just use ones that work if I want to parallelize my model especially using a bigger model and paralyze it across gpus uh then then I'll I'll pick the right config and you specify like you have these configs in the axol repo uh and then you supply it to the config the main config I'll show you an example when we talk about modal in a second can I can I add clarification on this one specifically yeah with zero one and z uh 01 and 02 specifically for deep speed um you um I think the bf16 and fp16 are can be set to Auto because it doesn't deep speed or doesn't care about it until after the trainer is loaded but for 03 specifically um and I see Zach nodding his head is it needs to know ahead of time specifically that you're using bf16 so you actually have to you can't set you can't set auto in the 03 config if you want to use bf16 so that's why it's said as like there's a specific 03 bf16 because it needs to know that you want to load it in bs16 before it ever before before the trainer sees it or something along those lines maybe Zach can explain it better than I can but no that's that's a pretty good explanation of it it's it's uh something with deep speed when it comes to setting up the actual call to deep speed and initializing everything it has to know well beforehand what we're actually doing uh which makes it a little Annoying whenever we're dealing with conf that um okay I think uh we should probably move on to the next thing which is training on modal or Zach just want to make sure you're done with yep you're good all right um so there's a lot of different ways you can train models there's you can use runp pod which Dan showed earlier that was like done on runp pod that was the like recording if you look at the axotal docs actually um it'll show you it'll tell you a bit about runp pod if you just search from runp pod here you'll find a little bit there but also there's a Docker container CX lle which is like what you want to use most of the time um Wing do you want to say anything about that like what's your preferred way of running how do you run it stuff like what's your compute so on my local 390s I it's I don't use doco containers just mostly because it's like development and it's just not amenable to using Docker containers for that um but for General like debugging issues that people are seeing um I will just generally just spin up a Docker container on my run pod and debug the issue there so because it's environment it doesn't have all of the mess and mismatch of like various um packages that I might not have updated makes sense um and then yeah if you look at the REM me there's a whole bunch of stuff there about it um okay so modal what the hell is modal so actually so okay like just some general rule about this conference um we were pretty selective about the tools that we brought in to this conference or that I'm going to talk about I'm only going to talk about tools that I use or that I like this is like hundreds of tools um one and you know one that I really like is modal so like what is modal mod is actually like this really cool Cloud uh Native way to run python code and the thing that's really interesting about it is like it has this uh like one Innovation is like it feels like local development but it's actually remote development has nothing to do with fine-tuning right now just I'm just telling you a little bit about model CS in background um and basically it's also like massively parallel you can you can get uh so like things like Axel it can do easily do like fine tuning um actually like Wing how do you do how do you do like uh hyper parameter search with your axotal training like what do you like to do it's manual right now it's like like change learning rates but yeah makes sense um so like a lot of times I do uh use something like modal or I'll use modal to do things like hyper parameter tuning there's different ways to do hyper parameter tuning it's not something you should focus on like in the beginning and it's totally fine to do it manual I do a lot of things manually I use bash scripts sometimes uh to do like many different axotal runs so um it's very like python native there's these uh modal docs which are here if you're just getting started in modal actually like to really experience this like magic of Al where what I you're like what am I talking about this like local but it's remote like what does that even mean I don't even know how to explain it to you without you like trying it yourself so like this is like I so there's a lot of docs here and like modal you can go through like the hello getting started one but I actually think like what I like to show people first is this like web endpoint one I'm not going to demo it right now because I don't have time but basically just like try it out and basically what you want to do is like you can change the code and you can see it change in production in real time and you don't have to do these like deploys like constant deploys to change code it's like this really IR iterative like interesting thing and I've built like lots of tools in modal I have built like this transcript meeting transcript summarizer with modal uh also weights and biases web hooks uh the links are that are going to be in the slides so I won't labor that too much uh the one thing about so for modal uh for Axel Auto they have this repo called llm fine tuning and it's a little bit different than it's like wraps Axel so that that that's interesting like Axel is already wrapping so much why we need to wrap Axel well actually like um it's kind of interesting like if you have a workflow that you really like um you might want to abstract it a little bit more and plus you can get all the benefits of modal by doing that um certain things you might want to know about this repo is um when you run the train it automatically merges the Laura back into the base model for you um by default you can turn it off and then also like one key thing is there's a data flag you have to pass you can't rely on the conf the data set in the config file you have to pass a data flag um and then the Deep speed config comes from the axle auto repo itself so you have to reference sort of like the the axol repo uh what I was showing earlier it's kind of like these are mounted into uh the environment this deep speed confix so it's kind of like a beginner's way of using sort of uh Axel with modal but it is um it's something to try first and like it's kind of like you can tweak it you can tweak it you can change the code um but basically like you know there's the read me here there's a way to get started obviously you have have to you know start model install it and essentially like what you do is you clone this repo and then you launch This fine-tuning Job And basically like this command um the detached thing just makes it run in the back like makes it run in the background so where you can uh do other things um but there's this uh there's here's the entry point this is basically where we're wrapping the Axel CLI command uh in this TR function and then you pass in the config file and then the data okay so it's like very similar to running Axel just wrapping Axel um I'm going to do a really quick video of what that looks like here so um you know just do modal run and then basically you know it will go ahead and and do your axle auto run if you want and this is like running the exact example in the repo um and you can do the same things you can put your wa and biases and your hugging face token and so on and so forth um so let me go back to uh the example oh sorry um let me go back to the repo sorry and just to point out here uh just to navigate yourself in the repo there's this actually I'm going to hit the period period on my keyboard to show you vs code real quick so I can just show you some code and uh so the source code like the code for modal is in this Source folder and the training part is maybe what you want to take a look at if you're curious on like what is happening and the entry point that we demoed right now is this train function so there'll be a train function here uh in uh there'll be you know in this file right here um let's see and then the common dopy that's actually the setup okay that sets up the environment that sets up the uh Docker container and installs some dependencies and makes your secrets come in you don't have to worry about this I wouldn't actually look at this like in the beginning I'm just showing you around so that if you wanted to dig in you could check it out I think it's pretty cool um and then one thing I want to point out is like there's these config files if you want to run the demo and the read me out of the box there's this like very small uh training run that basically overfits on purpose um you just have to know that okay the data set here this is just this will get replaced by whatever the data flag that you pass in um and then you just know that like okay uh for this deep speed is actually being used here so uh that's what we just talked about that was the background that Zach gave and this is actually being mounted from the axelo repo because remember the Axel repo has this deep speed speed configs and this is being used um so just this is just orienting you to that and let's go back to the slides whoops how do I go to the next slide um another thing you might want to do is debug the data so like you can run it end to end but remember I told you you don't want to do that you don't want to just train stuff so if you want to do your have your own data inside model um there I have this notebook here um so let's go to this notebook whoops let me just uh go to the repo and go back and go to The Notebook so I have this notebook here about inspecting data um okay and I'm just going to change this GitHub to NB sanity because it's easier to read um and basically uh this you kind of do the same thing is like you know just make sure this is a way that you can inspect the data so you can do modal run but then pass a prepr only flag and what happens is the logs will pin out print out a run tag and with that run tag uh you can see the last run prepared folder essentially and like the last run prepared folder um you can just get that data and analyze it the exact same way that I showed you in the honeycom example essentially which is like you know and then print out just to make sure the data is in the right format so I think that's important you might want to do that if you're using this uh and just this is a notebook that might help you okay um I think that's it and yeah we can do Q&A okay um let's tell about I will MC Q&A uh we have some questions that were answered typed but just so that um people hear the answer I'm going to do mix of open questions and answered questions uh a couple four in case there common questions will office hours be recorded answer there is yes um are tiny models like 53 more or less suited for fine tuning you answered that uh in text but for others to hear it since it was highly voted you want to uh tackle that ham or anyone else I usually don't go smaller than a 7 billion pamer model because I haven't had to go smaller than that like that's like a really sweet spot for me uh cuz the models are like kind of good enough and they're small enough but I don't know Wing or anyone else do you have any opinions on this or seen anything I haven't spent a lot of time with the 53 models mostly because I wasn't impressed by I guess the 51 models and I feel like they were just way too small and um there's I think with the smaller models just the reasoning is worse so I just llama 3 is good enough and it works so yeah the S billion about how to determine the adapter rank there actually two param this wasn't part of the question but there two parameters that go together there's the adapter Rank and then the adapter Alpha um someone said how to determine the adapter rank um what do you guys have to have for that one I just copy the compi so I don't determine anything Wing deter that's one of those that's one of those hyper parameters you should play with and if you assuming you have like good evaluations and um to just understand like is your model is is is a lore at that rank sufficient to like get good accuracy on what your Downstream use case is so um 32 16 and 32 is like a typically a good starting point that you see most people use and then um so for rank it's and then for Alpha is usually I believe the papers say it's it should be 2x the rank 2x the rank um and then if you're using something like I think it was like RS Laura it's has something to do with the square root but I try not to get into that there's a blog post I'm forgetting I think by Sebastian rashka where he actually has does a grid search and talks about um what works for those I'll try and share that with the community um yeah yeah yeah there's another thing that I do and this is kind of a weird answer um I actually asked my friends um who are a lot smarter than me so there's this guy Jon o Whitaker he's he like really understands a lot of stuff I'm like hey what rank do you think I should use for this and he gives me some tips Jon is actually speaking in this conference um he might not talk exactly about this but he has a really cool talk called napkin math for fine tuning um which you should check out yeah I'm going to switch over to some open questions I'll take the one that's set up top I have a custom evaluation uh or Benchmark for my model is there a way I can get it to run periodically during fine tuning to see how the training is going so far against that evaluation metric is actually something that I've wanted I don't know the answer to it but it's something that I've wanted in the past Wing I think that's uh since I just read it and uh does that question make sense to you do you understand the question B can you have like an evaluation function in Axel or something some call back or something like if you want to compute some like custom evaluation metrics like how do you deal with do you do that do how you deal with it like there there's there's like the tiny benchmarks that you can run sort of against the more standard benchmarks um as far as trying to get more like custom evaluations it's not really supported right now I think you could do things by adding like callbacks on the evaluation loot maybe and like doing some janky you you know pulling from like disc like things you wanted to I guess so so here's here's something you could probably try so um there is a way I think on the on the evaluation if you were to specify a custom test data set for your evaluations you can have it um generate predictions for those at certain steps and then log those out to weights and biases and then you could like pull those from weights and biases and then do your own like evaluations using like LM as a judge or something along those lines that would be one way you could do it but there's nothing like directly integrated right now that's sort of streamline for that how would you do that dumping of predictions in Axel like how would you do that yeah yeah so it's already built in I think this like the there's something called an eval table something um setting in Axel what it does is it will pull some number of prompts from your test data set um and then run predictions during the EV the evaluation step and then log those out to um log those to way and biases I think it's like eval table something it's a little it's it's a little bit flaky so it's not like a top level thing that I've used I think there was a contributor who submitted that you evalve table size and evalve so I believe the table size is the number of um yeah the number of predictions that you want to do and then the uh number of Max tokens is how long you want to like how many tokens you would like it to generate during that EV up that makes sense question I like this one given Axel as a rapper for some hugging face libraries are there any important edge cases of functionality that you can do in the lower level libraries that aren't yet possible in Axel I'm sure there are a lot of things that you could do um there tons yeah then you're operating at the code level yeah hard everything else that goes on underneath so like like yeah you can have custom callbacks and stuff you can do this eval thing that we were just talking about you know you can do all kinds of stuff yeah I think it would especially be like at the speed that Wing can Implement whatever we Chuck in to accelerate and more specifically we can then Chuck into the trainer and it's whatever that Gap is is the bleeding edge that you don't have access to you know and so like that could be like new fsp techniques new deep speed techniques that get added that we need to update and accelerate and then push to the trainer that I think for the most part should be the most major Gap because we try and shove everything we can in accelerate into the trainer that then Wing gets for free but I think this um flexibility for callbacks during training with whatever you want to do like at each batch or Whatever frequency to calculate custom evaluation metrics or stuff your data who knows where that would be like the sort of thing I aren't a ton of use cases for that but doing stuff in between batches seems like a these sort of callback seems like a an example yeah but you might be wondering like okay if you why I use Axel it's worth bringing that up again I just want to like like one example is like because there's a lot of stuff that you need to glue together especially if you don't have a lot of gpus so like one example that came out recently is like you know uh Ur working with fsdp for the longest time didn't work and the answer team uh kind of enabled that and then within hours Wing like glued it into axle like really before anyone else so I was able to use it like almost right away and Wing keeps doing that like over and over again for like anything that happens like the like you know the LM space is like changing extremely fast like from day to day there's like a new technique for like efficient fine tuning like lower GPU memory faster whatever something and like the ones that are really important like like this one get into axelo like really fast and trying to do all that yourself would take a long time uh there's a question um what are the practical implications of uh 4 bit verse higher Precision think we said that some of those um we will talk about more at deployment um is there anything that you guys think we missed in talking about the implications of uh so for bits obviously um gonna lead to a smaller Lura and requires um less Ram anything else you know forbit is quite I mean it's pretty it can you know it can be aggressive um like I I have noticed the performance degradation when going all the way to 4 bit before um like I've been using this library mlc for example and they have like four bit quantization uh and you know in that I did see a difference I don't see much of a difference 10 two and 8bit but I'm just talking about Vibe checks there's probably like papers out there that do some analysis you always have to check yourself it is worth like just doing it and checking to see like and running your evals to see what happens um but generally like the tradeoff is okay you you know like for the smaller models uh you know you'll have a more portable model that's probably faster probably uh you know maybe now it fits on one GPU you don't have to do distributed inference things like that potentially um but then it might come at a performance hit so you have to like do your evals to see what that performance hit is yeah um and one thing to keep in mind is Kore is definitely like a tradeoff when you don't have enough GPU Ram so if you're training if you have an h100 and you're training like a 13 billion pror model and it fits like don't decide to go down the key Lord because you lose a lot of performance in the quantization dequantization step and like I I experimented when like Kor came out I was like why is this really terrible on a A1 100s and like it should be faster right no it's like it's because of the like clation deconz steps that it's just actually worse um when you're if you're going for like Speed and Performance when you don't actually need it so it might be an over optimization in some cases it's definitely a GPU poor optimization for sure which is like lots of people yeah Axel also support um Mac M Series gpus so yes UMC um so pytorch is supported on macm series like there is like an example somewhere where someone um did uh did it but you're probably better off using like mlx I believe is the repository that does like has better fine tuning for like if you want to fine tune on your like your MacBook or what have you um I think yeah I think it's called mlx right yeah yeah it's mlx because yeah fine tuning on Max is three different Frameworks three different backends and all of them kind of work so um it can work your mileage may vary we got a request for your slide Zack can you I assume you'll be able to share them with uh yeah they're actually already in the Discord great we can probably upload those as well along with our slides right yeah yeah yeah it's just a web URL honestly because mine's actually hosted on the hugg and face Hub oh fancy so in an overarching sense are there mental models or intuitions that we bring to a gentic llm applications vers ones that are not agente so yeah I saw this question uh mental models agente versus non agentic I guess like in a sense okay like okay what is what do agent agentic means agentic is like some workflow where there's a function call um really it's like mods that make function calls are quote agentic I just want to demystify terminology people just like have terms and then feel like it's a rocket science I actually have not worked on a reuse case where there isn't some function call involved like even the honeycomb example like uh it's you know it's uh executing a query at the end for you um you know that's after like the the query generation but it is executing it and it's going in some Loop like after that to try to correct if something goes wrong um and so like and really everything you know it's really hard to think of I mean there might be some use cases that you know but there is no function calls but I feel like they they all that I've had had function calls I think like you need to write evals that you can kind of think of it as like uh unit test integration test like it's important to you know have tests that test the function calls and have unit test for those as well as like integration tests that's what I would say about it all right actually I got I got one is fine-tuning an llm to Output deterministic results exactly the same so this is I think important because um to Output deterministic results is not something about how you do training it is instead something about how you do inference so you're going to train the model it's going to have some weights um and then uh when you are predicting the next word the last layer is this softmax so that the out the output of the model is actually a probability distribution over the next token and then to make that deterministic you would just choose the whatever word is most like or whatever token is most likely um but that's all H something that is and then if you don't do that you're just sort of sampling from this probability distribution that's all something that happens at inference time rather than something that happens at training time I'll give you a little bit more Nuance there is like um if you okay if you want structured output from your llms uh this guide the guided generation that Dan is talking about is like you can clamp down the model so that it's it's providing you only tokens that make sense for like in your constraint so like if you want a Json uh output with a certain schema that only has like allowed values you can have a grammar or you can write it's like basically rules that clamp down on the model and like on what tokens it's allowed to predict um fine-tuning can you know if you have like a very specific type of structured output that you want the model to always provide um you know so like basically like you know fine tuning can make it happen more reliably um you know the it's like a trade-off I guess like you know if you're doing fine tuning correctly you should you know hopefully you don't um trigger the guided generation framework that often if your guided generation framework is getting triggered very often then you know perhaps that means that if you're already doing fine tuning anyways uh perhaps it means that your fine tune is not that good um but the cost of the guide generation isn't that isn't isn't uh very meaningful the guide generation Frameworks are actually like really good and really fast like you know things like outlines and things like that tend to be really good um but it turns out that fine tuning can help quite a bit bit in like learning syntax learning structure and things like that with more deterministic outputs

so plan for today we&#39;re going to talk about Axel uh how to use it broadly and then um we&#39;re going to go into uh the honeycomb example that we introduced last time and we&#39;ll do just a quick catch up there um for those who didn&#39;t see last time but uh the honeycomb example and Hamil will walk through that we will um have some time to get uh a conversation both our questions and your questions with way and then uh we will uh have some time for Zack to share about parallelism and hugging face accelerate uh very quick um run through of fine-tuning on modal and we&#39;ll uh have a little bit of time at the end of this for Q&amp;A so with all that said I&#39;m gonna get started um the most frequent question that I get from people when they&#39;re first starting to fine-tune is um they&#39;re really related to I&#39;m going to call it model capacity which is how much are we going to be able to learn the two parts of that are what model should I find tune off of and then the question which is simultaneously more technical but I think has an easier answer because the answer is almost always the same um which is should I use Laura or should I do a full fine tune um I&#39;m gonna give a shorter answer to the base model and then I&#39;ll walk you through what it means to find tomb with Laura and um but then the uh I think the answer there despite it being useful to understand Laura because you&#39;re going to use it a lot um you should almost always in my opinion be using Laura rather than full fine tunes but um the first part of this is what base model do you use so there are two Dimensions to this so one is what model size do I use a 7 billion or 13 or 70 billion or some other size um parameter model and then the second is um what model family do I use so do I use llama 2 Lama 3 mistal Zephyr uh Gemma whatever else um on the model size I think different people will have different experiences um I have almost I&#39;ve never find tuned a 70 billion parameter model and it&#39;s not that we can&#39;t it&#39;s actually with thanks to AEL and accelerate it&#39;s not so so difficult um but I&#39;ve fine-tuned 7 billion and 13 billion parameter models I think most of the use cases I have the breadth of what we are asking the model to do is not so so wide and so my experience has been that fine tuning a 7 billion parameter model versus 13 actually the 7even billion parameter model like the out the output quality of these for the pro projects I&#39;ve worked on has been close enough that I never felt the need to deal with the parallelism of um required for much larger models um so I typically ended up using just 7 billion parameter models those are a little bit faster it&#39;s a little bit easier to get a GPU that those run on um and if you look at the download counts this is not a perfect proxy for what others are doing but it&#39;s some proxy for what others are doing and you do see that um 7 billion parameter models are the um most popular and these are not instruction tuned models so these are models that people are typically F tuning off of and you see that um the seven billi seven billion parameter model is um the most popular um and then for people who want to know just like what is fine-tuning um you&#39;ll we cover that I covered that um in some depth in the first lesson so um yeah you can go back to that uh then the second question is which model family do I use um this is one where again thanks to the way that it&#39;s been abstracted from Axel it is extremely easy to try different models especially if they all fit on the same GPU um or even if you have if you have to boot up a new instance that&#39;s also not so so hard but it&#39;s extremely easy to try different models and just do a Vibes check um I tend to just do whatever is fashionable so a recent released model recently released model is llama 3 and if I were starting with something today I would just use llama 3 not because I&#39;ve thought about it in incredible incredible depth but rather because it&#39;s just a new newly released model that&#39;s widely known to be reasonably good um if you want to find out what&#39;s fashionable there are many places to um find that out uh you could go to hugging face and then for models like just there&#39;s a way to sort by hotness and just see what&#39;s hot um the local llama subreddit is a community of people who think about uh these things a lot and that&#39;s a good place to to look at and just for running models though it has local in the name they spend a lot of time just thinking about different models and how they behave differently so local llama is another U Community to look up if you want to um to choose a model but um I think people over index on this and that if you run a couple of models that are just the most popular models at the time that is uh that should be um good enough and you won&#39;t probably improve on that immensely by trying many more models um and I&#39;ll I&#39;ll talk in a couple slides about why that is um this second problem uh Laura verse full fine tuning is a question of when you fine-tune the model are you going to you so you&#39;ve you&#39;ve uh let me start with an image so if we imagine that we&#39;ve got one layer it goes from an input to the output I&#39;m going to for a second actually simplify the Transformer architecture so we don&#39;t think about a query Matrix and keys and values and imagine this almost as just like a for the for the moment a feed forward Network so you&#39;ve just got one um layer that we&#39;re going to look at and it&#39;s going to take an input that is really an embedding of the meaning of the text up to that point in the string and it&#39;s going to Output um another uh Vector representation in most of these models the inputs and outputs are somewhere on the order of 4,000 dimensions and so just for that one um layer you&#39;d have 4,000 dimensional input 4,000 dimensional output so that Matrix would be 4,000 by 4,000 that would be 16 million weights and the idea behind Laura is that we can learn something that you can add to that original Matrix that is much lower dimensional and that um will still change the behavior in a similar way but will have um many fewer weights and as a result it can be fine-tuned on um less uh GPU with less RAM and the uh I think it&#39;s safe to say that the vast vast majority of fine tuning that happens is either Laura or I&#39;ll talk about Cura which is going to work functionally in a similar way but the vast majority that happens is Laura and I think for um everyone in this course you should use Laura for a while and maybe someday you&#39;ll do a full fine tune but you U as a practitioner may never need full fine tunes there are some the theoretical reasons that full fine tunes if you have a lot of data could be higher performance but um and Zack or Wing or H can contradict me here but I think for most people is uh all you need um unless you guys want to jump in and correct me let me uh I&#39;m going to say a bit about just how luro works so um we want to make some changes to a 4,000 by 4,000 Matrix which is the um original weights and we do that by having a two M two matrices that we&#39;re going to multiply together those of you who remember your then your algebra will know that if you have a 4,000 X6 Matrix times a 16 by 4000 Matrix that is 4,000 by 4,000 so if we multiply these two pieces together that is going to create a new Matrix that we can add to the original weights um so it can change the original weights quite a bit but the number of parameters that are required here so each of these is this one is 4,000 by 16 and this one is 16 by 4,000 so if we said how many parameters is that that&#39;s uh each of these um two uh matrices on the right is uh 16 by 4,000 as the number of parameters you have two of those so now we have 128,000 weights that we are going to um need to fit when we&#39;re fine tuning that&#39;s a lot less than 16 million and as a result um it just requires a lot less Ram uh and GPU v r is uh frequently a binding constraint as we train our models and as a result it&#39;s nice to be able to um reduce that RAM usage by using Laura and you&#39;ll see that um yeah you&#39;ll see it&#39;s just a configuration flag so it&#39;s quite easy to do this in um uh it&#39;s quite easy to do this in um AAL the other uh piece which is I think conceptually also potentially somewhat complex to understand well but um extremely easy to use is going from Laura to uh Q Laura so here we had each of these matrices and those are just um numbers or each element in those is numbers numbers are stored uh in computers with a number of Bits And if you store it with many many bits then you get very fine gradations of what that number can be so you can go two to 2.1 and 2. two and so on uh so we tend to think of those almost as being continuous Q Laura is um dividing the um possible values for numbers into um a smaller set of values so for instance if you start with something that is stored in 16 bits you think that is almost continuous uh if it goes if the lowest value that you want to be a to store is minus two and the highest is just to pick a number 2.4 you&#39;ve got lots of numbers in between there Q will divide that space so that it can be stored in four bits the number of possible values there is two to the four so it&#39;s 16 values the exact way that we choose the 16 values is a technical topic that I think isn&#39;t worth our going into in this um moment there&#39;s some details about how you do back propagation there that we don&#39;t really need to know in practice um but by storing every number in four bits uh you cut down on the memory usage by quite a bit uh and so a lot of people do a lot of people do this you&#39;ll see again that this is not so complex to do and um in practice it save some RAM and it has some small impact on results um but I I think my intuition would have been that has a bigger impact on results than I&#39;ve actually observed it having and I think most people would agree with that and so a lot of people run Cur models or train with Cur either as their default um first step or at the very least it&#39;s something that they do frequently and again we&#39;ll show you how to do that and it&#39;s uh shockingly easy um so uh um maybe it&#39;s a good time to just pause for a second Wing do you Zach even like do you have any opinions on Cur Laura when you use them any observations feelings do you agree any yeah any further thoughts I know that um sometimes people see a difference between like the actual losses that or some of the evaluations that you get during um fine tring with qor because what&#39;s happening is you&#39;ve quantized the weight and then you&#39;re training on those but then when you merge those L back into sort of the original model because the quantization there&#39;s like quantization errors or um due to quantization that you&#39;re not actually getting the exact same model that you train so there has been some like debate over that I don&#39;t like I personally don&#39;t like feel like that&#39;s a huge issue um otherwise people would not be using it anymore so that&#39;s really the only thing that I have about that I think there was also something that I personally didn&#39;t understand with Cur um with the quantization was I think there were like double quantization and there&#39;s some like nuances with like that as well when you&#39;re quantitizing the weights maybe if Dan understands that better than me I I think I don&#39;t um one of the speakers so at Workshop 4 we&#39;re gonna have Travis Adair who um is the CTO of prabas uh but built Lorax which is a just a serving framework and he talked about um some of the quantization errors as you merge the weights back I think he has thought about this like way more deeply than I have and so I know that I&#39;m looking forward to Workshop 4 so I can hear um his description of what he&#39;s done about this issue but I yeah I don&#39;t know much more about it than that okay um all of this is like I said there are so many places in um Ai and before that ml where it&#39;s like tempting to like get really detailed about all sorts of things that seem very mathematical um the payoff to doing that even though like most of us were good at math from an early age and we&#39;re told like ah you should do a lot of math um thing with Hyper parameters while sounding cool has a much much lower payoff than spending that time looking at your data and improving your data um and you might think like my data is what it is how can I improve it and so when we get to Hamill&#39;s um what haml shows about um his work with uh honeycomb you&#39;ll see like you actually can improve your data and the payoff to improving your data um is so so large uh I think Hamil made a comment about uh many of you might know who technium is but I don&#39;t H you wanted to jump in here yeah anyway um improving your data the payoffs are massive and you should do more of that um one of the things that as we&#39;re going to switch into from the abstract like hey here are some ideas to how do we implement this one of the things that I loved about Axel when I switched from use it so act a lot as a rapper for lower level um hugging face libraries one of the things that I most loved about this switch from hugging face lower level libraries that give you a lot of rular control to using Axel autle is that Axel AEL was so easy to use that I never thought about like oh what&#39;s the error in my code and I just spent actually less time looking at code and I spent more time just psychologically looking at my data and so the ease of changing some things around um and being able to run things read up some mental space for me to um focus on my data which uh we said is a great thing to do um it also if you just use the examples and I&#39;ll show you some of the examples there are a lot of just best practices and default values that are built in um it does a lot of smart things uh as defaults I&#39;m going to um there are a couple things that I quite like that it does that we don&#39;t have time to cover and so I&#39;m going to make a couple of videos and then just post them um either in the Discord or in Maven or on the maven plat portal or both quite possibly both um showing things like sample packing which is a quite clever thing that it does that has a speeds up your training process um but it has a lot of things that you could spend a lot of time figuring out for yourself or you could just use some of these examples um in Axel and change relatively few things and have a lot of best practices built in by default um so I I&#39;ve have loved to Wing thank you um I&#39;ve Loved using Axel um and one thing I want to maybe it&#39;s worth uh lingering on for a second is um Wing like I&#39;ll let Wing tell the story has there any have have you been surprised by like the level of like you know what kind of people are able to fine-tune models like really competitive ones um without like knowing any like deep mathematics or things like that yeah I mean think you like just sort of um like if you think about actually the most popular model like I think generally like you know with tum Hermes models and those sorts of ones like they&#39;re generally very popular and if you actually talk to Ryan like he doesn&#39;t he&#39;s also of the you know he&#39;s very much like me where he doesn&#39;t quite get deep into like Transformers and the math and all of that and just wants to trade models and build you know focus on good data so like really all of his models are really good um there are there people like um I think like uh let&#39;s say I think Miguel T terera is it with the um I forget Which models he has that he releases I mean I think his background is more deep learning but um he also uses Axel and there&#39;s a lot of like um they don&#39;t really need to like go deep into the Transformers right so yeah like um Dan was saying they just are able to spend more time focusing on just procuring good data and Sy doing data synthesis rather than thinking about like all of the everything else that goes on underneath the hood great okay um let&#39;s get one level more uh tactical or concrete so using Axel um we are some people here have used it a bunch we&#39;re going to make the assumption that most of you have um either used it very very little or I think even more when we did a a survey at some point of some students most of you have used it at all um so this is going to be really a how do you just actively get started um it&#39;s I think you&#39;ll be surprised that it is not so so difficult to run your first jobs and I highly recommend doing that you&#39;ll just feel different about yourself as someone in this space once you&#39;ve run a couple jobs and you feel like a practitioner now so highly recommend using it um the way to get started is if you go to the Axel actually start with just Googling GitHub Axel if you go to the Axel repo um there is a separate documentation page but just the read me is fantastic and has most of what you&#39;ll need um I&#39;m going to point out a couple of things that you should look for while you are at in that read me so the very first is examples I mentioned earlier that there are a lot of examples um axel takes yaml config files and uh the config files are reasonably long uh maybe Wing could do it but I don&#39;t think there is anyone else who could open up them or have like a blinking cursor and then just type one out beginning to end and get it right so you and almost everyone else will go to one of these examples copy it um the first time you should just run it and I&#39;ll show you how to do that but then you&#39;re likely to change one or two PR probably the first one that you might change is the data set that you use but you might change one or two other parameters rerun it and it will always be an experience of taking something that works and then changing it around a little bit rather than starting from scratch so you&#39;re going to use um these examples uh to show you one of them so here&#39;s one this is uh to fine tune um a mistro 7B model uh with Cur so the first the very top is showing you what is the model that I&#39;m fine- tuning off of um so this is Cur so here we are uh loading in 4 bit um we have a data set I&#39;ll show you that data set in a moment um we&#39;re gonna store the data set after the prep phase at in some location we&#39;re going to have some validation data um most of these you won&#39;t change that frequently sample packing I&#39;ll make a separate video about uh this lur La R is related to the size of those um Laura matrices where that&#39;s that Matrix that I was showing earlier Laura Alpha is a a scaling parameter I wouldn&#39;t worry about some of these bottom ones I think the ones that you probably want to focus on up front would be actually um it&#39;s not the easiest one to change so you could change something else just to get an experience of changing it but when you really start working on your own use cases the first one you&#39;ll change is the data set um and the format of the data set is um so there are a lot of different formats I think one of the things that&#39;s really nice about Axel is it out there in the wild data is stored in a variety of format and if you tell Axel what format it&#39;s stored in you can use most of if not all of the common formats so this is a format called alpaca um but each row or each sample has um an instruction to the model uh optionally some input you&#39;ll see in these most of those are empty it has the output which is what we want the model to learn to reproduce and then it has um some text which will go above these so the text would be below as an instruction that describes a task um blah blah blah uh and then you&#39;ll have a question like what is the world&#39;s most famous who is the world&#39;s most famous painter and then here&#39;s the um training output which is what we&#39;re going to train on and try and have the model learn to replicate the behavior of um so just to kind of stop there for a second and talk about uh the config files so like when I start a project I you know I look at the examples too I message Wings sometimes not not everybody can message Wing please don&#39;t message wing with like not please don&#39;t don&#39;t DDOS him uh with uh questions like that um there is a slack Channel an axel sorry a Discord Channel um I think Wing looks like he&#39;s getting the link and putting it in the Discord right now um and that&#39;s a good place to like kind of trade configs but yeah starting with a known good config is a good idea it&#39;s like hey like I&#39;m training this model uh that just came out does anyone have a config and usually either by searching that Discord or looking at the examples or something else you can find a config and there&#39;s a lot of times in hugging face repos you can find uh nowadays you can find ax configs as well uh Wing do you have any other tips on where to find configs or where how people should kind go about it um yeah depending on um like some model creators I know like personally I try and include sort of the model configs when I&#39;m releasing models either somewhere in the repo or in the readme I think Axel model by default also Stores um in your readme it&#39;ll store um the ax model fig so sometime like if you go through huging face and there is a link where you can find like models that are tagged that were trained by Exel um depending on whether or not they&#39;ve modified their read me you can sort of like get configs from there as well um but other than that I think a lot of times it&#39;s you have uh you&#39;ll see some examples in the Discord people have and um I&#39;m happy to also help just like um you know with various things depending on like what but it&#39;s generally pretty self-explanatory most of the time I think so yeah um usually you&#39;re wanting you&#39;re taking like little bits from like one config and maybe combining with another piece whether it&#39;s like fsdp or deep speed or you know the lore versus Q versus like there so you most of the most of the um various configurations are pretty composable with each other and if they&#39;re not I believe we do enough validation that it will tell you that it&#39;s not composable sounds good yep okay and then a lot of those a lot of there are a lot of other parameters uh I won&#39;t go through these in I won&#39;t go through most of these most of them you won&#39;t change um but I will say a couple things one is uh many of us like using weights and biases it&#39;s a very nice weights and biases integration in Axel AEL you&#39;ll even see a config from haml later on that shows you how to fill this in um micro batch size is just the uh basically batch size per GPU um yeah and a lot of this stuff you won&#39;t change in the near future um and so like I said highly highly recommend starting with any of the example configs and then changing it just small pieces don&#39;t get overwhelmed by all the things that you aren&#39;t changing then once you have your config the next step is to run it um like I said I think the uh this GitHub read me is so so useful so after you&#39;ve got your example uh click on the quick start uh section and that will bring you to a set of uh depending how we count either three or four commands uh so the reason this while it looks like four could be three is that there are three steps so one is pre-processing your data the second is this training step and then after that you&#39;re going to want to just test out your um the the model that you&#39;ve trained so there is a CLI tool to do that that&#39;s this um third step and haml will actually show another way to do this the thing that I like to do is there&#39;s also if you run this bottom version instead of the third um that launches a very lightweight uh gradio app so that you can just on in the web type something into a form and that gets sent to the model and and inference happens uh and then the output is shown so I I quite like um using this bottom step uh you will I think it&#39;s worth mentioning you don&#39;t you you only want to do this to kind of like spot check your model this is not for like production you don&#39;t do inference necessarily in production with with this yep and we we&#39;ll cover inference in production in uh the deployment Workshop um sorry I lost my train of thought um ah you so you will not remember these commands the thing that I hope you remember is that everything you want is at the uh is in the GitHub repo and this one is in the quick start um but it&#39;s just the the series of commands so um what does it look like if you run that I&#39;m gonna show you some of the text here is going to be uh relatively small so we&#39;ll come back and I&#39;ll show you a screenshot you can see some stuff in more detail but this is just a very quick uh view of what happens when you train the model so actually I&#39;m going to make sure that you can see it in reasonably High depth so here I am typing out that first preprocess command I use the debug flag um and we&#39;ll talk about the debug buug flag whether you use it or not um when he gets to his section but I I kind of like using it and um when you do that there&#39;s some output here in a moment I&#39;m going to go into that in more depth and then after that I run the next command that was shown on that last screen this is just doing training and that kicks off training and then uh training depending on the amount of data you have can take um yeah minutes hours I suppose sometimes days though the projects I do it uh actually I do have one project where it can take days but it&#39;s typically um you know an hour or so and sometimes much less um so uh let me go to the next slide in there there was a section that uh it printed out from the pre-processing step with the debug flag that it would be easy to overlook but I think is really critical for you to in your understanding of what is happening here so though we started with data if it had multiple Fields your model is going to train on a string or I&#39;ll show you in a moment it&#39;s actually a string in one other piece but it&#39;s going to train on a string and so this is showing you the template for what does that string look like that we that we create uh in the pre-processing step and then that we later use for um for modeling uh so we have say there an instruction an input and out output um and actually those are for each sample just filling in here&#39;s the instruction here&#39;s the output here&#39;s the text um when you use this for inference you&#39;re going to want to provide everything up through this response part but then not the output because you wouldn&#39;t know the output when you use this for inference but this template is showing you what the string looks like and then we&#39;re going to use that autocomplete type logic so that we provide everything before the output and our model will provide the output um it&#39;s actually I this looks like it&#39;s just a string there is one other piece that I think is um important for your understanding of fine-tuning that is shown here so it&#39;s actually a string and a mask so I&#39;m gonna go back here for a moment when you calculate your loss function to which is part of for those of you who are familiar with deep learning which is just part of figuring out how do we change our parameters to change the model&#39;s behavior um we don&#39;t want to train the model to write the words below as an instruction that describes a task and we actually don&#39;t even the input here is a proxy for what the users of your apps input will be so we don&#39;t want to train the model to be the user we want it to instead be good at responding to user inputs and so um these pieces up front are not going to inform the loss so when we look at the um the output we can look at it on a token by token basis so somewhere in there there was some input and there were the words that appropriately completes the request with a period each of these are tokens and before that we have pairs of the word that is token ID 28.99 but because we don&#39;t want it to feed into the loss we have um the first piece of this Tupa here is minus 100 which is just a way of preventing it from influencing the loss and thus influencing the behavior of our model if you look at the um output that&#39;s in green here and for those we have the token ID then we also have the purpose of calculating a loss what token uh is this and it&#39;s the same so um there is a flag which I think is called train on inputs that will let you change this Behavior but broadly speaking this is just showing that uh this is a way of being able to see very clearly which what are the tokens that are influencing um that are the inputs to the model and what are the tokens that are influencing loss and that are um eventually going to be the outputs of the model or that we&#39;re training the model to Output way do you uh use that debug thing in yeah all the time because um mostly because I want to be sure that the to tokenization is correct because a lot often times I&#39;m using chat ML and so like because it&#39;s not a default token I just want to make sure I didn&#39;t mess anything up in sort of setting those special tokens for Chad ml um and just to double check that you know the outputs look right just just so people know Chad ml is a specific type of prompt template so if you go back to the previous slide that uh Dan had you know that this I believe is a alpaca template this is alpaca yeah so that&#39;s this a specific type of template and yeah Chad ml is different and in general chat templates tend to be a little more there&#39;s a slight complexity or Nuance to them that instruction tuning templates I think arguably are a little simpler but um okay didn&#39;t mean to cut you off Wi we can keep going yeah no I mean that was really and then sort of like checking sort of like the end tokens making sure that sort of the stop tokens are in in there correctly and um just because if sometimes if it if it&#39;s not in there you can get a model that just starts to like ramble on and on it never stops so it&#39;s just it&#39;s just a good like spot check for myself and sort of especially in multi-turn conversations just to make sure that it&#39;s like masking out the the responses correctly and um you sort of see that because it&#39;ll go like red green red green red green um so yeah just an easy spot check and the color the color um having the colors just makes it easy to like just glance at it just to like uh without having to like because that is hard that is actually really hard on the eyes to try and debug so yeah let me show this um this last step so we&#39;ve done training there was one more command I&#39;m gonna show the gradio version of it so um let me pause this for a moment then switch over to make sure that we&#39;re looking at this in the high highest possible resolution so um the last step was to kick off the app uh going to run this accelerate la launch have the um inference command with pass in the right EML file the directory with theur and then this gradio flag this kicks off an app you can click on that link open something in the browser and you can type and test uh test things in the browser so um that was that last step again you won&#39;t remember all of these pieces but you should remember that they&#39;re in the click store and uh you can refer back to this and again super highly recommend before other things get on your to-do list um that you run through this so that you&#39;ve uh have hands-on experience using Axel um and with that let me hand it off to Hamill to go through a case study um for which is the honeycomb case study uh you want to H um you want to take over sharing yeah let me do that right now okay see here me start the [Music] slideshow is that sharing good okay thank you okay so um we covered the there&#39;s a through example in the in the workshops in the fine-tuning workshops and that&#39;s uh this use case of honeycomb and um we discussed it in the first Workshop because we have so many students I&#39;m going to just go over it really quickly again so the case study is you have uh there is a company called honeycomb that I&#39;ve worked with and honeycomb is an observability platform it&#39;s a Telemetry system that allows you to log all kinds of data um and it tells you things like it helps you diagnose like if parts of your application are slow or there&#39;s bugs somewhere like that uh or something like that it&#39;s kind of like similar to data dog in some ways um honeycomb has a domain specific query language um called hql um and one of the things they want to do is like reduce the burden of people learning hql and so what they did is they released a alpha product that allows users to type in natural language queries so instead of learning the hon query language you can just type in your question and so the way it works is you have two inputs to the llm you have the user&#39;s query and then you have the user schema the user schema is retrieved with like a rag type approach we don&#39;t have to get into that so with these two inputs there&#39;s a prompt and then out comes a honeycom query so that&#39;s like the that&#39;s the um sort of a high LEL overview just to remind you so let&#39;s jump right into the case study for the case study I&#39;m just going to be walking through some slides and let me uh let me open this GitHub repo so it&#39;s github.com parabs FD course you don&#39;t have to open it right now actually just would follow along with what I&#39;m doing is this repo right or it&#39;s uh so I&#39;m going to open actually let me open the repo so just to show you so it&#39;s a repo that looks like this I&#39;m just going to go through the notebooks they number one through eight and Dan tell me if you can see the text on my screen or it&#39;s too small or what not I I&#39;ve got a big monitor but it&#39;s looks really clear to me good Zach okay um okay so let me just I&#39;m going go through some steps these steps are not necessarily linear but it&#39;ll give you a good idea I&#39;m going to be focusing a lot on what we did with honeycomb uh to fine-tune a model and a lot of the steps are going to be around data set curation and the you know data filtering and debugging and evaluation because you know we&#39;re not as Dan mentioned um we&#39;re not really focused on the model so much um and so basically I just want to go through the prompt real quick so this is the honey Chrome prompt it&#39;s basically the system prompt honey Chrome AI suggest users queries this is one of the inputs this is the schema um there is this long fixed part of the prompt which is a query specification which is just like a bit of a programming like a very tur programming guide to the hom query l language uh there&#39;s some tips and there is some um few shot examples of queries of questions or you know user queries and then honeycomb queries so there&#39;s a few shot examples and then finally um this is a completion model so when honeycom launch this they use the completion API so the chat API and so um you know they&#39;re just completing this based on the user&#39;s question just templated so the interesting thing is so um you know they you could see that there&#39;s a lot of stuff in this prompt like like all of this stuff is fixed every single time um in this particular situation so like you know the few shot examples plus the tips plus the um sorry I didn&#39;t go over the tips the tips are just like additional instructions so all of this stuff is fixed except for the columns in the question so that&#39;s a lot of boiler plate to be sending to large language model but then also it&#39;s like it&#39;s hard to specify everything you want in this prompt like no matter how hard you try um you hit a wall and like that&#39;s where fine tuning uh kind of moved the needle so uh honeycomb launched this product uh here&#39;s there&#39;s a link to the blog post um it&#39;s kind of neat to read it um and yeah it just this talks about the same thing you type in a natural language query and out comes how comes query um and you can read about it I don&#39;t want to go too deeply into that so um the goal in this case was to encourage more users to write query so so like the bar isn&#39;t like super high in terms of like it has to be perfect um but one thing we had to do is write evals so like one of the things you should think about is writing evals um after you after you do kind of like some prompt engineering you may like prototype with a large language model uh just off the shelf if you can um just to see if like just to get an idea of how well it works off the shelf so with honeycomb so what do I mean by evals so I have this blog post about evals I won&#39;t go through it in too much detail but there&#39;s different levels of evals level one is unit tests um where you write assertions um and then there&#39;s level two and level three um and I&#39;ll be going through like the like you know level one and two um level three is AB testing so basically idea is you want this virtuous cycle where you have evaluation at the center and uh the honeycomb example is actually like a really good use case cuz it&#39;s like very narrow and like simplified and um it kind of like allows you to like get what I&#39;m talking about so basically like you don&#39;t have to understand this code but just know that for the level one EV vales when I&#39;m talking about level one EV vales I&#39;m talking about assertions and unit test that don&#39;t involve calls to a large language model these are like rules that you can think of that you can run almost instantaneously and get feed feedback about whether your model is doing the right thing okay and so there&#39;s some code here and I&#39;m just showing you this code so you know that it&#39;s real in case you want to see an example but it&#39;s essentially what I&#39;m doing is I&#39;m just testing different things about the honeycomb query for correctness okay I&#39;m like testing if it&#39;s valid Json I&#39;m testing if it&#39;s there&#39;s invalid columns in the query based on the schema uh if there&#39;s invalid filters you don&#39;t have to like know the specifics of this just know that there&#39;s lots of different level one evals okay and you don&#39;t necessarily need to write it like this but just giving you an idea that you need to write these assertions um and also like so um just let know also that I had to iterate on this quite a bit like don&#39;t expect that you&#39;re going to get all the assertions right the first time there&#39;s an itative Loop where you kind of you know throughout this whole process you have to update these level one evals you&#39;ll notice more and more failure modes and I had to work really hard on on this um to get to get something that I was happy with um and then like you also want to use these evals you want to write them such a way these assertions that you can use them in different places so you not only want to use it for test you also want to use these evals to filter out bad data for fine tuning you want to use it for curation and you also want to use it in inference so you can do self-healing um and so like you know I have like encapsulated this query Checker again you don&#39;t have to know what this is just gives you an idea like hey I&#39;m using these like assertions in different places um and this is like an because this use case is oversimplified this kind of way of organizing your code may not work for you you have to do what works for you in that situation um but just know that it&#39;s here okay um and already went over this assertions are not just for tests they&#39;re also for filtering and curating and inference and yeah definitely look at blog post okay so one thing that you will often have to do when you&#39;re fine tuning is like acquire data and a lot of times like you don&#39;t have the data in an applied use case um so what do you do like in the honeycomb in real life um my counterpart Philip who I was working with didn&#39;t have lots of data he launched this to you know uh production but then like you know not only did not have lots of data a lot of that data was private and I can&#39;t see that data um and so what we you know he gave me about a thousand examples and I wanted to set aside a fair amount of those examples like in the eval set so I you know so I could test test the model so I wasn&#39;t really left with much and so like the question is okay what do I do from here like so you all you&#39;re um a lot of you if you&#39;re in the wild and you&#39;re trying to build something in large language models and um you&#39;re trying to fine-tune it um it&#39;s good to know about how to generate synthetic data um there&#39;s no hard and fast rule again about like how many examples you need I just um generate as many examples as I feasibly can um just based on intuition based on like how much it costs how much time it takes um I end up generating 30,000 examples um synthetically but but I kind of went overboard um so you don&#39;t have to do that just use your intuition based on your budget and what you have um so like you can do this with prompting so let me give you like a concrete example because I if I just say hey you can use a large language model synthetically generate data you&#39;re like well how like what does that mean and I think for every use case is different but let me show you what we did for honeycom so the prompt is basically the same exact prompt that you&#39;ve seen before except there&#39;s a second part that says okay you are given the following three inputs uh natural language query a list of candidate columns and the query your goal is to generate correct variations of the combination of nlq candidate columns and query to build a synthetic data set you can build a synthetic data set by rewarding the query and substituting the column name response should be Json with the following Keys um so on and so forth um and then basically yeah I&#39;m giving it the inputs now and then saying please basically perform data augmentation so substitute like rewrite the natural language query substitute the columns and substitute the query and basically I&#39;m able to generate LS and lots of synthetic data this way now you might be wondering is that good data like is it duplicated like all this stuff yes and you have to clean it up um and which I&#39;ll talk about in a second um but just know that like for example you want to use those level one assertions as your first line of defense a lot of the stuff going come out of this is going to be junk maybe or some amount of it you want to get rid of it so the level one assertion is already going to help you and it&#39;s going to help you throughout this whole thing um okay so you have a way of getting lots of data this is how you do it I&#39;m not going to show you the code of doing that it&#39;s fairly straightforward it&#39;s like use your favorite uh large model to do this use the most powerful model you feel comfortable with to help you generate the synthetic data um and then okay so the next step in this is like preparing the data for Axel um we&#39;re going to so like usually what I do is like I go through a run I run all the way through and I see like kind of what&#39;s going wrong and then I come back and improve it you know you don&#39;t want to just like try to make your data perfect the first time and then like you know go through it you want to like go all the way through see some predictions make sure the plumbing Works Etc and then you can come back and curate and filter the data um that&#39;s what I recommend because you can get stuck it&#39;s good to know where the problems are and have an idea so uh okay so so you want to prepare your data to look like this um in this case cuz I&#39;m using the share GPT uh alpaca format um and I&#39;ll tell you what that means basically if in Axel there&#39;s this config share GPT and alpaca um and let me just open the docs so you can see that so there&#39;s the data set formats this is the axle docs there&#39;s different formats um we&#39;re going to I&#39;m using a conversation format and there&#39;s a shared GPT and you can see share GPT you have to structure your data like this you have conversations and then you have from in value and you have different roles like the from can either be human or GPT uh and then the value you can also have a system prompt which I do have in this case um which I&#39;ll show you but anyways like you can see there follows that here I have this like a conversation where I have a system prompt then a human then GPT now why is that uh well that&#39;s the way that Axel expects your data in for this format but also it&#39;s important because if you remember Dan talking about the Train on inputs uh you know not training on inputs so this is considered an input the system the system role in the human question is considered inputs and and the output is considered is this is the query and so what we&#39;re doing is we are only penalizing the model we&#39;re like forcing the model to basically learn to get the right query and not trying to have it predict what the question is if that makes sense so you organize your data like this to this Json L um and let&#39;s take a look at the config so the thing you want to pay attention to here then already went over the config but in this case change the data set this is a local data set so I have this basically the sample data and I have this like synthetic queries um and you can look at what that looks like if you want it&#39;s in that GitHub repo at this path um and then also the Train on inputs is also false there&#39;s a there&#39;s a key in here train on inputs which I&#39;ll let you find I don&#39;t want to try to hunt for this uh right now it&#39;s right here train inputs um and then also you want to change if you going to run this example which you can and I&#39;ll show you how um you need to change the following things in your config like you you won&#39;t be able to access my weights and biases account and you won&#39;t be able to access my hugging face account probably want to create your own and so like what Axel does is like you can log as Dan mentioned you can log all the training metrics to weights and biases and then also um you can also put it in a hugging face model repo and it will upload your model to that repo which is super handy um at you know it&#39;ll do that at the very end and I&#39;ll show you what all this I&#39;ll show you some examples what this looks like okay so prepared the data you got your config file now what do you do so what I like to do is I don&#39;t ever jump straight into training ever because I&#39;m dumb and I make a lot of mistakes in data set preparation always make like do something wrong and honestly I think a lot of people do something wrong wrong here and so what I like to do is look at the data and I look I like to double check how Axel is preparing the data and the way I do that is I do this Axel pre-process command um and uh that will basically flatten the data and assemble it in the right format um you can see all the different commands by using help so I just show that here just for reference um and so I like to look at the data manually um there&#39;s that debug thing that Dan showed but I like to like look at it manually um just so I can like kind of play with it a bit more manipulate it kind of inspect things um so basically what happens is when you pre-process the pre-process the data Axel dumps that data by default into this last run prepared directory and that is a hugging face data sets format and so you can load that hugging face data set format and inspect it and that&#39;s what I&#39;m doing here with this code basically you can see it has sort of flatten that Json L into a format that looks like this and that is the alaca format just like how Dan showed earlier you have this like instruction um and then response and so um yeah like what I recommend is check multiple M examples make sure it looks right make sure you didn&#39;t put the wrong thing in the wrong place or have like things in there that you didn&#39;t intend in your data happens all the time um one thing that I&#39;ll mentioned is yeah there are these spaces right here you might be wondering what the hell is that um it&#39;s a little bit of a tricky issue it&#39;s kind of some artifact about the way Axel assembles um you know tokens um I don&#39;t know if Wing wants to say something about this yet but I found it not to be an issue as long as you&#39;re like consistent uh with inference time um and I&#39;ll talk more about that and I I have a blog post about that as well um okay there&#39;s also verbos debugging which Dan already covered um you know and basically yeah you could do debug flag um the special tokens are here and that&#39;s like worth paying attention to but there&#39;s like the red green I&#39;m not going to go through that again um and then yeah it&#39;s it&#39;s always good to know what like the spot check like what these tokens are and if it&#39;s correct so like for example like what is this token like you might be wondering see this you haven&#39;t done this before you&#39;re like what the hell is that token is that wrong like okay that&#39;s a new line um but yeah if you want to go into like why what&#39;s going on with the tokens there is this blog post here um I&#39;m not going to go through it now but just to tokenization gotas um as an exercise for y&#39;all you might want to go through this blog post as a homework and uh take a look and see you know if it&#39;s something that you find that matters um I was really super paranoid about these like small things like spaces but I found that it didn&#39;t matter and I actually discussed this a lot with wing um but Wing do you want do you have any opinions on this is he here might not be here um no worries okay I&#39;m just going to go straight on to the next um so uh okay that was data set preparation now we going talk about training we already seen the config file the config file is also located at this location uh which I will go through um you can see it&#39;s been uploaded to hugging face um there is a link in the notebook so you don&#39;t have to memorize what you&#39;re seeing on my screen to run training you run this accelerate launch AEL command and Zach is going to be talking about accelerate I don&#39;t want to go into that deep rabbit hole right now um I&#39;ll just let Zach talk about accelerate in a bit um if you notice I have a weights and biases config here and this weights and buis ENT entity is just basically like a GitHub org and the project is basically like the repo and so uh when you do that Axel will upload you can log your training runs to weights and biases let me show you weights and biases real quick so weights and biases looks like this it&#39;s a bunch of runs uh and you can you know yeah you can just log your runs and the results look at your training uh loss curves I&#39;m not going to spend too much time on this um but just know that it&#39;s there if you want to look at it um so basically like with training what did I do I Tred different parameters so I Vari the learning rate so first of all I took a uh so this is mistal 7B so I went into the examples I asked in the Discord so on and so forth like what is the best uh what&#39;s the best config for mistol and um you know I started with that and so I varied the learning rate I tried different learning rate schedulers um I actually tried like different distributed scheme schemes like using deep like deep speed 0 1 2 3 just to just to test stuff I mean not that it matters but um uh actually this is a small model so it fit on my GPU just fine um but yeah I mainly just vary the learning rate and the bat size um another thing is like you know there&#39;s sample packing that you might want to try um to save GPU space um or to like save the amount of vram you need or like you know increase throughput um but Dan will upload a video for that or talk about that in a in a little bit more detail later on um so when the training is done it&#39;s uploaded if you put your hugging face ID it&#39;s uploaded into the hugging face which is here so this example of this model is here um you don&#39;t need to know what is here I don&#39;t want you to kind of you can look at this later um and I&#39;ll go through like some of this code in a bit so the next thing you want to do after you train your model is to sanity check it okay and like um there&#39;s a lot of different ways you can sanity check your model I like to uh you can use the way that Dan mentioned earlier by using ax L directly however um I like to actually use code to up to like um and using H hugging face Transformers to actually uh make this work hey Dan I think like uh Wing may be trying to open his camera potentially I know uh um okay so sanity checked the model this is the hugging face repo where the model is uploaded into don&#39;t be confused that this says like Parlin laabs and the other config says haml that&#39;s because I changed the name of the repo and I didn&#39;t want to break the links but um yeah so this is just code about basically pulling that model from hugging face and then this is the this is the temp tempate so another reason why sanity check things this way is I want to make sure that I understand the template and that it works um because I had my own like basically yeah and like the way I want to do is I just want two inputs the natural language query and the columns um this different ways to do this you can use hugging face has like a A templating system that you can use I&#39;m not going to go into it but I like to like make sure I understand the template um and so that&#39;s what I have here is I have this template it&#39;s basically the same thing um and this is just code to like run it um but basically just like sanity checking examples okay so nothing too crazy going on here I just have some natural language queries and some schemas and I&#39;m checking to make sure that it works um that&#39;s what you should do that&#39;s the first thing you should do okay great so we&#39;ve done all this stuff we trained the model we sanity checked that at least like the plumbing works and some results maybe look plausible so the next thing you want to do is like so the question is like is this any good yeah it passes like you can see like these level one evals you can track the different metrics of the level one ebals you can know like which assertions are failing how you know like what kind of errors are you getting the most that&#39;s all good um but then like beyond the level one assertions after you conquer those like are are these like good or bad so when I when I shared so I launched uh this model onto replicate for inference and we&#39;ll go through inference later so I don&#39;t want to like get stuck on that is like uh you know and allowed it did some sanity more sanity checking and um basically like Philip did some sanity checking and said okay this model is okay but it&#39;s not great um it&#39;s still making some mistakes in some places and actually it turns out that the data that we used to expand um that data wasn&#39;t great either and this will happen all the time um and you might you might find this when you&#39;re doing uh like basically you have to do some error analysis and figure out like okay if a result isn&#39;t great uh like why is that and one way to do that is like to look at the data look at the training data try to debug like this in this case I looked at similar queries and the training data and try to see what was happening and we found that okay like actually the training data could be better um you know like things are passing level one test just fine but they&#39;re not like the greatest queries they&#39;re syntactically correct and so what do we do now so like one one thing you might be wondering is okay like are we stuck do we have to stop here like the data is me like and Philip doesn&#39;t have time to sit there and label a bunch of data or write better queries um because he he doesn&#39;t have time so what do you do now okay like what you can do is basically you want to try to encode the knowledge of philli in his opinions into a model like you want to like see like can you have like Philip as an AI in this situation so what I did is um I started building llm as a judge and basically it&#39;s the same exact original prompt um but basic in like that you&#39;ve seen before but with an instruction that you are going to uh be a query validator okay you are an expert query evaluator that has advanced capability judge query good or not blah blah blah and then there&#39;s a bunch of few shot examples here of uh you know like inputs nlq columns query and critiques and basically what I did is um I did a bunch of so how did I get this um in this case I used a very uncool low technology technique using a spreadsheet and I sent philli a spreadsheet every day for a few weeks and had him write critiques and over time what I did is I aligned the model as much as possible with Philip so that it was agreeing with him in the critiques it was writing and I kind of kept tweaking the few shot examples and the instructions until I was until we were both satisfied that this llm as a judge was doing a good job um and the thing that is really good about this is like and so I talk about this a little bit more detail in the blog post when we talk about level two human and model eval um I don&#39;t want to go there&#39;s a lot you can say about this like there&#39;s different ways you can do this I just want to give you an idea so that you have like the general process in your mind and you know that this is a tool in your toolbx um it&#39;s impossible to teach everything I know about it in one you know in such a small session but what I will say is uh yeah like when you have the result of this you get a bunch of critiques and uh you can use those critiques to actually make the data better and you can use the you can use the same LM as a judge to filter and curate the data like filter out bad queries hey like try to make the data better given a critique can you make the query better if it still can make the query better then you filter it out um so that&#39;s kind of like a sort of you know what we what we went through um and so basically from there you can curate your data so like what I mentioned before uh first thing is you can like fix the bad dat data again using a large language model it&#39;s like you&#39;re giving the following inputs in a critique and then it&#39;s output the improved query and uh just output the improve query um that&#39;s one way you could try to like increase the quality of the data but then also you um like I mentioned you want to filter the data there&#39;s many different ways to filter the data when we talk about data set curation there&#39;s a lot of things that you can do um uh and like filtering again you want to use both your level one evals that I mentioned like those assertions you want to use these level two evals to do even more filtering but then also you commonly have other filters that you&#39;ll find uh that you you&#39;ll see like different things in the data set you&#39;re like oh like things in this part of the data set are garbage or like hey the model is making a certain kind of mistake let me let me filter that mistake out um and then you have to decide whether or not you have to go acquire data for that mistake so one example of that um that&#39;s not necessarily a test but it&#39;s a filtering technique is in this case I noticed there was a lot of either low complexity queries like super super simple queries or really really high complexity queries with like lots of operations lots and lots of filters that didn&#39;t make any sense so basically I had some code that filtered those out okay um there is a in the more General case there&#39;s a tool called lilac which kind of like helps you uh find more General things that you might be interested in filtering out of your data in searching your data and the like also finding duplicates so another part of curation is to get rid of duplicates you don&#39;t want you don&#39;t like okay we did a lot of data augmentation and things like that you might have lots of data that looks very similar or too similar and that&#39;s not going to be good because what ends up happening is like you&#39;re going to like overweight on those examples so like um there&#39;s a lot of sophisticated things you can do you should start with dumb things if you can obviously so like in this case there&#39;s three parts there&#39;s three main parts of this data set there&#39;s the natural language query there&#39;s the schema and there&#39;s the output and so one dumb thing you can do is like to drop uh to drop any data where there&#39;s a a pair that is D like duplicated within those three if there&#39;s a pair of two that are duplicated that&#39;s like one thing and I did there&#39;s another another things you can do you can do like semantic semantic searching and see semantic D duplication um you know that&#39;s why in lilac for example you have like fuzzy concept search and things like that um so that you can and then you have like clustering and things like that so you can kind of like look at data try to maximize it diversity uh clean out things that are like too duplic like too much duplication so that&#39;s kind of like an endend overview like the idea is like this is not a linear process I went through this in like one through eight but just know that like I have to go back and forth between all these different steps and do these things IND differently as I hit various things like you know like I mentioned um I have to constantly rewrite the level one evals um you know or I might decide to redo the level two EV vales um but this is again this is a very simple example um just to give you a concrete use case to give you the idea of the workflow so that is the honeycom use case um let&#39;s let me just quickly talk about debugging Axel I&#39;m going to switch gears so like when you&#39;re using Axel uh it&#39;s really important if you&#39;re going to use some software that you know how to debug it and I just want to call your attention to this these Docs that uh will show you how to debug axle but there&#39;s these guidelines here that I think are really important so if you&#39;re going to debug Axel AO like something is going wrong you want to make sure that number one using the latest version of Axel you also want to eliminate concurrency as much as possible um so basically make sure you&#39;re only using one GPU one data set process use a small data set use a small model you want to to minimize iteration time and also you want to clear caches clearing caches is huge like especially if you&#39;re trying to debug something about data set formation like hey it&#39;s not you don&#39;t think like your prompt is getting assembled correctly or something like that you want to clear your cache um because that can trip you up um I also have there was a bunch of questions in the zoom about how do you connect the docker container um that if you want to run Axel in and like that&#39;s really uh connected to debugging actually in a way like cuz you can uh use vs code to do that um and I have some videos and tutorials in the AEL docs that show you how to do that either with Docker or not using Docker and how to attach you know to remote host and things like that um let me go back to the slides and already cover this um Wing okay so I went through we went through a lot I&#39;m just to stop and ask you is there anything else on your mind in terms of um things like tips you might have for people using Axel that you like to highlight um I don&#39;t have any off the top of my head I it usually comes when people ask questions that I remember oh you should do this this or this but I don&#39;t have any off the top of my head right now no worries um there a couple of maybe this now&#39;s a good time um there a couple of questions in the Q&amp;A actually some are listed as answered but for everyone to be able to hear them um about this one uh how do you predict how long a fine tuning job will take before you start it you have any recommendations there that one is relatively hard to answer um you know depends on you know model size lower full fine tune the gpus you&#39;re using the number of gpus if you&#39;re using like deep speed 02 or 03 and you&#39;re having offload it&#39;s just there&#39;s so many factors that can affect you know the amount of time that it takes to find tun a model that it&#39;s us like I think once you have like a gauge on a specific data set um and on like certain parameters that you&#39;re going or hyper prameters that you&#39;re going to use for a specific like you know set of experiments you can usually like get a good gauge on from that but I don&#39;t have like a good like all allaround like formula that works for everybody yep um we just looking through any of the other uh questions that uh yeah we can come back we&#39;ve got a lot of questions I answered just a second ago um talking about um someone had asked about um you know doing a fine tune and then improving like doing what ham was just saying like improving the data and then like whether or not you should start from scratch again or like fine-tune over that fine-tune model and I think one of the things when you think about that is like if you if your model is already you know getting pretty close to being like overfit just fine-tuning that again for mult more Epoch right is just going to definitely overfit at that point and you should really consider just like cleaning up the original data um and adding in the you know the new Improv data and then just starting from sort of starting from scratch again at that point on the base model yeah I always start again from scratch uh when I improve my data I haven&#39;t thought about trying to keep going um okay I think we probably should move forward because um looking at time as well um I think the next thing that might want to do is jump in right into Zach&#39;s sure let&#39;s do it how do I uh looks like I can take over for you so less for you to worry about we&#39;re all seeing me all right yep perfect all right hey everyone uh my name is Zach Mueller and we&#39;re going to be talking about scaling model training as you get more compute how do these people wind up doing that uh so a little about me uh I&#39;m the technical lead for the hugg and face accelerate project and I handle a lot of the internals when it comes to the Transformers trainer I&#39;m also a humongous API design geek and before we start talking about like how do they go about doing this sort of what we call distributed training uh let&#39;s get a general understanding of model GPU usage right so uh we were talking about how you can use things like luras to reduce some of the memory overhead but how much memory overhead do certain models actually use uh we can sort of get gu what that number winds up being uh if we&#39;re using like vanilla full fine tuning so without using luras and then you can sort of convert some of it later uh the assumptions that you basically have to have are we&#39;re going to use the atom Optimizer and we&#39;re going to start with a batch size of one and for example let&#39;s take Bert base case right so that&#39;s going to be 108 million parameters how much GPU space am I going to need to train that well each parameter in a model is four bytes and the backward pass usually takes about two times the model size and the optimizer step takes about four times that one for the model one for the gradients and two for the optimizer when it comes to atom so after doing all this computation you wind up getting to 1.6 gigs is needed to train on a batch size of one for bird with mixed Precision that&#39;s knocked down by half because uh while the model is still in full Precision which I&#39;ll go over why that&#39;s important in a moment uh the gradients wind up taking less because the gradients themselves are in half bit and so we&#39;re able to fit and roughly guess that it&#39;s probably going to take about a gig to two gigs uh overall when we&#39;re training on Bert now let&#39;s talk about why that matters all right so that&#39;s great if you have 12 to 24 gigs of GPU space right typical consumer card but what happens when we scale that up right so if we look at llama 38 billion 8 billion parameters loading the model in is going to take you in full Precision 28 gigs gradiant are another 28 gigs backward pass gets you to 56 and suddenly you&#39;re somewhere between 56 and 112 gigs of vram I know I certainly don&#39;t have 56 gigs on a single card let alone 112 if we want to avoid things like PFT what do we do this is where the concept of distributed training comes in or how do we make sure that we can use multiple G use to achieve what we want so there&#39;s three different kinds of training when we think about it at the hardware level so we have single GPU right so that&#39;s no distributed techniques you are running it straight off of whatever GPU you have we have the concept of distributed data Paralis and this works by having a full model on every device but the data is chunk and split up between every GPU another way to think about that is essentially we can process the data faster because we&#39;re sending chunks of our full batch across multiple gpus to it to speed up the training time and the last part that I&#39;m also be covering in today&#39;s talk is fully shredded data parallelism fsdp and deep speed and these are the key areas that was sort of hinted at in the earlier discussions where essentially we could split chunks of the model and Optimizer States across multiple gpus and what that allows is rather than having the limit of DDP where we&#39;re stuck with say 2 490s at 24 gigs that&#39;s all I can use in memory it acts as a single 48 gab GPU when we think about the total Ram that we can play with to train models and that&#39;s the secret to how you can train these larger and larger models now what is fully sharded data parallelism the general idea here is you take your model and we&#39;re going to create what&#39;s called shards of the model so that&#39;s say taking the model we could imagine A Shard being it split perfectly in half the first half of the model and the second half of the model and depending on how we configure fstp certain chunks of the training Loop will happen in that uh vram space and then depending on what points occur during that occasionally torch needs to know what&#39;s happening with that other model chunk because it&#39;s all the same model and we need to get the gradients all aligned so these uh what are called Communications and generally you want less of these because it&#39;s essentially time spent on your gpus just talking to each other and trading information you&#39;re not tra you&#39;re not training anything you&#39;re not processing data it is quite literally just your two gpus trading notes on how they think the model should be and then correcting themselves now uh I&#39;m not going to really go too much in depth into every single thing fscp can do what I am going to talk about is in my opinion the most important ones when it comes to training in low resource areas and when you&#39;re using fscp uh and sort of how you dictate how those weights and gradients and parameters get charted and on top of that I&#39;m going to cover some of the important ones I needed when I was doing a full finetune of llama 38 billion without PFT on 249s spoiler alert it was very slow so the first part of this is what we call a sharting strategy and the general idea here is This Is Us telling fscp how we want to split all of these different things that take up uh vram so with full Shard as it sounds like everything&#39;s going to be split our Optimizer State our gradient and our parameters uh with Shard grad op which is Optimizer instead we&#39;re just sharding the optimizer state in the gradients and then essentially the model will be split when we&#39;re not using it and then joined back together when we are such as during the backward pass this reduces some of the memory overhead because we still need more than the original model right because we&#39;re still fitting the entire model in vram but it reduces that training vram a little bit for us we have a technique called No Shard which as that sounds like that&#39;s just going to be distributed data parallelism we&#39;re not sharding anything and then the last part is a new thing that uh P Tores come out with called hybrid sharding and it&#39;s kind of like full shard where we&#39;re fully uh fully sharting absolutely everything including the optimizer States gradients and parameters however if you&#39;re training on multi- node right so multiple computers are training a big model at once it keeps a copy of the entire model on one of on each of those nodes that&#39;s important because remember how I said Communications slow down things a lot hybrid Shard lets us reduce the communications from I think three down to two if not one and so you&#39;re train speed is increased honestly to some extent exponentially depending on how uh long it takes for your uh computers to talk to each other so the next part is we know how uh we&#39;re going to split the memory right but how do we split the model because we need some way to tell fstp all right I have this model how do I want to split it in between my gpus uh with accelerate with Axel with Transformers uh we use uh two different nomenclatures Transformer based W and size based W Transformer as it sounds like is very specific to Transformers uh with this you need to declare the layer you want to split on so like this could be a Bert layer or a llama layer usually Transformers has good defaults and good helpers to help you figure out what that is the other version is more manual uh and basically you&#39;re just telling fsp after X amount of parameters go ahead and split the model uh that&#39;s great because works out of the box that&#39;s bad because there could be uh speed increases that you might be missing by having say like each head of a like mystal model on a separate GPU so that way it can handle its own computations much faster than needing to wait to communicate with other gpus now the next part which was particularly important for me is the idea of offloading parameters and what this says is okay I have 48 gigs of vram right now if I&#39;m assuming 249s and I can&#39;t fit that I can&#39;t train on it well I&#39;m going to accept that I still want to do it I don&#39;t want to go by through a cloud provider and so fstp will let us offload gradients and model parameters into RAM now as that sounds like that&#39;s going to be extremely slow right because we&#39;re taking things from the GPU to the CPU and now shoving it at Ram but it lets us train as big a model as essentially you have available in Ram so case in point uh when I was doing a full fine tune of llama 38 billion to match a paper that came out uh I wound up needing to use offload parameters because as we saw earlier a billion requires about 50 gigs or so I only have 48 uh and it was going to take like 72 hours to do four iterations through my data uh versus an hour or two on an h100 so yes it&#39;s cool that you know how to use these tools and it can help you train things locally make sure to double check though a what your time constraint is and B what your budget is because I can run it for free and it can take longer or I can pay five doll and go finish it in an hour depending on how much time you have available each solution has different uh opportunities now another kind of critical part uh in my opinion when it comes to doing fstp that accelerating Transformers has is this idea of CPU Ram efficient loading and uh also this idea of sync module States so if you&#39;re familiar with accelerates big model inference that&#39;s fine I&#39;ll give you a brief summary uh basically pytorch lets us use this thing called device equals meta and that essentially is the skeleton of your model the weights aren&#39;t loaded it can&#39;t really do computations too well but it&#39;s just the skeleton for us to eventually load weights into so rather than loading uh llama 8 billion on eight gpus so now we need eight times the amount of ram of our model to load it in at once right so that&#39;s going to be easily 100 200 gigs if I&#39;m not mistaken instead we send all the other versions onto that meta device so they take up no RAM and then we load all of the weights only on one of them and so then when we&#39;re ready to do uh fsdp well we already know we&#39;re sharting the model so we just tell the first node to send those weights to whatever node or GPU needs that particular chunk of weights and this really helps keep your uh Ram size low and you don&#39;t suddenly sit there with crashes because oh no you ran out of CPU memory because fun fact you will Redline this quite often I found um at least in this particular scenario now I&#39;ve talked about uh fsdp a lot and I&#39;ve assumed that you knew context about Axel autle Transformers and all this stuff let&#39;s take it back and just focus on which you might not know is the foundation of a lot of your favorite libraries so uh practically all of Transformers uh and hugging face as a whole relies on uh accelerate same with Axel fast AI anything Lucid Rin stunts at this point as well as cornea and the general idea with accelerate is uh it&#39;s essentially three Frameworks you have a command line interface that uh haml and Wing already showed us whenever they were doing accelerate launch uh you have a training Library which is under the hood what is doing all of this distributed training fairly easily and then the big model inference that I mentioned a moment ago for the sake of this talk we&#39;re not talking about big model infs we don&#39;t particularly care about that here we&#39;re just caring about fine tuning llm so we&#39;re going to focus on the first two so you need about three commands to really get everything going the first is accelerate config uh this is used to configure the environment uh this uh is also what uh Wing has managed to wrap around beautifully when he shows his accelerate launch commands because his config files can directly be used for uh doing accelerate launch which is phenomenal uh the second part is estimate memory which goes through those calculations I showed a moment ago whenever I was playing around with the idea of well how much vram can I use and the last part is accelerate launch which is how you run your script let&#39;s look at sort of why the matter launching and distributed training uh sucks uh there&#39;s a lot of different ways you can do it there&#39;s a lot of different commands you can run some of it&#39;s pie torch some of it&#39;s deep speed and all of them have slightly different commands right so here if you just do python script.py it&#39;s not going to train in any distributed scenario and most you get model parallelism but you won&#39;t get like distributed data parallelism fsp don&#39;t work won&#39;t work torch run and deep speed are the main two commands you can use to run uh this will basically say torch run run on a single computer with two gpus my script and then it does some things in the background to help make sure that works uh and that&#39;s a lot of different commands that you have to know and remember and so accelerate launch is here to just say okay tell me what you&#39;re doing and I&#39;ll make sure that we&#39;re running it so for uh it operates by these config files similar to what again Wing was showing us at AEL and these essentially Define uh how we want certain things to run so here we&#39;re saying I have a local machine that&#39;s multi-gpu running with bf16 mixed Precision on eight gpus uh with fsdp on the other hand we can go through and specify everything we want to use with fscp using a config uh and this way accelerate launch just knows hey we&#39;re going to make sure that we train an fsdp if we&#39;re using accelerate and that&#39;s all you need to do from a launching perspective and if you&#39;re using aelole or transformers this is all you need to do the next part I&#39;m going to show is sort of the internals a bit on the low level of how accelerate works and how you can use accelerate specifically but do remember this isn&#39;t necessarily needed if you&#39;re using things like Axel or Transformers so the general idea with accelerate is we want a low-level way to make sure that this can essentially be device agnostic and compute agnostic right so make sure you have your code running on a Mac running on a Windows machine running on a GPU running on CPU running on tpus and it does so in a minimally intrusive uh and I ideally not very complex manner you create an accelerator and you just have it prepare all your things and that&#39;s it you&#39;re Off to the Races uh switch your accelerator or switch your backwards function to use accelerator backwards and on a whole that&#39;s most of what you need to do how it winds up working is uh similar to fstp accelerate will do the data sharding it for you in taking your data and splitting it across GP use uh it also operates by essentially having one Global step so an easy way to think about it is uh if we&#39;re training on eight gpus the uh versus single gpus if a single GPU had a batch size of 16 and now we&#39;re training on eight gpus the equivalent in accelerate to get the same exact training would have each GPU have a batch size of two because 2times 8 is 16 and so what winds up happening is this lets us successfully scale our training with that should have roughly the same results when training on a single GPU versus training on multiple gpus without needing to worry about oh do I need to step my schedule or more oh do I need to adjust my learning rate more oh do I need to do this do I need to do that it&#39;s the same amount of data being processed at one time and uh everything else is just done for you now uh the next part of this I want to talk about some very specific tweaks that uh we do to protect you from dumb decisions uh the first part is mix Precision uh this is a bit different than maybe your normal idea of mixed Precision uh we don&#39;t convert the model uh weights to bf16 and fb16 when we&#39;re training with accelerate and we try our hardest to make sure that doesn&#39;t happen instead we wrap the forward pass with AutoCast instead to just convert the gradients this preserves the original Precision of our weights and leads to stable training and better fine-tuning later on because and this is very important if you go to bf16 you are stuck in bf16 there was a whole issue a few months ago with trans forers where some quality of some fine-tuned models weren&#39;t doing well this was the cause now going a bit more than that if you&#39;re familiar with uh or keeping up to date with efficient memory training you might have heard of something called Transformers engine or MSM uh the idea behind this is we make use of like 409s h100s and do training in 8 bit now this is different than quantization you are actually training on Raw native 8 bit so eight bits and that&#39;s all you have uh a lot of mistakes I see people do with this especially with the Nvidia examples is they do the prior thing of converting the entire model into bf16 and then train uh that leads to huge instabilities during training and generally people&#39;s performance hasn&#39;t been the best uh I&#39;ve also heard rumors though that even this can go bad so it&#39;s always worth playing around with if you have the ability fp16 versus non fp16 and that includes bf16 uh and test out sort of what levels can be an 8bit because like with Transformers engine it&#39;s still using the AutoCast and so the computations rather than being done in 16bit are done in 8bit uh and then if you&#39;re playing around with Ms amp uh that lets you experimentally go even further with this and so it can you know we can get to a point where if we do 03 almost everything is in 8 bit your master weights are in 16 bit and your Optimizer states are even an 8 bit I&#39;m scared to play around with that I don&#39;t know necessarily how good that is uh I need to play around with it and that&#39;s sort of what I&#39;m using the LL 3 training for to just toy around with these things but uh it opens up opportunities if you have the compute to do this now last part I&#39;m going to very briefly talk about and we can talk about this more in my office hours is deep speed by Microsoft and fully shed data Paralis these two are almost the exact same uh deep speed has a few tweaks and call things a bit and calls things a bit differently but if you&#39;ve done it in BF or in fscp it can be done in deep speed and vice versa a wonderful Community member uh recently posted some documentation where he directly talked about this parameter in deep speed is this parameter fsdp and generally what I&#39;ve seen it&#39;s a mix of if people prefer deep speed or fsp uh it&#39;s usually a matter of do you want to go with Microsoft and do their thing or stick with pytorch and just stay native uh but either can be used interchangeably as long as you&#39;re careful about setting up the config so as a whole uh accelerate helps you scale out training especially with using fsp and deep speed uh to train these big models across a number of gpus you can use techniques like fb8 to potentially speed up training and reduce some of the computational overhead but when using mixed Precision in general especially with fp8 be very careful about how you&#39;re doing it uh because you could potentially lock yourself into that weight for you and everyone else so uh I&#39;ll post this uh presentation of course in the Discord but there&#39;s some handy links there uh that will help get you started with accelerate go through some concept guides uh to understand some of the internals and really get you going so uh yeah there we go let&#39;s look at some questions uh let&#39;s see I have one here I thought that deep speed 03 is the same as fsp but the other options in deep speed weren&#39;t necessarily equivalent uh it&#39;s got to a point where there&#39;s some equivalencies now uh the chart talks about it uh 03 is definitely the equivalent of fstp uh but there&#39;s some tweaks that you can do because fstp gives you options to only offload certain things I just want to mention that okay I didn&#39;t show you there&#39;s a deep speed and FSD DP configs like when you want to do multi-gpu training an axel in you have to supply a config file I&#39;ll show you some examples uh of those um they&#39;re in the I can when whenever Zach&#39;s done I&#39;ll share my screen yep sorry a link there you go okay I&#39;ll just do it right now uh let me find I add some clarification while we&#39;re while we&#39;re pulling that up yeah um so one of the things especially for the fsdp part in the axle Auto configs is we try and move those fsdp specific configs into the ax modle and then it like Maps them into accelerate um what we found was that a lot of people were running accelerate config and then setting up like setting things and then they would go and use a lle and it would have like a mismatch in certain parameters and what would happen was it just would break in most in a lot of situations um so what we actually recommended people do we added warning saying just remove your accelerate config and then we will sort of map all of those uh configurations that normally get set by accelerate through like I think accelerate uses like environment variables to sort of communicate that under the hood anyways when you use accelerate launch so we just sort of like mimic a lot of that um just to like avoid some of the headache um of doing it one laun running accelerate conf getting a mmch later on that just caus a lot of support issues so um that&#39;s just that makes perfect sense that&#39;s exactly the solution I recommend like even I&#39;m debating on rewriting half of our internals for the fstp and deep speed plugin because like I don&#39;t necessarily want to rely on environment variables and even setting it up I&#39;m sure as you&#39;ve experienced normally is problematic at best so uh yeah that&#39;s a very smart way to go about it because it&#39;s even we&#39;ve had users that report issues and it&#39;s like well it&#39;s because you set up your config wrong and you&#39;re using something else yeah I mean and so that&#39;s like what you heard from Zach today about like stage one to three bf16 all that that&#39;s all like background that you might want to know so like demystify a little bit about what is happening when you supply these configs what I do honestly is I just use a config again I just use one of these like 0123 um you know or the bf16 one use kind of use it off the shelf and then maybe consult like Zach has like written a lot about this I actually look at his presentation he&#39;s given like similar versions of this before and post it online he will today posted slides and I kind of fiddle with it a bit sometimes but honestly I just use ones that work if I want to parallelize my model especially using a bigger model and paralyze it across gpus uh then then I&#39;ll I&#39;ll pick the right config and you specify like you have these configs in the axol repo uh and then you supply it to the config the main config I&#39;ll show you an example when we talk about modal in a second can I can I add clarification on this one specifically yeah with zero one and z uh 01 and 02 specifically for deep speed um you um I think the bf16 and fp16 are can be set to Auto because it doesn&#39;t deep speed or doesn&#39;t care about it until after the trainer is loaded but for 03 specifically um and I see Zach nodding his head is it needs to know ahead of time specifically that you&#39;re using bf16 so you actually have to you can&#39;t set you can&#39;t set auto in the 03 config if you want to use bf16 so that&#39;s why it&#39;s said as like there&#39;s a specific 03 bf16 because it needs to know that you want to load it in bs16 before it ever before before the trainer sees it or something along those lines maybe Zach can explain it better than I can but no that&#39;s that&#39;s a pretty good explanation of it it&#39;s it&#39;s uh something with deep speed when it comes to setting up the actual call to deep speed and initializing everything it has to know well beforehand what we&#39;re actually doing uh which makes it a little Annoying whenever we&#39;re dealing with conf that um okay I think uh we should probably move on to the next thing which is training on modal or Zach just want to make sure you&#39;re done with yep you&#39;re good all right um so there&#39;s a lot of different ways you can train models there&#39;s you can use runp pod which Dan showed earlier that was like done on runp pod that was the like recording if you look at the axotal docs actually um it&#39;ll show you it&#39;ll tell you a bit about runp pod if you just search from runp pod here you&#39;ll find a little bit there but also there&#39;s a Docker container CX lle which is like what you want to use most of the time um Wing do you want to say anything about that like what&#39;s your preferred way of running how do you run it stuff like what&#39;s your compute so on my local 390s I it&#39;s I don&#39;t use doco containers just mostly because it&#39;s like development and it&#39;s just not amenable to using Docker containers for that um but for General like debugging issues that people are seeing um I will just generally just spin up a Docker container on my run pod and debug the issue there so because it&#39;s environment it doesn&#39;t have all of the mess and mismatch of like various um packages that I might not have updated makes sense um and then yeah if you look at the REM me there&#39;s a whole bunch of stuff there about it um okay so modal what the hell is modal so actually so okay like just some general rule about this conference um we were pretty selective about the tools that we brought in to this conference or that I&#39;m going to talk about I&#39;m only going to talk about tools that I use or that I like this is like hundreds of tools um one and you know one that I really like is modal so like what is modal mod is actually like this really cool Cloud uh Native way to run python code and the thing that&#39;s really interesting about it is like it has this uh like one Innovation is like it feels like local development but it&#39;s actually remote development has nothing to do with fine-tuning right now just I&#39;m just telling you a little bit about model CS in background um and basically it&#39;s also like massively parallel you can you can get uh so like things like Axel it can do easily do like fine tuning um actually like Wing how do you do how do you do like uh hyper parameter search with your axotal training like what do you like to do it&#39;s manual right now it&#39;s like like change learning rates but yeah makes sense um so like a lot of times I do uh use something like modal or I&#39;ll use modal to do things like hyper parameter tuning there&#39;s different ways to do hyper parameter tuning it&#39;s not something you should focus on like in the beginning and it&#39;s totally fine to do it manual I do a lot of things manually I use bash scripts sometimes uh to do like many different axotal runs so um it&#39;s very like python native there&#39;s these uh modal docs which are here if you&#39;re just getting started in modal actually like to really experience this like magic of Al where what I you&#39;re like what am I talking about this like local but it&#39;s remote like what does that even mean I don&#39;t even know how to explain it to you without you like trying it yourself so like this is like I so there&#39;s a lot of docs here and like modal you can go through like the hello getting started one but I actually think like what I like to show people first is this like web endpoint one I&#39;m not going to demo it right now because I don&#39;t have time but basically just like try it out and basically what you want to do is like you can change the code and you can see it change in production in real time and you don&#39;t have to do these like deploys like constant deploys to change code it&#39;s like this really IR iterative like interesting thing and I&#39;ve built like lots of tools in modal I have built like this transcript meeting transcript summarizer with modal uh also weights and biases web hooks uh the links are that are going to be in the slides so I won&#39;t labor that too much uh the one thing about so for modal uh for Axel Auto they have this repo called llm fine tuning and it&#39;s a little bit different than it&#39;s like wraps Axel so that that that&#39;s interesting like Axel is already wrapping so much why we need to wrap Axel well actually like um it&#39;s kind of interesting like if you have a workflow that you really like um you might want to abstract it a little bit more and plus you can get all the benefits of modal by doing that um certain things you might want to know about this repo is um when you run the train it automatically merges the Laura back into the base model for you um by default you can turn it off and then also like one key thing is there&#39;s a data flag you have to pass you can&#39;t rely on the conf the data set in the config file you have to pass a data flag um and then the Deep speed config comes from the axle auto repo itself so you have to reference sort of like the the axol repo uh what I was showing earlier it&#39;s kind of like these are mounted into uh the environment this deep speed confix so it&#39;s kind of like a beginner&#39;s way of using sort of uh Axel with modal but it is um it&#39;s something to try first and like it&#39;s kind of like you can tweak it you can tweak it you can change the code um but basically like you know there&#39;s the read me here there&#39;s a way to get started obviously you have have to you know start model install it and essentially like what you do is you clone this repo and then you launch This fine-tuning Job And basically like this command um the detached thing just makes it run in the back like makes it run in the background so where you can uh do other things um but there&#39;s this uh there&#39;s here&#39;s the entry point this is basically where we&#39;re wrapping the Axel CLI command uh in this TR function and then you pass in the config file and then the data okay so it&#39;s like very similar to running Axel just wrapping Axel um I&#39;m going to do a really quick video of what that looks like here so um you know just do modal run and then basically you know it will go ahead and and do your axle auto run if you want and this is like running the exact example in the repo um and you can do the same things you can put your wa and biases and your hugging face token and so on and so forth um so let me go back to uh the example oh sorry um let me go back to the repo sorry and just to point out here uh just to navigate yourself in the repo there&#39;s this actually I&#39;m going to hit the period period on my keyboard to show you vs code real quick so I can just show you some code and uh so the source code like the code for modal is in this Source folder and the training part is maybe what you want to take a look at if you&#39;re curious on like what is happening and the entry point that we demoed right now is this train function so there&#39;ll be a train function here uh in uh there&#39;ll be you know in this file right here um let&#39;s see and then the common dopy that&#39;s actually the setup okay that sets up the environment that sets up the uh Docker container and installs some dependencies and makes your secrets come in you don&#39;t have to worry about this I wouldn&#39;t actually look at this like in the beginning I&#39;m just showing you around so that if you wanted to dig in you could check it out I think it&#39;s pretty cool um and then one thing I want to point out is like there&#39;s these config files if you want to run the demo and the read me out of the box there&#39;s this like very small uh training run that basically overfits on purpose um you just have to know that okay the data set here this is just this will get replaced by whatever the data flag that you pass in um and then you just know that like okay uh for this deep speed is actually being used here so uh that&#39;s what we just talked about that was the background that Zach gave and this is actually being mounted from the axelo repo because remember the Axel repo has this deep speed speed configs and this is being used um so just this is just orienting you to that and let&#39;s go back to the slides whoops how do I go to the next slide um another thing you might want to do is debug the data so like you can run it end to end but remember I told you you don&#39;t want to do that you don&#39;t want to just train stuff so if you want to do your have your own data inside model um there I have this notebook here um so let&#39;s go to this notebook whoops let me just uh go to the repo and go back and go to The Notebook so I have this notebook here about inspecting data um okay and I&#39;m just going to change this GitHub to NB sanity because it&#39;s easier to read um and basically uh this you kind of do the same thing is like you know just make sure this is a way that you can inspect the data so you can do modal run but then pass a prepr only flag and what happens is the logs will pin out print out a run tag and with that run tag uh you can see the last run prepared folder essentially and like the last run prepared folder um you can just get that data and analyze it the exact same way that I showed you in the honeycom example essentially which is like you know and then print out just to make sure the data is in the right format so I think that&#39;s important you might want to do that if you&#39;re using this uh and just this is a notebook that might help you okay um I think that&#39;s it and yeah we can do Q&amp;A okay um let&#39;s tell about I will MC Q&amp;A uh we have some questions that were answered typed but just so that um people hear the answer I&#39;m going to do mix of open questions and answered questions uh a couple four in case there common questions will office hours be recorded answer there is yes um are tiny models like 53 more or less suited for fine tuning you answered that uh in text but for others to hear it since it was highly voted you want to uh tackle that ham or anyone else I usually don&#39;t go smaller than a 7 billion pamer model because I haven&#39;t had to go smaller than that like that&#39;s like a really sweet spot for me uh cuz the models are like kind of good enough and they&#39;re small enough but I don&#39;t know Wing or anyone else do you have any opinions on this or seen anything I haven&#39;t spent a lot of time with the 53 models mostly because I wasn&#39;t impressed by I guess the 51 models and I feel like they were just way too small and um there&#39;s I think with the smaller models just the reasoning is worse so I just llama 3 is good enough and it works so yeah the S billion about how to determine the adapter rank there actually two param this wasn&#39;t part of the question but there two parameters that go together there&#39;s the adapter Rank and then the adapter Alpha um someone said how to determine the adapter rank um what do you guys have to have for that one I just copy the compi so I don&#39;t determine anything Wing deter that&#39;s one of those that&#39;s one of those hyper parameters you should play with and if you assuming you have like good evaluations and um to just understand like is your model is is is a lore at that rank sufficient to like get good accuracy on what your Downstream use case is so um 32 16 and 32 is like a typically a good starting point that you see most people use and then um so for rank it&#39;s and then for Alpha is usually I believe the papers say it&#39;s it should be 2x the rank 2x the rank um and then if you&#39;re using something like I think it was like RS Laura it&#39;s has something to do with the square root but I try not to get into that there&#39;s a blog post I&#39;m forgetting I think by Sebastian rashka where he actually has does a grid search and talks about um what works for those I&#39;ll try and share that with the community um yeah yeah yeah there&#39;s another thing that I do and this is kind of a weird answer um I actually asked my friends um who are a lot smarter than me so there&#39;s this guy Jon o Whitaker he&#39;s he like really understands a lot of stuff I&#39;m like hey what rank do you think I should use for this and he gives me some tips Jon is actually speaking in this conference um he might not talk exactly about this but he has a really cool talk called napkin math for fine tuning um which you should check out yeah I&#39;m going to switch over to some open questions I&#39;ll take the one that&#39;s set up top I have a custom evaluation uh or Benchmark for my model is there a way I can get it to run periodically during fine tuning to see how the training is going so far against that evaluation metric is actually something that I&#39;ve wanted I don&#39;t know the answer to it but it&#39;s something that I&#39;ve wanted in the past Wing I think that&#39;s uh since I just read it and uh does that question make sense to you do you understand the question B can you have like an evaluation function in Axel or something some call back or something like if you want to compute some like custom evaluation metrics like how do you deal with do you do that do how you deal with it like there there&#39;s there&#39;s like the tiny benchmarks that you can run sort of against the more standard benchmarks um as far as trying to get more like custom evaluations it&#39;s not really supported right now I think you could do things by adding like callbacks on the evaluation loot maybe and like doing some janky you you know pulling from like disc like things you wanted to I guess so so here&#39;s here&#39;s something you could probably try so um there is a way I think on the on the evaluation if you were to specify a custom test data set for your evaluations you can have it um generate predictions for those at certain steps and then log those out to weights and biases and then you could like pull those from weights and biases and then do your own like evaluations using like LM as a judge or something along those lines that would be one way you could do it but there&#39;s nothing like directly integrated right now that&#39;s sort of streamline for that how would you do that dumping of predictions in Axel like how would you do that yeah yeah so it&#39;s already built in I think this like the there&#39;s something called an eval table something um setting in Axel what it does is it will pull some number of prompts from your test data set um and then run predictions during the EV the evaluation step and then log those out to um log those to way and biases I think it&#39;s like eval table something it&#39;s a little it&#39;s it&#39;s a little bit flaky so it&#39;s not like a top level thing that I&#39;ve used I think there was a contributor who submitted that you evalve table size and evalve so I believe the table size is the number of um yeah the number of predictions that you want to do and then the uh number of Max tokens is how long you want to like how many tokens you would like it to generate during that EV up that makes sense question I like this one given Axel as a rapper for some hugging face libraries are there any important edge cases of functionality that you can do in the lower level libraries that aren&#39;t yet possible in Axel I&#39;m sure there are a lot of things that you could do um there tons yeah then you&#39;re operating at the code level yeah hard everything else that goes on underneath so like like yeah you can have custom callbacks and stuff you can do this eval thing that we were just talking about you know you can do all kinds of stuff yeah I think it would especially be like at the speed that Wing can Implement whatever we Chuck in to accelerate and more specifically we can then Chuck into the trainer and it&#39;s whatever that Gap is is the bleeding edge that you don&#39;t have access to you know and so like that could be like new fsp techniques new deep speed techniques that get added that we need to update and accelerate and then push to the trainer that I think for the most part should be the most major Gap because we try and shove everything we can in accelerate into the trainer that then Wing gets for free but I think this um flexibility for callbacks during training with whatever you want to do like at each batch or Whatever frequency to calculate custom evaluation metrics or stuff your data who knows where that would be like the sort of thing I aren&#39;t a ton of use cases for that but doing stuff in between batches seems like a these sort of callback seems like a an example yeah but you might be wondering like okay if you why I use Axel it&#39;s worth bringing that up again I just want to like like one example is like because there&#39;s a lot of stuff that you need to glue together especially if you don&#39;t have a lot of gpus so like one example that came out recently is like you know uh Ur working with fsdp for the longest time didn&#39;t work and the answer team uh kind of enabled that and then within hours Wing like glued it into axle like really before anyone else so I was able to use it like almost right away and Wing keeps doing that like over and over again for like anything that happens like the like you know the LM space is like changing extremely fast like from day to day there&#39;s like a new technique for like efficient fine tuning like lower GPU memory faster whatever something and like the ones that are really important like like this one get into axelo like really fast and trying to do all that yourself would take a long time uh there&#39;s a question um what are the practical implications of uh 4 bit verse higher Precision think we said that some of those um we will talk about more at deployment um is there anything that you guys think we missed in talking about the implications of uh so for bits obviously um gonna lead to a smaller Lura and requires um less Ram anything else you know forbit is quite I mean it&#39;s pretty it can you know it can be aggressive um like I I have noticed the performance degradation when going all the way to 4 bit before um like I&#39;ve been using this library mlc for example and they have like four bit quantization uh and you know in that I did see a difference I don&#39;t see much of a difference 10 two and 8bit but I&#39;m just talking about Vibe checks there&#39;s probably like papers out there that do some analysis you always have to check yourself it is worth like just doing it and checking to see like and running your evals to see what happens um but generally like the tradeoff is okay you you know like for the smaller models uh you know you&#39;ll have a more portable model that&#39;s probably faster probably uh you know maybe now it fits on one GPU you don&#39;t have to do distributed inference things like that potentially um but then it might come at a performance hit so you have to like do your evals to see what that performance hit is yeah um and one thing to keep in mind is Kore is definitely like a tradeoff when you don&#39;t have enough GPU Ram so if you&#39;re training if you have an h100 and you&#39;re training like a 13 billion pror model and it fits like don&#39;t decide to go down the key Lord because you lose a lot of performance in the quantization dequantization step and like I I experimented when like Kor came out I was like why is this really terrible on a A1 100s and like it should be faster right no it&#39;s like it&#39;s because of the like clation deconz steps that it&#39;s just actually worse um when you&#39;re if you&#39;re going for like Speed and Performance when you don&#39;t actually need it so it might be an over optimization in some cases it&#39;s definitely a GPU poor optimization for sure which is like lots of people yeah Axel also support um Mac M Series gpus so yes UMC um so pytorch is supported on macm series like there is like an example somewhere where someone um did uh did it but you&#39;re probably better off using like mlx I believe is the repository that does like has better fine tuning for like if you want to fine tune on your like your MacBook or what have you um I think yeah I think it&#39;s called mlx right yeah yeah it&#39;s mlx because yeah fine tuning on Max is three different Frameworks three different backends and all of them kind of work so um it can work your mileage may vary we got a request for your slide Zack can you I assume you&#39;ll be able to share them with uh yeah they&#39;re actually already in the Discord great we can probably upload those as well along with our slides right yeah yeah yeah it&#39;s just a web URL honestly because mine&#39;s actually hosted on the hugg and face Hub oh fancy so in an overarching sense are there mental models or intuitions that we bring to a gentic llm applications vers ones that are not agente so yeah I saw this question uh mental models agente versus non agentic I guess like in a sense okay like okay what is what do agent agentic means agentic is like some workflow where there&#39;s a function call um really it&#39;s like mods that make function calls are quote agentic I just want to demystify terminology people just like have terms and then feel like it&#39;s a rocket science I actually have not worked on a reuse case where there isn&#39;t some function call involved like even the honeycomb example like uh it&#39;s you know it&#39;s uh executing a query at the end for you um you know that&#39;s after like the the query generation but it is executing it and it&#39;s going in some Loop like after that to try to correct if something goes wrong um and so like and really everything you know it&#39;s really hard to think of I mean there might be some use cases that you know but there is no function calls but I feel like they they all that I&#39;ve had had function calls I think like you need to write evals that you can kind of think of it as like uh unit test integration test like it&#39;s important to you know have tests that test the function calls and have unit test for those as well as like integration tests that&#39;s what I would say about it all right actually I got I got one is fine-tuning an llm to Output deterministic results exactly the same so this is I think important because um to Output deterministic results is not something about how you do training it is instead something about how you do inference so you&#39;re going to train the model it&#39;s going to have some weights um and then uh when you are predicting the next word the last layer is this softmax so that the out the output of the model is actually a probability distribution over the next token and then to make that deterministic you would just choose the whatever word is most like or whatever token is most likely um but that&#39;s all H something that is and then if you don&#39;t do that you&#39;re just sort of sampling from this probability distribution that&#39;s all something that happens at inference time rather than something that happens at training time I&#39;ll give you a little bit more Nuance there is like um if you okay if you want structured output from your llms uh this guide the guided generation that Dan is talking about is like you can clamp down the model so that it&#39;s it&#39;s providing you only tokens that make sense for like in your constraint so like if you want a Json uh output with a certain schema that only has like allowed values you can have a grammar or you can write it&#39;s like basically rules that clamp down on the model and like on what tokens it&#39;s allowed to predict um fine-tuning can you know if you have like a very specific type of structured output that you want the model to always provide um you know so like basically like you know fine tuning can make it happen more reliably um you know the it&#39;s like a trade-off I guess like you know if you&#39;re doing fine tuning correctly you should you know hopefully you don&#39;t um trigger the guided generation framework that often if your guided generation framework is getting triggered very often then you know perhaps that means that if you&#39;re already doing fine tuning anyways uh perhaps it means that your fine tune is not that good um but the cost of the guide generation isn&#39;t that isn&#39;t isn&#39;t uh very meaningful the guide generation Frameworks are actually like really good and really fast like you know things like outlines and things like that tend to be really good um but it turns out that fine tuning can help quite a bit bit in like learning syntax learning structure and things like that with more deterministic outputs

Transcript for:Axel Fine-Tuning Techniques Overview

Transcript for:
Axel Fine-Tuning Techniques Overview