Transcript for:
Aligning Open Language Models by Nathan Lambert

uh today we're happy to have Nathan Lambert a research scientist at the Allen Institute for AI uh who focuses on rlh FF and the author of interconnects doai he'll be presenting a really cool talk on aligning open language models today so thank you for joining us Nathan yeah thanks for the intro okay this is a long time coming this is a brief intro on why I'm doing this I think generally since chat GPT you'll see a lot has obviously happened but I don't think it it's been a blur for me as much as anyone else so kind of taking the time to retell like what has happened in this kind of fine-tuning and Alignment space since chaty PT happened is something that I thought was a worthy undertaking so this is not really a 101 lecture but it will probably give you a lot of context on why people are mentioning certain things and what still matters and what does not so hopefully this is fun um I can see the chat I don't know exactly if questions are going to come to me or if I will see it the whole time I think clarifying questions are good maybe not discussions the whole time and I'll try to keep sure that there's time for questions at the end so um let's get into it um generally we're going to talk about language models so everyone wants to talk about these days is I need to do some of the older history so I can talk about recent history the place I like to start is actually with Claude Shannon who kind of had this paper talking about approximating um arranging characters to create language models um that's probably why anthropic called their models Claud that's not that's pretty well known and a lot has happened since these very early um papers on predicting sequences of text and this is largely built on this loss function which is called the autoaggressive loss function so if you kind of have this training example where you have something like is saw a and you're trying to predict what comes after this the whole idea is that there's going to be one correct token that is the correct label and their training loss is going to increase the probability of that token and decrease the probability of everything else this very simple loss function classifying which token to to use and actually predict has enabled wild things and this kind of took another turn in 17 when this Transformer paper was born attention is all you need everyone here has heard about this it's a great exercise to actually dig into what the attention mechanism is doing not the focus of this talk we'll quickly kind of keep going in 2018 there was three main things these are slightly out of order um Elmo was the earliest one which is contextualized world word embeddings in the same year we also had gpt1 and Bert released which is kind of the beginning of core ideas on which modern language models and Transformers were trending towards and just getting these better models training on large pre large internet scaled corpa Bert was a classifier gpt1 was generating text and we kind of continue along these Trends through the years gpt2 was when we started learning about scaling lws and if you use orders of magnitude more compute the actual test loss will continue to decrease in a linear fashion with respect to the log compute these ideas now are common place when we talk about language models um gbt2 also pioneered a lot of discussions on releasing language models so gbt2 when it was first announced they were holding access back because of the risks of language models and this started a lot of the conversations around um what you should or should not release with language models they eventually actually released gpt2 and you could download the models on hugging face and use them but this is where that kind of conversation around release strategies emerged in 2020 is when language models really started to be noticeably good so gpt3 C is when a lot of people are like whoa this can actually do really interesting things if I kind of create a really clever prompt figure out how to give it my information correctly and gpt3 could do a ton of things with kind of this few shot or multi-shot learning which is when you give it a few examples in the prompt and then ask it to do another rendition of it and with this power came many harms and this is kind of a discussion of what are the risks of releasing language models what types of groups will be hurt by this very important problems that kind of culminated in 2021 with the stochastics parrots paper which is arguing about whether or not language models can be too big is in the title but there's it's a really a critique on how we should be thinking about language models what are the limits of them are they actually doing the things like are they actually thinking or doing any of these human things or are they just kind of following patterns in the data and then just the year after after that this is kind of like the it's like the tragedy of sastic parrots as no one talks about it now is that chat 2bt came the year later and totally reshaped the whole narrative around language models one more time and this is really where we start today's talk is like how does this idea of alignment emerge in chatu BT and then what happens after this so the question that I ask myself is like or I tell a lot of people is can chbt exist without rhf and what we saw in the release day so if go back and read the actual open AI blog about rhf they list all these limitations but they say that rhf was an important tool to launching chat TBT and the limitations that they list are really the things that we're still researching and that we're talk about in this talk it's a great blog post to go back to but a good way to frame it is that rhf seems to be necessary but it's not sufficient you can't do something like chat PT or gemini or Claude with a technique that without something like rhf but it's not the thing like pre-training is still most of the work but the fact that RF is needed is really important to kind of contextualize all these improvements that we've seen in the open in the last 14 months or so some examples that I like to site on rhf being relied upon you can list many more models here than I have um this kind of this figure from anthropics constitutional AI paper is the single one that I go back to all the time showing how just kind of using rhf can get these more desirable behaviors from their model in really dramatic ways so these kind of Elo measurements aren't um kind of calibrated so we don't know how to compare llama 3 on this chart compared to anthropics models but the level of investment that anthropic has had in these kind of techniques and showing this kind of wide ranging improvements of their models with rhf is a kind of flag that we can follow to try to learn how to do alignment with this much precision and with this much kind of impact as places like anthropic or opening eye will have one such example is just a simple quote from the llama 2 paper which is kind of like the colloquial way of reading this quote which I will read is that whoa rhf worked really easily and what the quote is is meanwhile reinforcement learning known from it for its instability seemed a somewhat shadowy field for those in the NLP research Community however reinforcement learning proved highly effective particularly given its cost and time Effectiveness so this is one of the biggest endorsements of rhf and it's always fun for me because I came from the RL side and then I've been learning NLP but for NLP researchers to say these things like yes reinforcement learning is known for instability and but given that is cost effective and time effective for an RL person that's shocking it's like RL has never been particularly cost and time effective but for in these language model domain where we're fine-tuning with it rather than learning from scratch to have people in NLP that are saying this is just really striking for how much impact it can have and the timeline alignment and open alignment is really like when do we see these benefits like these benefits didn't show up in models that people were playing with for quite a bit of time so this is kind of a little Atlas I've thrown together I also made a hugging face collection where I tried to add all the models that I talk about to it so you can actually click on the models or try to use them if you're so inclined of actually running the models yourself it's just kind of another way of documenting um the artifacts I talked about and the artifacts that for me this is a good review on like what What mattered What mattered in this really noisy journey in the last year of course some disclaimers I'm not covering every model since chat GPT this this little plot of model icons could probably look more like an exponential than this kind of capped bar um and there's so much history of NLP that people are building on in the alignment space that is totally swept under the rug here a lot of academic and infrastructure contributions that I'm not talking about but are really important to kind of this proliferation of fine-tuning models so just kind of describing what this image that I have here is um is kind of summarize um some of these are base models I'm not going to focus on base models as much as fine-tuned models the base models are extremely important like none of this happens without llama none of this happens without llama 2 the base models are the Bedrock of this ecosystem and then the alignment models are what people the aligned models are a lot of times what people can play with and what you could try out what you could do yourself on much less computing infrastructure and all these things so I'm going to talk more about the aligned models but everything matters it's one big ecosystem um another thing that's not fun but I'm going to do for the sake of kind of flag posting no one really likes listening to definitions um here are some things that you'll hear thrown around this isn't even all when talking about quote unquote alignment I have like here like alignment I've defined as a general notion of training a model to mirror a user's desires really with any loss function it's not restricted um there's other ter so like there's a difference between instruction fine tuning and supervised fine tuning instruction fine tuning is about trying to get a model that will respond to queries formatted instructions while supervised fine tuning is more about learning a specific tasks capabilities these get interchanged all the time like that's okay it's good to know that they're different and then two kind of more ones I need to touch on and we could go on even longer is reinforcement learning from Human feedback there's this multi-stage process it's it's a specific tool for aligning ml models to human data um it's kind of a class of tools so it has some sort of you learn a preference model and then you extract information from it so there are so many different ways to do it it's really an approach and then there's a term that I'm kind of trying to grow which is preference fine tuning which could Encompass rhf methods like Po but there's the question of how do we differentiate something like direct direct preference optimization which doesn't use an RL Optimizer from all of rhf and I'll kind of come back to this but it's good to have some common rounds to build on because I might be going through some of these things pretty quickly this is a chapter that I cover in one slide because it's really tapping into a lot of different personal stories it's hard to retell how crazy things were when chat GPT dropped people were not really losing their mind but there was a lot of uncertainty on what the future held especially that it was clear that language models were important but it is not clear there's a lot of articles on like titled we're going to reproduce open chat GPT which you can't really have a mo open model that does what a closed product does there's a difference between model weights and this product that chat GPT represents but there is so much excitement that everyone is saying they're going to do these things and trying to figure out the right coalitions for actually doing so and it's it's interesting this this delay is kind of this land grab where people are learning the basic things like what is red teaming how like what is the difference between a dialogue agent and a predictive language model like what tools should we use and everything kind of follows from here with what people are building but like personally I just remember multiple meetings where like people are like yeah you should do it you should go try to build open chat GPT and when you look back that goal is just so wild that so many people are just going like we need to we need to build this thing in the open source and it's just like it doesn't even make sense because you can't open source a whole system that way but there are some things that make a bit this makes a lot more sense which is when things start to get grounded in actual models so the first llama Suite was released and I think in February I have the date in some notes somewhere um and then these instruction tuned models started to show up on this first llama model the first one to really crack the narrative was this alpaca model and it did a bunch of things that still are used today so this was trained on 52,000 self- instruct style data distilled from text venci 3 there's a lot in the sentence I'll say what self- instru struct means but this wasn't even data generated from chat gbt it was generated from one of open ai's API models so if we talk about in this is all on how to apply instruction fine tuning and this is thing I mentioned on the definition slide but really it's about making a model that will respond to specific styles of inputs what often happens at a technical level here is that the model is learning to integrate something called like a chat template the ability to include system prompts so you want to tell you want the model to know it is an agent you want the model to know what day it is excuse me you can do this in the system prompt which is something the user doesn't see but it steers the behavior of the model and instruction tuning is really where we make the model capable of having these behaviors but the question is like what data are we training this behavior on so the most common example is kind of um you continue training with this Auto regressive loss function on question answer pairs so it's like what is a Transformer and then you'll the language model will predict an answer it could be from stack Overflow it could be something else and this example is human data but what made alpaca in a lot of these um early models and even today really popular and accessible is by using data to answer questions that is generated by an AI so this is where the kind of idea of self- instruct data comes in self- instruct was a paper from Al Ali and udub in 2022 before chat GPT where essentially the idea is how do we expand on the distribution and instruction data that we have this training data for fine-tuning a language model without getting more humans in the loop so um what you really have to do is you start with some high quality often human prompts and then what we now see as more common practice today but was very new then is asking a stronger language model create a list of prompts that are similar to this but still diverse and then once you have a list of prompts you can use trap GPT or another model to actually generate completions because then what you have is a really big list of question answer pairs but you don't need to go through the bottleneck of getting humans to sit down and write all of them so what alpaca was really why alpaca worked is because of kind of realizing this and taking in this better model from open AI so you can see this figure here on the right is from the alpaca paper or blog post one of the two they took this model from open Ai and they asked it to generate more tasks so they had 175 to start and then they ended up with over 50,000 kind of tasks and they also generated completions from this open AI model and um then what they did is they took these meta weights that had just come out and they instruction fine tune them and then you end up with alpaca this is kind of a pattern this is a pattern that we've seen many times with alpaca which is essentially you take you you generate some data with from a stronger language model and you find tune on it it sounds so obvious today but this was the first Model to actually release this I can now see questions coming in I'll I'll answer the ones that are clarifying and stuff like this so thanks for asking them and we can come back to more at the end once alpaca happened it felt like there was a new model every week the second model was vicuna and really what they changed was they added new sources of prompts to the distribution so you can see that I say share GPT they also introduced the idea of llm as a judge which is now obvious from a lot of their later evaluation work but let's talk about why Shar GPT was so interesting so Shar GPT was one of the only data sets that got open language model Builders so people like me prompts that were similar to what people um were asking chat gbt so what was happening was you would install this browser plug-in and it would let you share your prompts from chat gbt on Twitter or whatever so it was making it easier to share the prompts before in your conversations before open AI made a tool to do this and it now there's this legal gray gray area over the data set because most of these data sets are unlicensed and they were kind of created without consent or they were released without consent so there's a legal question on whether or not people should be training on this data but the fact of the matter is that Shar gbt was really important to this kind of acceleration and progress on fine-tuning models because the diversity of data is just so much stronger than what people were going to get from like this alpaca self- instruct idea and it set the bar much higher it's only today and in the last like few months or six months for some of them that were getting data sets that could replace these so you see I mentioned um lmis chat one 1M which is just a million conversations from chatbot Arena which took a lot of work to clean out personal information and then a project from the Allen Institute of AI which is wild chat which is really similar to share gbt but the users were given consent at the start that their data was going to be collected and released in exchange for using a language model for free so there's a lot of happen stance in the story where something like this which is legally gray and the data is still on hugging face but it's looks kind of odd um where these little things helped enable the ecosystem even though looking back it's like oh we don't know if that should have happened following vuna is one called koala from also from Berkeley and like it's with you look at the time frames it's pretty obvious that a lot of these were developed concurrently and then the release dates just happened to be slightly different koala is mostly known for having kind of a different diverse set of data sets they used some from alaka they used some from shared jpt again they also used anthropic data that has been released and they had some human evaluation from grad students so this just added more data diversity and the evaluations weren't necessarily better but it was an important model that a lot of people noticed just from these kind of bringing back up these new data sets that had been in the literature from years prior something they might ask looking at these slides is that it's like why weight differences I have all these slides like weight diff to llama 7B weight diff um essentially when llama was released it was released as research only and just you had to and it was distributed to researchers upon request and the license prohibited people from updating llama one to hugging phas so it's kind of this annoying phase where in order to use a model on hugging face you had to clone it and then you had to run a script to convert it with this kind of Delta into the new model in order to in order to actually use it so this is kind of a really frustrating phase from a user perspective because it just made experimentation have one one more barrier to entry and thankfully it was changed with llama 2 but it was really something that many people dealt with at the time and we now today see different license restrictions on how llama is used I mean the Llama 3 released today essentially if you if if I find tuna model for my research and I release it at ai2 llama 3 needs to be in the name so if I wanted to release a new like Tulu model it would have to be llama 3 Tulu 4 DPO it's like the namings are going to be crazy but there's always been restrictions on using llama weights or how you how you share them and the final model that I kind of group into this batch of this real first swing was Dolly so dolly was fine-tuned from a different base model it was fine-tuned from the pit models from alther which were a suite of early scaling experiments from alther AI which is still used extensively but they added some human written data to the loop which is just really important because almost all of the projects that I'll mention today talk about synthetic data or data derived from open AI there's only a few of them that actually added new human data to The Loop and this is what everyone remembered dolly for and a lot of its performance limitations are probably from the base model Pia which is trained in a Time where this type of inference that people expect wasn't as popular and it was kind of before this the scaling laws were thought of it differently you can kind of see through these where we're going to start with different model sizes and different Mt bench scores I'll talk about what Mt bench is in a few slides it's an evaluation tool and this is really just to ground you on what the kind how the scores would change over time so I have these throughout the talk as we kind of go through different models just just to show um kind of how this how the scores continue to progress over time from one small change to another as the community gets better at these things while I was talking about human data so remember dolly is all about human data the probably the still the single busiest um human coordination project for data generation was open Assistant I think it's easy now if you get into fine tuning to see the open system data set and not realize how important it was to the process of alignment in this whole summer and it is still used today so essentially there's this quote on the top but they the leaders ran a community project to generate human written prompts and these kind of like human written responses in many different languages with rating them so you could use it as preferences with the slide has a bug where it's like like has over 10,000 anod trees and over thousand U volunteers this is still used extensively today will come up again in the talk they also released models so the first mod one of the the first model used on hugging chat which I don't remember the launch date of was an open Assistant model so open Assistant was probably the first majorly successful project of the era and the data set is still used today where I will end this talk is saying that we need more things like this it's like really one of the most important things is we need more human data in the open this is a quick aside it's kind of out of the flow of the talk but on April 28th of 2023 typo on the slide of April 28th of 2023 stable vuno was released from Carper AI which looks now like the style of training models except for the data set which was um is now popular they they got po to work they had some human evaluations that were solid it was a good chat model it wasn't um out of distribution but Carper AI was really ahead at the time and then it seems like priorities kind of shifted from stability but it's important to know the extent by which there were still some players who knew how to do rhf really on early on even though they were pretty rare this is the last this is the last slide of this kind of first chapter on instruction tuning was the idea of qor which was kind of unlocked a whole new um bunch of players into actually being able to find two models so for the quick 60c overview um Laura stands for low rank adaptation which is the idea of you can freeze some model weight you freeze most of the model weights and you add new weights to specific layers that you can then fine-tune as if you were fine-tuning the whole model you use the same approach of instruction data with question answering but it takes much less r memory Q Laura was a technique that built upon this by adding very specific quantization and GPU tricks to make it so that memory requirements to fine-tune models was even lower um tin demmer's and team also released this guano model with it which was another big step up in performance of these models I have a few more slides on it on the method so you can kind of see on the right this difference full fine tuning and Laura they look similar where Laura you have fewer parameters is kind of what the smaller shapes mean in Cura um they quantize the base model that you're propagating gradients through to save most of the memory so this is an approximation of if you're fine-tuning different model sizes on the top so 7 billion 13 billion 30 billion with full fine tuning different amount of bits but full fine tuning versus Laura versus Cura and you can kind of see um for reference one a100 GPU has about 80 gigabytes of memory and these are really hard gpus to get plenty of consumer gpus will only have like 24 to 32 gigabytes of memory so you need to use these Cur techniques to actually get the ability to fine-tune models at the seven or 13 billion parameter size um and like guano did this and they released 33 billion and 65 billion parameter llama fine tunes which were clear steps up in the kind of state of the art at the time and they also figured out ways to filter this open Assistant data set that I mentioned and this kind of filtered version of open assistant is what is still most popular today I'm going to kind of pause and skim through the questions and see if there's anything on that section and if not I'll save the relevant ones for later okay I'm going to keep going they're they're great questions and I appreciate them um but they're mostly not specific enough where it's worth the digression this chapter 2 phase is really like where it seemed like things were a little bit slower on the ground but when we look back at a lot of things that came out of this time like DPO Pap the DPO paper was in this era everyone read it but we didn't know what to do with it yet and the new evaluations are still really used transitioning in setting the seam for being not sure of things work um a lot of people are continuing to try to build on these Laura methods and Cura method me I remember a lot of excitement at hugging face where we were setting up our rhf pipeline where we could do rhf on 7 billion parameter models and we could maybe do it on a consumer GPU it was really cool to see the loss going down it's great to bring more people into the space but weeks and weeks would go by and they like why has no one picked up what we released in the blog post and trained a really good model with it and the kind of consensus now is that um these Lura methods just have some sort of weird limitation in how you use them or how the gradients flow that make it much much harder to get a really good model out if you only have a certain number of gpus such that Laura is your only option definitely use it but for people that have more gpus figuring out how to scale is normally a better solution than just using something like Laura that fits and is easier for in the short term another defining moment of this arrow is the Llama 2 back Clash um I'm guessing some people remember this which is like the the famous line was people asked llama if how to kill a python process and it would say no and this really started a whole bunch of new discussions around what kind of alignment means or what model should or should not do here's an example from a paper for a safety evaluation test set called XS test and it's just like should chat models be saved or should they follow the instructions that I want and this is a fundamental question it'll differ by organization it'll differ by individual and this is the point where this became very serious and something that people actually had to reckon with because there were models that were actively disagree people were really disagreeing with this specific take I don't have any clear solution to it but one of the things that led to is this idea of uncensored models um it's a really popular category on uh kind of hugging face right now where the idea is you remove filtering so if we're using synthetic data and I ask a language model a question like if I asked chat gbt how to make a bomb it's going to say I'm sorry I'm as a language model I shouldn't make this and the idea of uncensored models is to remove those points from our kind of remove those points from our fine-tuning data set I think there's a lot of confusion over the name because language models were never at this stage really aren't censored to with but it's really that the data set and the method for creating these data sets needed more filtering or they needed some way of becoming unbiased so like there's a lot of people now that only build models to try to make them unbiased against any sort of refusal a refusal is when you ask a language bottle something and it says no and this goes on today and this came out of this llama 2 thing but otherwise this is a transition period where there's a lot of good solid models being trained but either they didn't have a lot of documentation they didn't have the right release team to Splash as big as they should have the methods were complicated to implement or something like this so like I could run through these and I remember all these models coming out but none of them were really things that are household names like alpaca is today like wizard the team from behind wizard L LM where they created this method called inval instruct which is a synthetic data method like all these things were clearly working for them based on the models they were generating but for whatever reason the narrative wasn't actually changed um there's some new data sets from Ultra like ultr LM is from open BMB in China that is releasing new data sets more people training on share GPT the model called xwin LM was the first one to be like a similar Ballpark and it's like also trained with rhf so not just that Carper model but for whatever reason it's like these didn't really Splash and that's kind of was this kind of Summer after llama 2 where fine tuning was chugging chugging along but the narrative wasn't changing all of that much at least from my perspective but hope that's why I'm here but what was happening in the background while the models weren't seeming that different is that new evaluation tools were coming out that ended up kind of being the standard of today so you can see the dates here so May May 3rd bot Arena June 8th alpaca ofal June 22nd Mt bench sometime in early July the open llm leaderboard all of these things were created about the same time where there's a desperate need to get some sort of signal on what our fine-tuned models are doing in the open like we don't have the capability of paying humans to compare our responses like they do at anthropic where they're always trying new models on humans that's way too expensive we need something that you could sit down as an engineer and get feedback in 10 to 15 minutes so kind of run through these in order and some a lot of these are obvious but it's important to take this from the perspective of what can I use when I'm trying to align models and like what is an immediate feedback versus what is kind of this long-term signal so chapot arena is obviously fantastic like everyone looks at this today as something that is defining um it's like defining corporate strategy is defining the biggest language model players like if Cloud 3 is better than GPT 4 but if I'm an engineer a many small providers aren't going to get their models in and B it takes especially previously it used to take weeks to get your models rating but now it takes days like I need to know what my models of are before I decide to actually release it so that's the biggest thing where like I know I need something Beyond chatbot Arena just for my engineering development and this is where like alpaca of Val and Mt bench really Thrive so alpaca of Val got the slide forting got changed but um I'll just kind of keep rolling through this um alpaca ofal is the idea of you have a list of prompts that you compare to some a strong other base model like um open ai's D vinci3 or gp4 and then you ask a language model which is better and the data set here is compiled from all these popular data sets that um I have been talking about so far so data sets from open Assistant vuna koala anthropic like all these data sets that people have been using they took the test sets from those and that's what Al pack Val mirrors it's kind of a known thing it has some limitations because there's only so many prompts and it's like asking a language model to provide a rating is going to have some ceiling where we don't know how to compare two really good models so has more samples than Mt bench so there's more um so there's just kind of Le smaller error bars and it's easier to use because it's a single turn generation but we've heard about the length bias for a really long time and it's not clear how to interpret these top results so if you this is the older screenshot of leaderboard but what is beating a model 95% of the time mean to another language model that's the questions that we can't really answer in the short term a paco of al2 came out which takes steps to this where it rate compares the GPT 4 rather than D Vinci 3 D vinci3 was an instruct gbt variant but at the end of the day if the like um gbt 4 is answering these questions in the alpaca style really well so what does beating gp4 exactly mean I we need to get more specific in our valuations because I don't really know if I care too much about a 20 or 30% score and alpaca about 2 because I don't know what it means and this is one this is the opaqueness of all of our evaluations we'll see this time and time again where we don't know what an increase in score means so it's like the next step after being able to do it easily this update was pretty recent um Mt bench is is pretty similar where instead of comparing to one model you ask a language model to provide a score to a list of prompts so if I have a model I'm training a gener the completion to 80 diverse prompts and then I asked gp4 hey from 0 to 10 how good were each of those uh completions and this is good but it runs into the same problem of like what if our model is getting really good if our model is getting really good it's it's just like it becomes saturated like gp4 only gets to about nine and there's only about 80 prompts in the evaluation set and all these things it's just and one of the nuance points is that there's actually a variance so even if you set the temperature to zero gp4 versions change your own Generations from your model you're trying to change train can change and this makes it better where it's like okay I can tell if a model was really bad if Mt bench and Alaco Val have really low scores but it's hard to like it's it's still we have these this goal for a precise evaluation so like in pre-training we have Mt bench and Alpac and H swag and all the or sorry in pre-training we have like mlu and hellis swag and all these things that people can look at and average over 10 tasks and if you get like a 2% Improvement on average you're doing great but we don't have this clear indicator in alignment evaluation the open llm leaderboard was the same where it's this came out of the team I was working on at hugging base where we were just trying to evaluate more models to get more signal which was that we needed to know what our competitors are doing and get some ballpark estimate and what this grew into is this whole kind of like ecosystem supporting Discovery tool just because like getting any signal is so useful so this is where we were starting with evaluation which was just like no signal and why this leaderboard was so successful is because it gave everyone access to a little bit more signal on the models that their evaluation but it didn't really solve any of the fundamental problems and it didn't show us that like doing RL jeon models would actually make the scores go up it's starting to get better today but that's like a year on from the launch of this leaderboard so like these problems are still this is talking about a section from July of 2023 and it seems like the things that if I were to go into like go into work and talk to people what we're going to do with our new models like these are the same questions that we're still asking which is why it's there's it the these evaluation schools are still so useful and it's why people still talk about opaco Val but it shows how much of an opportunity there still is so this kind of a summary of what I was talking about it's like how easy is it to use these evaluations like chapot arena is everything like Andre Cory tweets about it and it's great and you can go there and you can use models but like I I don't know how to make sense of that as if I'm trying to sit down every day and write code and the alpaca andal and Mt bench mostly solve this by being cheap and pretty accessible but I really really think there's a huge opportunity here to come out with more so the colleague at ai2 launched wild bench which is a good tool that kind of fits in it's like a chatbot Arena alpaca of Al hybrid and you can use it a little bit faster it's like how are we going to continue to push this along is a great question and I would love to hear what people think we'll take another pause I think we're getting good questions in the chat around um rhf and other things um to what extent are do aligned models actually reason about whether user attent is malicious rather than perform talk detection to detect to avoid unsafe topics this is a question that I wanted to read because it kind of gets at this model versus system topic so when trbt was released on day one it has an output filter that does moderation the language model that is instruction tuned or rhf tuned generates a bunch of text and then a separate model says yes or no and that's like the where it actually does detection and with the release of llama 3 there's another model that's called like llama guard and this is a classifier which will take this text do the moderation and say which type of unsaved topic it is the actual model that is generating does no reasoning over um kind of what is actually an unsaved topic so I'll come back to other ones I'm going to do some discussions about um rhf right now so this will kind of give grounds for where we can continue some of these discussions on um there's orpo or reinforce I don't cover all of them in the lecture but I kind of lead on to why we would talk about them so this this chapter is when I started to get validation as an RL researcher that being otic and going to work in language models was actually a good idea for a lot of this there was a lot of uncertainty over if the people if people in this kind of open ecosystem were even going to be able to use rlf at all or if being a quote unquote rhf researcher for me meant I was going to do instruction fine-tuning and talk about aows and never think about RL again it turned out to be wrong um I'm going to review some RL fundamentals just to kind of make sure we're talking the same language as we talk about this and this will lead into direct preference optimization so there's a reason why I'm doing math I no we don't this is not a normal lecture but here is the equation where you'll see this in rhf papers this is what we're optimizing when we're optimizing rhf it it looks kind of nebulous here I'll break it down so on the left side we're really maximizing with respect to some policy Pi um this reward that is parameterized by a network fi and we're have a penalty that is this kind of KL term which is the distance from our policy to some reference we want to increase reward and but we want to constrain the model so that it doesn't the this kind of optimization doesn't go too far and the primary questions when doing this is how do we Implement a good reward function and how do we optimize the reward these are a really RL Centric way of doing it which is like if you give me a reward function I can optimize and the classic RL idea was I'm in an environment the environment has the reward function built in in rhf we're designing our own reward function so this adds a lot of weirdness to the actual optimization that we're doing and what we do is to get this reward function is we learn what is called a preference or reward model and the most popular way to do this is to take a take a language model that's kind of predicting this separation of two preferences this is called a Bradley Terry model which goes back to some economics but the key idea is that the reward will be proportional to the probability that the text they have would be chosen over any other arbitrary text quickly sounds really Theory like but it outputs a scaler which is now a reward function and is based on this par wise data so the idea is with this equation what if we just use gradient Ascent on this equation and instead of trying to learn a preference model and learn this R what if we just use gradient Ascent directly this is really what direct preference optimization is doing there's a bunch of math in here to get what this R is but this was released back in May so we've already moved on months ahead this chapter starts in kind of late September October back in May when we're still talking about open Assistant this DPO paper came out it's a fantastic paper if you hadn't read it it's a great way to learn about language model map it it's worth reading but the core idea is like why are we spending all this time um learning a reward model when we can just use gradient and solve for the loss function some key ideas to think about with DPO is that DPO is extremely simple to implement on the right here side is the example code from the DPO paper where it's like as long as you have access to the log probs from a model which is a very core thing for training language models um you can compute the DPO loss um because of this because the loss function is at a nice abstraction it scales nicely with existing libraries and it what it's actually doing is training an implicit reward function so the reward is a function of the log probs I don't have the equation here because it quickly becomes a rabbit hole um but whatever the whole DPO versus poo debate means or um o RPO I don't remember what the paper me title is um we're going to see a lot of these things because it's simple and it's scales well that doesn't necessarily mean the fundamental limits are higher but it sometimes it doesn't matter if the Li are higher if it's easier to make progress on something because it's EAS it feels better when progress is being made so that's really a core thing is we'll keep seeing these models and we are and there's this whole debate that has gone on kind of crushing a whole bunch of these questions by um redirecting them in a very political manner but it's like should we use reinforce what about po what about like other things um there're very different styles of optimization so in one half we're used RL update rules which is ultimately about learning a value function and then learning to update taking gradient steps with respect to that value function in DPO we're taking gradient steps directly from the probabilities of the language model they're very different optimization regimes and there's this great meme where like all the like there was a month where the whole NLP Twitter was just arguing about this but both of them are continuing to progress and that is good like it will not just be one or the other um so what really made this debate kick into gear was this release of the Zephyr beta model from hugging face it's after I left hugging face but the team I was on and it was the first Model to make a splash with DPO and it was a big step up in how models were perceived this model was added to like the U search engine people were using all sorts of crazy things so it just felt really good to use it was building on this better base model mistol had come out a new data set this Ultra feedback data set that I mentioned is still one of the core data sets used today when we're kind of practicing alignment this was back in September October that this model came back one of the core things to getting DPO to work was using really low learning rates like 5e minus 7 there's memes about 3E minus 4 being the only learning rate you need to do deep learning and changing it being kind of a joke DPO is the case where that is not even remotely true and then you can see the Mt bench scores again continuing to rise so this is like a validation proof that DPO works that came four months after the paper was released that delay is something that nobody expected we were kind of losing Hope on DPO at many times and now look at where it is and then when I joined ai2 they were already working on this project and I had just helped kind of get it across the line it's like the classic advisor thing where sometimes it's just easier is the first model to scale DPO to 70 billion parameters the last question was oh yeah DPO works on small models when I ever use it on a big model answer is yes and it's like built on the same recipe as Zephyr with a little bit different instruction tuning data sets but scores continued to climb this was this model was so close to beating GPT 3.5 on chatbot Arena it was like a couple ELO points below so we didn't get the title of being the first on to do that but open models were starting to get that kind of Chatty behavior that for so long had eluded them because we had figured out scale because we hadn't figured out these data sets so it was great progress very important and kind of major transition in my career where now it's like okay rhf methods really can work and these weren't just I was not the only one touching things that did this a couple other projects that are really important so Nvidia had steer steerm where steerm was collecting um feedback data where there was attributes on it like how helpful the message was how concise the message was and they did of fine tuning and released Good Very solid models and they also showed that PO is better than DPO which is interesting and then Berkeley came out with this um Starling LM Alpha where they had a new preference data set nectar which is still looked at today and then they also used this kind of Po method after training a reward model and both of these came out about the same time and they're like huh DPO isn't doing as well for us the models are really good recently the second starling model came out its reward model is very strong in my testing it's a 7B model that's almost reaching chat GPT levels in chatbot Arena it's crazy how fast these models are going but we still get a lot of models that are both with po or with DP it's really not one of the other at this point okay I think this is a reasonable time for me to take a couple of these questions might come back to them in more slides but someone asked is there a particular alignment method that I use um there this is teasing a paper but there was a recent paper that was came out where I don't remember the group I can I can find it later but they did what they called a systematic study of Po and DPO and they showed that PO is better I will say that in the experiments that I'm seeing at allna AI I'm also seeing po to be stronger and we hope to release this stuff soon um it's not a um one crushes the other it's that we're seeing that there's for some reason PO is just getting a bit more performance um and then The Logical question is um why not reinforce which is another one of these questions I would love to try it it's just like we have the code that we have and we don't want to touch things that are working well and there's just so few people that are kind of working in the space which I'm like let's get more people working on these things because there's so few people that can answer all these questions so there's another question that says like some say reinforce can work as well as if not better than po it probably can it comes down to your infrastructure carefully fine-tuning it what people are excited about and a lot of luck so we'll see these continue to play out um throughout the year but it's it's complicated I'll come back to the Lama 3 question I have I have one slide for that in a little bit but really this modern ecosystem is how investment in releasing open models that people can use is continuing to grow into 2024 I think there's always been this tenuous period of like there's only a few people releasing these aligned models there's these important people in the ecosystem that are just doing this because they want to and it's for fun they might have a day job and it's like how long can this go on like what are the limitations on this but in in 202 before we've really seen more companies kind of come into the space and someone drew meta llama 3 on the screen was like talking to co-workers and like you're yeah you're going to need to keep adding models you're never going to be able to give this lecture yeah it's a losing battle I know um but the modern the there's just way more types of models so I get away with not having llama 3 on this specific slide because I'm talking about diversity of players and models not just the fact that there are more great models so there's interesting models like this one gen struck from new research in the last few months where it's like a specifically fine-tuned model for rephrasing any text into instructions so if you have a book and you want a model to be able to answer questions about this why don't we just throw it at this rephrasing question this rephrasing model and the teams that I work on at ai2 we're trying to release instruction models where every single thing that we've done to train it is documented and reproducible from data to what compute it was there're just these models are getting new features in these little ways other than just being the quote unquote best open model um such as like these corporate entities that are going for really standing out in the open so there's data bricks dbrx model um coh's command R plus model I think people were mostly blindsided by coher releasing model weights but it was the first open model to pass gbd4 on chatbot Arena and that has been a long time coming I think beating gp4 on human evaluation is not easy and yes the open is still like a year behind but that's fine as long as we have a functioning ecosystem it'll continue to grow then there's other things like interesting research models like row came out does data waiting um we're finally starting to get multilingual models with a which is also from cohere um people are getting more mixture of expert models to train on which are just a bit more of an efficient pre-training equation State space models are really taking off they had this moment in December with Mamba and now it's kind of continuing in 20124 so there's just a lot going on and this makes me feel good because it's like okay I just have to keep doing what I'm doing and encouraging people to participate and we're going to keep being able to do this kind of fun thing of figuring out how to make models and share them with people this is my slide for llama 3 um the reason why I didn't make a lot of slides about this all day is that llama 3's release is more about scaling and the kind of ecosystem as a whole than it is about alignment the Llama 2 paper was extremely detailed about alignment and we're going to get a llama 3 paper soon if you can believe multiple sources of meta which I choose to and when the Llama 3 paper comes out is when we will learn all the interesting alignment things that they have done that being said they are very unlikely to release the human preference data that they did um yet to succeed in getting them to release a reward model for llama 2 or llama 3 from alignment so we have more work to do on getting meta to support this kind of open alignment ecosystem to the same extent that they are supporting the pre-training ecosystem and this kind of scaling story that I'm saying very much connects to the previous slide where scaling and solving this is very much determined by the market markets and like Capital incentives but so long as scaling is continuing to happen in the open ecosystem it just means that more players are going to stick around and in some ways it kind of feeds back into itself where if this llama 3 is rumored to have or they're training a 400 billion parameter model which we're not 100% sure that the weights will be released but it seems like that's Mark Zuckerberg's intent and having that which is about gbt for Quality really changes what you can do to get language models running in your products so llama 3 and how many people are playing in the open space right now goes to show that we have more of the same coming which is interesting models coming on a weekly basis and most people are just kind of accommodated it to it now like people don't freak out when there's a new well they like mistral's model because there's a magnet link and it's funny but like we're used to it and I still expect that to be the case for the next year or two with this pace just kind of being how it is and it's really fun to follow and I I just think that it's like not a time to be worried about scooping being scooped but to just kind of keep figuring out where you can contribute whether it's on evaluation or some of these other alignment methods that people have talked about so I have a quick thing on kind of current directions which we I'll come back to some of these data things that I mentioned multiple times and then we can get to questions um and mo the thing that people want to know a lot is are open models going to catch up to closed models my answer is probably not ever completely there will be some friction in the system by a time delay by which open models are closed and open model weights are not inherently unsafe the open versus Clos debate has mostly converged around this but given the territory that we're going with in AI where we're uncovering new capabilities we've never seen I think it's okay that if there's a few months wait before you have open weight so you can run on your laptop as we're discovering what AI can do if you look at someone if you look at Maxim's plot with trend lines showing them it shows that open models are getting closer but we're not really sure if open models will stay closer on chatbot arena in the long term there will always be an open and close category because there is demand to have models that are tuned to what you want them to do so this kind of Lins into my current directions data is the biggest limitation to alignment which is we have like two or three data sets that are driving all the research in open alignment and thropic HH data set for my friend deep got that uploaded back in 2022 I think um Ultra feedback from open bmbb and nectar from Berkeley neus flow with the Starling models are what most people are focusing on we need more particularly if humans wrote it to add more diversity to our models and more robustness um DP o is continuing in an academic sense there is a comedy of papers extending DPO so there's um this is odds ratio preference optimization which doesn't need a reference model constrained DPO identity preference optimization I don't remember what bco is and then I can't pronounce the kto authors but like ktky something optimization from contextual in Stanford dno sdpo which is like sequential DPO and self-reward there are so many and that's good and that Trend will continue and at the same time we're seeing more model sizes most alignment happened at the seven or 13B scale I think there's a large drive to make smaller models aligned Google is releasing 1 billion parameter paramet models but it's also an opportunity where there aren't that many people playing in the space but it's something that a lot of people want just because to run these models locally making them smaller makes it way easier and then kind of running back to two themes throughout this lecture is um what are specific evaluations that we should be building and how do we personalize these models they kind of go hand inand these are the things that I'm thinking about I welcome feedback from them I kind of identified some people that I'm following to see where new models come out so I try to release models at ai2 um hugging face quickly turns around new aligned models Under The Zephyr brand these kind of Berkeley necks and necks of slow people building data sets and Starling models new research is a kind of they started as just the guy technium was fine-tuning models and now it's a company for fine-tuning models um open BMB in China has been doing a lot of preference data sets they recently released some data sets called Ultra interact which is some math preference data for doing rhf and fine tuning um Argilla is a startup around building tools to annotate data is focused on refence data and there's even just individuals that are driving this narrative so Maxim and John there's just a lot of people model merging is something I didn't talk about but it's kind of like DPO but taking it even farther where it's model merging is so accessible it doesn't you don't need a GPU to merge models it's a for Loop so people are going to try it and there's going to be iteration on it so in this alignment space never bet against people where they can just try things and see what's better excuse me and eventually learn that's what model merging is and it's going to be here to stay so thanks for listening um happy to take questions and thanks to my many teammates at hugging phas and ai2 that make it look like I did so many of these things but there's a lot of great contributors that underly this so I'll kind of slow down and drink some water and answer some questions but thanks for coming again yeah so the top question on scores if please rate them because it's easy for me to see them was about odds ratio preference alignment I think it being agnostic to the method is the best thing but you probably need to be good at engineering to get really good at one method to get um a specific model out and kind of getting these deliverables important to getting recognition I don't know if people can talk via microphone which is a much more natural experience but I'm just going to keep talking to myself uh there's a question around the future of alignment given simple methods can circumvent fine-tuning um I think that the future of alignment is like safety is not the only thing that matters there's a lot of Promise showing that alignment helps with how much people like the model so how much rhf improves the user experience and how much it improves code and math abilities so like while everyone hates qar like qar has some things to guide towards which are using synthetic data RL search and stuff to improve the raw capabilities rather than just talking about safety okay onwards yeah someone people are asking about the fact that llama 3 uses llama 3 said that they use instruction fine tuning rejection sampling DPO and PP PO for their aligned models which I was like I don't know how they're using all of these things but I think they're Shifting the abilities incrementally to provide nice initialization for the next method and to keep being able to use new human data and make the metrics go up I think over time that will become simpler in the future meta will not have this convoluted five-stage multimethod process and we'll figure out a way to distill that to one algorithm um pitfalls of synthetic data is repetitiveness and not robust distributions so most of the synthetic data sets out there are about like they have very similar things in there and that is like the models are going to generalize less well and probably get less well less boosts from alignment training if there's not this kind of General Improvement to capabilities so um we want to take some imperson questions um oh yeah that's much better does anyone have some inperson question questions to ask Nathan okay hi thank you so much um for the talk what do you think are the greatest hotspots of like research or work in terms of personalized language models and where do you see them having the most impact this is one of the things that I'm excited about the local llm community like I'm not particularly ideologically aligned with like the effective accelerationist stuff but I do think that they have a lot of drive to create a language model that they like to use so that therefore there's going to be things we learn from them and that's kind of a classic like how to integrate multiple communities so it's like academics aren't used to looking there but I'm sure there's a lot to learn there yeah I guess there were multiple questions about advice for the field whether it's like grad school or I'll give my advice with the Caged advice that is that you should be very wary of listening to people's advice because it based on their situation but I think that the most important thing you can do when the field is crazy is just keep trying to develop skills and keep trying to build something that you think matters because it's like at the end of the day that's what you're making progress on and you'll never be able to keep track of everything and that's okay and I can't keep track of everything and I'm still trying to train models and build data sets so it's just like like school is about learning to do research and that still has value but industry is also fun if you want to do a startup so it's there's not like here's think about what you want to do all I think someone sent me you can hear me right someone sent me a question through Zoom um a quick question you indicated that making lower methods work with reinforcement learning is tricky do you think lore methods work well with DP or its variance I haven't seen it be particularly successful so that's that's my general role of thumb is I really wait to go deep into a method until there's been a model release that's in like the relevant ballpark with that type of with that method so the fact that it's been around for so long and hasn't happened could be a blind spot but I think that there's some weirdness that's preventing it from happen great okay another one um thank you for the talk you mention GT4 being used as an evaluation metric but it causes data contamination what are some ways to mitigate this oh man yeah I mean this is why it's like nice to have human evaluation but I I don't know if I have an answer at this point I'm kind of fried from Reading llama 3 stuff and giving this lecture but that like that's the fundamental problem is how to disambiguate various evaluation like various biases and evaluation and still get the signal out of them right okay one more um give me a second for stuff like llama 3 training on so many tokens like 15 trillion would that actually make it harder to align this model without losing some capabilities uh learn from this overtraining um it's not technically overtrained but every every model will have kind of a different point by which they're released East so it's like that's why you'll need a different learning rate in back sles and data sets for models so you will need a different um kind of way of continuing it but that that is a common confusion on like how like I I mean I don't even have an intuition for it just to know that I have bought this thing in the past and been proven wrong about it but it's like it's not that it's overtrained or harder to fine-tune it's just that there's more information into the model and as you continue to do this the model can keep learning it just takes more and more data to get marginal improvements so meow is willing to invest more money into the model to make it just a bit better but that shouldn't that should only help that shouldn't hurt right great uh here's another one um do you think synthetic data generation like cosmop is the way to go for making controlled or trusted domain specific models I think it'll be very good I also think it's a good way to get around the fact that um like Google is paying Reddit $60 million a year to use their data so that we can no longer train on the newest Reddit data I think that um cosmop media and synthetic data sets at a large scale can be a way around this and there are rumors that industry is doing something similar give me a second I think there's one that I missed um um could you please share some insights on why you are finding po better than DPO it's mostly like it it ends up extracting more the data so it's like the benchmarks end up being a little bit better if we get it set up correctly with the same like starting point so it's like we you choose a set of evaluations that you care about and you look at them and through fine tuning the it's primarily a group of great grad students doing this is just running a ton of models and trainings and they're seeing that PP over liely can be doing a little bit better and it's like this is this is the fine margins that a lot of AI works on nowadays great um do you foresee a better evaluation method to be determined by a stronger or more specialized model which means world based are Dead Forever um maybe I try not to say no to things this is becoming philosophical which is like I'm trying not to say no to things in the language model space with how fast things are progressing it's like I should try not to bet against progress continuing this goes for pre-training and alignment and it's like at multiple stages in the last few months come to benefit me so it's like if you just assume that things will get better and they will work it it's like just makes it a little bit easier to wrap your head around things one last one here from um give me a sec at its cor and llm is trying to approximate a complex distribution would you say that alignment is the process of squashing specific parts of this distribution according to what humans prefer yeah I think that's phrased generally enough that I could get behind it it it is it's like alignment is about changing the distribution and it's it can be multiple tokens of it's like a multi- turn prediction like RL is not just autor regressive like it can be these kind of multi-ring different things that are getting shifted around and it's a really different loss function here's one from um how do you envision the usage of watermark for both open and closed uh language modk I think it's a lot of times feels like a losing battle um I think that a practical solution in the future is that a lot of if you want to prove something that is human made you can prove that it was generated by a human by having a certain tool rather than trying to understand if a specific content was made by an AI so the Assumption will be that all content was made by an AI less proven to be human is not what I would consider a sociologically like good answer it just seems like a practical one makes sense I think we have a few more minutes so if anybody has any last minute questions feel free to send them um over to me on the zoom chat yeah this is much that was much better than me half reading the question all right here's one um what are your thoughts different optimization functions to train large language models rather than using mle what could be good research um directions through I think this is the whole idea of what rhf represents unless why like if you ask people have been in NLP longer one of the most compelling Arguments for rhf for me is like you now have extreme flexibility on the loss function while we were kind of limited on what our regressive losses could do so there's kind of arguments that it's like why is there any limit if we could just keep doing more and more tokens of RL training it's a really like General framing but like rl's loss function it it you make it so that the training of a language model can incorporate many different things and that's very exciting that could be like the 10-year goal of rhf to what extent is trading on adversarial data effective for defending against Crescendo and other simple multi-turn attacks I haven't spent as much time on safety as I would want to but I think that it's like it'll be this Everlasting dance where if you have example data you can defend against it but it will not be impossible to generate new data so it mostly comes down to the use case that you're looking at protecting so if you want to protect something really important you need to have layers on that that are not just sensitive to a new prompting technique but like limit what the model can do that's kind it's like a use focused theme well the kind of whole like security is a very complicated thing otherwise here's one on quantization um do you see potential in quantization methods such as bitet um like 1.58 bit if so do you think bitnet will become popular I have no idea I wouldn't Ru this is what I mean it's like okay sounds cool wouldn't rule it out you think there is a need or a way to control large scale data extraction from large language models like cosmop pedia I do think there's a lot of wills and a lot of ways to explore making the synthetic data better I think it's very early I have a project that's going on it and it it is one of the few ways that can generate more tokens which is like like like people are actually running out of tokens especially if you try not to train on things that you're not supposed to train on it's like like then you can just generate more data and as we've seen with llama if you have the compute more data will help you let's see selfplay like things any chance um you can kind of expand upon or share your opinions on self-play likee things like open AI super alignment work I think people will keep using language models in the loop training other language models but it's it's a kind of broad field that doesn't have full agreement on how to do it okay cig I think we're pretty much out of time so um if folks want to get in touch or have more questions can they email you or yeah read great but yeah thanks so much again um for taking the time and for giving us such a great talk um so yeah give it forthan and I think the slides as well as the hugging face collection are all posted on our website as well as um Discord so in case anybody wants to follow along sounds good thanks a lot for having me no worries see everyone soon bye bye