spe should I save this okay six yeah this is one okay okay Mak sense okay I yeah I guess you need to start with down SL yeah do you need to uh yeah I think need to take a seat up first where's the I do display [Music] options I think can can you just share the SL not the P yeah I'm trying to and afterwards you can help him switch to the to his slides right yeah okay still showing you can minimize this I mean I don't know but you should be able to but it's still like showing all I don't know why it's not full screen yeah able to just share share what you want to share not or I'm I am just sharing uh the Google Chrome but this is for this one but but so [Music] uh so do the speakers need to use this mic or so so there's I think probably they need mic I think the speakers might also need right I forgot to ask the previous uh this but I think can you make it so that the up here too because SP yeah it would be nice to hi it yeah I think I need to go into oh sorry you can go to presentation yeah oh where's um where where one is it fullen already um where's my uh yeah I think it's full screen already although I don't know why at least on there it's like also you're right by the way that's that's the general inclusion exclusion oh I see I think it's like slightly delayed by like second yeah see but this is not what's this is what's on your screen oh right is this exact yeah this is the YouTube this is a YouTube live stream okay I think it's I think it's delayed though is it no no no oh oh you think so was hey dude how uh are you still working on this yeah finished right oh you're finished nice here I switch the slide I'll text to you thanks or actually it's probably not because it's so long um is it because it's showing oh oh I see I think I know the issue you are sharing the wrong screen yeah you are sharing the wrong screen and can you see I okay okay I think it's probably okay now I I started sharing the right screen okay oh no you oh now it's not the right one the right one yeah I think it it's still a little bit delayed okay sorry oh are ash yes nice to meet you okay nice to meet you okay okay great great thank you yes yeah yeah I think it's good um have keep admitting the students on the zoom can can can you tell who to admit because then you you both should but a chair here you have that one and then so because laptop is so you you admit student student yeah keep because there be more stud keep joining keep right you should make see who co-host uh where's the oh I think I guess I can so you need to yeah okay oh yeah yeah oh exact I guess so just make um did already uh I okay okay we have one minute yeah I'm just keep it as it for now I think either one yeah I guess yeah one second sorry yeah not very good some oh oh so okay so you will I'm gonna make say hung it's not letting me search oh you can see oh because I'm not sure what are you on oh here you are yeah okay are you coost um let me double check so okay yeah okay then and also for the YouTube live stream uh you don't need to do anything for YouTube live stream but you can get the link from T send the YouTube pleas what about the top uh what's the full screen button okay I thought that was just me but like lowy I can't I can't hear anything either hey I'm in this class they'll they'll like let you in if you want I can yeah like they're cha it's just them trying to do like presentation if they okay you're in all good um they're still just working out like T um are you actually enrolled oh it's still going to show the I don't know if we can get rid of the zoom so I think if you see that thing you just it just takes some time to get no I know I know but it doesn't disappear okay um I think oh Alex so how do you make this bigger um can you make this bigger and you can move this into a right okay and then can move this a little bit just're trying to like figure out the zoom what it's fine iwi it's fine we have to get oh okay oh that works okay so we I'll get that so okay so I'll get started and then and then you'll help us yeah Dan you can sit there first I think oh I think yeah you can you can we have two chairs so yeah and I will be there and you can this chair it's fine it's fine you can then hi everyone thanks out for being here H welcome uh to the beginning of the new semester H and Welcome to our class on large language model agents uh great thanks thank you thanks everyone for being here we know that there are still students on the W list we do plan to expand the class capacity so please just the stent uh we hope to uh get the students uh who are on the witness uh to be in the class uh and also to this class actually uh at Bly on campus we have about 400 students who are ened in class in the class and also at the same time the class is also being offered online as well to enable uh other students globally to uh enjoy the class also and uh uh even though this was only announced uh three days ago uh there are now already close to 5,000 students uh uh joined online and so this we hope everyone together will have great Learning Journey uh of a great semester okay thank you um okay so I will first start with some introduction and then we'll get the actual content uh of this class started okay so first uh so my name is Donan I'm a professor in computer science here at Y berley and also a co-director of Campus wise Center called the center on responsible decentralized intelligence uh so I'm the instructor for this class and also we have a guest Co instructor Shin from gole who is also a alarm actually my former student here uh teaching this class together and also we have our great Alex and S and also we have our great readers Tara uh and Ashman um okay uh so this is the teaching staff uh who will be working together together with you uh this semester okay great so everyone's here uh everyone has been seeing the exciting uh growth of large language models uh the speed of advancements is just ason astonishing uh however these large language models they operate in a fairly simple manner they take tax input and produce tax output so what we will cover in this semester in this class is the next Frontier large language model agents so instead of just taking text uh as input and produce Tex as output here we use a large life model as the key ring for reasoning and planning for the agents and enable the agents to interact with external environments observe uh uh the environments and take actions in the environments and the agents will uh be using um external tools and also external database and knowledge base uh and so on uh for retrieval uh to help uh the agents to perform these tasks and the reach capabilities of um these large models makes these LM agents very flexible uh and they can easily operate in diverse environments without much uh particular training and these uh our agents they can uh interact with different types of environments uh including for example suring the web uh through different apis online and they can also be oted even in a robot and operating in a physical world and the can inter sense the environments through different types uh of inputs uh even in the multi model setting even including various sensory uh inputs and taking actions in these diverse uh environments and through this interaction with the complex and diverse environments uh they can uh update their memory they can learn to do to use they can interact with humans and uh they obtain grounding through this uh interactions as well and these agents not only just interact with the environments they can interact with other agents through multi-agent uh interactions and collaboration including humans as well and this multi-agent uh collaboration can help agents together to solve even more complex tasks so why is our agent the next front tier why do we need to empower ours with the agent framework for a number of reasons soling real real world TX is never just uh in one goal with text inputs produce tax outputs often times involves a trial error process and leveraging external tools and the retrieval from external knowledge can help expand lm's capabilities and the more importantly uh this Dynamic uh agents uh agentic flow uh this agent workflow can facilitates solving complex tasks through enabling task decomposition allocation of sub tasks to specialized modules division of labor for project collaboration and throughout the course we also see that multi-agent generation can help inspire better responses even though M LM agents has been a fairly recent development we have already seen LM agents uh helping transform different uh application domains uh through wide ranging including education law Finance healthare cyber security you name it and the development is really exciting and is fast improving there are many different leaderboards for different agent uh uh benchmarks that you can see online and you can see the really fast improvements on all these different agent Frameworks so overall to better enable um agine deployment there are a number of key challenges that we still need to address so first we need to improve the reasoning and planning capabilities of Agents agent tend to make mistakes when performing complex tax end to end and it's important to improve the reasoning and uh and planning capabilities and also to improve embodiment and the learning uh from feed uh environment feedback for these uh LM agents LM agents are still not efficient uh at recovering from mistakes for long Horizon tasks we need to further develop uh methods and capabilities for continuous learning and self-improvements for this L agents and also improve multimodel understanding gring and the world model capabilities of these agents and also as I mentioned multi agent can really help uh agents to uh provide better solutions for tasks and developing the of Minds helps multi agents to better develop as well and safety and privacy these issues are also very important for umm agents LM are susceptible to a attacks can imit harmful messages or leak private data and so on solving these challenges are also really important for deploying LM agent safely uh in the real world uh and also enabling human agent interactions and uh ethics uh how to effectively control ourm agent behaviors and design interaction mode between humans and ours to best inable our to serve human needs is also really important so to help students learn and uh better develop uh methods to address these challenges the course uh is has been designed to cover a broad spectrum of topics uh actually throughout the different uh layers uh of the agents framework uh and also the domains so first in the class we'll cover uh key model uh capabilities including reasoning planning multi model understanding we also cover a popular real world agent Frameworks to enable students to learn how to better design uh agent applications and use various agentic flows easily and this will help uh students to also learn uh to use uh our Asian Frameworks for workflow design to use uh retrieval argumented the generation rack and multi-agent systems and we'll also cover a number of exciting application domains using these our agents including software code development workflow automation multimodel applications and Enterprise applications um and finally we'll also cover important topics on our um agents safety and ethics to cover this wide ranging topics we have assembled an amazing team of guest speakers uh and researchers to cover these topics so the class will be left by me and shinin and we have this amazing crew of uh guest speakers to help uh cover these important topics in class so that's the overview of the topics that we cover in class and why uh you are here why we are here to um study um agents together and now I will briefly talk about the course uh some logistics for course workload for the birthday students and for the Muk students online uh we'll provide separate information uh for the course workload so for uh the birthday students uh taking this class there are a number of components for your course workload so first there's weekly reading assignments and these assignments are due uh by midnight Sunday before the following Monday's lecture and today is the first lecture so there's no reading assignments for today but for the next manual lecture there is reading assignments that will be posted today and it's du uh by this coming Sunday midnight and also we'll have a hands on lab to give you some hands on experience to learn about how to use some of these real world popular agent framework and the information will be released uh later in the semester and then in addition we have a semester long course projects uh for students who are taking the class for three and four units and the greeing information is available online and here is just a quick summary for students who are taking different number of units for the class what your workload uh will be and again you can check the uh specific information online and for students who are taking two three or four units besides the uh the weekly read assignments the lab uh you also need to do a project and for two unit students your project does not require implementation and for three units the project needs an implementation and for four units the project needs a significant implementation components and endtoend demo and the greeting criteria and the percentage of the different components is also available on the course website uh you can go there and uh uh look at the details and I just want to briefly talk about the class projects uh for students who are taking the class for two three or four units and given the size of the class uh we require uh the projects to be done in group and group size of five and also uh given the huge exitement of uh this topic uh we will also have a haathon uh for our agents that'll be launched later in the semester so students who work on class project you can also use our class project to participate in ham and which will be really exciting for everyone and in terms of the cloud projects you are welcome to pick your projects uh and related to LM agents and in particular here we recommend uh you can consider four different categories that we call uh five different categories that we call five tracks uh the uh application track so you can think about how you want to use uh these LM agents to build an exciting application uh in a particular interesting domain uh that you are interested in and the second one is Benchmark uh Benchmark track to better evaluate and understand our agents it's very helpful to develop benchmarks to evaluate their capabilities and also other aspects such as safety and so on so your project can be how you can create a new uh set of Benchmark in the new domain for our ages or you can also uh work on improving an existing Ben Mark uh as well you can either expense an existing Benchmark or you can help improve the quality of existing Benchmark the third track is fundamental track which is to uh develop new uh Technologies and approaches to enhance core agent capabilities including memory reasoning planning to use and so on and the first track is a safety track to help develop methods to address safety concerns in deployment including misuse privacy and and so on and the fifth track is on uh decentralized and the multi- aan track to develop um methods uh and tools to enhance decentralized multi-agent systems again uh we will soon provide more uh detailed information uh about uh examples for these different types of projects uh for these different tracks uh but these are just some initial information to help you uh start thinking about uh your project and also the timeline uh information is also available online uh on the course website uh you can take a look to plan um your uh course load for this master so um and as you can see through the group uh project group formation is actually do uh next Monday and the T will be uh posting uh threats onm to help students to form uh project groups to find group members and add them uh with that any any questions for Logistics just quick will the set of slides be posted yes the slides will be posted downline after time and the project group be less than five people in general we recommend project groups to be five people uh at the first maybe the class is not even divided by five and so there may be some exceptions um if your project group it cannot exceed fine but if you want to have smaller groups you can talk to the T and we can discuss yeah I think that's mostly oh all right and one more uh very important note so students in the same group must be taking a class uh with the same number of units we cannot have a mixed match so Ed group is a five student all four units or all three units or all two units okay so with that let me uh now turn the mic to Denny uh our first guest speaker uh so Denny Joe is from Google has done a lot of really amazing work um um reasoning so a lot of the the key paper that you have been reading on and many others actually done by Denny and his group and with that let's welcome Denny yeah it's my great pleasure here um and then actually my second time to use a the first time with about 10 years ago I also get a call here um so at that time I the CH is as nice as before the only difference is the of audience in my first call about 10 years ago about people in my most professors but today is time that's amazing um before the talk um I want to ask one question for everyone so what do you expect for let me take few seconds about it so I can imagine many different answers like solve the hardest ma problems that hum is to solve for example even par how Sol right or discover new scientific theory or even South ASI um my background is machine learning I don't know in the Curr days there many people study machine learning course or not because it's Why Try form is all we need right as a m person I have a little impation about ai ai should be able to learn from just few examples like what humans usually do in the past decades the mission Community spent great efforts to develop data efficient methods like Sam supp learning Active Learning CH Mach something and uh you know if you look at newspaper in the past decade the people always excited about one games in the as already paper so in practice actually I never saw data efficient approaches I would like to see Miss and failed you know don't feel bad about that I am the present I started almost years um after CIA actually that me to s a different problem what's missing ma learning so I sort of T them for years and finally I found out the answer reason me in the C days in particular for people in the course today the so obvious this lecture about reasoning humans can learn from just few examples because humans can reason not because of data statistics sound so straight forward let's start from a to problem you my research I usually prefer a very simple problem but it contains all the detail all the challenge places so this problem is called the last letter con if you are familiar with NE symbolic literature you'll find similar problems so for this problem given a people name as input the output will will be the conation of the F of the last letter of the first name and last name for example like El mask and the last letter of n the last letter of M is key so output is m key it's so simple and if you have this problem few years ago you probably will try to soled by Machine learning model for example you we use transformal model is why is decoder is encod is encoder is decoder and then you will find that okay you probably need P of label examples to train the model and finally you get accuracy like 85% or 9% something now it's inter think about machine learning uh methods you know for such a simple task I mean simple for humans okay and if the method requires a vast amount of label data to learn and would you like to call it as AI or not AI means artificial intelligence right I I suppose a intelligent model should you have learn this part just use one or two examples now let's see how this problem can be solved by using uh lar models um I suppose most people know L models but Professor told me to uh explain what LM are okay LM is a a transformal model train to predict next word for example given the text AI is the future where mask future just use Ai and the as the input well that model predict what will be the next word if the next prediction is not the word future we need to adjust parameters to make produ cor you miss one that's called back propagation of course here you can change your model with many sentences for example you can use all text from the internet if you don't want to go to details you can simply SLE training LS as training pars to mic human language actually um I TW this sentence and one guy to me he said he he he's very fous about training par he's looking for a job um why we change this model okay and then we can just meic the process of the training the training is about PR NE token what can use whatever as input and to see what will be the output the mod just PR that token and then you can inut the the gener token you can use the input token and next token that's how to get answer from a and for this problem we can simply contaminate all the examples we have had as the input and also conate with is a test example Barack Obama here we can try this has use any and see what happened and probably see you get a wrong answer here is called C ke of course it's not correct right because K is the last letter of Brad and and a is the last letter of Obama the output should be a Ka so it's wrong right the problem this for f short promting it's just mic of machine learning process instead of training model we just use the examples at the input that's the only difference in the kind days we know how to fix this uh this prompting idea we just need to add ging process before the answer like we just add the exclusive PR here the last letter of elong is n the last of M is key n key is n key like that it's called reasoning process and similarly for B case and now with this add the new input and you will see okay we get a perfect response from the lar model so even like humans one demonstration is enough to get accuracy 100% that's exactly what I looked for we cannot imagine any machine learning method can achieve this perfect sh here there's no way but by the way don't overrate what I said about machon machon is just so useful and important for doing research in the C days I saw many naive mistakes from social media news even from the the papers in the public conferences all the na mistakes mostly from people who have no AC on machine learning they just Rand different ideas it's interesting you know this kind of idea of adding intermediate steps has been uh proposed many years um in the literature so this is the amazing paper we know um by researchers published in as to so in their paper they use natural Lang reg nail to solve Mass problems in the paper they even wrote derive The Final Answer through a series of small steps and then they trained a a s six model from scratch if you know uh channel for World you'll be so surprising about this paper right the auers are just like Time Travelers they know how to make a different approach in 2021 a team in the OPI published an amazing data set called GSM they followed the idea in Dem paper in 2017 in this data set every problem uh followed by inid Stacks as solution and also final answer and this team created this amazing data set and use that to F tune gp3 model they are greatly scale up the work by 4 mon in 2017 even in the same year 2021 a group researchers in Google brain now part of Google mind part the work like show your work s path for ined competion with Lal models they discovered the similar ideas independently but in the domain of program synthesis that's why they used um actra symbols here instead of using natural language in the C probably many people know our work chain of s property and chain of s actually literally chain of s is not a term where invented is just Comm English phrase it means M step reasoning so in this work we extensively evaluated from individ steps and Sh amazing results on almost every NLP CL so let's put all Pap here together in 2017 who demand publishing paper training with ined steps in 20121 and the few public papers until L with intermediate St in 2021 2022 and the pr te with intermediate steps how we see okay which part is more important you can see here actually it doesn't matter if you are CH or F or model what really matters here intermediate steps that's the key so let me summarize here regard these of training funing or promting when provided with examples that include intermediate steps L will generate responses that also in intermediate steps Okay g the mid steps when ask question is it helpful to introduce Reon strategies in those examples for humans they when they solve a problem they could have a strategy for solving so this um so here's I work from our team is this toos PR team in this work we enable easy to hard genation by the conversation probably many people saw this famous book how to solve it bya a classical book for Mass education so there's a chapter about conversation so if you just if you go to details you may lost yourself in details yeah now let's see what is difference by um by De computation so given this uh Mass problem here so by the way so um in this talk the the mass is keep Elementary level so every time when I keep talk um before actually my daugh also came that and make sure she understand she's at fifth grade now and U you say Esa has three appos and now had two more appos than es how many app do they have together okay well we say the difference is that okay we first show L models how to break down this problem to Sol problems and then Sol one by one and that's why I could least most from least to most complex problems is this simple idea but surprisingly powerful so that's just like hum how to decompose complex task into simple tasks so this is a um scan task for composition or generalization you can look at examples here give a naturally natural langu uh command and we need to translate to a sequence of aess and that could be executed by robots something that so if you use this most prom we get reive 99.7 so we just use .1% demonstra examples so one I wonder why I chose this task actually um I knew this task from sh she's here today and she invented a beautiful approach to solve this task many years ago when I looked at this task I was really surprised because it look so straightforward for humans why could be so finally we can make it by LM and this is another task with save CH test to code again it's a compositional gen task I don't know if anyone knows the concept composition or transation roughly speaking the test examples are more difficult than than training examples or or PR examples so for example uh for the for the for the C code problems the test problems we need a longer uh snipp here where approach is a little bit change a little bit called Dynamic this most pring and we just use one person data and achieve the per great results way better than the solar result in the literature and the Sol actually by specialized aric design and the training and they use of course all the training data set yeah so far any question here otherwise I'll go to the next section yeah okay I suppose this part is quite a familiar for everyone um I have two case my daughter is uh um 10 years old and my my son 7 years old um actually When the Children saw pring paper came out I uh heard a very interesting conversation between my daughter and my my son and my daughter asked uh a little brother so was 11 17 * three um the little brother said I don't know and then she asked was 10 * 3 30 was 7 * 3 21 so was the 17 time three oh yeah I know 51 and the funny thing is my daughter shouted to me Daddy CH pring also works in my little brother bre okay now uh okay why say okay why intermediate steps are helpful so why me say okay that's so natural for humans but if we are doing research we have to S it deeper you know that's just something similar that for L models are just machine models we understand what happened and this year we have work published at I 2024 and I collaborated with radiant s from Stanford and in that work we give rigorous mathematical analysis okay so um here are the key results Transformer generating intermediate steps can solve any in currently serious problem as long as it steps exceeds a constant sh code after emphasiz again constant that means independent for your input however if a Transformer generating direct answers either requires a huge deps to to solve or cannot solve at all yeah please check the statements again then I'm moving to the next slide probably can see tons of practical implications of this Theory yeah if you put Sol a problem you may think about generating more in STS and also probably you could um call some external tools that search to help intermediate steps right so I think in in this LM agent um course many people talk about how to use eal force and you can think about how to di Inus limitations yeah so I one of my big fun is to FR problems my daughter can Sol in settings but's fail yeah okay so far we have talked about how to use examples to trickle LMS uh to generate step by step reasoning so one is possible to triggle reasoning without using p amazing work actually when this paper came out I thought it was a joke it turned out not and then I was inspired a lot by this world it's called leing step by step so given this question okay we don't need any examples we just need to see that's step by step and the model can generate business STS yeah it's really cool but usually you know the zero short approach that mean z short there no demra examples um it's worse than F shot so wom Wonder okay if we have approach is still to shot but can do much better work so this this two our another work is called L as an Lal reasoners so again this beautiful book how to solve it and Bia you um so in the book say okay how to do analogical reasoning in solve Mass problems so we have see in new problem you first ask your question do you know a related problem or methods what strategies yeah so after my talk how you're going to try to find dis rate and provide another paper um so this I really like the code from banah yeah if you studed f analysis you will know Ban's space and I was really amazed last sentence the ultimate mathematic is the one who can see analogies between analogies of course I should I show here let you know how far away from AI um so given this simple problem okay of course you can see okay let step by step but now can see different way okay we call a related problem and then solve the one solve this one okay you can see that actually indeed we call um relevant examples and knowledge is here but those problem are exactly the same problem but that I use for that's amazing and uh we found that actually which of course we try bench marks and see it works really well so you can see that um the Lo R is for anical visioner by Pro of course you can opiz the prompts by yourself got better results the most important thing here see that um it's much better then just see less step by step L step by step here means zero short c yeah and even this approach all performs manual C here as the main reason is that you know we use this approach the model automatically generate related questions to each different problem this uh results on big bench yeah so was great performance and uh there result unes competive programming yeah if you are interested in competive programming you could try this approach so what we didn't do here is about sking you can uh you can search the web from all related problems knowledges uh for for the problem we will solve so the key idea here you know defly generate relevant examples and the knowledge for each given problem instead of using a big set examples as a manual chal prompting okay now we can see that okay we can use f short examples to Shield the model how to do step by step reasoning what can do zero shot without using any examples just see that's step by step now I could ask another question is it possible to trigger step by step reasoning even without using any problem that's step by step you could say okay all the models in the country are just like that right you're right they or something that means they already used many examples in the data mixture for training or Ching so yeah we found the is is yes DES in our region World CH of s reasoning without problem without PR that without see anything just give problem to the model even for so um let's look example here I have three app my dad has two more app than me and how many ERS do we have together and actually when we wrote the paper I ask our to write simple example to explain the idea and as I told her has three and no I will change my D and for this example we see the approach actually is very simple um at decoding at the first step we look all possible tokens here are listed file tokens here okay so we started the first Prof and then continue gr deod okay for the first one is a file app okay the first one is a file and the next to was and if use um to two I then the four generation will be I have three appro has two more eort than me and Sil has five E yeah it's cor yeah and I see that so that's that's very interesting right so we didn't see anything about reason you hear but the model can do some R you started from different tokens here is another example you say okay was nuos cage Bor in even or all the year um the first say okay was in all year new was the first token and second one that's even and then period third one is then period okay now probably say okay if the you know the model could could have had s in their response the prise how found it I say okay you can take a longer sentences longer s means the model could do some really steps actually a surprising s that to look at the probability of the of the on the token of if you look at the probability on the first roow here a nichos cage was born in order and the is quite low and however if you see if there a Rec part like the last one uh Cas was born in 1964 and even year this a Rec process here and then probability finally jump to 998 that's amazing right it seems that the model is so well calibrated I was really surprised at seeing those probabilities see that like two three if mod say even or old right the probab are very low so key observations and pre have had responses with step by step reasoning and among Generations started with the top key tokens we don't need use any PR here not needed and high confidence in decoding the final answer when a step by step reing path is present so here is a comprising between gr decoding and the chn decoding we see that the CH decoding perform Mar better yeah so far any question here now we don't be any problems yeah uh good question um we um we didn't publish the full work that yet if not just one token right you s of look the probability yeah we didn't publish that work you have think about that okay um now let's uh move to the next topic right generating intermediate steps are helpful if really helpful you know but any cons on Genera image sets instead of direct answers any concerns [Music] yeah so probably say depends on your problem your need yeah so actually in the kind days you know we need to always keep in mind that alms are priu models of generating next tokens they are not humans no matter if I use my keys examples or not keep this in mind so it's a PR model so that's see what LM does in decod so it's actually ARA probability of reasoning pass and The Final Answer given the problem however what we want is ARX probability Final Answer given problem right that's what we learned in machine learning this doesn't mean written pass is not important but I just say final answer we have to make sure final answer is correct and then look the reason pass they are not aligned right the two different objective okay now let's look one step further okay the probability of find answer given problem for the computer we should sum over all possible reason parts that's uh that's called uh is computed from our course will learned right so given Mass problem you could have found different Solutions which lead to the same answer yeah when need that need do the summation here and then okay how to compute the side if you started machine learning you'll know the answer here right simply so simple okay now uh this did to our work selfon probably many people have no self consy but my thought here is I really want to let you see the underlying motivation how we approach this problem from the first principles you machine learning so let's look the question here okay give this math problem and you could sample the answer multiple times yeah again and the finally you see okay the most frequent answer is 18 okay while we give one here is not most frequent reason part we choose most frequent answer that's huge difference Reon pass here is lat varable this idea is so simple by by using self consistency we simply Crush s results in the literature that time and see that doing research we don't need we we can you know it's really just about your idea we don't have to have to know a lot of sense and of course you know give our explanation on self consy is about probability it's about samply so imagine that more consistent results more likely to crash when you look at the curs here if the consistency is more than 80% then the then the uh accuracy is nearly 100% here okay so when the outputs a direct answer without intermediate steps we use the sample several times and then choose the most common answer anyone would like give answer here yeah okay great yeah one token okay that's already the the the token with Maxim probability yeah all that and for the second question and change self consistency by letting a generate more mle responses instead of sampling mle tense and then choosing the most common answer does this make sense yeah no great yeah that is now and for both aners we just need to follow this principle I Max probability to find answer given problem that's all you need to understand self consy it's a very very simple principle Al first one of the first principles in maion if you know more probably you know okay this this record ma marginal inference okay so one okay how about free form answers um this is found as universal s c uh let here so this idea is a little bit different uh by relation I put here and given this problem where do people drink less coffee than they do in Mexico know if you look answers and each answer is different from others but the most common response is to here Japan China and India um any question otherwise going move to the next section okay now yeah self consisteny oh sample you answer multiple times and then chose the most um frequent answer as out yeah and next I'm going to talk about limitations the first one I will talk about L can be easily distracted by compex from psychology studies you know um you information May significantly decrease some children and even adults problem solving accuracy so I want to check if this observation hold for LS so this a simple problem here um the highlighted text is manually added you so that Mario's m r is $10 is Rel for the original problem but I see after that the model made a wrong uh solution here so um actually interest is that okay if we add a PR like ignore context and the model immedately notice that and make make a COR but it's still hard to take back if we make the problem make contents actually big so if we simply just add relevant synes like the sky is blue and the the grass is green or something you know those Nance you can make this uh input uption along you will see significant performance draw across all L the next limitation I'm going to talk about LMS cannot self correct Reon yet let's start from a m problem again and actually this problem is a little bit tricky you look at and uh you see that the model um give a wrong answer and then we uh PR the model with review your previous answer and find problems with your answer okay and then interestingly after reviewing the model re recognize the mistake and uh proactive design this looks something amazing right you see and then we see another problem so based on the problem you find improve your answer and The Final Answer here is correct however if the origin answer is correct we do the same prompt the model could have made mistake that's the problem so overall when allowing a to reval their generated response can help correct inaccurate answers it may also risk changing correct answers into incorrect ones we run extensive studies on um some benchmarks like GSM commiss QA and the QA and we didn't notice any improvements from self correction methods they just Mi this worse um how do you supp some improvement from the literature you know they they said Improvement on reasoning and actually they use Oracle answers you can see the Oracle here Oracle means you only prompt LM to correct the answer when the answer is one the problem is that the model doesn't know if the answer is correct or wrong your tell them the answer is wrong is correct and also this ability to M agent debate the you could one could use multiple a debate each other and to uh to achieve agreement or consensus and also we try this approach no we uh we find out actually the trick is how many response are generated for example if we have three and if let everyone generate response that will be three if let debate that there will be n response together so how about just two self consistency with response and see what happened we find that those approaches cannot outperform self consistency self is much simpler just simp simple M times and take the most frequent as final prediction so the last Le here is Oracle feedback is needed for a to self correct if so that we for our work is s s naturally leverage unit test as orle you about cing problems you can have un test track actually we started this work quite early and we didn't make make a s work and finally we move to the and the last last limitation I want to talk about the premise order matters in reging so um you know in the C base where every time we T reports from archive or to somewhere the people always show great results for example recently the model could be FL and result on JK and I I have problem with TR those numbers in the C the model TR with all data from the internet there could already some problems so one of um on my team is to generate different Evol tasks so to to test the models so that here like here we just did a simple check we are given this original gsmk Pro we um we order the sentence a little bit and see if the model still can solve it so here you know in this in the OR problem that he loses 10 be while getting home and we could you know just move this sentence to to the to the end and see what happened we just did some change for some JK problems and we noticed that there about 10 points drop on solving R across all from P LM so here the response here can can look compare response for the problem for the or problem and the order problem so um see that the model actually just the model just know how to solve the problem sequentially they couldn't go back and forth and one could say okay that related to some semantic understanding reason okay then we design another another task it's called logical inference is more pure than Mass problems you say B if if then if then if then right even even we don't use real real words we just use random uh random um tokens here and given the rules and the facts and then model inference logical inference in the query and the rules for the original problem the rules are ordered according to the use in the inference process but I I will point out not all rules are necessary for the query and another way you know we could just randomly order those rules okay I only only Rand order rules relevant query if not R query they just keep the positions and surprisingly then we saw C part points per draw across all Frontier [Music] L um from my u personal experience I think is really important to design experiments when doing a research so this just like a department okay now U let me summarize the talk here so um the first thing I talk about is generating intermidate steps improves L's performance actually lot and you can do training funing promp with intermediate steps but you can do also do two sh analogical resoning or some kind of special decoding like Cy decoding I presented today and also self consistency greatly improves step by step reasoning no matter your from from F model or from p and I also a lot of limitations you know like you have the context self correction and the premise order all those are matters for performance so when I next problem [Music] and I the most important here is you know I see we work we put work on ai ai that's not a problem the problem is Def find a right problem work on and Sol it from first principles not just from principles that's why super important here and actually currently I'm cognizing a conference called conference model which A F of amazing people and this the first ever confence dedicated to langage modeling and yeah that's it s any question related to which one self cons is related to beam search uh okay yes uh let me repeat this question if the question is is self cency related to be search or not um I my understanding not that related S search is a approach self just about T it's just like for examp if you want to know the probability of has of B coin you just put the coin and the it here for for which some T and see which is more frequent yes why okay let me repeat this question so why mod debate is worse than self consistency let's um recall what s means right for sey our goal is to AR Max probability of Final Answer given problem and from this first principle okay according to me technique with some T and take the mo and take the most frequent answer that means we marginalize or Val those are re pass I as far as I know this process is optimal in Statics so that means we can look better than that these strateg I know there are many VAR in the literature for example even some people put some weights weight on each answer for example you use the probability that's not necessary yeah again oh what about mod debate oh we didn't try that we just a single a to generate same number of response instead of using multiple LM yeah does that work with the luggage does that work with the luggage thank you I couldn't here uh yeah oh okay so um so okay is okay for me to simpy is how s is Rel to a h okay and so far does anyone know what's the the of a I don't know if there's a cons cons here or something uh consensus um so I would like to let me first explain safety bargain safy bargain is when the models generate response and your no ground shoes that unit has in code generation and you'll see you will find okay if the is not correct you you just tell the model again something that and that's a good approach for for code Generation Um I don't know how that Rel to hi maybe she you have a different question of okay sure yeah sorry yeah okay [Music] you resp is boost for training uh no we didn't do any training here yeah I know great yeah of a boost made approach we didn't do any training just some no training which just simple yeah okay the problem is we could wait um the responses that's question so my answer is not necessary you know I just follow the typical proced Mission technique some times that's how we Compu PR yeah for example even like given a b coin to multiple times you need CH the frequen of Tails and CH we don't use any we right yeah okay after all the way models it's like popular l So inspirationally speaking I found difficult to understand from like external so I mean compared to this how do you think and maybe like comp to this one may can control the external structure to make the model think okay so your problem is arure or structure with the model to generate business right yeah I saw but I didn't going in that direction for example I saw some people here D B to gener better reasoning or use CH SW training or something and that could be little arure or some something else however I would like to remind this workk I presented today transform model if we allow the model to generate intermediate tokens as many as you want and the model can solve any problem in [Music] Ser so we didn't see any SE there and part if you make some Str structure that will make re more efficient but I don't know and should be something different yeah [Music] okay so um the question is about more Compu right that's a good point so all stats are about using more compute and so that's even kind of computing right and um for mul problems you will generate more interm SP that's just one way about a Computing but you could consider computing in architecture for example it use to use a de desx that means for each token you will use different to generate um okay model curious with the idea problem how do you define like a Rel problem you use you you're probably the B premise order and PR about how uh like related problems if you give a related problem then help the answer could you use different ordering like as a oh I see okay let me R this problem and for PR is um your ordering um task if possible to use related POS Rel examples with different orders right and probably that will help L to solve the problems you are definitely right that will be helpful and our goal in that work is a little bit different we about a general apprach or limitation about and want to solve that problem in a general way not just par for that task that task just for motivation in other purpose yeah question for the self consistency that you mention it seem it's like sampling but for S I don't know like maybe for the common sense problem so but I I wonder for the noble problem such as like a drug Discovery and something like a noble problem refer to some like regarding some like logic in problem how we how is s consistency have this type of problem I just imagine is it U the follow question will be is R like algorithm more capable in this type of problem because it's always scoring and if you reach a better State then for the noal problem you can have better maybe like okay yeah the is how could be applied to some open problems that Discovery or something yeah um that really depends on those problems um so in the talk I also presented Universal selfy and uh where the answer is not unique and you could to find a cons however I that problem can more cheing in scientifical research once you make a great break work that me your work should be very different from other people and that will not be just not surprising yeah so want be you're right okay yeah so in the direct work that you present today uh you Pro that if you allow the lar model you allow the Transformer to uh output immediate steps then can Sol problem in TC rather than AC and one problem that very well known is the multiplication of two um so might draw the conclusion that oh if you allow LM to Output intermediate answers then they can do multiplication and if you don't allow that they cannot but I think the Really Work by getting toy show that to train Transformers model properly uh with this intermediate steps but you only ask the model to Output a single answer the final member then you can still do multiplication and for like seven digit members so that seems like a bory so what's your okay your problem is about multiplication [Music] um yes yeah I try to put on the captions that's what this class about bro trying to get so um I I probably also depend on how you train in this days um I have [Music] no have idea here so for this course agent I those speakers in the future course tell you to use Python that's simplest way for modification yeah so you mentioned this reordering but if you re X then the accuracy drops a lot right what take down that is that because of determine what use next part of the problem and you take that part away problem okay so the problem is why L probably drops a lot after right um I think the problem uh is mainly du to the TR all kind of models are model they just so for theorder even for humans if going to solve problems they the problem back and for that's be for for that task okay all right yeah so the question is if TR with [Music] and um you today I didn't talk about anything about the training prod and uh sorry I have no comment sorry I just about use not about training okay uh the question is selfy imp accy um I don't say why not because if high c mean theity problem is higher oh your problem is the model is and may not work there right yeah that's good point so we have to make sure the samples are independent yeah yeah okay the question is how the temperature will impact the performance of self consistency yeah [Music] um that actually depends on um tip Mouse how the trend or here and from our experience um we usually choose the temp like one to show the results we didn't particularly choose different par to see how perance will change of course if change temp zero that won't work right yeah we really just chose temp water the okay uh the question is um if we see is with High confence um yes so um many examples like that um actually that's kind of related to th effect in Psychology you know if we uh don't have expertise in one domain we uh will make more mistakes actually more than that we could even has strong strong about our answers you know we don't know something have more something that similar things in um however on the Mach side what you can do is choose the maximum answer is C optim of course if we have other PR knowledge we could use it otherwise we have no other options okay for like automatic moving for is there way that we can incorporate some like a right into the no um so it can learn more efficient the so repeat your question so for the automatic like maing like those problem is there any way that incorporate the accent of some Theory into the mod that it can Leverage those a and those theor include some new problems so makes it like learn more efficient and stop the problem okay so um your question is we could add some serums such that can solve some questions more efficient or learn more efficient right um yes so incorporate the AUM and the uh the theor so um it's not not like diverge of some like new problem when is it so if some new problem that is so new and no it haven't seen before in the in the tring problem you can use those like okay and about adding Ser so depend on how to add so usually in stage the model have SE all the data from the internet including the sering and probably you want but you have new series and it's not in the model and you can include it in the prod that will be helpful as [Music] well great and thanks everyone for being here and we'll see you next Monday and also again please welcome the grps and thank you all yeah but I I think students uh so please just wait we are the so you don't need to talk to the right now we than I how do you why go