Transcript for:
Kaggle AI Course Highlights and Insights

welcome everyone to the next iteration of the kaggle generative AI intensive course we're really excited to have over a quarter of a million developers dialed in today to learn all about the Gemini apis AI Studio vertex Ai and all of the wonderful AI features that Google cloud has been putting out over the last several months um but before we we we begin I really want uh to uh to welcome on to the stage someone very special our chief scientist at Google um who wanted to be here to to to give you a short message um so with that let's welcome Jeff Dean hi everyone Jeff de here uh I'm super excited you've all decided to join the kaggle Gen uh intensive uh course um you know we ran this course last year and we had about 150,000 developers sign up um we're running it this year looks like we'll have more than 200,000 which is pretty amazing um and really we're excited to see what people do with uh these generative model there's so many possibilities in terms of creating you know useful uh things that help people get things done you know things that entertain people things that help people you know take information and transform it into different kinds of modalities um and the the capabilities of these models are increasing quickly so that means that the building blocks that we all have as developers are growing in uh you know sophistication and that enable us to do more and more interesting things so we're really really excited about what you all are going to do uh what you're all going to learn to do and that's why we're running this course and uh we're looking forward to answering questions and giving you you know guidance about how to use these tools and seeing how you all use them super exciting thank you excellent and thank you so much Jeff Dean I I think that we all have uh uh we've all used many of the products that Jeff has built from the ground up um and so it's wonderful to have him here today to tell us a little bit about um what he's excited to see for our students to create so taking a look back at the slides um just as a general reminder this five-day intensive course is hosted by kaggle it includes everything from daily assignments to Discord discussion threads um live stream uh and ask me anything session which you're dialed into right now um and as a reminder um this 5-day generative AI intensive course incorporates foundational models embeddings and Vector databases agents domain specific models and today we are going to be exploring the wonderful world of foundational models and prompt engineering um just as a reminder you should have been listening to our summary podcast episode which was created by notebook LM we'll get to hear more from The Notebook lmm team later this week you should have taken a look at some of the white papers that we have around this first day's content and then also importantly the kaggle code lab which an aunt will be walking through towards the end of today um but before we go into this I want to have a huge shout out to our generative AI course moderators um this is the team that's been tirelessly answering all of the questions that y'all have been asking on Discord they've been helping spot questions that could be useful in the Q&A session um so thank you so much a virtual Round of Applause to everybody who's been um making this course happen if you see them digitally uh or in person say thank you um and with that let's head into our Q&A so today I am honored beyond belief to welcome Logan Kilpatrick Warren Barkley Kieran melan um Arena Ziggler and Matt boso to um to the stage um they are our expert hosts for day one all about prompt engineering and foundational models um and I I with that I I'm going to to go ahead and get into our first question um so first uh first question is for Logan and for Matt um I and taking a look at it now tell us a little bit about AI studio um what capabilities does it unlock for developers and how does it bridge the gap between Google deep mind's latest research and all of our great Google Cloud tools want to go first Logan yeah sure I'm uh I'm happy to Pig uh this is this is a ton of fun to tell folks about about AI studio and everything that we're doing the sort of high level mental model of this is aiio is intended to be the F the fast path to access the latest Gemini models the latest generative AI models coming out of Google and and specifically coming out of Deep Mind um so if you're a developer and you want to see what's possible with with the Gemini models and page I know you and I are do this all the time you have some idea of what you want to build with ai ai studio is a great place to just show up test can the model actually do this thing does it have this capability baked in um and then with a single click you can go and grab a bunch of code and and actually start building that idea um in python or JavaScript or whatever your language of choic is um you can also get an API key and and start building with the API really quickly so really sort of the the path to accelerate um idea to actually building something with Gemini absolutely and I know that uh just recently we've been focused on unifying the sdks such that you know when you hit that get code button it's a really seamless path to get from AI Studio code to code that works with vertex AI um so it's been really really cool to see um Matt is there anything you want to add no that's what we going say people often ask why we have two different products vertex and the I studio and Warren is going to talk about vertex soon but uh we want to have both a unified developer experience like you all say that you shouldn't have to Cho choose one or the other before you start coding like you can just you know pick an SDK and just work with it uh but there is value having two different products for two different types of use case I studio is very very simple it's not an mlops platform it doesn't have a model garden with many models right it's just Google's models uh is the simplest sort of getting started experience where vertex is an Enterprise product and it has a bunch of capabilities that AI Studio does not have so there is very having these two different things as long as we have a unified developer experience so you don't have to write your code twice and and we are doing that excellent and people can try it out today uh at AI dodev I believe uh that which is the coolest URL that uh that I have seen in recent so very very excited thank you for the great answer um second question for Kieran um how has prompt engineering evolved since the early days of large language models so as an example Palm 2 gpt3 and the like how do you think it will evolve over time especially given the rise and popularity for multimodal models and you've been doing this for quite some time I believe so this this is a fascinating question I think it's fascinating to look back at the evolution not just of prompt engineering but s of language models themselves um so if you look at how language models began they weren't specifically designed to do the tasks that we're using them for now they were designed as probabilistic models that are able to predict the next um the next token or word in a sentence and um in the early days we got these models as tools and um really there was a lot of trial and error to find techniques that worked well with them and um Ena them to you do useful things um so this led to things like Chain of Thought and F shop prompting adding personas things like that they they were just techniques that um were essentially discovered rather than designed um but in parallel we put a huge amount of work and effort into instruction tuning the models um uh so this is really us teaching the models how to behave in certain scenarios um this has progressed really rapidly um uh and it's moving to a future where like all the techniques are really designed techniques rather um and they're intentional um so as a team within Google it's it's our job to make llms as intuitive and easy to use as possible um and when I think about sort of prompt engineering I think of two sort of sets of Persona of user bases um you can think of end users um so those are folks maybe like using the Gemini app trying to use Gemini to answer questions um on a daily basis you know they may not be experts in prompting um and like really what we're trying to do there is get to a state where like no prompt engineering is required at all if um you should just be able to answer a user's reasonable request and um if there's something that's not clear then the app should um just be asking for clarification or offing multiple suggestions things like that um but then you can also think about people trying to build applications with other thems and um really what we're trying to do here is um make that process as simple and easy as possible like if Gemini um doesn't understand something in your prompt then it should be asking you about that um when you're developing your application rather than when you're deploying it um and really what what we're doing in parallel is kind of coming up with design patterns you know much the same way that programming language is evolved um and you got good design patterns for um like building applications um uh in like standard programming languages I see the same same thing happening in LM so you can imagine sort of guide books and things like that allowing you to sort of almost take off the shelf a um uh like an example bit of code and then customize that um to your specific application um when we talk about multimodal um I don't think anything should really be changing here in terms of our prompting techniques um you know one thing is that we're a little bit earlier in our journey um for multimodal prompts than um uh we are for text so like maybe we're still understanding where the gaps are um but I think it's actually really um really cool to try prompting um like the Gemini app by voice um because what you see there or certainly what I see is um a really impressive performance um despite the prompts really being less well written than they are in text um like if I'm talking to the Gemini app it'll be full of ums and and I might go back and correct my sentence you know I wouldn't do that if I was text prompting and yet what you really see is um like a very coherent um fluent process um excellent I I love the I love the the recommendation or the the the sort of forecast that in the future models will be asking for clarifying questions or or they'll be pulling in additional context to help support users requests um and make sure that they have the best possible outputs so it sounds it sounds like a Brave New World very excited for it in the future excellent um next question uh is all about vertex AI um so vertex AI is rapidly evolving with new features and capabilities looking ahead what are some of the key emerging Trends in Enterprise AI that you're most excited about and how is vertex AI positioned to help businesses leverage those for real world impact um Warren and Arena uh do you want to do you want to answer yeah sure um I'll go first Serena feel free to fill in um I think that when I look at like where we were and where we're going now um talking to customers we're in this kind of large place of business you know process automation where it was how do I understand unstructured data how do I pull it out and do things with it you know what we're seeing now as playing with the gentic Frameworks and some latest models last week is moving to this place of being able to uh understand and reason to kind of do deep research type things so Mar Way Beyond I just want to take this unstructured data and understand what it is to hey can I do comparative analysis of things and so when I was playing with the agent the model actually was smart enough to know that it didn't have the latest information and invoked a tool to go out actually on the web and find the latest information and bring it into the analysis I was doing and so it becomes much uh easier for you to do kind of some of this deep uh analys is and understanding of the data in the world and so I think that that's I see a ton of that in Enterprises going on where they're really getting Beyond just being able to search and understand that they've got a bunch of unstructured data and what to do with it Arena you want to fill in I think that sounds good great excellent and I I know that there uh the the question around evals and how to adequately evaluate these these agentic workflows is is something that uh that many companies are are thinking about and and starting to Pioneer um so I excited to hear more about uh Arena's uh uh work in evals later on as well um cool so next question is for everyone on the call um uh looking ahead one to three years what specific task or capability do you believe foundational models will unlock that seems challenging or impossible today um conversely what inherent limitations do you think will persist despite all of the progress that we've made um over the last couple of years and especially in the last several months so so just uh given that we have a lot of folks on the call let's keep it brief so around uh just two to three sentences um and to start U Matt do you want to go first sure three years in AI is 100 years in in human life so like that's question uh I think think the way we build software and the way we use software will radically change like change in ways that people are still not realizing I would summarize as that yeah and Logan what about you what do you think will be different in the next one to three years yeah I think the whole code generation software engineering stuff is going to continue to take off but I think the thing that is most interesting to me is I think on the context of what's not going to happen I do think the models are still going to be bad if you don't give them enough context like I feel like this is like not going to be a solved problem if you just like ask a really basic question and expect the model to be able to do some miraculous thing for you um I think you're you're still going to be disappointed in a few years maybe the models will be smart enough to know that they have to ask a bunch of follow-up questions or something like that in in the next one to three years but uh that that piece is is still going to persist absolutely the it's just like today if you ask a person to to solve a task or to or to do something for you you still need to give them enough uh enough background information to be successful and to point them in the right direction Arena do you want to go next I think mine is kind of similar from from the eval lens I think we will see less issues with how to automate eval flows but we will still be at the point where if you don't give the model your criteria and what you care of about and what you define is good there is just no way to to know that and that can be provided in context or somewhere else but you will still need to provide that somehow absolutely and Karen I might bring it back to the previous question and say that in a few years I hope that prompt engineering will be a thing of the past and I and i' expect it to be um uh But like everyone else says the models will never be able to read your mind and um uh people have to have to get more into the mindset being super clear and specific on their intention yeah and Warren I think even in this exists today but I think it gets worse is over time which is uh Enterprises especially are not built around the ability to move super quickly and so anything that happened six months ago that was two generations ago and so the ability for you to keep up and make sure that you design your you know the way that you roll out software the way that you put it in production that you can actually do that in an agile way that allows you to take the latest and greatest I think that that's uh one of the big uh issues that folks are are facing now and will continue to face as we move forward excellent and H and thank you so much to to everyone who's dialed in this week to to learn more about what's being built and to learn how to incorporate it into your businesses into your work or into your personal projects that is one of the first steps that you can do um to make sure that you're staying up to date and that as all of these great new features get released you are on top of all of them so so um good work to everybody dialed in today and learning more through the kaggle generative AI intensive course um the next question that we have is our first Community question um from donna. oie uh thank you for adding this on Discord how can AI system design and prompt engineering be optimized to improve Energy Efficiency computational performance so things like speed and accuracy and reduce environmental impact while maintaining output model quality um so Logan would you like to to take a stab at this one yeah this is a this is an interesting question I think there's a bunch of different dimensions I think one of them is um at some layer like different API Services have ways to help with this like you could do prompt caching as an example um I think there's also like you know another angle of this is like batch apis like you can take sort of you know the fixed cost of the fixed requirements of of having Services just like running all the time and find times when there's like idle compute just sitting around there um and be able to sort of flatten out and make sure that um you know the compu is being used from the Energy Efficiency standpoint you could imagine a world where like you say in the future this doesn't exist yet but you could say I'm a batch API customer and I care a lot about the environmental impact so actually try to run my batches times when you know data centers are more likely to be using renewable energy or something like that so I think there's a lot of degrees of freedom um and and dimensions in which in which you could do this absolutely and I know that you know Google's been investing a lot in in smaller models especially smaller open models and that can help reduce the the environmental impact footprint as well um if you're using something that's on device as opposed to to pinging a server with a larger model so it's it's been it's been really inspiring to see the creative ways that the community is kind of rallying around and and making these workloads more efficient in real time next question um this is from Channing Ogden um again from Discord how can judge models be selected and customized efficiently to minimize compounded bias and accuracy issues building confidence without relying solely on repeated evals um Arena do you want to take this one yes that's a great question and something I'm thinking about a lot how do you actually build trust in all the judge models and auto writers that you use so I I kind of think about three areas here first you just need to have a good foundation right so do I use a state-of-the-art model as a judge and do I have some core techniques in place to mitigate bias so just a couple examples you already mentioned multi sampling right so not relying on the first answer and there's other things like flipping because you need to flip orders as these models show a position bias they prefer the first or the second in a p w comparison so this would be kind of the first step and then I would say evaluate your judge it's yet another llm application so you want to test it and this doesn't have to be a lot of data but some basic understanding if the alignment if there is alignment between your human experts or your own judgment and what the judge says on again your very specific data and if you don't see that alignment and this is the third thing start with prompt engineering right we already spoke about this a lot today but are the criteria that you give the judge clear and specific I often see hidden criteria so users implicitly consider something but they don't really explicitly state it in the autter rator promt and as Kieran said maybe in the future they can ask clarifying questions but they don't do that today so this is something you have to think about and then after you've done this there's and and you see that refining The Prompt doesn't really close the alignment gap between you and the judge then you can explore more Advanced Techniques so maybe maybe the data set level criteria that you're working with they just don't fit your data set it's too diverse so maybe you need something like per item rubrics instead of universal criteria potentially you might consider fine-tuning your judge because your data set is so so specific but again I would start with the prompt engineering if you see some misalignment between you and the judge model excellent I feel like we all just got a crash course in uh in evaluating and sort of perfecting judge models this is wonderful um and I hope we I hope we get to cover for a little bit more of it later on in the week cool um next question um from sarar 42 uh is writing the prompt appropriately prompt engineering or setting up the number of tokens temperature top p is prompt engineering or both um and uh an not do you want to do you want to take a stab at answering this question um I also love to hear everyone's uh everyone's initial thoughts around prompt engineering what is it why is it useful and what exactly um what exactly does it entail so maybe I just give a quick uh overview and then uh the rest can take over so I see prompt The Prompt part of the prompt engineering as the input like in traditional ml models where there were features so prompt is the input to the model while the the other parameters the generation or decoding parameters like top PE temperature these are operating at the output level where we kind of uh with the fixed input what else we can modify the the output to kind of ensure that the tokens that are selected and sampled are um done so in a way that optimizes the response for the task anybody else uh want to take a step at this maybe um uh Matt or airon sure I'll ask K to cover his ears as I say this but like my opinion is propt engineer is a little bit of an art right it's a little bit of try and error it's a little bit of let me test this with multiple models let me test this with multiple temperature settings does a smaller model can do this job or do I need a larger model like a lot of these things I put in the bucket of prompt engineering which is experimentation over and over again until you find the optimal uh you know way of working for your scenario and yeah I I'd agree I'd agree with the art points at the minute um like I see a future where there's far less art involved and you know if we think about this sort of specific question you know whether you call it prompt engineering or llm engineering or whatever these are all things as a developer that you have to think about um and there are various knobs that you can control um and you may not you need to play with them a bit to build up an appreciation of the effects that they're going to have um and there's something that I would expect the um like need to go down for in the future as well like temperature is really just a way of balancing your creativity um and uh like increasing your creativity to make sure you don't get the same response every time um and uh that's uh that's I guess a knob that we expect to sort of to stay um but um be clear on what these parameters are are doing and expect things to look a bit different in the future just one thing to add on to that for those of you who haven't read The Prompt engineering white paper yet we have a section on automated prompt engineering which kind of crafts a prompt uses together with the evaluation topic it evaluates and crafts it for you so have a look at that it's very useful um for your projects absolutely next question um Hallucination is a major challenge for large language models techniques like Rag and prompt engineering can help um but what are the most effective methods that Google uses in Gemini 2 to minimize uh incorrect or misleading outputs are there trade-offs between reducing hallucinations and model creativity um Kieran uh do you want to do you want to take a stab with this one sure absolutely so reducing hallucinations um or you know this is an area that we call factuality as well it's one of our key areas of focus it's one of the things that um was very apparent in LMS in the early days and um it's critical to make them like most useful going forwards and I think there has been a lot of progress over the last couple of years um uh when I think about H Nations I think of it in sort of terms of sort of two different framings um one is whether um Gemini is trying to answer a question or Pro provide an answer based on a bit of context that has been input um so you know that can be rag but it can also be like answering questions over documents that you provide um and also if you look at the um the generative experience in Google search that's what happens um like gini will go away and do a search and then uh of summarize those results for you so um it's able to produce its answer based entirely on knowledge that is being Prov ided to it um and then there's a second one which is where um Gemini is um like answering essentially from um from its training data or you know if you think of human analogy it's kind of like answering from your education or um your memory um the first is always going to be um like inherently a bit easier for it to do and um you know probably a safer bet you know I've I it's a lot easier for me to verify an answer if I'm looking at some Source materials in front of me than if I'm having to sort of troll through um some distant knowledge of facts In My Memory um so um really if you're trying to reduce hallucinations finding ways to ground to the input um is like a very good strategy and like vertex offer search grounding as um an option on the API um so what you're really doing there is giving verifiable um references to justify the answers is coming out um um beyond that um you know you can be uh explicit in your prompt about what your requirements here are that may help um and you can also add on explicit verification steps at the end if you want to be really uh really um like uh mindful about your factuality um uh llms are also very good at verifying answers and checking whether an answer coming back is uh meets a certain criteria so um you can try that strategy um even maybe combined with self-correction um to uh like go away and correct any errors that you see um and in terms of the balance between uh like factuality and creativity um you know there is a uh a trade-off there um you know it's difficult to write a very creative poem that is grounded in sort of real fact um um but uh you know something to be aware of coming back to the previous point about essentially the temperature knob um when you increase the temperature um what you're doing is you're making it um uh you're making it more likely for the model to pick a less probable token uh because these are probabilistic models um so when you're exploring your factuality you can play it safe and uh turn the temperature right down to zero and you can also look empirically at the results that you're getting back as you increase that to see what is the right balance for you in your use case absolutely just like uh Warren was mentioning a little bit earlier I love all of the knobs and dials that you can use to to ground and to reduce the likelihood of hallucinations for models things like search grounding or or code execution um these are all wonderful um wonderful methods for for getting the right kinds of outputs from from the models that we have available today and with that uh thank you so much to all of our expert guests uh for coming today for for getting to share a little bit more about what you're working on and to answer all of our great Community questions we really appreciate you and your time uh and am looking forward to seeing what uh what y'all um build and release over the course of the next several months this has been excellent thank you wonderful and now uh uh one of the most exciting parts of the of the the sort of kaggle generative AI intensive course at least to me I'm delighted to welcome on to the stage um anant who is going to be going over our code labs and demos um as well as giving a a brief overview of of some of the cont content that you all should have been learning about over the course of the last 24 hours um take it away an a thanks paig so hi everyone um so uh we aware that uh some of you are directly joining for the live stream and uh I'm going to give you a quick overview of the white papers and then move on to the code labs to give you a flavor of what you learned or will learn if you have haven't had a chance to dive into it so uh uh let's start off with the white paper overview so the we had two white papers the found one on foundational models and the one on prompt engineering and first in the foundational model we looked at the uh basically how these AI models which often use a Transformer architecture leverage the attention mechanism to train on like large Corpus and leverage large context we also looked at architectural variations like mixture of experts how they can improve efficiency and quality we also saw the rapid Evolution through different models all the way from birt Palm to Gemini along with several open um models as well driven by scaling data and model size next we looked at how these models are trained and adapted so they start off with pre-training on v data sets for basically just building a general understanding of um the modality they're training on then often to ensure that they we looked at basically to ensure that they follow the instructions that you provide we use super they go through a supervised fine tuning state AG um uh to improve instruction following and to make them more task specific and lastly um we looked at how reinforcement learning from Human feedback also called rlf aligns the outputs with human preferences for basically things like uh helpfulness and safety then we looked at okay the training uh tuning these models especially to tasks um Downstream toss quite expensive so we can do we use various parameter efficient methods which which uh make uh tuning these uh models very efficient we all in the second paper on prompt engineering right um this is a core part of what we discussed in the earlier on in the Q&A session we looked at the various prompting techniques all the way from generation or sampling parameters such as temperature which controls the randomness uh and diversity of the output top p and top K and how it's crucial to balance predictability and creativ creativity depending on the task you are using them for we also looked at various prompting techniques all the way from uh basic zero shot techniques where you just ask the llm to providing examples to the llm uh uh to make it more effective for the tasks that you're using it for we also looked at structuring prompts uh or structuring prompts uh with system instructions task relevant context or assigning a role also can help guide the model and we also then um later on uh look at things like for more complex problems which involve reasoning Chain of Thought models uh step by prompting prompting and uh other techniques which can help with uh enhance complex problem solving for models finally uh in the foundational model chapter we looked at um uh how to optimize the models for Speed and efficiency as discussed earlier on in one of the Q&A questions there various techniques that we can use for that from quantization uh distill as well as various other techniques such as speculative decoding which basically makes models faster and cheaper I would recommend uh looking into the podcast of white paper to know more about them and then finally we concluded the whole um paper with looking at evaluation and best practices for evaluating your models using other llms as well as well as uh simpler techniques so yes um now that we have uh kind of gotten an overview of what we that in the reading assignments let us dive straight into I'm going to be sharing my screen for the code lab so hope everybody can see my screen so the first code lab um that you all have received basically dives into um prompt engineering and how we can utilize these um um models to for your uh applic ations so the first part of it is basically just using Gemini 2.0 flash which is one of our most uh performant um performance and speed balance models uh uh that uh we released and this uh we use it to see how we can do simple prompting here then uh we move on from this single turn prompting where we just give an input and get an output to a multi-term structure where we can have a conversation with the model and the the the the API automatically retains previous context to inform forther responses by the model for conversational applications and then we looked at the various option that you can uh of models that we can utilize from the API Beyond just 2.0 flash this uh we this this is um you can also use your own tuned models and uh they provide them as um like specify the model for your um API call and the the next part of the Cod lab which you'll be learning from here is um the how to like uh modify the various generation parameters and you'll be playing around with a lot some of them for instance output length this is basically um specifi telling the model to um like limit the amount of tokens it's generating it's a hard limit keep in mind so it does not automatically tell the model to automatically generate a smaller like a concise response it's essentially just truncates the response so if you are going to ask the model to make like a thousand-word essay and give it a very short like response like uh Max response length it's going to just cut off the essay abruptly so keep that in mind uh but it can help you control costs then we looked at something we have talked about earlier what Kiran mentioned earlier temperature this basically controls the degree of Randomness in token selection and long story short um a higher temperature leads Le to more diverse outputs while a lower temperature leads to more predictable deterministic outputs we looked at an example of how when asking the model to pick a color a random color it can pick different colors um across different iterations uh when their temperature is higher while for a lower temperature it tends to center around one single color then we also looked at things like top how we can play around with knobs like top p and top K which also kind of constrain the tokens that the model selects at the coding time or generation time this can help you uh with your um uh with selecting the amount of Randomness and determinism in your applications and then towards the last section of the first C laab we looked at various uh prompting techniques that we have discussed earlier zero shot where we give it a prompt and give it a task and ask a response all the way to um using like constraining the response so instead of a free text field where we just ask the model to generate some uh text uh without the constraints we can also kind of limit the responses that the model can provide us the enumeration enumeration set and we looked at also other techniques like one shot and few shot where we provide some input and output examples to help it perform better on certain tasks and um things like Json mode uh where we can structure give it a certain the provide the model to a certain structure for its output it can help it um uh structure its output in a way which can easily be parsed by the downstream models and then Chain of Thought reasoning um and other prompting techniques as well and react as well so that was the first collab and moving on to the second collab which is also um we doing it for first time this year um is the one on evaluation and structured output so it's as we learned in the podcast in our discussion earlier evaluating the respon uh response is pretty crucial especially for your tasks and um we look in uh in this code lab we we look at how you can um basically um set up um set up some context as like a PDF um for one of our technical reports and utilize um uh an llm as an evaluator where um to evaluate this for instance um we provide in the first part we see how we can give it a point-wise evaluation prompt very we carefully craft The Prompt for an llm to to evaluate the response of another lm's response with respect to what the prompt that was provided and it evaluates it in a with a score of one to5 with respect to a carefully crafted rubrics which we have defined here and uh we see how this can be used and it gives some reasoning as well as a score and we see here that uh we get a score as well as the reasoning behind it as well as well then we play around to see uh like how you can improve or reduce the score and what makes a good prompt so explaining for example explaining our technical reports to a 5-year-old is obviously going to lead to a slightly lower uh response by the evaluator because we often have the word puppy which is not present in our technical report so good way toig to see the difference and we get a very expected score of one and in the last part of our col lab um we discuss various techniques for using evaluation in practice for instance we discuss pointwise evaluation um where for summarization where we kind of take a response take a prompt which generated a response and have an llm evaluation evaluator determine on a score on a scale of one to five what score does it get however it could be because the the scale of 1 to five is not very fine grained that uh there's a tie between some responses and this is where point-wise evaluation is uh sometimes does not suffice and it's often useful to have um um um pairwise evaluation uh as well where we uh where we give two responses to a particular prompt and have the llm as a judge select that response this is what we also see and then uh uh and then uh we finally end the co uh the collab with um seeing how we can uh it's often um useful as enen mentioned earlier to generate multiple responses by the llm athor judge instead of just selecting the first response to ensure that the score that you get is not just due to noise and uh stochasticity so yeah so it's a lot to cover but um off to you page for the pop quiz excellent thank you so much Anand and to Mark McDonald uh who uh helped create many of these code labs um youall have done a great job of making all of the concepts that are being discussed in the white papers and in the Q&A sections real uh and I'm really looking forward to seeing what people build and uh and their initial impressions of the code Labs so this is very very cool thank you um with that I'm going to go ahead and move into the uh the pop quiz um so I'm going to bring up onto the screen some questions that hopefully folks will be able to answer now that you've had this uh now that you've had this 24-hour crash course in prompt engineering and foundational models um to start with our first question which Gemini configuration setting controls the degree of Randomness in the selection of the next predicted token is it temperature is it top K is it top P or is it the output token count um remember back to uh to the white paper as well as some of the questions that Kieran was answering today which Gemini configuration setting controls the degree of Randomness and the selection of the next predicted token so hopefully everybody's had a chance to jot this down on either a a pen uh or a piece of paper with a pen or a pencil um or if you're just kind of taking notes in a dock um but the answer is a temperature um so hopefully hopefully everyone got this right um question two which of the following is not a technique used to accelerate in large language models um is it quantization is it distillation is it flash attention or is it fine-tuning um again and you're looking for uh what is not a technique to accelerate inference in large language models and I countdown five 4 3 2 one and the correct answer is D fine-tuning fine tuning is not a technique used to accelerate inference question number three which of the following is a unique characteristic of the Gemini family of large language models is it a Gemini models were the first to introduce the concept of unsupervised pre-training B Gemini models can support multimodal inputs C Gemini models are decoder only or D Gemini models can support a context window of up to 2 million tokens what is a unique characteristic of the Gemini family of large language models and I countdown five 4 3 2 one the correct answer is D Gemini models can support a context window of up to 2 million tokens and hopefully if you've been experimenting with some of our Pro Models in AI dodev um which is Google AI Studio you should have been experimenting this uh with this firsthand question number four how does reinforcement learning from Human feedback rhf improve large language models is it a by training the model on a massive data set of unlabeled text is it B by using a reward model to incentivize the generation of human preferred responses is it C by reducing the number of parameters in the model for faster inference um or is a d by converting the model into a recurrent neural network for improved performance how does rhf improve large language models and hopefully everybody remembers this a little bit from the from the white papers as well as through some of the questions that we were answering today I'm going to count down five 4 3 2 1 and the correct answer is B um rhf uses a reward bottle to incentivize the generation of human preferred responses question number five which technique enhances an lm's reasoning abilities by Pro by prompting it to produce intermediate reasoning steps leading to more accurate answers is it a zero shot prompting B step back prompting C self-consistent prompting or D Chain of Thought prompting hopefully everyone is thinking back to the white papers um and hopefully you've also been experimenting with this prompting technique as you were uh as you were kind of playing with the Gemini models in AI Studio as well as in your kagle code labs but the answer is D Chain of Thought prompting Chain of Thought prompting helps the model show intermediate reasoning steps which often leads to more accurate answers um so thank you to everyone for for participating today hopefully you all scored 100% correct on your pop quiz uh and if you didn't you learned something along the way so thank you for uh thank you for kind of testing your your new Knowledge and Skills and capabilities um and the the last question what is a minimum GPU memory needed for inference on a three billion parameter model using standard float Precision um this is kind of a an extra bonus question which you should have been reading the white papers in order to answer um is it three G uh three gigabyte six 12 or 24 um how much GPU memory would you need to to run a 3 billion parameter model and the correct answer is 12 um so hopefully hopefully everyone got this right um and if not um refer back to the white paper to to learn more so thank you so much uh again for dialing In from from all of the the folks that are working on our q&as um the kaggle team the Google deep mine team the organizing team a huge round of applause for everybody involved um thank you so much and look forward to seeing you tomorrow