[Music] it is my amazing privilege and I will have to read this a little bit because it is amazing and I don't want to mess it up is that our speaker is an assistant professor in the school of information at UC Berkeley go Bears thank you she studies human computer interaction with her research spanning education the healthcare to or stora of Justice which is an amazing breath her research inses are social Computing centers human centered Ai and more broadly human computer interactions her work has been published and received Awards in the Premier venues colon and our speaker gave me all the acronyms and I just like I'm going to spell them out because they're so impressive and it's important the association of computing Machinery conference in human factors and Computing systems AC Chi Conference of computer supported Cooperative work the cscw and the empirical methods in natural language processing so these are amazing three I did some reading on this I I encourage you to look these things up absolutely amazing and if that wasn't enough um she's been covered in you know little things like Venture beat wired and the guardian you know those ones that are a little harder to remember she has a WT Grant Foundation scholar for a work uh on promoting equity in student assignments uh and uh algorithms and a member of The Advisory Board of generative Ai and this tiny little startup named Nvidia I'm not sure if anyone's ever heard of them um and um even though we recognized that uh you know she she does go to Cal now which is a good thing she teaches for Cal or has teached lately she is from this little PhD thing from Stanford so please join me in welcoming Dr assistant professor Neil thank you Jerry that was very kind um and very funny thank you uh thank you all for spending your afternoon with me um we saw some really cool demos and that's really what I'm going to talk about I want us to I'll take a step back as academics that's what we like to do and just think a little bit more broadly about what's going on here what's happened so far and where are we headed so what I'm going to talk about is large language models it's been about two years since chat gbt came out we've all tried them and what they mean for work and what they can mean for product management so like Jerry said my name is nilu fari I go by nilu um I've been a professor at UC berley since 2018 and I've been doing a lot of work on uh thinking about human centered Ai and what that means and so I'm going to come back to this quote but this is something that is just stuck in my mind since the first time I read it and it's a quote by Russell akov um the quote is managers are not confronted with problems that are independent from each other but with the dynamic situations that consists of complex systems of changing problems that interact with each other I call such situations messes and this will be very familiar to anyone who's done product you're always dealing with messes and we're going to talk about what AI can mean for that mess there are some estimates that about 80% of data in all organizations is messy unstructured and rarely used and product people are expected to always know what that data is what the trends are what we should be doing with that data and so we'll get into that but first um this is what we're talking about I really like this XKCD comic because I think it sets the stage whenever I'm talking about AI I always bring up this comic so there's this guy who's going like this is your machine Learning System and the other one's like yeah you pour the data into this big pile of linear algebra then collect the answers on the other side and the big question is what if the answers are wrong and the person says well you know just stir the pile until they start to look right and that's what we're dealing with at the most basic level we're dealing with linear algebra that you can keep stirring until it starts to look right but it becomes profoundly powerful as it starts to get right and so that's something that we have to be thinking about and this is something that people um across um different kinds of organizations and even venture capital is starting to look at as well so um there was this article that came out of En about the second wave of AI that is less about generating new stuff but it's about synthesizing what we already have and that's something that I think is really really important and um something that we will see more and more of as we move past the initial stage when we were just starting to learn what models can do and into the stage where they get really deeply embedded in everyday workflows so I believe that we are at a stage where we're going from the internet era or where connectivity and information sharing led to really an exponential growth of information and our ability of what to do with information to a Quantum error where it's not just that we have unlimited amounts of information but now we finally have the tools to also be able to process and synthesize that information and that's why I think everything is going to change but it's also going to change really slowly and in really weird ways and it's going to take a lot of work to get there so I'm going to share one project that um me and my students have worked on on at UC Berkeley and I'm going to turn this into more of a conversation than me just presenting the work but we were about a year ago starting to look into what would it look like if we used AI more for data synthesis than for just for Generation so what if we took these models and we took all this unstructured messy data and we tried to use the models to make sense of that data so I'll pause here and see if anyone's tried to do that so show of hands if you've tried to upload any of your own data or run a model over any of your own data all right so about half the room and we saw some demos on how to do that today as well you can go try them out but I want to hear from the people who have tried it what has your experience been like what's gone well and what has not worked this always structure you have data set that's not structure in way that [Music] model okay so the model needs to know what the structure of this information is otherwise it's it doesn't understand it and then the output doesn't make sense okay what else could you say more [Music] [Applause] [Music] [Music] so you're expecting magic but it's not really working out so you have to iterate a lot how do you know it's not working so the output doesn't make sense what else what have other people's experiences been um setting up a testing framework to answer that question turns out to be the hardest question to answer because you're literally asking fundamental product questions in trying to evaluate the effectivess of the the so evaluation is really hard [Music] yeah anwers from are different and sometimes you that's something you expected expecting then you use the same problem for different thing it doesn't work like let's say data and I ask some questions sometimes it gives me right answer I ask same [Music] [Music] right that's a really good point so because these models are non-deterministic or stochastic they don't necessarily do the same thing again and I think that that's a really big key to how this will all play out I think we're moving past the era of software where the software was always deterministic you knew exactly what you would get out if you ran it times you would get the same thing into an era where a lot of things are going to be really noisy and evaluation is going to be really hard um so a colleague of mine uh Joe Gonzalez and another colleague of mine Joe helstein started a company is called run llm and they did all these po blog posts about a year ago where they were saying you know everyone's evaluating their llms off of Vibes and this can't work so you know we have to have some kind of benchmarking here and a lot of people have tried to do llm evaluations and benchmarking and just a week ago I saw a blog post come out of their uh their company and the title of the blog post was in defense of vibe based evaluations after a year of trying to create benchmarking they had realized that even if they really put a lot of effort into a good Benchmark it still didn't mean that their end users felt that this was a better llm and what the end users were doing were going in asking a few questions and seeing how it looks like and how it feels and just The Vibes and the blog post was saying if that's how the users are evaluating it then maybe that's not so bad maybe we should be thinking more about that so I think there are some fundamental things that are going to change yes um another big problem is often like 95% of you may get situations where 95% is accurate and 5% is and it's easy to one if you don't really understand the domain to just think it's 100% accurate um because you're seeing a bunch of things that look accurate and detecting which parts are not um hallucinations is another problem um and so I think that the veracity of what you receive and can you fully trust it and in some domains you can't afford to have like 85% confidence or even 90% confidence you need like 98 99 100 so I think that's another that's another good problem that's exactly true so even if I know that I'm 99% of the time correct I don't know what those 1% are and that's another really big problem because um our brains are just not used to tools that work that way um if I drive a car around La and it's working pretty well I expect that if I come to San Francisco it'll work just as well and that's how our brains function and how we think think about tools and now we're entering into an era where I really don't know why these models keep using the word in the realm of something whenever I see an email and it has in the realm I know that it's AI generated because no person uses that word but I don't know why the models keep using it so it's going to be weird and we're not going to be able to understand why we have a lot of experiences in this um about four years ago I started working on models used in really high stick settings because I was thinking if these models are keep going to get better and more embedded we have to know how to make them reliable and so I I thought to myself let's take it to the extreme what is a setting in which you absolutely cannot be wrong and I thought well Hospital settings because if it's life or death then you absolutely cannot be wrong no matter how good the model is and so we started looking at language models early language models translation models used in hospital settings and one thing we learned was that Google translate was widely used in hospitals across California and so we started this study um let's actually try this out with this group so say you have a doctor say you're a doctor and your patient comes in and you want to tell your patient hold your kidney medicine until you've had a chance to speak to your kidney doctor so you just got their lab results back and you have to give them this piece of information now your patient comes in and they don't speak English they only speak Mandarin now you have a choice here which is to put that information into Google translate and hand it to them or you could you know go try to find a way to find a Chinese interpreter now let's do a show of hands if you were that doctor would you give the Google translate output to your patient let's show of hands if that's a yes okay I think about 30% um and what if it's a no okay 70% so this is really interesting for me because I usually talk about these things to a technical audience and when I speak to a technical audience almost no one raises their hand for a yes everyone's absolutely no but when I go and talk to doctors because a lot of my collaborators are doctors all of them do this routinely and so I always use this as an example to say that the the doctors are being super practical and the technologists are thinking about it in this zero to one numbers way and it's interesting to see that product fee follow is somewhere in the middle where we understand the tech and yet we know the reasons that why we need to be practical as well yes have you had a doctor put something like that into Google translate and had somebody who speaks Chinese translate that into English for you um so my colleague Dr elen Kung who's at UCSF has a lot of Chinese patients and she does this all the time she puts all her notes into Google translate um and so a lot of the times the patient comes in with someone with them who may know some English and so yes they they do that but we did run this experiment where we actually tried using Google Translate to back translate so I'll talk a little bit more about this a little bit later but um yes sometimes people do things like that yeah and know change or something end you could get yes liability is a big issue yeah absolutely question how accurate I will I was ask translating might be what I was talking about I I personally wouldn't directly translate but I would translate it not sure the patient I translated back in English to see what it rated as and then if it seems reasonable perhaps I would show the original yes so there's been a button in the middle of Google Translate for flipping it around so doctors do that as to direct some do some do not all we actually found a hospital somewhere in California that where the guidance to doctors was to do that to to back translate but you're completely right there is absolutely no written rule anywhere and the liability lies completely with the physician who did that one that's a really good point that likely will work better that also be you're talking about there's an angle is the doct on their nurses and and on their staff to actually provide the an foration so even if they use Google Translate it's dependent on the nurses to make the final translation for those patients Li slightly lower in an actual hospital or clinic set where nurses are much more involved ination well I guess in this scenario the liability would be with the person who gave the incorrect information so um what I the example that I gave you was a sort of a trick question um it's one where Google translate at least up till 2021 was um flipping the meaning so hold your kidney medicine until you've spoken to your doctor was an actual sentence that someone was given in hospital discharge instructions that my colleague she had a big database of things that she had said to patients so we ran it through Google translate translated it had a had someone who's um understands that language look and rate and there were about 8% of the sentences that were not only incorrect but were incorrect such that they caused clinically significant harm and the sentence that I said was one of those it got trans translated to keep your kidney medicine and so this is an example of these sort of errors that are very hard for us to identify because we don't know when the model is going to be right and when it's going to be wrong and even if we know a percentage of 8% of the time is wrong that still doesn't help me on an individual basis when I'm working on the with these models and so I think you know the the next generation of tools that we use are going to be completely different and everything about them is going to be different and I think that that means a lot for product managers uh we're going to have tools that are stochastic that are not always right that you can't even really run an evaluation with because each time you run it something different is going to happen I think that that's going to touch every aspect of the product I think it's even going to touch pricing because so far we we're so used to a per license or per seat pricing model because it was kind of the cost of running software was pretty much the same for for each person who joined right but now we're at a point where it kind of depends on how much they use it the models are actually the thing that's really expensive and so we're seeing some startups do token based pricing or there's even one startup that's doing this really Innovative thing where they're doing outcome based pricing so they're saying we'll help you answer your support tickets and we will take a percentage of how many support tickets we close and so I think we're going to see more of that because we're going to see more and more tools that are more involved in the act ual doing of the work than just being a tool that that you use to do the work I think we're going to see more and more I think we're going to see less of those things that have become so standard where you know those tables that every company has where it's like our features and what the competitor's features are and there are ticks in them like we give you 20 gigabytes of storage the other company gives you 10 I think that those are all a product of software that was very deterministic and very easy to measure and we're going to see less and less of that I think you know Chad gvt and G and I have been around for a while there isn't a single table that compares them like that and most people just go off of Vibes or they try it a few times and they I've heard some people who really like Gemini and some people who really like tgbt and I'm sure some people love story tell um and so I think that we're going to see a whole new way of how people you know evaluate software make decisions on what software to buy and how to use it and they are going to be really really important questions about real reliability here because these are non-deterministic or stochastic systems I think that reliability is going to be the most important thing and the way that I think about reliability is two things accuracy and verifiability and so going back to this project that we started about a year ago these this is me and my students uh we started to think about okay what is it going to look like if we use models for what we believe models do really well which is go through lot lots and lots of data and find out the patterns insights or even take actions on behalf of the person so I teach HCI at Cal so a lot of my students end up going into product roles or ux research roles and so we said let's try it out on something that we do all the time which is go through user data and try to understand what people's pain points are or what we should be building next or those kinds of things now I think conversational interface are great I think they come very naturally to people but I don't think the future will be a lot of One-Shot interactions with models I think that models are good at certain things but for a lot of the workflows that we see people actually do it's actually going to take longer forms of Engagement with multiple different models or different kinds of algorithms and so we said okay let's build three template workflows and we call them widgets and what each widget does is it takes some unstructured data and solves a problem for a potential product manager so the first one is a Q&A widget that one is just you have your data you can ask questions about it the second was paino tracker so we said let's say you have all of your data say reviews from customers and you want to know what are the main pain points that customers have and how have those pain points changed over time so we made a widget that you can drag and drw over data and it draws a graph for you of here's your pain points that your customers mentioned and here's here's how they've changed and the third one we call this one user insights widget this one takes a lot of qualitative data and you can either ask it specific things that you want out of that qualitative data or you can um you can just say just find out the main Trends here and it goes and it searchs through all that data and it pulls it out and it gives you a whiteboard with each Trend being one of those squares and it puts sticky notes in it so we thought of these three workflows as things that we think would be really helpful for product managers and so I'm going to show you a video of the thing that we built and mind you this was a research project mostly coded by students so it's not uh can we start playing it okay um I think we need to get rid of that thing awesome thank you so this is what we ended up building so there's a space here and you can pull in data we ran this study with um notion reviews so we took we scraped App Store reviews from notion that had three stars in below to test the paino tracker widget so I think if one of each one of these data points is a review that someone wrote on the app store for notion and the task here was it's impossible for someone to read all of these but what if there was a widget and the way you run it on this stas whatever data is in the middle is what these widgets will be run on you kind of give it a name and you can either give it specific pain points you want it to go look for or you say find the top three so when you run a widget over data it starts to run meaning that it's going through all the data analyzing with an algorithm and the output is the spreadsheet with here are the top five pain points that everyone mentioned and here how those pain points have changed over time so this is how many times each paino was mentioned each month and there's a graph here you can see laggy and glitchy on iPad was mentioned five times in August and then it went down so it probably got resolved um but another paino lack of intuitive design is going up in notion opstore reviews this is the Q&A widget that I talked about where you can ask questions from the data like what are the biggest challenges with notion AI it goes through that data puts together a response for you and it tells you it sources so that's the where the verifiability comes in and you can keep running these widgets we try to make it really flexible because our goal was to see what will people do with these kinds of things um this is another example of the widget this is more for qualitative interviews so um we actually got these off of YouTube so there are reviews that um someone did off of the notion app uh these are YouTube video transcripts so we took the transcript and these were also very long so we took tasks where like actually doing it manually would take hours and hours and hours and we built widgets so I think I'll do the user insights one so you give it a name you can give it specific topic areas that you want it to go look for in the data like I want to know more about the Calendar app and I want to know what people are saying about the AI output because notion added an AI bot and so it goes analyzes the data and puts it all together and the output looks like this so it makes a whiteboard where there's a box for each user a little summary of that interview and then all the questions that you had um it makes a little sticky notes for each of those things or anything that it finds in that data so this is for user two what they said and then there's a summary table at the end for each of the things that I was interested in and each of the users what did I hear all right so this is what we built and our goal was to see how do people use this is it useful what can we learn from giving this to some product managers and having them use it for something that resembl a real world task so if we could go [Music] back we just built it from Str