Developing Advanced AI Assistants

you're holding what looks like a can I'm holding a magazine and there is some text in Spanish I want you to read the text back to me look at the webcam the text on the magazine reads Las Florida Delo can you tell me what type of Beverage you're holding a can that appears to be a Red Bull energy drink a few weeks ago I published a video where I built an AI assistant using my microphone and using my webcam and it was pretty wild and people actually loved it they sent me a bunch of questions because it felt alive right it's it felt like the assistant had access to everything that was happening around me and I was able to answer questions about what I saw through the webcam pretty cool stuff but after I published the video this is where this is get funny a company reached out to me and they pretty much said nice try are you ready for the real stuff now um this company's life kit I'm working with them to build this video and life kit built the plat form that open AI is using for their chat GPT assistant pretty wild stuff obviously I decided to use their platform and rewrite my agent from scratch to build a real AI assistant that feels Alik so I want you to take a look at me interacting with this assistant so you see how cool it is I want you to look at the image and tell me what I'm pointing at right now you're pointing at a hanging light fixture a few days ago my my son gave me this carard look at the webcam and tell me what were we celebrating the person is holding a card that says Happy Father's Day you were celebrating Father's Day now of course in this video I'm going to show you how to build the same thing from scratch I put together the code you're going to find a link to the GitHub repo somewhere in the description below so I want you to just follow me as I'm going to explain every single line to you so you understand what I'm doing and how you can make changes to it let's start all right so let's take a look at the source code it's a very very simple source code using life kit uh I have I think 139 lines including extensive comments that I added just so you don't get lost so this is actually pretty simple to follow so before we run this the source code make sure you follow the instructions in the r me file just very simple you have to create a virtual environment install the libraries and you need to set up a few environment variables first you need live kit environment variables go to life kit create an account is free you're going to be able to grab these three environment variables and you need to set them up then you need an API key for deep gram we're using deep gram to turn my audio or my voice into text it's going to be free as well and finally you need an open AI API key because I'm going to be using GPT 40 for this demo after you do that here is how you run the assistant and after the assistant is running and I'm going to show you how to do that I'm going to be using the hosted playground that life kit offers to connect and give access to the assistant to my webcam and my microphone you could build your own user interface if you wanted to life kit has plenty of examples of how to do that I am not going to do that I'm just going to use the hosted playground for Simplicity so let's go to the assistant here now one thing that I want to mention before I start going down every single line of code the way this assistant Works high level is I want to be chatting with the assistant giving the assistant only access to my voice or basically the text generated off of my voice I don't want the assistant to get access to the webcam because that will be a little bit expensive if every request that I send the assistant contains an image from my webcam I'm going to be sending a ton of data for no reason instead what I want to do is only offer the assistant access to the image whenever the assistant requires that image to answer a question that way we can limit the amount of information we send over the network and we make this interaction way more fluid way faster than if we have to send the image all the time so imagine I ask the assistant tell me a joke I don't want to have to provide an image to the assistant because the image is not relevant for the assistant to tell me a joke or if I want the assistant to answer how much is 2 plus two I don't want to send an image but if I ask the assistant hey am I wearing glasses look at my webcam I want the assistant to be able to grab an image from the webcam and look at it in order to answer the question whether or not I'm wearing glasses so how do we do that well it's going to be very simple basically we're always going to chat just with text except when the assistant thinks it needs an image so how do we do that using function calling remember we have the ability this model support the ability to call functions outside of the assistance perview so whenever we ask the assistant hey look at an image or look at my webcam we want the assistant not to answer the question but to respond back with a call to a function saying I think you need to call a function this is the name of the function that you need to call and this is the context or the text that triggered that call to the function if we do that we could capture that in that function we could resend back the request to the assistant Now with an image so that way that extra roundtrip will allow us to save a ton of traffic to the assistant so to explain how I did that let me show you this assistant function here very very quick it's a class and this class is is supported by life kit to implement any function calling is super cool super easy to understand so it has to obviously uh extend the function context class and here I can specify different functions that my assistant will have access to call okay my llm will have access to call you do that through this callable AI colable sort of like decorator here and you specify the description and remember this description the agent will have access to this description to decide whether or not to call this function okay in this case the description is hey this should be call whenever you're asked to evaluate something that will require Vision capabilities for example an image video or the webcam feed so we're going to pass this part of the prompt behind the scenes this is going to go as part of the prompt to the llm and the llm will look at this and if it notices that the request is requires Vision capabilities the assistant will answer back not with a message but by asking us to call a function and the function that's going to be called is going to be this one here I'm calling here image so that is the way this is going to work by the way you can use function calling to do all sort of cool stuff let's say you want an assistant to be able to do complex math equations right so in instead of letting the llm hallucinate some stupid answer you can have that llm to call nutti is you know realize that you want to solve a math a complex math equation and call a function in your code where you can actually solve the equation and give the answer back to the llm so that's the way or one of the ways function calling can can help all right so in this image here you're going to notice that the only thing that I'm doing internally is storing metad data variable that's called user message so whenever the assistant or the llm realizes that it requires Vision capabilities to answer a question it's going to call this function I mean it's not going to call directly the function it's going to answer back saying please call a function and the the platform life kit is going to call this function and this function is going to receive the original message the user message the user message that triggered the function call and we're going to store that user message in a metad data variable that's it that's the only thing it will become clear how I'm using that in a few seconds okay I just wanted to get this out of the way because that model is something very different from what I did before and it's super cool all right so with that out of the way let's take a look at the main entry point which is right here um what I'm calling or the thing that matters is that I'm calling a function that's called entry point where everything happens So within that entry point I'm going to go to that entry point Point here within that entry point first I'm just printing the room name that we're connecting to and then I'm creating a variable that uh creates a chat context and this is what's going to allow us to keep a conversation with the llm okay and I'm going to initialize that chat context with a system message this is how you inject personality into your llm okay through this system message in my case I'm saying hey your name is alloy and you're witty and you're funny and you should answer with your answers that's it that is my chat context I'm initializing it like that now I'm defining my GPT my llm I'm using GPT 40 it supports text and it supports images which is very important for what I want to show and then I'm creating an assistant I skipped over a couple lines here I'm going to get there in a second my voice assistant here I need to define a bunch of properties for my assistant to work first a voice activity detector which is going to allow us to just detect Whenever there is sound and I'm using cero I think it's pronounced like that don't kill me please it's open source you can look it up Zer is going to help us do that for the speech to text conversion I'm going to be using deep gr which is amazing again I'm using their free tier you need an API key for that to work for the llm I'm using the GPT 40 that I just defined for the text to p PE so turning text into audio to play through my my speakers I'm using the open AI API and that's the definition that you see right here notice that I'm specifying that this is going to be using the alloy voice then I'm defining the class that's going to help us with the function calling and that is the assistant function that I I just explained and finally I'm passing the chat context variable so the initialization or the conversation that this voice assistant is going to keep with an llm that's it that's my assistant after that I need one more variable to define the cont the chat manager I'm going to call it chat this is just going to keep that chat going and I'm initializing it to the room there are three functions internal functions defined here I'm going to go one by one when it's time so I have an answer function and I have a couple of events or two more functions that are decorated with this weird thingies I'm going to explain what they are in just a second skipping over the go I'm starting the assistant then I'm just sleeping here for a few moments I basically want the assistant to have a second to breathe catch on connect do everything that it needs to do and then I'm going to play hi there how can I help you this calling say basically plays that audio through my speakers okay and that is the initialization of everything here notice that I'm passing allow interruptions in true meaning that at any point in time while the assistant is speaking we can just interrupt it which is awesome finally just to finalize what happens in the main thread we have a while loop that is going to happen or is going to just work while we are connected to a room and inside that audio Loop I'm going to be getting a video track okay from that room this is a very simple function let's go to get video track so you can see actually what's happening so the video track first it creates a variable Here video track and then goes through every participant connected into this room that participant is going to be me through the playground that I'm going to be showing you for every participant I'm going to be grabbing every single published track so for example audio and video tracks so if I have a microphone that's going to be a track if I have a webcam that's going to be a track so I'm grabbing every single video track and then I'm asking hey if that video track is a remote video track set this variable to that track and just break the loop and just return that so pretty much high level what this function is doing is grabbing the first webcam that's connected to that room and that is going to be the video track that we're going to use for the assistant to see what's happening around me if you had multiple participants you will have to change this function a little bit but for now this is just fine all right let's go back so after I grab that video track it's a webcam every frame that gets generated is going to be an event event and this is what's going I'm going to do here so for every event generated by that track I'm capturing it you can see here I'm just wrapping that track into a video stream I'm grabbing that frame and I'm storing that frame in this variable that I defined before it's latest image so the end result here is at any point in time whenever I need access to my webcam I can just use latest image and that is going to be guaranteed to have access to the latest image published by the web track one thing that is very important here to keep in mind many people talked about this and they're confused this assistance do not support video that is just an illusion when you see these assistants working and they answer questions from a webcam or whatnot it looks like they actually see they actually have access to video they don't they only receive images and text so what I'm doing here is just making that illusion possible I'm capturing a webcam grabbing those frames from the webcam and whenever I want to answer a question from one of those frames I sent the image and the text and it looks like the assistant has full access to my webcam it doesn't Okay so hopefully that's clear this is the main track of the program everything else happens through events a couple events here that we want to uh take care of first event is called a message receive this event gets triggered Whenever there is a new message that we have to answer to or we want to to send to the chat so whenever I speak and The Voice Assistant captures my voice and turns that into text it generates this on message receive event and obviously this chat message here will contain the text that we want the assistant to answer so whenever I get this event I'm basically going to just call the function answer and we're going to see that function answer in just a second but basically I'm passing to that function the message which is the text that I want the assist to respond to and I'm going to be saying hey don't use an image for this because this was just a normal text there is no indication that the assistant requires Vision capabilities to answer this question to respond to this text so don't use an image and there is a second event which is UNF function called finish this event is going to happen whenever a function called finishes executing remember that I talked at the beginning of this assistant function class that our assistant is able to realize hey I think I might need Vision capabilities for this so instead of answering this question I'm just going to call it function that is what's going to happen when we call this function here and remember internally we're storing the message that triggered that function call in a variable called user message after this call is done an event is going to get triggered and that event is called function calls finished so we're going to we're going to capture that event and we're going to grab the variable that we just stored as metad data we're going to grab it remember this is the original request the original request that I spoke and made the assistant believe that he needed Vision capabilities now it's right here in this variable and now I'm just going to answer that but this time I'm going to specify that yes I need an image see that extra round trip I asked the assistant a question the assistant said hm I think I need an image to answer this so it's going to call a function in my code or basically respond with a function call in my code and then what I'm going to do at that point is oh I'm just going to send back an image to the assistant then same request but now with an image so now the assistant has everything he needs to answer the question does that make sense yeah so these are the two function the two events and both of them are calling this function which is called answer very simple function okay so look at this the answer function receives the text and it receives the Boolean indicating whether or not we want to pass an image the way I pass an image or don't pass an image is just by creating a dictionary and saying hey if you're saying that I should use an image and there is an image that I capture so if you don't have a webcam do this doesn't blow in your face I'm just going to be adding a property adding an attribute to this dictionary called images and I'm going to be passing an array with a chat image defined by the image here you know with the image the latest frame capture from my webcam so this is what I'm going to be passing here and now I can take I can use the chat context and appen a message this is just going to be a collection of messages that I'm going to keep adding adding adding adding adding here I'm going to be passing a message to this chat context uh the role is going to be user because it's the user the one that's sending this message it's not the system the text is just the text that I received and I'm going to be passing the list of arguments here which is going to be either images or nothing so if I need to use an image the images are going to get passed here after passing that after updating my conversation my chat context with the new message I can just call chat from the llm that I Define and pass the conversation so this is going to get sent to the llm the llm is going to realize that the latest message is from the user so it has to do something is going to basically answer I'm going to capture the stream the answer stream right here and I'm going to ask the assistant to play that stream through the speakers that's it that's pretty much the whole deal so let me show you how to run this uh so you can do it yourself so you're going to execute this right so python assistant. py which is the name of my file and you're going to call start so I'm going to do that and this should now be running it's funny that this is telling me the region is France that's because I created this assistant before on that region I was in France when I created it for the first time okay so this is the playground and uh there is a link in the r me file that you can open this playground and just need to make sure you connect to the same room uh just very straightforward so I'm going to click connect and from that point on I'm going to be giving access to the agent to the webcam and to my microphone so let's see how this goes let's click connect hi there how can I help hi what's your name I'm alloy nice to meet you awesome alloy can you look at the webcam and tell me how many fingers i i showing you you're showing five fingers that's awesome uh can you tell me what color is my watch your watch is blue that's great I'm holding something on my hand take a look at the webcam and tell me what it is you're holding a comic book so as you can see that's pretty awesome stuff I hope you enjoyed it if you really like this video and you want more content like this just please click on the like below subscribe to this Channel and I'll see you in the next one bye-bye on Facebook Twitter that was a very bad joke tough crowd

Transcript for:Developing Advanced AI Assistants

Transcript for:
Developing Advanced AI Assistants