Bruke OpenAI og Python for databehandling

hello everyone in this video I will show you how can you get answers from multiple files which means that you will be right asking one question and your answer could be any of the files which you are taking as an input so here we will be using open AI tick token and python so let's get started first of all we need to import all the required packages so let's go ahead and say import pandas then we need to import numpy next we need to import OS because we need to deal with local file systems and then I would say import tick token import open AI so these are the ones and next to one is the configuration from which I am reading my key so let's set that as well openai dot API key equal to configuration you can read it from anywhere in my case I am storing it in a configuration file then here itself I will go ahead and set encoding so tick token dot get encoding and inside this I will be using CL 100K base so these are the initial things which we need next thing what we are going to do we are going to construct a data frame which will primarily have three columns file name content and the number of tokens which are used by that particular file so for that we need to go ahead and construct the data frame so I am naming that as DF and here we will quickly say what all columns we need so first one would be file name second column is the content which is sitting in that file and the third one is tokens so once these columns are constructed we need to browse the directory or the path wherein we can read all the files so for Simplicity purpose here I am taking all the files is text file so let's go ahead and read those so for reading the files inside from the directory I would say list dir and inside that I will be passing the directory name so for me it is Data store you can pass the directory whatever is present on your machine and inside this file name now we will say if file name dot ends with txt then go ahead and open that file so let me quickly provide its path again so it is data store and then I need the file name so this should go and let's name it as F so here we are getting individual files and then we are extracting the content out of that so I will call it as lines and we'll say F dot read lines so this is reading line and then what we need is we need to know how many tokens individual file is using or the row of your data frame is using so I would say token equal to encoding Dot and code so note that we have already set the encoding here so I'm not setting it again and I will simply say join because this is going to return list so here we will get tokens next thing is we just need to push this data into our data frame so for that we can call append and it's just like key value pair we need to Define here so let me quickly type this file name I want and its value would be the file name which we extracted above then we have content column whose column whose value is the same thing I would have calculated it above but it's okay let's go with this way and the last column we have is tokens and its value is nothing but the length of tokens okay once this is done we need to pass additional parameter ignore index because if we are not doing this append will throw you an error I believe this should work let's go ahead and print the data frame okay so here you can see this part you can see so file name is coming as empty it means we missed something with open data store double slash Okay so there was a typo problem with this column name let's run it and now you can see that two rows got added because I was having two files so this is a file name and this is its content and these are the number of tokens which are used by this particular file similarly I have a sports file then I have this content and these are the number of tokens which are used by this particular file okay so once we are done with this thing next thing we need to do is we need to let me remove this okay so we need to calculate the and embedding so for that I'm going to define a function so there is nothing fancy in this this is the same function which I have explain in my previous videos so I'm just going to type it here so here you can Define what type of embedding model so I would say embedding Adda zero zero two and it's going to return us a list of float because embeddings are in the embeddings are the values which are in float so let's quickly go ahead and call the create endpoint for the embedding I would say open AI dot embedding and then I would say create so inside that we can say model equal to model input we can say as text which is passed from which is passed as a parameter and once this is done we will simply say return me the result now this result will be in the form of some hierarchy so if you're not sure why I'm doing this either you can go through my previous videos or you can go through open AI documentation which will tell you what is the response format so inside data at the first level we will be having embedding so this is how you can go ahead and figure it out okay so once we got the embedding we need to compute this particular thing for all the rows so how we can do it we need to iterate through it so let me quickly write another function compute doc embedding which will take data frame and it is of type pandas data frame then let's go with this one first and then we need to say return row this is kind of the same syntax so rather than explaining I'm just tapping ahead get embedding and this ampeding we need to call on content column of your data frame okay it's not taking let me go ahead and type it for row index Row in DF dot it a rows okay here it is and maybe to put these brackets okay so this function what it is going to do is it's going to call embedding on all the rows for the content column and let me Define its return type it would be easy for us so let's call it dictionary and inside this we will have a strstr and then we have a list of load values so and I need to Define colon over here and we are good too so once this is done we are done with Computing uh with our embedding computation next thing is we need to compare the documents so for that let me go ahead and create another function which is order document section by query similarity so here again this is the same function which we have written earlier also so what it is going to take it is going to take the query which user wants to execute then we will have context which is nothing but the dictionary of string and next thing would be the array I will explain you when we will execute this let me quickly type it first and then it is going to say query embedding and here we are going to calculate the embedding for the query which is supplied by the user so this is the first thing then we need to figure out the similarity between the users query and the embedding which we have in our for our documents so for that we need to say dot protect calculate the dot product here and I am going to say document underscore similarities and this we want in the sorted form because we want the highest rank to come on top so that's why I'm doing if you are not using sorted or reverse equal to true in that case you have to sort it by yourself so let's go ahead with this one and here I am saying the dot product NP Dot Dot array first of all we will pass the query embedding and then we will say document embeddings so here we will say doc embeddings this is done and next thing is the doc index so I would say let me see if I'm missing something so what this is doing it is calculating the dot product of this one and then we are I mean dot predict is calculated for query embedding as well as the document embedding so once this thing is done then we need to I mean we need to do this calculation for all the rows so here we need to put loop again so I'm using for doc embedding let me give it different name otherwise it would be very confusing so here it could be talk embedded in contexts dot items okay and here we have this bracket and at the end like I said we need to the highest rank to be on the first position so let's make it not reverse equal to true so I guess we are done the only thing which is remaining is the return type for this function so let's define it float again because embeddings would be in the float form in the flow data type and here SDR comma Str let's execute this so we are missing something okay okay so let me quickly execute this and see if we have any errors okay ah so here we need to return one more thing which is similarities Ctrl C okay so we are done with our all the groundwork next thing is we need to use our actual query to figure out what are the similar ones so I'm going to call this function here order sections document sections by query similarity and it is going to take your user query so let me type in something how poverty is the reason for homelessness so let's go ahead and figure out the answer of this query based on the two documents which we have supplied and here I am saying compute doc embeddings which is going to take data frame as a parameter and I'm interested only in top two results but let's say you have many documents so you can define those numbers so let's say you are interested in all the initial 100 top ranked portions of your documents or the documents so in that case you can Define that number here I have taken just two documents so I think for me in any case I will not get more than two so that's the reason I am doing it so let me go ahead and print this out run it and oops so oh let's do it once again okay we are good this time okay so now you can see that for this particular question there were two documents and the first document was ranked with 88 and this is 68 so there are high chances that my response will be in this particular document so let me go ahead and utilize the same index and the values to figure out what is going on now we are using here GPT 3.5 turbo so which means that I need to Define here roles so let's type in here so the first role is system and here content would be let's say I I'm saying you are you are a professor who provides to the point answers so you are [Music] professor who provides to the point answer so what are we a question we are asking we are expecting this particular system to behave like a professor and then provide us the responses and then next thing we need is the user text so for you user text we will be utilizing this particular response here so in my case I can say content is my column and then we have similar ones 0 comma 1 because I need to read this particular value here so this is falling on the one index or the second position okay so once this is done we will go ahead and append this so this would be my query now let me grab this query from here this would be the query and this we need to append with user text which we just calculated about okay so along with this query we are going to pass this particular thing because this is what we have the contexts we want this question to be answered from this particular given user text so that's what we are doing here and once this is done we need to append the message so messages dot oops messages dot up and because initially we define the role for the system now we need to Define for the users so here we need to again say role and this time this is user role content would be the user drafted content which is oops which is this thing okay so we are done with the system level we are done with user level messages the only thing which is remaining is to call the chat completion endpoint and that we can do it using openai dot chat completion dot create and inside that inside this we need to pass our model which is gbt 3.5 Turbo and of course we need to pass in messages here so we are done with this let's let me extract the response here so I would say reply equal to response dot so again if you are not sure why I am doing response to our choices this is the format of the response which we get from the completion API and you can read about this and the documentation also let's do this and go ahead and print out print this Okay so if everything is fine we should get the response oops so it is saying none of the index are an index so perhaps we did some mistake here oops so instead of closing here we are supposed to put this square bracket here okay so this is our response let me quickly paste it here so that we can read it okay so the question was how poverty is the reason for homelessness and you can see that I just took two documents as an input first one was ports and second one was homelessness so definitely this answer should be from the first document so poverty is the major contributing factor to homelessness when a person lacks the resource to maintain a decent standard of living they may be forced to prioritize based their needs based such as food Healthcare so all these things are available in my document and that's why we are getting this response so I hope you got an idea how you can do this and there are few things which I didn't cover here in order to keep it simple so the first thing is let me quickly print this data frame here so you can see that in my case tokens are pretty less so if you are taking the file which is really very huge then in that case you have to divide this content into multiple rows so that your number of tokens would be pretty less so this is the one thing which I have not covered in this and next thing is if you want this conversation to keep on going then in that case you have to put it inside a loop and keep appending the messages so that's how you can set the context for the next question I hope you enjoyed watching this video and make sure to give me thank you and subscribe to my channel thanks for watching

hello everyone in this video I will show you how can you get answers from multiple files which means that you will be right asking one question and your answer could be any of the files which you are taking as an input so here we will be using open AI tick token and python so let&#39;s get started first of all we need to import all the required packages so let&#39;s go ahead and say import pandas then we need to import numpy next we need to import OS because we need to deal with local file systems and then I would say import tick token import open AI so these are the ones and next to one is the configuration from which I am reading my key so let&#39;s set that as well openai dot API key equal to configuration you can read it from anywhere in my case I am storing it in a configuration file then here itself I will go ahead and set encoding so tick token dot get encoding and inside this I will be using CL 100K base so these are the initial things which we need next thing what we are going to do we are going to construct a data frame which will primarily have three columns file name content and the number of tokens which are used by that particular file so for that we need to go ahead and construct the data frame so I am naming that as DF and here we will quickly say what all columns we need so first one would be file name second column is the content which is sitting in that file and the third one is tokens so once these columns are constructed we need to browse the directory or the path wherein we can read all the files so for Simplicity purpose here I am taking all the files is text file so let&#39;s go ahead and read those so for reading the files inside from the directory I would say list dir and inside that I will be passing the directory name so for me it is Data store you can pass the directory whatever is present on your machine and inside this file name now we will say if file name dot ends with txt then go ahead and open that file so let me quickly provide its path again so it is data store and then I need the file name so this should go and let&#39;s name it as F so here we are getting individual files and then we are extracting the content out of that so I will call it as lines and we&#39;ll say F dot read lines so this is reading line and then what we need is we need to know how many tokens individual file is using or the row of your data frame is using so I would say token equal to encoding Dot and code so note that we have already set the encoding here so I&#39;m not setting it again and I will simply say join because this is going to return list so here we will get tokens next thing is we just need to push this data into our data frame so for that we can call append and it&#39;s just like key value pair we need to Define here so let me quickly type this file name I want and its value would be the file name which we extracted above then we have content column whose column whose value is the same thing I would have calculated it above but it&#39;s okay let&#39;s go with this way and the last column we have is tokens and its value is nothing but the length of tokens okay once this is done we need to pass additional parameter ignore index because if we are not doing this append will throw you an error I believe this should work let&#39;s go ahead and print the data frame okay so here you can see this part you can see so file name is coming as empty it means we missed something with open data store double slash Okay so there was a typo problem with this column name let&#39;s run it and now you can see that two rows got added because I was having two files so this is a file name and this is its content and these are the number of tokens which are used by this particular file similarly I have a sports file then I have this content and these are the number of tokens which are used by this particular file okay so once we are done with this thing next thing we need to do is we need to let me remove this okay so we need to calculate the and embedding so for that I&#39;m going to define a function so there is nothing fancy in this this is the same function which I have explain in my previous videos so I&#39;m just going to type it here so here you can Define what type of embedding model so I would say embedding Adda zero zero two and it&#39;s going to return us a list of float because embeddings are in the embeddings are the values which are in float so let&#39;s quickly go ahead and call the create endpoint for the embedding I would say open AI dot embedding and then I would say create so inside that we can say model equal to model input we can say as text which is passed from which is passed as a parameter and once this is done we will simply say return me the result now this result will be in the form of some hierarchy so if you&#39;re not sure why I&#39;m doing this either you can go through my previous videos or you can go through open AI documentation which will tell you what is the response format so inside data at the first level we will be having embedding so this is how you can go ahead and figure it out okay so once we got the embedding we need to compute this particular thing for all the rows so how we can do it we need to iterate through it so let me quickly write another function compute doc embedding which will take data frame and it is of type pandas data frame then let&#39;s go with this one first and then we need to say return row this is kind of the same syntax so rather than explaining I&#39;m just tapping ahead get embedding and this ampeding we need to call on content column of your data frame okay it&#39;s not taking let me go ahead and type it for row index Row in DF dot it a rows okay here it is and maybe to put these brackets okay so this function what it is going to do is it&#39;s going to call embedding on all the rows for the content column and let me Define its return type it would be easy for us so let&#39;s call it dictionary and inside this we will have a strstr and then we have a list of load values so and I need to Define colon over here and we are good too so once this is done we are done with Computing uh with our embedding computation next thing is we need to compare the documents so for that let me go ahead and create another function which is order document section by query similarity so here again this is the same function which we have written earlier also so what it is going to take it is going to take the query which user wants to execute then we will have context which is nothing but the dictionary of string and next thing would be the array I will explain you when we will execute this let me quickly type it first and then it is going to say query embedding and here we are going to calculate the embedding for the query which is supplied by the user so this is the first thing then we need to figure out the similarity between the users query and the embedding which we have in our for our documents so for that we need to say dot protect calculate the dot product here and I am going to say document underscore similarities and this we want in the sorted form because we want the highest rank to come on top so that&#39;s why I&#39;m doing if you are not using sorted or reverse equal to true in that case you have to sort it by yourself so let&#39;s go ahead with this one and here I am saying the dot product NP Dot Dot array first of all we will pass the query embedding and then we will say document embeddings so here we will say doc embeddings this is done and next thing is the doc index so I would say let me see if I&#39;m missing something so what this is doing it is calculating the dot product of this one and then we are I mean dot predict is calculated for query embedding as well as the document embedding so once this thing is done then we need to I mean we need to do this calculation for all the rows so here we need to put loop again so I&#39;m using for doc embedding let me give it different name otherwise it would be very confusing so here it could be talk embedded in contexts dot items okay and here we have this bracket and at the end like I said we need to the highest rank to be on the first position so let&#39;s make it not reverse equal to true so I guess we are done the only thing which is remaining is the return type for this function so let&#39;s define it float again because embeddings would be in the float form in the flow data type and here SDR comma Str let&#39;s execute this so we are missing something okay okay so let me quickly execute this and see if we have any errors okay ah so here we need to return one more thing which is similarities Ctrl C okay so we are done with our all the groundwork next thing is we need to use our actual query to figure out what are the similar ones so I&#39;m going to call this function here order sections document sections by query similarity and it is going to take your user query so let me type in something how poverty is the reason for homelessness so let&#39;s go ahead and figure out the answer of this query based on the two documents which we have supplied and here I am saying compute doc embeddings which is going to take data frame as a parameter and I&#39;m interested only in top two results but let&#39;s say you have many documents so you can define those numbers so let&#39;s say you are interested in all the initial 100 top ranked portions of your documents or the documents so in that case you can Define that number here I have taken just two documents so I think for me in any case I will not get more than two so that&#39;s the reason I am doing it so let me go ahead and print this out run it and oops so oh let&#39;s do it once again okay we are good this time okay so now you can see that for this particular question there were two documents and the first document was ranked with 88 and this is 68 so there are high chances that my response will be in this particular document so let me go ahead and utilize the same index and the values to figure out what is going on now we are using here GPT 3.5 turbo so which means that I need to Define here roles so let&#39;s type in here so the first role is system and here content would be let&#39;s say I I&#39;m saying you are you are a professor who provides to the point answers so you are [Music] professor who provides to the point answer so what are we a question we are asking we are expecting this particular system to behave like a professor and then provide us the responses and then next thing we need is the user text so for you user text we will be utilizing this particular response here so in my case I can say content is my column and then we have similar ones 0 comma 1 because I need to read this particular value here so this is falling on the one index or the second position okay so once this is done we will go ahead and append this so this would be my query now let me grab this query from here this would be the query and this we need to append with user text which we just calculated about okay so along with this query we are going to pass this particular thing because this is what we have the contexts we want this question to be answered from this particular given user text so that&#39;s what we are doing here and once this is done we need to append the message so messages dot oops messages dot up and because initially we define the role for the system now we need to Define for the users so here we need to again say role and this time this is user role content would be the user drafted content which is oops which is this thing okay so we are done with the system level we are done with user level messages the only thing which is remaining is to call the chat completion endpoint and that we can do it using openai dot chat completion dot create and inside that inside this we need to pass our model which is gbt 3.5 Turbo and of course we need to pass in messages here so we are done with this let&#39;s let me extract the response here so I would say reply equal to response dot so again if you are not sure why I am doing response to our choices this is the format of the response which we get from the completion API and you can read about this and the documentation also let&#39;s do this and go ahead and print out print this Okay so if everything is fine we should get the response oops so it is saying none of the index are an index so perhaps we did some mistake here oops so instead of closing here we are supposed to put this square bracket here okay so this is our response let me quickly paste it here so that we can read it okay so the question was how poverty is the reason for homelessness and you can see that I just took two documents as an input first one was ports and second one was homelessness so definitely this answer should be from the first document so poverty is the major contributing factor to homelessness when a person lacks the resource to maintain a decent standard of living they may be forced to prioritize based their needs based such as food Healthcare so all these things are available in my document and that&#39;s why we are getting this response so I hope you got an idea how you can do this and there are few things which I didn&#39;t cover here in order to keep it simple so the first thing is let me quickly print this data frame here so you can see that in my case tokens are pretty less so if you are taking the file which is really very huge then in that case you have to divide this content into multiple rows so that your number of tokens would be pretty less so this is the one thing which I have not covered in this and next thing is if you want this conversation to keep on going then in that case you have to put it inside a loop and keep appending the messages so that&#39;s how you can set the context for the next question I hope you enjoyed watching this video and make sure to give me thank you and subscribe to my channel thanks for watching

Transcript for:Bruke OpenAI og Python for databehandling

Transcript for:
Bruke OpenAI og Python for databehandling