Transcript for:
Module 6 - Lecture 3: Generating Code

as counterintuitive as it sounds it's actually easier to generate computer code than it is to generate regular human language and the reason for this is because computer code by design has a very predictable structure there's a rigid syntax and you can't ever deviate from that or else your code won't compile so what some researchers did was they they wanted to see if they could create you know a language model that would be trained on code so that it could do the same thing as like what some of these large language models could do that is it could start with some kind of prompt and then guess what should be in the vicinity of this so they started their their training data was this which is uh GitHub so GitHub is a just a massive repository of software development um projects where you know it's it's everything from companies you know showing off some of their some of their code uh to you know casual students who are just trying stuff out uh but what this has created is just hundreds of thousands of of projects that are you know that are written in computer code and you know many different languages but the language is part you know the language is identified so for example if you wanted to build a language model that could generate let's say python very popular Pro programming language for data analysis then you could just take all the projects that are in Python and then you know build your language model and then use it to generate based on a prompt so I wanted wanted to show you uh how this works because this was actually something that I ran into and I was quite astonished by uh okay so let me go to um okay uh this is uh let me let me get some of this junk out of the way here so one tool for developing python is what's called collap which is a a web-based development environment created by Google where you you know typically when you're doing data analysis uh what you do is you you write some code and then you have to visualize some data then you have to list some data generate some descriptives do some you know so you're typically just kind of interacting with the data and then getting some visualizations and getting some output so the the programming Paradigm where you just have a big program that you run run and then in a batch it's done that doesn't work as well for this so we have what's called this notebook format where you're writing some code and then you know you write little blocks of it and then you have some output you write little blocks you have some output so that's called The Notebook Paradigm so that's that's what this collab is um so I want to show you uh so what you do is you you just write some text and then you execute it like this so this is just accessing a spreadsheet with some data in it in fact I can show you the spreadsheet it's just the candy bars data that we saw earlier uh let's see it's this thing so each row is a different candy bar and each column is uh some kind of attribute about that so I want to do some analysis of this and I have now just read the file into to memory and so now I go to the next code block and it says okay do you want to start coding or generate with AI um you the first time I saw that I was kind of surprised I was like well surely this thing can't program the way I can program um but yes it could so typically the way you know best practice is to First Express what you want in plain English so that you're really clear about what you want to do and then write the code so something like this I probably say well first I want to get some information about this so um I I I would do a comment like this you know and these are non-executed pieces of text so I'll just say um show information about the data okay and then I hit enter and then I would um write the code to do that now what I see here is that in faint gray gray here the line of code has already been generated so I just hit tab to accept it okay so it turns out I could just data is the object that has all my data and I'll just do data info all right well let's see if that works okay yeah that's exactly what I wanted these are all the different columns and it tells me the data types and it tells me whether I have any nulls okay great um now I see that I have my calories variable so um let's see let's let's let's start exploring this calories variable because in an earlier scenario we were thinking well let's build a a machine learning algorithm that can predict number of calories given these other attributes okay so let's let's explore the calories variable um let's see uh so this is just a comment right so I what I need to do is generate descriptive statistics of the calories variable all right now that's just a note for myself I hit enter oh wait a minute okay uh it looks like it it it wrote a line but I don't really like that line that's not really that's not really what I wanted I just wanted the calories variable uh let's see generate descriptive let's let's have it try again let's how about of of just the calories variable let's see because I don't want all all the other stuff well it looks like okay well just just do a data describe all right so data describe final hit tab all right and there it goes so I have the there's my my um calories looks like there's a mean of 243 standard deviation of 61 uh so the median of 230 and a maximum of 450 okay so um you know this whole thing is a data set but I don't really expect to use all of this so I kind of want to get rid of this First Column here I don't remember exactly the Syntax for how to do that but I'll start working on it by first giving myself a note um that says uh drop the uh drop the variable called uh let's say serving per package oh okay so it turns out that is the Syntax for dropping um that variable uh so let's execute it um sering oh it was my fault because it's not service per package it's so this is human error serving per package let's okay so it's okay generated okay yeah so it did it right I just typed in the name of the variable wrong okay so now if I were to do um you know same thing uh I don't remember the syntax describe the data again all right it tells me there it is execute it and now I see sure enough that that column's gone okay well all right um that's not very hard uh let's um let's do something um you know one part of exploring is to to you know get a box plot let's get get a box plot of the calories so give myself a note here um create a box plot of the calories variable oh turns out that's all I got to do was that one line okay I didn't know I could do that but go ahead and execute it all right and there it is so you see that um you know for those of you who don't know how to read this this is a uh this kind of represents the minimum not considering outliers this is the maximum um this is the interquartile range this green line is the median um sometimes you'll have the mean on there uh marked with an X or something it doesn't look like we have one on here um well so I can see that there are some outliers here uh I would like to see what those outliers R um I don't know exactly how to write the python to do that so um let me see um show me the show me the rows um where the calories variable is an outlier oh and look what it did it said okay I'm going to select from the data where the number of calories is greater than a thousand oh okay that's kind of a neat thing um turns out it looks like anything over three 300 is going to be an outlier so let's just change that to 300 it did pretty did pretty well I'm glad I didn't have to look up exactly how to do that that syntax right there oh there it is okay so it shows me these are the candy bars that are high outliers they have an extraordinary number of calories uh okay so now I'm going to say um um show me you know show me the row with the highest number of calories okay it turns out okay that's how it does it it says selecting the data where the calories is the equal to the calories Max and there it is so I didn't have to know that syntax I wrote my instruction in plain English and it told me what it could do um let's do um let's do a little bit more let's say um uh create a create a histogram of the calories variable variable okay turns out that's I only need that one line okay that's pretty convenient all right and there it is looks like it's kind of right skewed here we've got some that's why we have those outliers here all right um what else might we want to know okay let's see how smart this is um how intuitive it can get um uh list the most fattening the most fattening candy bars let's see if it could do ah okay s sort the values in descending order of calories so somehow it knew that fattening was uh related to calories so Not only was this able to translate um you know fairly specific instruction into code it was able to do some uh inductive reasoning to know what fattening meant so we now have um you know we now have you know ex extremely intelligent uh code completion where you could practically without any kind of training at all um simply open up a notebook and then just start generating python to explore data sets um I never remember all of the syntax you know I have to look stuff up all the time and this has cut my programming time probably 90% um so this is going to be something that is huge so this is just another large language model it's it's um just language generation same as chat GPT only now it's generating a different language so the the impact that this kind of thing is going to have on you know the IT world is going to be incredible uh the the efficiency gains are are going to be you know each developer is going to be 10 times more productive than they used to be and what are going to be the implications for the the field the you know job prospects um well that's something that uh we will have to discuss