artificial intelligence took a really huge leap in terms of quality in the last few years mostly because of neural networks neural networks for our purposes these are enabling technologies so this this this little outline that I put here kind of describes where they fit into our you know in as part of artificial intelligence so artificial intelligence is more general term for as as we talked about just the ability for computers to do things that are apparently intelligent and machine learning is you know where we start with some data some previous data and we we ask the computer to learn how to combine features of the data to make predictions and underneath machine learning our neural networks and deep learning where deep learning is really just you know neural networks on steroids so we have to kind of figure out what these neural networks are now as I said these these were a big contributor you know there were certain tasks such as classifying objects from images which were attempted for a long time but it wasn't until the use of convolutional neural networks that it actually started being accurate enough to be put into commercial practice so you can see the practical use in computer vision of object recognition for you know surveillance cameras or autonomous vehicles another kind of neural networks which is called recurrent neural networks are good for sequence data but we're going to get into that in another video but suffice it to say that the reason why we have you know all these virtual assistants that we can talk to is because voice recognition has gotten really good because of neural networks and basically classification of natural language expressions has become really good now because of neural network so a lot of the tasks that were you know the the tasks themselves are not new what has happened recently because of neural networks is that the accuracy has gotten so good that these are now a valid addition to artificial intelligence and it's a huge leap forward but I want to talk I want to introduce the the technical side of this by talking you know in a more concrete way about what these are and kind of relating them to some of the analytical tasks you already know how to do I don't want to immediately start talking about the spooky similarities with brains and how they work I don't want to get into that at least not very much I want to talk about something more specific like that you might actually be doing um let's start off by looking at predictive modeling so let's just take a very concrete case that where we want to do something you know not too not too fancy not too complicated all we want to do is be able to predict what a houses price would be well what should a what should the predicted price be given the number of square feet and the number of bedrooms okay so this is a predictive model and what we would typically do for something like this is we would just do a linear regression where the the regression would supply us with some weights that we can multiply by the square feet in bedrooms and combine them into some kind of model for what the price would be so we might have some historical data like this where I've got just going on what has happened in the past I've got here's the price in this column this is the number square feet and the bedrooms and I want to do a regression where I predict the house price based on these two variables okay so I'm going to do this in SAS jump which you have access to as as a student here so this is my okay so this is jump I've got this now that's the general jump window and here is this data pulled in to jump okay so there again it's the same it's the same data just now it's pulled in to jump which I like to use for aggression so okay I'm gonna make my predictive model now or just to analyze fit model and I'm going to predict the house price so that's my Y variable based on the square feet and bedrooms so this is where you put your dependent variable and this is where you put your independent variables and we'll just leave the rest of this as default and we'll run it okay and um don't worry about a lot of this this is about model evaluation so forth so I'm just gonna show you kind of the point of getting these things is to you know we want to make predictions with this and I can see that I've got you know an R square of 92% and adjusted R square root of 90 percent so it's it these two variables helped us quite a bit predicting but these are what I'm interested in these are the parameter estimates down here and these supply me with these actual numbers that I want to use for my predictive model so I'm going to come back over here okay so now that I have my I'm gonna have these weights I'm just gonna put in some placeholders here for 0 so looking down here I see I have an intercept let me just write that down so that's 103 point nine five five 103.9 five five okay and then what this is telling me is that I should multiply the number of square feet by 0.01 six and don't worry about that in significant p-value right there Oh point zero one six so I'm going to multiply by point zero one six four square feet and I'm going to multiply bedrooms by sixty two point one one seven sixty two point one one seven okay so my predicted price is going to be it's going to be this intercept plus square feet the weight of square feet plus bedrooms times the weight of bedrooms okay so it's just a weighted sum and that is now my predicted model predictive model so I can say okay if I've got 1,200 square feet and 1,200 square feet and two bedrooms okay I can expect to stay 247 thousand if I add one more bedroom okay it's just gone up if I you know get a lot more square footage whoops that's a lot less a lot more square footage it goes up accordingly okay so what we have here is a predictive model okay it's just a just a predictive model based on you know a weighted sum okay so you know very simple sort of thing that that you know that you do okay so another slightly more complicated model would be to predict a probability and I went over this case in a previous video but we can do a similar thing where we have slightly different data so this is the churn data so this is the you know the historical record of you know churn the you know we're turn each each row here is a customer and the churn code is whether the customer has canceled our service so this is historical data that lists the number of calls that this customer has made and the number of years they've been they were with the company and then whether or not they left so we can now use this to build a predictive model of estimating you know what is the probability that somebody is going to to churn and for this we use what's called a logistic regression so again we do the analyze fit model and I'm going to predict the churn code based on calls and years and I'm going to try to predict one and it's going to be a nominal logistic regression here okay so once again what I have here down here under parameter estimates so these are the numbers that I'm going to put into this model that I'm interested in okay so what this is telling me is that again just like up here I'm going to have an intercept okay and I'm going to put the let's put the intercept here you know I should put the intercept here so the intercept is two point eight seven one eight two point eight seven one and once again I'll put in just some some placeholders here I'm gonna multiply calls by point zero seven five and I'm going to multiply years by negative point eight four five negative point eight four five okay now in this case this weighted sum but so let's just put in some slightly more real data so let's say twenty calls and five years okay now this case I'm going to do weighted sum just like up here with this model so it's just going to be this intercept plus you know the calls times the weight of the calls plus the years times the weight of the years okay now in this case this weighted sum gives me something called log odds and that's just how logistic regression works what it's actually giving you the the linear prediction is actually giving you log odds but we can convert this to a probability by applying a function so I'm just gonna apply a function just to make it something slightly more usable and interpretable which is probability and we just do that and you know don't worry about why this is this is just how logistic regression works you convert this with the probability by doing this function okay so now I have this probability so so 5050 however if the number of calls goes up and now goes up to 71% if their number of years with the company is to now close up to a full 97 percent okay so now I've been able to build a predictive model of the probability that the customer is going to leave based on these calls and years but in this case I did not only a weighted sum but I had to apply a function okay so let me let me go ahead and write this you know write down and what we've what we've done so far what we've modeled and so in the first case so when we were predicting the house price we had this sort of situation where we were predicting a y based on we had some intercept plus some weight times the first variable which was the number of square feet and then plus another weight times the other variable which is number of bedrooms okay so the so the central problem in estimation then was finding out these weights we need to know okay what is this what is this and what is this so we take our historical data we have Y x1 and x2 and it figures out what these are in this case it's just ordinary least squares regression so it was able to find those pretty easily now okay so in the other case so in the next case we were trying to predict the probability in this case we wound up we had a similar thing where we we had an intercept and we had a weight times the first variable and a weight times the second variable okay so again we got these weights but this time we we applied some function just going to do that so we applied some function to that linear prediction but again so we had weights this time so this is just a um say this is just a weighted sum and this was some function applied to a weighted sum okay I'm sorry for the sloppiness it's hard to write with this okay so and this is an F okay so one one way we might rewrite this so let's take this let's take this first one that I'm gonna draw a line here so how might we we can redraw this to kind of have a little bit of a geometric interpretation so let's take let's say what we're going to do is we have you know our input variables that x1 and not x2 and then this will just be for my intercept and I'm going to draw these as little you know little nodes on a network okay so what we're doing for this first one is we're gonna we're taking these input variables and we are producing an output by applying a weighted sum so this is this was this weight down here this was this weight and this was but what sorry this should be 1 and this is 2 okay so this is one way of visualizing what we're actually doing here is we're we're we've got these inputs and we want to produce a certain output and what we need to know are these weights so the process of linear regression is just one of finding these weights okay now in the case of this other one it was only just slightly different we once again we had a we had these two input variables and we were combining them with a weighted sum as before that's 2 and beta 1 but this time we were applying a function to that weighted sum and then arriving at an output okay so this is a just a different way of visualizing what we're doing but ultimately what we're doing is we're starting with data and we're using some learning process to arrive at these weights ok in this case the weights are slightly harder to learn because the you know whatever algorithm is learning the weights has to take into consideration this functions transformation so but eventually you know the problem is one of coming up with these weights ok now let's let's take a look at how that logistic regression did just for fun here so I see ok so we had an R square of 0.49 so you know that's you know that's pretty good you know that means that it'll be you know it accounts for you know about half of the of the variance and you know we've got these fit measures here with the root mean square error of 0.3 4 4 so it's not perfect but this is kind of where we're data science stopped for a while and they just said well this is you know the combining the data in this way gives us as close as we can really get of a prediction but then somebody quite clever said well ok is it really necessary to only have one step like this couldn't we do something like this what if we what if we did this instead of just doing that one that one step what if we took our input and then instead of just going straight to the output what if we had like just put a bunch of a bunch of outputs here so we so we combined the data to okay and then you know maybe and then went to our output okay so this is fully connected I don't think I got them all but okay there we go so now what it's doing is it's it's saying okay we need to do some kind of weighted sum so each of these is going to include a weighted sum and it's going to also apply some kind of function just like that with logistic regression okay um and so that would mean that when we're training this thing we have all the data we have our x1 or x2 and and we have what the output should be right here and now what this what the training algorithm has to do you know so ultimately the goal is to have just this this step-by-step way of doing weighted sums and then applying some function and so we do it on this level then we do it on this level to do it on this level and then this last step is going to give us an output okay so it's basically multiple different regressions and with you know with different functions and so what they found out was that this was astonishingly accurate that instead of having you know like you know for example we had a we had a root mean squared error of like 0.3 for just going with the one level you know like this is that if you if you have something like this you know you're you're down you know you might go down by a huge factor like all of a sudden your error rate is only like that so you know when you have a something like this where you can you know put in your data and get the right output you have just an unbelievable amount of accuracy on your on your training and hold out data but the key challenge here is remember that when we were making this we had to learn three weights right so it just said okay we've got this input in this output find the right weights that will get us from this input to this output now the course of training the course of training this thing means okay we have these two inputs in this output now we have to find weights for all of these all of these you know so let's I'm not going to count them all but that's one and that's one and there's just a bunch here and so we have to work backward given the output in this input and then tweak all of these different weights you know so the first you know back in the 80s when they first conceived these things they just they started off by putting in random weights and then just kind of tweaking them as they went just loop through and tweak them you know like maybe increase this one a little bit decrease this one increase this one until they just kept adjusting it until it worked and this took just way too long I'm just and so consequently research you know using these and you know this is this is a neural network doing a predictive model with a neural network like this was stalled because there was just no efficient way to get all of these different weights it just took too long and the increase in accuracy that you got was just not worth the trouble um but eventually certain things fell into place I have a slightly nicer picture here to just also show what I'm talking about here so again here are the inputs this these are the weights so these are all the variables and then you combine these in a weighted sum and then you apply some kind of function so you know you've got all these inputs that are combined to make some kind of output and they call this an artificial neuron or or a a perceptron and this the reason why these are called neural networks is because these bear some superficial similarity to neurons the actual neurons in your central nervous system work the same way where you have all of these different inputs that go into the neuron you know you've got some processing and then it has any number of outputs but then the processing part of this is you know what kind of function goes on in here to determine what these outputs are okay so this this is why they're called neural networks because these waited some things that when you hook them up together actually look you know they have some kind of physical resemblance to these neural these neurons because these outputs will go into another neuron and so what you wind up with is you know multiple layers of actual neurons just like what goes on here with these artificial neurons okay um but the big the big holdup here though was um as I said coming up with an efficient way to determine these weights and let's see I have it here okay especially when you start doing you know craziness like this okay so so these are the inputs so this is always the input layer and these are the outputs now we've always these have we've always had one output layer but but you could you could have you know you know an output which has several different values like you know let's say 1 2 3 4 5 6 7 8 9 10 11 12 you have let's say this is a classification task and you have 12 different classes and your output is going to be a probability of each class so if this is probability point one their point o2 there and then maybe down here you've got like 0.9 well it can't be 0.9 but you know like 0.6 then you could say okay well now I know that this is the most likely class so so you can have multiple outputs but when you have a lot of inputs and then you have these things in here which are just kind of like where you store the weighted sums these are your these are called hidden layers and so I you we've got one two three three different hidden layers here and you know well how many hidden layers do you do you have well we're gonna get into that in a minute but um so you can see that once once you start adding these up there there are enormous numbers of weights to to figure out you know so you've got all your input data you know you've got your historical data and you're trying to learn all of these different weights it just became you know an intractable problem so progress was stalled until a few things happen recently one was that data became available okay we have large data sets that make it possible to train a network like this you know as you can see with this number of weights that have to be estimated unless you have a lot of rows you're going to run into degrees of freedom problem the other thing that happened is improvements in software algorithms particularly back propagation so what back propagation was was a more efficient way of actually finding the weights so instead of just initializing all of these to a certain arbitrary you know just a random number and then tweaking them up and down back propagation became a little bit more sensible and efficient then we're not going to get into it because it's way too complicated but but it's it's it it made it possible to actually come up with these hundreds or thousands or even millions of weights um the other thing that happened is that hardware became available that can efficiently process these now this GPU stands for graphics processing unit now why you know why graphics processing unit there's no graphics involved here well you know all of these these operations here are weighted sums which means you have you know something like this you have a pilot see you can have X 1 X 2 you know any number of variables that you're going to be multiplying by weights weight 1 weight 2 dot dot dot weight n and so what these operations are these are linear algebra linear algebra and it turns out that there was a lot of linear algebra involved in generating computer graphics and so over time the folks at Nvidia came up with specialized hardware specifically for doing linear algebra and so you know with a computer screen you've got you know thousands of pixels right each little pixel has to have color values calculated for it and so there are these millions of matrix multiplications matrix and vector multiplications involved for every single pixel and so the graphics processing units got very good at doing these kinds of linear algebra operations and parallelizing them so that they could calculate all the different picture elements simultaneously so it turns out that these graphics processing units of which I have one on this machine right here and I'm going to show you of I'm going to show you a network implementation in a minute these also sped things up tremendously so that instead of you know spending days trying to find all these weight combination of the software backpropagation and the GPU made it possible to find the weights and it's in a reasonable amount of time okay so um okay so when we're one verb we have a lot of choices to make then we have our inputs and we have our outputs which are kind of given but we can decide how many layers and we can decide how many of these and you know how do we go about it so here are the decisions that we have to have so in this in this little this little neural network here makes it a little bit simpler to look at because there's only one hidden layer here one hidden layer and we've got three inputs you know they're doing their way to combination their function application and combining their outputs you know combining those into these outputs here so in you know this kind of demonstrates okay well how did they decide to have just one hidden layer so the first decision we have to make is how many hidden layers should you do and then secondly how many units on each layer so this this one just you know got a bunch on this layer and then you know only had five here you know if you were here how in the world did they come up with that decision so we have to decide well how many how many units on each layer and then remember so each one of these you know you take the weighted sums and then you then you pass them through that output right you know whatever um so this function there are any number of functions that you could do okay they're they're linear functions and you know stepwise functions there's so you have to decide on which you know after you take that weighted sum what function do you apply at each unit okay and then finally this is a this is a much more loaded question I'm going to get into in a minute so I want to show you how you can make a neural network quite simply using in another video I'm going to show it to you in Python but for now let's just take this this turn data that we have and let's just make a neural network so instead of just doing the you know the one step where is it instead of just doing the one step logistic prediction here we're going to actually have a hidden layer - okay so I'll come here okay this is just the same old data calls years turn and I'm going to go to analyze and sorry not fit model analyze predictive modeling and neural okay so so SAS jump provides you with a point-and-click way to make a neural network so I'm going to predict my turn code so this interface this dialog is very similar here and I'm going to put in my X Factor's right here okay all right so now here is where I can make those choices that I was just talking about so in the case of SAS you know the this tool that they give you they kind of choose for you that you can um you can have up to two hidden layers okay now theoretically you can have three 400,000 different hidden layers um but for this tool we have a maximum of two hidden layers well let's go ahead and use let's go ahead and just have one hidden layer for this first one now so what this is saying is you had for the first hidden layer and the second hidden layer you can pick the function that you want to apply on you know for the number of nodes to apply that function to so I'm just gonna the default setting is on the first hidden layer apply this function which is the hyperbolic tangent function don't worry about that it's kind of an s-shaped thing and I'm just going to take that all the default elements here okay so let's not worry about reading this just yet but let's take a look at the diagram so this has now okay it takes the calls in the years and then passes them into three nodes on a hidden layer and then goes to the turncoat okay so the diagram before was just causing years into turncoat now it's applying functions on three units on hidden layer okay and now if I look up here I noticed that I now have a generalized R square of 0.87 remember the R square on the logistic regression was only I think point five something right and then look at the accuracy if you look at these confusion matrices it actually does quite a good job of predicting it was only wrong one time in in the unseen code the unseen data okay let's take a look at another let's see if we if we can add more layers to to make it even more impressive so let's have a second layer now let's apply let's have to know two nodes two units that do hyperbolic tangent three that do linear and then on this level let's have eight that do the hyperbolic tangent so let's go ahead and go and generate that so this is the new one down here so I'm going to show the diagram with this one okay so now we've got one two three four five units in this layer one two three four five six seven eight in this so these eight are doing see that s shape that's the that's the hyperbolic tangent function and these three of the linear and these three these two are the hyperbolic tangent again so it's going through two hidden layers before it goes to that turn code now let's see if we've gotten any better over here slightly different confusion matrix it doesn't look like we gained a whole lot by this you know we can just keep trying and try you know we can just get we can try to get funny about it let's see if we can break it if we make twenty and let's see 12 let's see what it does okay and let's take this diagram okay yikes and now let's see we got now we were actually less accurate now okay so we you know we might want to just stick with this one up here helping you point eight seven point eight five point eight two so we actually got worse by adding more layers there's not really any theoretical basis for choosing the number of layers or the number of units per layer when we talk about convolutional neural networks we'll get into some of the more mind-blowing aspects of that but for now we just say we'll try it through trial and error we've determined that this is our best neural network and so if we want to deploy this we would do something like let's see publish the prediction formula and then we could just do something like okay well let me save it as some Python okay so now we have our functions and you can see in there there it is it's got the three different layers where it's applying the hyperbolic tangent function to the data that you have in there okay so you can push it out to Python okay all right so so this is you know this is kind of entry point for doing neural networks again they they come back to this idea of you know you want to make a prediction you want to make some kind of prediction and you do it by combining a weighted sum with some function and all that we're doing with these more complicated setups is where we're doing those weighted sums and combining them with functions on multiple layers and it turns out that these provide more accurate predictions now the last question that I have here is you know should we make a fully connected Network so in this case everything we've done so far we've had a weight that is a you know a an arc going from every single node every single other node but what if we just want to say have this one going to these two and then this one going to maybe those two and then this one just going to that one and maybe that one up there um how do we know you know how do we know first of all whether dense is going to be better than you know kind of a selective graph or um if so you know if it is better to select you know and not have everything connected how do we learn which ones we should connect to which so this is where we get into as I said one of the more mind-blowing aspects of neural networks and that's convolutional neural networks so with convolutional neural networks these actually or CN NS or confidence these actually learn the the network connection how it should be connected so you just say okay I have this number of layers and I have this number of units and then it goes and learns which ones it should connect and it does this by noticing pattern so for example if we have an input image and convolutional neural networks tend to go along with images for some reason this just seems to be how vision works and in fact I understand that psychologists have found that the visual cortex is structured much like these convolutional neural networks where we have some input so the actual input consists of all of these these pieces and the convolutional neural network will actually divide them up into components on this layer where I'll say well this layer is well predicted by these four things and this this node is predicted by just these two and this one is predicted by these five okay and so it it combines certain pieces of the input into certain nodes here on this layer and then from those it gives output which in this case is it predicts it's a cat so on this layer it just pulls in the input this is the input layer here and this here it's identifying and I here it's identifying knows and here it's identifying I think that's an ear yet it's an ear okay so when you use a convolutional neural network you're actually going to a you know breaking down the image into individual pieces okay but you do so again with weighted combinations okay um in my next video I'm going to show you a neural network that's used for kind of an interesting challenge which is recognizing handwriting I've got the M nest data collection which is 60,000 handwritten images and so how would you train a computer to identify three well we're going to use a neural network and I'll show you that next time