Transcript for:
Regression Residuals and Analysis

hi everyone this is Matt to show with intro stats and today we're continuing our discussion about correlation and regression so last time we looked at some of the famous statistics that we calculating correlation and regression including the the the correlation coefficient R squared the coefficient of determination slope y-intercept and the regression line we kind of saw how those were calculated and how what they mean how to explain them but today we're going to get into a little further into this idea of residuals so residuals are a big part of regression analysis so if you kind of want to know how good your formula is for predicting things then we're going to have to get into this topic of residuals now we talked about how residuals are the vertical distance that each point is from the regression line so if we measure how far vertically every single dot is above or below the regression line that's called a residual so and what we found also is that if we sort of average those residuals we get something called the standard deviation of the residual errors which tells us the average distance that the points in the scatter plot are from the regression line and it also gives us the average prediction error so let's look a little bit at the same example we picked up our last time our x value was the temperature high temperature of the day and then we looked at how much profits a store that sweats L swimsuits made that day so looking at swimsuit profits versus temperature temperature was the X and swimsuit dollars profit was the Y so when you're calculating the residual what the computer does is the computers by the way all do this I'm showing you you know all the calculation but really the computers do this in a split second but really what you're going to do is you need to figure out what the Y value is on the line right that's called the predicted Y value or Y hat so to do that I'm going to basically plug in each of my X values into the regression line formula so this is the formula of this line if I replace the letter X with the number that X was equal to I can find the the Y value on the line for that X we call that the predicted Y value so when I'm doing is my my regression line was twenty six point four plus eighteen point zero six x remember from last time twenty six point four was the y intercept in eighteen point zero six was the slope last time I showed you how to how those were calculated so now what I'm doing is I'm plugging in the x value seventeen into that formula I'm replacing the X with seventeen and I'm working it out and I get I predicted Y value of three hundred thirty eight point four two so my actual Y value was three hundred and seventy eight my predicted Y value was three hundred thirty three point four two to get the residual I subtract the actual Y value minus the predictive and that's going to give me the vertical distance that's what this is doing just a fancy way of figuring out what's the vertical distance so three seventy eight minus three hundred thirty three point four two gives me a residual of positive forty four point five eight in other words that point is forty four point five eight dollars above the line it would be the same units as the Y so this thing's this point right there is forty four above the line okay that's kind of the idea now I'm gonna do this for all the X values so the next x value is nineteen I'm plugging back into four X and the regression line formula I multiply this out and then at the twenty six point four I get three hundred sixty nine point five four if I subtract these two I get negative eight point five for the negative residual we're tells you that the point is below the line so that point was actually eight point five four below the line okay now we go to the next one twenty twenty was our twenty was our x value we're gonna plug that in for X into the formula and that gives us a predicted y value of three hundred eighty seven point six again my actual Y value is three ninety nine so I subtract those Y values I'll get the vertical distance so three ninety nine minus three eighty seven point six gives us a positive eleven point four so that point was eleven point four above the line negative residuals mean the point was below the line positive residuals mean it was above the line and you can kind of keep just doing this for all the points you kind of see how I just plugged him in and I got my residuals see how the residuals are positive and negative numbers so you get some positive and some negative points that are below the line you'll get the negative residual points above the line will have a positive residual now there's a couple of famous graphs that come from these residuals that people look at actually there's quite a few graphs out there but two main ones are the residual plot versus the X valve area below so what they do is they put the x-axis just like the x-axis of the original scatterplot but the Y is no longer the actual Y value that's actually the residual is the Y so what you kind of see is using a residual plot versus the X variable you'll see the x-axis right here so this is the x-axis this one right here that's the x-axis but these now the Y variables are actually the residual so this is the residual right here these are the residuals so you can see positive residuals right here positive 10 20 30 40 negative residuals net 10 negative 20 negative 30 negative 40 okay so that that's what the so the residuals so if you kind of look at this basically 17 not a residual of positive 44 so at 17 they put a dot of it positive 44 so it kind of shows you how what the vertical distance is of each one of these points you see how in this graph the scatter plot the points are so tiny it's really hard for me to judge how far right the points are I kind of like to think of residual plot as a magnifying glass in some ways you have kind of bad eyes like me this kind of helps me see the vertical distances much better so think of the if the residual was zero this is like representing the line itself so if you were the end of residual of zero your point would be exactly on one so this is like the line and these are the points and how far vertically they are above or below the line so at 19 we had a residual of negative 8.5 for so you can see at 19 I got a point at negative 8 right there and so on 20 had a residual of positive 11 so I got a point at positive 11 and so on at 23 I had a residual of negative 56 so 23 at negative 56 there all right so we got a negative 56 and then and so on at 27 we had a residual of so we had a residual of so 20 I'm sorry 27 we had a residual of 29.9 8 so that would be right here twenty nine point nine eight is right above and then at 28 we had a residual of negative five so here's negative five so you can kind of see this is called a residual plot and one of the things we like to look for in residual plots is if the sorry my little toy poodle is coming in to say hi to me so my little my so in a residual plot what we want to do is we want to kind of look at and see if if it's evenly spread out that's one of the things we look for in a residual plot is to see if it's evenly spread out or there are some areas of the graph where all the points are very close and then other areas with a graph where the points are very far away back that can be problematic so this is also why we study residuals another graph you'll see occasionally is they'll make a histogram of these residuals so they'll actually make a histogram counting how many how many numbers were in eat every section so you'll kind of see again with the residual histogram of the residuals you'll see these negative positive numbers and you'll see zero so this is a histogram of these residuals right here so you can kind of see what it looks like one of two things we like to look at with this histogram of the residuals we like to know if the had the highest bar is pretty close to zero or is it kind of far away sometimes in that you can kind of see here the zero is not really where the highest bar is so that's kind of can be problematic also we like it to be relatively normal bell-shaped so again this is actually not this looks a little skewed left so again this is not a very good-looking histogram of the residuals but this is the idea of it so those are a couple famous graphs that go with residuals now what about the standard deviation of residual errors right how do we calculate that well it's sort of like the average of the residuals that's kind of how you do it and and it is a standard deviation calculation so if you guys remember when we talked about standard deviation we're saying it's the it's the sum of squares divided by the degrees of freedom and then the square root of it at the end so that this is going to do now it's going to look at the sum of squares so Y minus y hat and then divide it by the degrees of freedom and minus two okay and then we got so this is the formula for it so it's y minus y hat that's the residual square it and add up all the squares so I'm going to square all these numbers which I did right here and add them up and that's where we get this eight hundred and eight thousand four hundred eighty-seven point zero three and you're dividing that by six and and then again we're going to go ahead and divide those two now the the sample size and notice it says n minus two the degrees of freedom for one data set was n minus one but two data sets is n minus two right remember these are ordered pairs but there are two of them so again that's and we get we get out one there's going to be one fixed value in this data set and one fixed value in this data set so we we say n minus 2 so 8 minus 2 would be 6 that's going to be our degrees of freedom and then divide them and take the square root and we get this the standard deviation residual errors is about thirty seven point six one remember it's always going to have the same units as the as the dollar amount as the Y value so this will be dollars again so if you were making a prediction suppose you the company wanted to predict what their profits might be based on temperature okay well one thing you have to kind of keep in mind is that all formulas and science are you have to think about when