Understanding Simple and Multiple Regression Analysis

So now we're going to talk about assessing the fit of the simple linear regression model. So the sum of squares, which is the sum of the error terms or the residuals from our last time, is due to the error. The value of SSE. is a measure of the error in using the estimated regression equation to predict the values of the dependent variable in the sample.

So if we go off of, and this is this example, if we go off of this example, the SSE, which is the sum of squares error from this example, can be computed in the following way. SSE which is equal to the sum, and the root 10, 1 to 10, of EI, the quantity squared, which for that particular example is 8.0288. Now, as we said, the way that this works is that if we look at that example that we're continuing through, What we have is that we have this average y value right here, and then from the average y value, then there's a distance.

And so this distance for, for instance, the fifth data point is the difference from the average y value to that data point. And then, for instance, the third data point has that. distance as well. So right here, this is our y value.

This is our predicted y value. So the sum of squared deviations obtained by using y-bar equals 6.7 to predict the value of travel time for each driving assessment in the sample. This was our previous table. We had that before.

So for instance, for that Butler trucking example, the ith driving assignment in the sample. The difference y i minus y hat, or pardon me, y bar, it's a minus sign, provides a measure. of the error involved in using y bar, which is the average y value, to predict travel time for eighth driving assignment. Now, what we then have is that SST is a measurement of how well the observations cluster about the y-bar line.

Y-bar is the average y-value. So this is the corresponding. sum of squares or the total sum of squares, as it's sometimes called.

Now, you can notice that it's as an abbreviation SST, and that's because it's sometimes called the sum of squares total. Total sum of squares and sum of squares total, they're the same thing. So Should it really be abbreviated as TSS? I would say. However, this is a standard notation system, and so we're keeping to it, despite the fact that I very much believe that it should have been originated as TSS.

But there's a consistency. Sum of squares is what appears first in these abbreviations. So, for instance, if we're talking about. that calculation that we've been going through, then continuing for the calculation that we've been going through. You'll notice that, and we commented on this before, if you actually look at the sum of the differences, it's zero because the average difference from the average y value should be zero.

That's what it means to be an average centered around the mean. And that means that we have the sum of squares total as 23.9 for that example. So deviations about the estimated regression line in the line Y bar can be seen from here. And so we see that the line Y bar cuts straight through.

And the line of best fit, which is now also displayed on there, has then that there are then two different types of differences that we are assessing, which is so, for instance, for this point, there's the difference from the average Y value, but there's also then the difference from the predicted Y value. So this little distance is the difference from the predicted Y value. This distance is the difference from the average y value. So the points cluster more closely around the estimated regression line than they do about the horizontal line y bar, meaning that they're closer to the line of best fit than they are to the average y value.

Now, these two different differences. relate to each other in the following way. So the SSR, which is the sum of squares regression, is the sum of the differences between the predicted y values and the average y value. And there's a very nice formula that relates these things, which is the SS. t, sum of squares total, is equal to the sum of squares regression plus sum of squares error.

Now sum of squares regression is the differencing between this and this. So we now come to this thing called the coefficient of determination. So the ratio of SSR and SST is used to evaluate the goodness of fit for the estimated regression equation.

This ratio is called the coefficient of determination and is denoted by r squared. And I'll tell you, it doesn't matter whether you use a capital R or a lowercase r. It takes values between 0 and 1. What's the reason why it's always between 0 and 1? Well, it's because SSR and SST are both always positive terms, so they are non-negative terms. It's interpreted as the percentage of the total sum of squares that can be explained by using the estimated regression equation.

Or, stated differently, it's sometimes used to reference the amount of data points, amount as a percentage. that can be explained by the model. So for instance, for that particular table of data values, the r squared happens to be equal to 0.6641.

And from that, then we can conclude that 66.41% of the total sum of squares can be explained by using the estimated regression equation. that we found, which was that y hat equals 1.2739 plus 0.0678 x. So that's just as an example of what we're talking about.

Now, when we want to bring up the coefficient of determination, which is the r squared value, for the example that I did the other day, so let's go to that real quick just so that we can see that I did actually already share how to find the R squared value for a on a scatter plot in Excel. And the idea is, is that when you have a trend line and you then double click on the trend line and you then want to go through on options, trend line options, it's right down here. display R squared value on chart. So it is very easy to get that kind of thing from using Excel. No problem there, right?

And we will, of course, as I have commented before, we will look at how to do this in R. So now the multiple regression model. What is the multiple regression model? Well, a multiple regression model is like the following. It is largely the same as the simple linear model.

The only difference is, is that, well, we have multiple variables. So your y is still the dependent variable. And then you have that. beta zero, beta one, beta two, dot dot dot, up through beta q, or q is some index.

These are parameters and epsilon is still your error term. And x1, x2, dot, dot, dot, up through xq are the independent variables. Now, what are the place of those independent variables?

In other words, what do the independent variables do? So that's what we want to understand. So We want to try to understand what is the influence and the interaction of these things.

Okay, so how do we begin to interpret this model? Well, so the interpretation of the coefficients. the beta i's, is that they represent the change in the mean value of the dependent variable y that corresponds to one unit increase in the independent variable xi, holding the values of all other independent variables in the model constant.

So the multiple regression equation that describes how the mean value of y is related to x1, x2, x3, dot, dot, dot, up through xq is the expected value of y given x1, x2, dot, dot, dot, up through xq equals b to 0 plus b to 1 times x1 plus dot, dot, dot, up through b to q, xq. So in other words, to state it differently, When we have the multilinear equation, beta 0 plus beta 1 x1 plus beta 2 x2, all the way up through beta q xq, the coefficient of each x is the rate of change in the y variable in regards to the particular x that it's associated with. Now, Continuing with this model, what we find is that the estimated multiple regression equation is y hat equals b0 plus b1 x1 plus b2 x2 plus dot dot dot all the way up through bq xq, where b0 b1 b2 all the way up through bq is equal to the point estimates of the theoretical versions of it. So Once again, we have it that there is a theoretical model, a conceptual model, and then for that conceptual model we find approximations to those values based upon the particular data set that we have. So the least squares method and multiple regression says that the least squares method is used to develop the estimated multiple regression equation so that B0, B1, B2, up through BQ that satisfy minimizing the sum.

from i equals 1 to n of y i minus y i hat squared, which is equal to minimizing the sum of the e i squared i equals 1. And then using sample data to provide the values of B0, B1, it amounts to minimizing this quantity where this much of this quantity. as given as the subtraction of the predicted values. In other words, we will use the predicted values stemming from that particular equation.

So we have a little model chart, which is shockingly similar to our model chart for simple linear regression, which is that we come up with a theoretical multiple regression model of where in the theoretical concept, you have your beta zero, beta one. through b to q, which are the unknown parameters. You then collect sample data, and that sample data then allows you to compute the estimated multiple regression equation, where you have then specific sample statistics of b0, b1, up through bq. Those then provide the estimates on what the theoretical and therefore conceptual beta 0 through beta qr, which then updates then your theoretical model.

So all of that is important, all of that is great, and the estimation process for multiple regression that happens here based upon the dependent variable y is computed by substituting the independent variables into the estimated multiple regression equation. Now, we do not do this by hand at all. In fact, almost no one ever even gives examples of doing this. equation is slightly jumbled here, but it is still good. So now to go back to our equation, the estimated simple linear regression equation from before, which was y hat equals 1.2739 plus 0.0678x, that had an R squared value of 0.6641, if you'll recall.

Now, if you take the square of that, meaning the square root of that, take the square root of that, that leaves you with r equals 0.3359. And what that then does is that it leaves you with saying that when it comes to this particular kind of quantity, sorry not the square root, it's 1 minus r squared, there we go, is 0.3359. Crazy talk.

uh my coffee is running low this implies that 33.59 of the variable and variability and sample travel times remains unexplained uh this would seem to indicate that one or more independent variables such as like number of deliveries or something uh is missing from the model and the reason being is that uh explaining only 66.41 percent of the variability in travel time is not very good. Now an estimated multiple linear regression with two independent variables for this then could represent of where y hat equals b0 plus b1 x1. plus b2, x2 of where y hat is equal to estimated mean travel time and x1 is equal to distance traveled.

and x2 is equal to number of deliveries. Now the SST, SSR, SSE, and r squared are all computed using formulas previously given. Now, in particular, when we look at what this model And by this model, I mean the one that I'm currently highlighting, this model, what this model would imply is that the interpretation on B to one, or pardon me, not B to one, the interpretation on B one is that it's the effect size on estimated mean travel time of the distance. The interpretation on B two is that it's the effect size of the number of deliveries on that.

Now. When we want to use multiple regression in Excel, we want to go to the data analysis pool pack and you can then just select multiple entries. For instance, as is being designated right here, it's that you're selecting multiple x range inputs.

We will give a demonstration of that shortly. And then Excel has a regression output of where it describes for us then all of our points of interest, such as it gives the r squared value, which is what we've been reading off. Now there's an adjusted r squared value, we'll come to talk about that eventually. It then right here gives us the coefficients, for instance, for the model that we had just done, this would be the b0, this would be the b1, this is the b2. Now, what's the interpretation graphically of this kind of a model?

Now, we have a limitation on, we can only graph for up to exactly two variables that are independent variables, because the interpretation is very limiting. And that's because it's a plane, or in higher dimensions, a hyperplane. So what is a plane?

A plane is a piece of paper floating in space, rather than a line floating in space. So the idea is, is that right here, if you pick a number of deliveries and a number of miles, so three deliveries, 70 or 75 miles, that then there's an associated predicted point, which let's say that that's Y hat seven. And then there's Y seven, which is the actual data point associated. There's then a difference between them, much the same as what we were saying when we had lines.

And next time we will then talk about inference and regression, which will lead us into a number of different new topics. So more on this next time.

Transcript for:Understanding Simple and Multiple Regression Analysis

Transcript for:
Understanding Simple and Multiple Regression Analysis