Hello and welcome to lecture 44 of my class from data to decisions. This lecture we're going to use both Excel and R multiple regression. Then we'll do a few calculations of goodness of fit, how good our regression model is at fitting the data, things that we talked about in the previous lecture.
So let's start with Excel. Here I've loaded up a set of data I call the body fat data. And it's an attempt to make measurement of a person's percentage body fat easier.
The most reliable way to measure body fat is to measure the density of your body. And then using something called the Siri equation, we can determine what percentage of your body is fat based on density. But density measurement is difficult.
It basically involves what's called hydrostatic weighing or underwater weighing, where you take your whole body, you immerse it in water, get a displacement, Archimedes'principle. Displacement of water tells you the volume that you occupy. That allows you to calculate the density, which then allows you to calculate the Sierre equation, the percent body fat. So here's a set of 252 subjects who went through this particular measurement process. They're all men.
And then a series of other measures were made. Age and weight and height, the obvious ones, but also circumference around your neck and your chest, your abdomen, hips, thighs, knees, ankles, biceps, forearms, and wrists. The idea was if we can find a model that correlates very, very well with percent body fat, if we can predict the percent body fat using some of these other measures, these other measures are much... easier to come by in a doctor's office. Therefore, we can get a reasonable measure of the present body fat.
That's the idea anyways. So what we're going to do is start playing with this data and see if we can come up with a model that allows us to predict present body fat. Well, this model building idea we're going to talk a lot more about in the future, but right now I'm just going to show a couple of examples of multiple regression. Let's start with a single predictor variable, things like we've done before. Here's a case where I'm going to take the percent body fat as my output and then I'm going to use abdomen circumference as the input.
Here's a plot, a scatter plot with a best fit line. Of course I use LUNEST to calculate slope and the intercept for this best fit. ordinary least squares regression line. Well in our lectures we talked about various goodness of fit measures.
The adjusted R squared, the Akaike information criterion, that's why we always say AIC. Us Westerners having difficulty pronouncing that Japanese name. But there's also the Bayesian information criterion, BIC you pick.
also called the Schwartz-Bayesian criterion or SBC. And here's the equations that I use. I have them plugged into Excel and so we can calculate what each of these criterion are as well as generally get from Linus, the plain old R-squared.
But we can do other regressions. Here's a regression where we use the weight. You might think of weight as a of the percent body fat, but in fact it's not nearly as good as abdomen circumference. You see the r squared is 0.37. Before it was 0.66.
So we're definitely explaining a lot more of the variance with abdomen, not abdomen. Spelling error on my graph. Correct it in real time.
There we go. Copy of this spreadsheet that'll be on the web page. We'll now have it spelled correctly.
So you see we're explaining 66% of the variance with this predictive variable. I just used weight. I'm only explaining 37% of the variance.
I think you can understand why. There's tall people and short people. And short people don't weigh as much, even though they might have more body fat than a tall person who weighs more.
So you might think, well, maybe height and other things come into play. play. We're going to get into model building later. I'm just practicing regression now.
Here's another one, chest circumference. And you see that chest circumference is something in between. It has an r squared of about 0.5, so about 50% of the variation in body fat.
is explained by chest circumference. We can also, besides our square, we can look at our metrics like AIC. Here it's 902. If I go to the weight, it's 955. And with our test predictive variable, abdomen circumference, it's 800. Well, there's several different single variable. Regressions.
What about a multiple regression? Well, to do a multiple regression we can still use Linest. Here I have percent body fat in row in column D.
In columns B and C I have both abdomen, circumference, and weight. What do I do? I do the exact same thing we've done before except for one little detail.
The y variable, the response, is still this one column. selecting all the data percent body fat. But now when I select known X's, the inputs, I simply select all the columns that have the data.
So a requirement to make this work is you have to put every column a different predictor variable next to each other so they're all contiguous and I can just grab that whole range of columns and use that as the input to the nest. Let's go ahead and just run through that sequence real quick. So I'll cut this and then I will delete everything that I did already and I'll paste it right there.
So the nest body fat and then I selected both of these columns of abdomen circumference and weight we see here and enter. We're going to do that same trick. Ding F2, control shift enter.
Fill out the array. Now I have to instead of just doing two columns I have to do three columns. Three columns, five rows because I have now the parameters.
I'm gonna get a column for every parameter. Intercept coefficient of abdomen circumference and the coefficient of weight. So I'll hit F2 Control shift enter and it fills them all out. The legend is basically the same but you have to Understand that things kind of go backwards. It starts with the intercept last and abdomen and weight.
You see my columns where abdomen and weight going left to right. And here they kind of go in the opposite direction. You just have to know that that's the way the nest works.
So we have the slope of both of those parameters and the intercept. They're standard errors, r squared, everything the same. I can then calculate the AIC.
SBC, the adjusted R squared, using the same formulas we've had before, just a different number of parameters. Here I wrote a little formula to calculate the number of parameters as the difference between the number of data points and the degrees of freedom that Linest gives me back. Alright, three parameters in this case.
What happened to our AIC and SBC compared to the prior best regression? Prior best regression it was 800 and now it is 756 for the AIC. Zoom in a little. So it went down.
Added an extra parameter but our metric says we have a better model. These information criteria need to go down and you get a better model. Adjusted R squared went up. It does the opposite. It's better when it's higher and that went up as well.
But that is not always going to be the case. Sometimes I can add extra variables to our model and things get worse, not better. I'll show you an example. In regression 5 I've got again two variables, but this time it is chest and weight. So you might recall that when I did weight, I got an AIC of 955. When I did chest, I got an AIC of 902.9.
And when I combined the two of them together, AIC went to 904. So chest and weight is better than weight by itself, but it's worse than chest by itself. So in fact, weight is making the regression or in fact, if we look at the coefficient for weight is negative and standard error 0.028 is in fact bigger than the magnitude of the coefficient, 0.022, meaning that we can see very clearly that zero is within the confidence interval of the coefficient of weight. So statistically, we can't really tell, we can't really say that this coefficient wouldn't be zero, meaning that weight is not affecting the model. Notice in both of these cases, when I added weight, the coefficient's negative. But of...
Of course, when I plot body fat versus weight, it's obvious that weight, higher weight, results in, it correlates with higher body fat, lower weight, lower body fat. And yet, the model says negative coefficient. This is very interesting, and this is one of the complications of multiple regression. How do you interpret these?
individual coefficients. Well, the cause is a correlation between the two predictor variables. Abnum and circumference and weight, in fact, are correlated with each other and therefore their influence on the model is not independent, they are coupled. That will be an important subject of upcoming lectures. You need to understand that phenomenon.
Right now we're just practicing how to calculate the AIC, SBC, and adjusted R-squared, and how to do multiple regression. All right, let's do the same things in R. All right, so here's my R framework, and I've loaded up a script I call multiple regression dot R.
It'll be on the website. I've taken that same body fat data, and I've put it into a CSV form. Make it easy to read.
You can easily do that. given the Excel spreadsheet of the body fat data that's on the web page. So let me run that line. You can see I created a data frame called body fat.
If I look at it, you see it's got subject density, body fat, age, weight, height, neck, chest, etc. All the same data that we saw before, 252 subjects. All right, we've loaded that in.
We can easily plot the data. Here's that graph. Looks just like the graph we saw in Excel, of course. I didn't do anything to make that pretty.
Now we'll do a single regression. Body fat versus abdomen. I can use this LM formula that says body fat dollar sign body fat.
So remember, little r is case sensitive. So small case b and small case f is different. than capital.
I just happen to name this data frame body fat with small cases. One of the columns in that data frame has a label body fat with the capital B and the capital F. That's the percent body fat data.
Then abdomen contains the abdomen circumference data. I can do the model. By the way, there's another way of formatting MLM.
the equation using the LLM by setting the data frame. So we use this data equals option in the LLM function. And I say data equals body fat tells me that I'm using the body fat data frame.
And when I type in, you know, y as a function of x in this formula, I can leave out the body fat dollar sign of each of these. Actually makes things a bit easier to input the. the names of the variables a little bit cleaner to look at.
So this is probably a better way to do it just from the cleanliness of typing in your equations. It gives you the exact same results. I look at the summary of the model. I'll let you go back and double check.
You see that you get the exact same sets of numbers as we get with Linest. Another very convenient function to use is confint. confidence interval of the model. I set the level to 0.95 and then down here I get the high and low values of the confidence intervals of the two parameters.
Very, very useful. In this case, you see that abdomen confidence intervals does not include zero, so we can say with some confidence that it is a significant predictor variable. All right, let's go and look at how we can calculate these information criterions. Well, we can do it manually, right? We've got a formula.
I can simply plug in things and do things. calculate the formula and here I've shown that. So n is the number of observations, n obs of model tells me there's 252. p, I said, is the length of the coefficient array.
The same thing as the number of parameters. And then here's a formula. I'll call it my IAC, a formula of a specific AIC calculation.
Here I get 1517.977. Alright, fortunately you don't have to do that. You don't have to manually type in the formula because there's a function already built into R.
In fact, there's a couple of them. We're going to use them both. One function is called aic. aic, you give it the model.
And then you give it this parameter k, which we set equal to 2. This is the penalty function parameter that multiplies p. So k equal 2 means we're adding 2p, like our normal AIC formula. But as I'll show you in a minute, we can put in other values of k.
But if you want the AIC, then you want k equal 2. All right. I run that, and, well, lo and behold, I get the exact same number as I did when I did the calculation manually, as you might expect. There is another function called extractAIC that only gives you the AIC. it kind of has k equal 2 hardwired into it.
And it can also be used to pull that number out using r. However, it gives you a different number. Let me just go down and run that.
If I do extract AIC, I get 800.8319. What's the difference? Well, all of these information criteria are valid to within an additive constant.
different formulas for doing it, simply additive constants differently. So in fact here is the formula that the extract AIC function uses instead of the one I used before up here. And if I run that manually, I see over here I get the same 800.8319 that I got before. So it simply uses a different formula.
They're different only to an additive constant. We're only going to use these AIC metrics. comparing one model to another for the same data set.
So this additive constant won't matter. We look at the difference, the change in AIC. We can also calculate the Schwartz's Bayesian information criterion, SBC or the BIC, and again we'll use the AIC function to do that, but instead of k equal 2, we'll use k equal log n. Remember that the BIC has a penalty for complexity. which is p times log n.
k is the thing that multiplies by p in this formula, so I simply put k equal to log n, and I've got my BIC calculation, which happens to be 1528. So you can compare the AIC to the BIC. BIC is going to be bigger because you're penalizing number of parameters more heavily. Finally, adjusted R squared, we have a formula for the adjusted R squared.
and we can calculate what that is, 6613. But you don't really need to do that because the summary of the model gives it to you. If you look at our model summary, here all of this stuff, one of the outputs is adjusted R squared. So it's automatically calculated and dumped out every time you look at the model summary. All right, that's the AIC, BIC adjusted R squared.
Regression with a single parameter. How about if we add four parameters, do a multiple regression? It turns out to be very easy to do. So here's my original equation that I used, body fat versus abdomen. What if I simply said plus weight?
All right, now I'm going to use a capital W. I have to remember, go back and look at your data file here to see. and you'll notice that weight has a capital W. So I need to make sure I use a capital W when I describe it as weight here.
So that's all I have to do. Abdomen plus weight. Now, inputting formulas into R for this lm function is a little different than what you'd expect. When we use the plus sign, you're not really adding in terms of the arithmetic operation of adding these two numbers together.
Instead, adding weight as a variable. So that's really what the plus sign means. It doesn't mean plus adding a number.
It means plus adding that variable. If I run that, look at the summary of the model, notice now that my table of coefficients has the intercept and then the thing that multiplies the abdomen and the thing that multiplies the weight. If you go and check with the Excel that we've already run, you'll see these coefficients are exactly the same. I can put any number. I could add chest if I wanted to.
Plus chest, I think it is a capital C as well. I can run that model, look at the summary, and you see now I have four parameters, chest, weight, and abdomen, plus an intercept. So that's all it takes in R to add more predictor variables to your model.
You just put them in like this. Now there's going to be other things we can do, like interactions, and we're going to talk more about that coming up. But here you have the basics of how to do a multiple regression in R and how to extract.
adjusted R squared, AIC and BIC. That's the lecture. Next time we're going to get busy talking about a very difficult and important problem in multiple regression, multicollinearity. Until then.