Transcript for:
Understanding Linear Regression Techniques

Welcome to Statistical Analysis 7. Now we're moving into what I think is one of the more fun topics in statistical analysis because it lets you use more than two variables. That's linear regression. Let's start by defining what a regression analysis is. Regression analysis is a statistical technique used to evaluate the relationship between a continuous dependent variable, so your outcome variable, and one or more independent variables or predictors. So that's the exciting part. Now that we've learned a lot of the basics, we can consider more than one variable that might impact our outcome. The main objectives of regression analysis are really twofold. First, we evaluate the overall model fit. So this tells us whether the entire set of independent variables, all of the predictor variables that you choose to include in your model together, significantly predict the dependent variable. So this uses the same test as what the F test did in ANOVA, where it determined if there's like an overall effect of these variables. And just as a fun fact, ANOVA is actually mathematically equivalent to regression if you're using data that's appropriate for ANOVA. Second, and most often what we might be most interested in, we measure the unique effect of each predictor on the outcome variable. So this is reflected by what we call a beta coefficient, and these beta coefficients are assessed with the t-test to determine if each independent variable significantly predicts the outcome on their own. And then these beta coefficients also provide the direction and magnitude of the effect. So you can see why we're learning this at the end of the course, because it has components of t-tests and ANOVA that you had to learn first. And now let's move on to understanding the types of statistical methods that fall within the umbrella of regression analyses. So in regression analysis, we have two primary families of statistical methods, the general linear model and the generalized linear model, conveniently named to be easily confused. Sorry, but you can always refer back to this slide to tell the difference between these two types if you need to. So the general linear model includes linear regression, and it has the assumption of normality of residuals and requires a continuous dependent variable. So we have these requirements just listed here for you to refer to, but other types of Analyses that actually fall under the umbrella of general linear model include ANOVAs, what we learned a few weeks ago, and other types related to ANOVA that we didn't cover, like ANCOVAs, MANOVAs, and MANCOVAs. So, on the other hand, we have the generalized linear model. So, this allows for... non-continuous data, like nominal and ordinal data, and it does not assume normality. So this type of analysis includes methods like the binary logistic regression and multinomial logistic regression, types of regression that we'll touch on next week. And so just so you know, when you run into these models in articles, you'll hear about a link function in these models. The dependent variable in a generalized linear model is actually transformed using this link function. So it creates a new variable that can actually be modeled linearly, even though the data itself are nonlinear. So the issue with this is that this transformation actually changes how we interpret the coefficients because the results are now based on this transformation rather than the original units. You'll have a better understanding of this next week when you learn it through an example logistic regression. But I just wanted to give you a heads up before you before you get there. And let's go into the type of the choice of which type of regression model you'd need to your to use for your data. And as usual, your choice of model depends on how your dependent variable is measured. In this case, if you're your dependent variable is a scale variable, you'll use linear regression, which is our focus for this week. For anything else, you'd use the generalized linear model range of approaches, and these are the specific types of logistic regressions that each depend on really how many categories of your nominal outcome variable you have, or if you have an ordinal outcome variable. So these are all generalized linear models and types of logistic regression, which again we'll cover in in your following week. But for now, we're sticking to the umbrella of this, this general linear model and specifically the linear regression. So let's get into our linear regression. The analytical definition of the linear regression is a statistical test used to determine if a set of predictor variables do a good job in predicting a continuous dependent variable. So again, this tells you you need a continuous dependent variable, and you can have more than one predictor variables. So in this case, if you want to determine whether both age and weight predict systolic blood pressure, linear regression is the appropriate analysis since we're trying to determine if more than one continuous predictor effectively predicts a continuous outcome variable, which is systolic blood pressure. We can also use it with a continuous, or sorry, a single continuous predictor variable predicting a continuous outcome, like in this example, where the hospital administrator wants to understand if there's a predictive relationship between self-acceptance and personal growth among their employees. They might want to learn this to see whether they should be encouraging self-acceptance in order to promote personal growth. The important point here is that she wants to predict an outcome. That is the key word that's often your clue that regression is the analysis you'd want to use. if you're aiming to predict an outcome. As we mentioned, linear regression has certain required assumptions, as usual, for parametric tests. And here's the list for you to refer back to. And you can note that only really that first one is completely new to you. But there are slight differences in the homoscedasticity. assumption here and the normality assumption compared to other analyses we've run and we'll go into that specific differences as well. Multicollinearity is a new term for us, and it means that two or more predictor variables are highly correlated with each other. When you have multicollinearity, this can actually distort the results of a regression analysis. So we want our predictors to be related to the dependent variable, but not too closely related to each other. Multicollinearity can be assessed using variance inflation factors, which are VIFs. So values greater than 10 indicate high multicollinearity. And we note here, again, Intellectus automatically checks this for you. So another method that you can use is just a correlation analysis. If you run a correlation between your variables, and if two predictors have a correlation coefficient of 0.9 or greater, they're considered highly multi-collinear, so that would violate this assumption. So yeah, again, it's really nice that Intellectus streamlines this part of the testing of assumptions that requires the absence of multi-collinearity. So you want not to have collinearity for in order to be able to run a linear regression. Okay, let's review the other requirements. As a reminder, the homoscedasticity assumption means that the variance should be consistent across all levels of the independent variables. So for linear regression, homoscedasticity actually specifically refers to the residuals. So the residuals are the difference between the observed data values and the model predicted values. So we need those residuals to have equal variance across all predictor values. And since this particular version of homoscedasticity is based on residuals, it's again nice that Intellectus does this really complicated work for us. But as usual, this assumption can be tested both analytically using that Levene's test and visually using scatter plots. But again, we'll let Intellectus check this with the Levene's test because it's run specifically on the residuals, the difference between your observed data values and the model predicted values. Okay, so moving on to independent observations. is exactly the same assumption we've had for most of our tests, parametric tests, and it's one that's not automatically checked by Intellectus because it has to do with the measurement design. So you do have to state whether you can reasonably assume that one observation doesn't influence the value of another. Normality, again, is the assumption that the data follows this normal distribution, as you've seen before. And you've also seen the central limit theorem, which helps us here because, you know, with a large enough sample, the mean of any continuous distribution will approach normality. But if the data deviates far enough, the sample size needed to met this assumption also has to increase even more. This assumption, though, is slightly different for regression than it has been in the other times we've used the normality assumption, again, because it applies to the residuals. So we're requiring the residuals of the model. So, again, the values of the data minus the predicted values to be what actually follows that normal distribution. Again, we'll let... Intellectus tests this for us because it can run the test on the residuals. And finally, the outlier assumption requires that there's no significant outliers in your data. So here's a reminder from our previous lesson about how to identify an outlier. So you can either do that visually with a scatter plot where you. might visually identify this as an outlier, or you can do it analytically by checking whether any data points are 3.29 standard deviations from the mean of your data. Again, Intellectus will automatically check for outliers when you ask it to run a linear regression, so that's also helpful, but you do have to remember that this is one of the assumptions for this task. And now that we've covered the assumptions, we can talk about how to formulate a research question for linear regression. The research question should simply ask, do the independent variables predict the continuous dependent variable? And for this research question, we would write out the null hypothesis as the independent variables are not related to the dependent variable, or you can say the independent variables do not predict the dependent variable. And the alternative hypothesis would be that the independent variables are related to the dependent variable, or you can say the independent variables do predict the dependent variable. Let's apply this to the example where the hospital administrator wants to determine whether there's a predictive relationship between self-acceptance and personal growth. The research question would then be, does a person's self-acceptance score predict their personal growth score? The null hypothesis here would be that the self-acceptance score is not related to the personal growth score, or self-acceptance scores do not predict personal growth scores, while the alternative hypothesis would be that self-acceptance scores are related to personal growth scores, or self-acceptance predicts personal growth. Now that we have our research question and hypothesis set, we can discuss how to interpret the output of a linear regression analysis. This is what the output table for our results would look like, and this was modeled from the hospital professional data set using the personal growth pre-variable and self-acceptance, self-accept pre-variables. So you could run this yourself if you'd like. So as we can see, it's. it's very important to get your independent and dependent variables right here, especially if you're going to add more than one independent variable, which you could have. So we could have said we actually believe personal growth is predicted by self-acceptance. Let's throw gender in there, maybe also age, and maybe also how many years of employment in the hospital. might also predict personal growth. So you could put all of those as independent variables, and this table would just be expanded to have more rows with each of these statistics for every single independent variable you would have. Okay, so now we'll go over the key statistics we see here in this table. But the main points will be written out for your reference on the next slide. So here we have this beta coefficient, B, which represents the effect of a predictor on the outcome variable. So here we see that a one unit increase in the predictor self-acceptance is actually will result in a 0.50 unit increase in personal growth. That's how you interpret this. So the beta means every one unit increase in your predictor results in this amount of change in your outcome. The next point that we have here is a standardized beta coefficient, which is written as the actual Greek beta as opposed to a b. And what this is, is it measures the effect of the predictor in standard deviations. So the benefit of having a standardized beta is it allows you to compare the relative effect size of different predictors if you had more than one predictor in different units. So that's where this would really come in handy if you had more than one predictor here. But we're keeping it simple for our first examples. The standard error. It tells us about the uncertainty surrounding these beta coefficients. And Intellectus then uses the standard error to calculate the confidence interval and the t-statistic. So the 95% confidence interval, remember that provides an estimate of the range within which the true value of the beta coefficient is likely to fall. Because we're estimating beta. 95% confidence interval is about where the true value of beta will fall. And then we get a t-statistic for every predictor variable we run, which is then used again to calculate this p-value, which indicates whether the predictor significantly influences the outcome variable. And in this case, the p-value is less than 0.05, so self-acceptance does. significantly predict personal growth. And you might wonder why we also have an intercept. That is where your value of personal growth would be if your self-acceptance was at zero. So we start off with a value of 13.34, but we don't typically need to interpret the intercept, and that's often why it's included in parentheses here. Our real interest is in whatever actual predictor values we put in our data. Sorry, we put in our model. Okay, so finally, we go down here to the R squared value. And this actually tells us the percentage of variance in our outcome that's explained by the model overall. So in this model, 35%. of all the variance in personal growth was explained by self-acceptance. That's actually quite huge. To think like over just over a third of any variability in personal growth is actually explained by these individuals self-acceptance. This is the write-up of the statistics in the table from the previous slide that I just discussed, and we've included this here so you can refer back to it as you like. I do want to mention that you get a t-statistic for each individual predictor variable that will test the null hypothesis of whether the beta coefficient is zero. What that means is every t-test for each specific predictor is trying to reject the null hypothesis that one unit increase in the predictor predicts zero change in the outcome. This is then compared to the F-test, which is testing the overall model fit, so whether your predictors explain any variance in the dependent variable. Specifically, the mathematical null hypothesis of this F test is that the regression model overall reduces the variance by zero, reduces the variance in the outcome by zero, and that's what we're trying to reject. Those are really the more technical definitions of what's being tested here, but then as we mentioned, we had a significant F test, and the R-squared value was 0.35, so that tells you that the model, in fact, explained 35% of the variance in our outcome, and overall, our set of predictors, which in this case was only one, did significantly reduce the variance in the outcome. It did significantly, essentially, explain the outcome variable. Okay, so now let's look at the regression equation here. So here's your warning that equations can look scary, but we'll go through it piece by piece. So the unstandardized regression equation that we see here allows us to predict the dependent variable based on the independent variable or independent variables. In our example, the regression equation for predicting personal growth based on self-acceptance would be this y hat. And that's just y hat is what we use to say outcome variable equals 13.24 plus 0.50 times self-acceptance. What this means is you can insert any value for self-acceptance that's reasonable. So it fits within the range of your actual measurement. If you measured self-acceptance on a 0 to 100 scale, you wouldn't put in a 200 here. You would want to stay within that 0 to 100. But you can insert any value here and see what the model would predict for personal growth. So let's start with somebody who scores a 0 on self-acceptance. If we put a zero here, multiplying 0.5 by zero is still zero, so someone who has zero self-acceptance would score a 13.24 on personal growth according to our model. And then another example here is someone who scores a 35 on self-acceptance, so you would put 35 into the equation here and that multiplied by 0.5 and then added to 13.24 equals a 30.74 score on that personal growth scale. Okay, so we note that in publications here, we usually have a plus E right after the end of this equation, and that usually indicates error that accounts for any additional variance that we missed in our model. Intellectus doesn't provide this information directly, so we just wanted to let you know that's why there might be an additional letter. in equations like this. And again, this might look scary, but Y hat just stands for your outcome variable, in this case, personal growth. Beta zero is just that constant number that is the value when all other predictor variables are equal to zero. So that's our constant value here, beta 0, and then in this equation we only had a beta 1 and an x1 because we only had one x variable, one independent variable. So this beta 1 is actually 0.5, and x1 is our self-acceptance. If we had, as I mentioned, included age and gender, beta 2 would be the beta coefficient for the effect of age. Age would be our x2 and beta 3 would be the effect of the beta coefficient for gender, so the the effect of gender and x3 would be where we would insert our value for gender. So you made it through the equations and that's pretty impressive. This is how the results would be written up from this regression. So the results of the linear regression model were significant with F1,48. Those are the degrees of freedom that would come from your model, equals 25.90. This is the F statistic. And then you would put the p-value for the F statistic and the R squared. So this is the overall model fit that you're writing out here. And then you would put the indication that approximately 35% of the variance in personal growth was explainable by the self-acceptance variable, because that was the only variable we put in our equation. If we had more than one, we would list it here, too. But that 35%, again, is coming from this R squared. Then we follow it up by actually interpreting each independent variable here. And again, we only have one in this example just to keep it simple. So we have self-acceptance, significantly predicted personal growth, and we include the beta value of 0.50. And then we put in the t-statistic with its degrees of freedom here, and we write the actual value for the t-statistic and the p-value. And then we have the interpretation. This indicates that on average, a one unit increase in self-acceptance leads to an increase in the value of personal growth by 0.50 units, and you get that 0.5 again by that beta value. Okay, you made it through regression. As a reminder, regression needs a scale-dependent variable. But you can use any type of predictor variable. I really encourage you to try this out on your own. It's fun to see the effects of different variables and to include them in the same model, even though we only kept it simple for this first example and put one in, one predictor variable. And again, this is used to assess whether a set of variables is correct or not. a predictor variables do a good job at predicting that continuous outcome variable. The key statistics then include this beta, which is the unstandardized beta coefficient representing the effect of the predictor on the outcome. So every one unit increase in the predictor is associated with a beta amount of change in the outcome. Then we have that standard error, which is how much the beta coefficient is expected to vary. We have the 95% confidence interval as usual. It estimates the plausible range of the beta coefficient in 95% of repeated samples. And then we have our standardized beta coefficient, which allows for comparisons of effect size across different predictors, no matter what their units are. Our t-statistic, which we've used in the past. is then used to calculate the p-value, which is the probability of obtaining the observed result if the null hypothesis is true. So significant p-value is usually, again, p less than 0.05. And now, with all of this understanding, you can apply linear regression in your own analyses, which you will have to do with this week's quiz. So good luck! And wishing you a happy end of the semester as this is the last lecture from me. Aside from, of course, any Zoom meetings you would like to have or email contact. So, yeah, it's been a wonderful semester. And next week you'll do logistic regression. And other than that, you'll work on your final assignment.