Transcript for:
Understanding Multicollinearity in Regression

Hello and welcome! In this video I explain to you what multicollinearity is and how you can check it online. And we get started right now. So the first question is what is multicollinearity? Multicollinearity means that two or more independent variables are strongly correlated with one another. The problem about multicollinearity is that the effect of individual variables cannot be clearly separated. Let's look at the regression equation again. We have the dependent variable here and the independent variable with the respective coefficients. For example, if there is a high correlation between x1 and x2 or if these two variables are almost equal then it is quite difficult to determine b1 and b2. If both variables are completely equal, the regression model does not know how to determine b1 and b2. This means that the regression model becomes unstable. If you now want to use the regression model for a prediction, it does not matter if there is multicollinearity. In a prediction, you are only interested in how good the prediction is but you're not interested in how big the influence of the respective variables is. However, if the regression model is used to measure the influence of the independent variable on a dependent variable, there must not be multicolinearity. And if it is, the coefficients cannot be interpreted meaningfully. So the next question is, how can we now diagnose multicolinearity? If we look at the regression equation again, we have the variable x1, x2 and upon to the variable xk. We now want to know if x1 is quite identical to any other variable or a combination of the other variables. In order to do this, we simply set up a regression model. In this new regression model, we take x1 as the new dependent variable. If we now can predict x1 very well from the other independent variables, we don't need x1 anymore because we can use the other variables instead. If we would now use all variables, it could be that the regression model gets very unstable. In mathematics, we would say that the equation is overdetermined. We could now do this for all other variables. So we estimate now x2 by using the other variables and we estimate xk by the other variables. In this case we have k new regression models. For each of these regression models we calculate the tolerance and the variance inflation factor. The tolerance is obtained by taking 1 minus r squared which is the coefficient of determination or the variance explanation. The variance inflation factor is 1 divided by 1 minus the coefficient of determination. Multicollinearity could exist if the tolerance is smaller than 0.1. If we look at the variance inflation factor, there could be multicollinearity if the variance inflation factor is larger than 10. And now I will show you how you can easily check the requirements online. In order to do this, please visit datadap.net and click on the statistics calculator. If you want to use your own data, just click on clear table. I will use the example data now. If you want to perform a regression, just click on the tab regression. On the left side you can choose your dependent variable, on the right side you can choose your independent variables. In our example we want to choose salary as the dependent variable and as the independent variables we choose gender, age and weight. Now we can click on check conditions and we get the results of the condition checks. First we start with the linearity. Then we see the normality of errors. Further we have the multicollinearity tests, where we have the tolerance and the variance inflation factor. And finally we can see the test of homoscedasticity. This is how easy you can check the requirements for a linear regression model. Another important topic when talking about regression models are dummy variables. If you want to learn more about dummy variables, just continue to watch the next video. See you soon!