Understanding Regression Analysis Techniques

do you finally want to understand regression analysis then this is the place for you my name is hannah and welcome to this full course about regression analysis after watching this video you will know what regression analysis is and what the differences between simple and multiple linear regression are further you will be able to interpret your results understand the assumptions for a linear regression and also the use of dummy variables and at the end i explain to you the basics about logistic regression and of course i'll also show you how to calculate a regression easily online let's get started a regression analysis allows you to infer or predict a variable based on one or more other variables let's say you want to find out what influences the salary of people for example you could take the highest level of education the weekly working hours and the age of people now you can investigate whether these three variables have an influence on the salary if you can do so you can predict a person's salary by taking the highest educational level the weekly working hours and a person's age the variable you want to infer the one that you want to predict is also called the dependent variable or criterion the variables you use for your prediction are called independent variables or predictors regression analysis can be used to achieve two goals first you can measure the influence of one variable or several variables on another variable or you can predict a variable based on other variables this is the second goal in order to give you a feeling for this let's go through some examples now let's start by measuring the influence of one or more variables on another in the context of your research work you might be interested in what has an influence on children's ability to concentrate you're interested in whether you can prove that there are parameters that positively or negatively influence children's ability to concentrate but in this case you're not interested in predicting children's ability to concentrate another example would be you could investigate whether the educational level of parents and the place of residence have an influence on the future educational level of children this area is very research based and has a lot of application in the social and economic sciences the second area which is using regressions for predictions is more application oriented let's say in order to get the most out of hospital occupancy you might be interested in how long a patient will stay in the hospital so based on the characteristics of the prospective patient such as age reason for stay and pre-existing conditions you want to know how long this person is likely to stay in the hospital based on this prediction you can for example optimize bad planning another example would be that as an operator of an online store you may be very interested in which product a person is most likely to buy so you want to suggest this product to the visitor in order to increase the sales of the online store this is where regression comes into play what we now need to know is that there are different types of regression analysis and we get started with these types right now in regression analysis a distinction is made between simple linear multiple linear and logistic regression in simple linear regression you use only one independent variable to infer the dependent variable in the example where we want to predict the salary of people we use only one variable either for example if a person has studied or not the weekly working hours or the age of a person in multiple linear regression we use several independent variables to infer our dependent variable so we use the highest educational level weekly working hours and the age of a person so therefore the difference between a simple and a multiple regression is that in one case we only use one independent variable and in the other case we use several independent variables both cases have in common that the dependent variable is matrix and major variables are for example the salary of a person the body size the shoe size or for example their electricity consumption in contrast to that logistic regression is used when you have a categorical dependent variable so for example when you want to infer whether a person is at high risk of burnout or not whenever yes and no answers are possible you use logistic regression so in linear regression the dependent variable is matric in logistic regression it is categorical whenever the dependent variable is yes or no you will use a logistic regression does one person buy a product yes or no is a person healthy or sick does a person vote for a certain party yes or no and so on and so forth in all these cases it does not matter what scale level the independent variables have they can be either nominal ordinal ometric so as i just said the scale level of the dependent variable can be metric ordinal or nominal in all three cases so in the simple linear in a multiple linear and in a logistic regression the dependent variable is metric in the linear case and it is nominal or ordinal in the case of a logistic regression it is important to know that in the case of a nominal or an ordinal independent variable the variables may classically have only two characteristics such as gender with male and female if your variables have more than two characteristics then you have to form so-called dummy variables but we will talk about dummy variables a bit later so now a quick recap for you there is the simple linear regression and the question could be does the weekly working time have an impact on the hourly wage of people the distinguishing point is that we have only one independent variable in this case then there is the multiple linear regression a question could be do the weekly working hours and the age of employees have an influence on the hourly wage in this case we have at least two independent variables so for example weekly working hours and age and now let's look at the last case which is logistic regression here the question could be do the weekly working hours and the age of employees have an influence on the probability that they are at risk of burnout in this case burnout at risk has the expressions yes or no so now i hope you got a first impression about what regression analysis is and we'll move on to the linear regression now let's get started as you already know a regression analysis allows you to infer or predict a variable based on one or more other variables but what is the difference between a simple linear regression and a multiple linear regression let's first start with this simple linear regression the goal of a simple linear regression is to predict the value of a dependent variable based on an independent variable the greater the linear relationship between the independent variable and the dependent variable is the more accurate is the prediction visually the relationship between the variables can be represented in a scatter plot the greater the linear relationship between the dependent and the independent variable the more the data points lie on a straight line the task of a simple linear regression is now to determine exactly the straight line that best describes the relationship between the dependent and the independent variable in the context of linear regression analysis a straight line is plotted on the scatter plot in order to determine the straight line the linear regression uses the method of least squares let's say a hospital asks you to give them an estimate based on the age of a person how long this person will stay in the hospital after a surgery the target of the hospital operator is to optimize the bad planning in this example your dependent variable the one you want to infer is the length of stay after surgery your independent variable is the age of a person the equation that describes the model now looks like this b is the slope and a is the receptor point if a person would be zero years old which doesn't really make sense in this example the model would tell us that this person stays eight days in the hospital in order to calculate the coefficients the hospital must of course provide you with a sample of people where you know the age and the length of stay after surgery by using your data you could find out that b is 0.14 and a is 1.2 this is now our model which helps us to estimate the length of stay after surgery based on the age of people now let's say a person who is 33 years old is registered for a surgery then we would put 33 for x our model then tells us that this person stays in the hospital for 5.82 days after surgery now of course the question is how do you calculate the slope b and how do you calculate the intercept a usually you use a statistics program like data tab in the case of simple linear regression however it is also quite simple to do this by hand b results from the correlation of the two variables times the standard deviation of the variable length of stay after surgery divided by the standard deviation of h a is obtained by calculating the mean value of the length of stay minus the slope times the mean value of the h the regression line always tries to map the given data with a straight line as best as possible of course this always results in an error this arrow is called epsilon let's say we estimate the length of stay after surgery of a person who is 33 years old our model tells us that the person stays in the hospital for 5.82 days but in fact he or she stays in the hospital for seven days then the difference between the estimated value and the true value is our error and exactly the error epsilon now you have a good overview of what a simple linear regression is and we proceed with the multiple linear regression now unlike simple linear regression multiple linear regression allows for the consideration of more than two independent variables as we already know the goal of a regression is to estimate one variable based on several other variables the variable to be estimated is called the dependent variable or criterion the variables that are used for prediction are called independent variables or predictors multiple linear regression can be used to control for the influence of a third party variable multiple linear regression is often used in empirical social research as well as in market research in both areas it is of interest to find out what influence different factors have on a certain variable therefore the creation for a simple linear regression looks like this we have one independent variable x plus the constant if we now go the way of multiple linear regression we have several independent variables the first variable the second variable and the kth variable the coefficients can now be interpreted similarly to the linear regression equation if all independent variables are zero we get the value a if an independent variable changes by one unit the corresponding coefficient b indicates by how much the dependent variable changes let's say x1 is the age of a person and b1 is 10. then for every further year y is increased by 10. if the person is 5 we have 5 times 10 which equals 50 so y is increased by 50. before we now look at the interpretation of the regression results i will first explain to you how you can easily calculate a linear regression online in order to do this simply visit data tab you can find the link in the video description and just copy your own data into this table then you simply click on the regression tab after you've copied your data into the upper table your variables will appear here below now you only need to select your dependent variable and one or more independent variables of course this data set is far too small but it just serves us as an example your own data set may of course have several thousand rows let's say we want to find out what influences a person's salary so we choose salary as the dependent variable as the independent variables we can now choose gender age and for example weight as soon as you've selected the variables you want to use data tab calculates a regression analysis for you since your dependent variable has a matrix scale level a linear regression is calculated now you can see the results here we will go through the interpretation of the results in detail in a moment the good thing is the data tab also helps you with the interpretation so just click here a multiple linear regression analysis was performed to examine whether male age or weight variables significantly predict salary the regression model indicated that the predictors explained 48.99 of the variance and the collective significant effect was not found f equals 2.56 p equals 0.1 and r squared is 0.49 but let's now go into more detail about the interpretation of results your results will be presented to you in this form you can now of course just use the interpretation in words on data tab but we'll go through each block individually let's start with the model summary the multiple correlation coefficient r measures the correlation between the dependent variable and the combination of the independent variables what does that mean here we see the equation for linear regression our statistics program like datadab calculates the regression coefficients now you can sum up all the values and with the summed values you can calculate the correlation with the dependent variable this correlation is the multiple correlation coefficient r or to say it in other words with the regression coefficients and the independent variables an estimate of the dependent variable is calculated r now indicates how strong the correlation is between the true dependent variable and the estimated variable therefore the greater the correlation the better the regression model the coefficient of determination r squared indicates how much the variance of the dependent variable can be explained by the independent variables your dependent variable has a certain variance the more of this variance we can explain the better it is let's take the example of how long you stay in the hospital after surgery of course not every person stays in the hospital for the same amount of time so we have a variation and it's this variation that we want to explain with the help of a regression model if for example we can predict the length of stay in hospital very well with the age and the type of surgery this r squared is very large if we can even explain everything the result would be one in this case we could predict exactly how long a person will stay in the hospital after a surgery using only the age and the type of operation and we can explain all the variants of the dependent variable but of course that doesn't happen very often in practice r squared overestimates the coefficients of determination when too many independent variables are used to fix this issue the adjusted r square is often calculated the standard estimation area indicates by how much the model overestimates the dependent variable on average let's say you want to predict the length of stay in days after surgery if your standard estimation area for example is 4 this would mean that you miscalculate on average by 4 days with your prediction now of course your question to the hospital management would be whether an average deviation of four days is too much or whether the hospital says great this enables us a much better planning security so next this table is displayed here the so-called f-test is calculated the f-test tests the null hypothesis whether the variance explanation squared in the population is zero this test is often not of great interest the test is equivalent to asserting that all true slope coefficients in the population are zero so b1 is zero b2 is zero and up to bk is zero in this case since we have a very small data set in our example the results show that the null hypothesis cannot be rejected the p-value is greater than 0.05 and we assume based on available data that all coefficients are zero so now we reach the probably most interesting table here we see the unstandardized coefficients the standardized coefficients and the significance level capital b the unstandardized coefficients are just the coefficients that you can put into your regression equation so if we want to predict the salary we have a constant of 1920 plus the value of the gender variable plus the value for the age variable plus weight this allows us to use the unstandardized regression coefficients to build our regression model if we want to predict the salary we now only have to insert the variables better are the standardized coefficients here you can see which variable has the greatest influence on the salary in this case age has the greatest influence on salary because better of age has the greatest amount and finally we see the significance value for the coefficients values smaller than 0.05 are considered significant in this case we see that not a single coefficient is significant what does that mean now the null hypothesis in each case is that these coefficients are 0 in the population if the p-value is greater than 0.05 we cannot reject this null hypothesis based on the available data and thus it is assumed that the coefficients in the population are not different from zero and therefore the variables have no influence on salary due to the fact that i've used a data set that is actually very small it's difficult to reach significant values of course when calculating a regression analysis you also have to check the assumptions if the assumptions are not fulfilled you cannot interpret the results of the regression meaningfully therefore we now take a look at the assumptions the assumptions for the linear regression are as follows there must be a linear relationship between the dependent and the independent variables the error epsilon must be normally distributed third assumption is that there must be no multi-collinearity or no instability of the regression coefficients and finally there must not be heteroskedasticity the variance of the residuals must be constant over the predicted values and now we'll go into each point in more detail let's get started in a linear regression a straight line is thrown through the data this straight line should represent all points as good as possible if the points are non-linear the straight line cannot fulfill this task let's look at these two graphs in the first one we see a nice linear relationship between the dependent and independent variable here the regression line can be drawn in a meaningful way in the second case however we see that there is a clearly non-linear relationship between the dependent and the independent variable therefore it's not possible to put the regression line meaningfully through the points since this is not possible the coefficients of the regression model cannot be interpreted meaningfully or errors in the prediction can occur that are larger than expected therefore we have to check at the beginning whether there is a linear relationship between the dependent and the independent variables this is usually done graphically the next requirement is that the error epsilon must be normally distributed in order to check this there are two ways one is the analytical way and the other one is the graphical way when using the analytical way you can use either the kolmogorov smyrnoff test or the shapiro wilk test if the p-value is greater than 0.05 there is no deviation of the data from the normal distribution so it can be assumed that your data are normally distributed however these analytical tests are not used that often anymore because they tend to always attest normal distribution in small samples and they become significant very quickly in larger samples thus rejecting the null hypothesis that the data is normally distributed therefore the graphical version is used more and more for the graphical solution you can either look at a histogram or better the qq plot the more your data lies on the line the better is the normal distribution and now we come to the hedgerow's capacity another assumption for the linear regression is that the residuals have a constant variance since your regression model never exactly predicts your dependent variable in practice you always have a certain error now you can plot your dependent variable on the x-axis and the arrow on the y-axis the arrow should scatter evenly over the whole range then homoscedasticity is present however if the results look like this then we have heteroscedasticity in this case we have different error variances depending on the value range of the dependent variable this should not be the case so and before i show you how you can check the assumptions online we now come to the last assumption which is no multicollinearity multicollinearity means that two or more independent variables are strongly correlated with each other the problem with multicollinearity is that the effect of individual variables cannot be clearly separated let's look at the regression equation again there we have the dependent variable and here the independent variables with the respective coefficients so now for example if there is a high correlation between x1 and x2 or if these two variables are almost equal then it is difficult to determine b1 and b2 if both are completely equal the regression model does not know what b1 and b2 should be this means that the regression model becomes unstable if you want to use the regression model for a prediction it does not matter whether there is multicollinearity or not in a prediction you're only interested in how good the prediction is but you're not interested in how big the influence of the respective variables is however if the regression model is used to measure the influence of the independent variable on the dependent variable there must not be multicollinearity and if there is the coefficients can not be interpreted meaningfully so now the question is how can we diagnose multicollinearity if we look at the regression equation again we have the variables x1 x2 up to the variable xk we now want to know if x1 is quite identical to another variable or a combination of the other variables in order to do this we simply set up a new regression model in this new regression model we take x1 as the new dependent variable if we can now predict x1 very well by using the other independent variables we don't need x1 anymore because we can use the other variables instead if we would now use all variables it could be that the regression model is unstable in mathematics we say that the equation is over determined we can now do this for all other variables so we estimate now x2 by using the other independent variables and we estimate x k by using the other independent variables for all k regressions in this case we have k new regression models we calculate the so called tolerance or the v i f value the tolerance is obtained by taking 1 minus r squared which is the variance explanation once again the variance explanation indicates how much of the variance of x1 can be explained by the other variables the more it is the more one speaks of multicollinearity if these variables can explain a hundred percent of the variance of x1 then we no longer need x1 in the upper equation if the tolerance is less than 0.1 you have to be careful because in this case we could have multiple linearity on the other hand we have the vif value the vif value is calculated by dividing 1 by the tolerance accordingly you have to be careful if the values are greater than 10 and now i will show you how you can test the assumptions online if you want to test the assumptions online with data tab you only need to select your variables and then click on check assumptions now you see a nice overview of the results here you can check the linearity the normal distribution of the error the multicollinearity and the heteroscedasticity it's very simple just try it yourself now it's time to look at dummy variables you have to use dummy variables if you want to use categorical variables with more than two values as your independent variables as independent variables or predictors either metric or categorical variables with two expressions can be used of course it's also possible to use variables with more than two categories and i will explain this to you in the next slides categorical variables with two characteristics are also called dichotomous an example would be gender with the cadagras male and female let's say you coded female as zero and male as 1. in this case female would be the reference category if we now look at the regression equation and say the variable x1 is gender then b1 is the regression coefficient for gender now how can we interpret b1 we said zero is female and one is male so we just put that in then we have zero times b1 for a female person and 1 times b1 for a male person accordingly b1 indicates the difference between male and female let's say we want to predict the income of a person so why is the income if a person is male this person earns more by the amount of b1 than a woman if the person is male there's a one here and we have one times this value if the person is female there is a 0 here and we have 0 times this value if b1 is 400 for example then it means that man earn 400 euros more than women according to this model now we've discussed how to handle variables with two values further we should look at what we do when we have a variable with more than two values or categories let's say you want to predict the fuel consumption of a car based on its horse power and the vehicle type and let's say there are only three vehicle types sedan sports car and family van thus here we have a variable vehicle type with more than two characteristics but for a regression model we need a categorical variable with two characteristics therefore the question is now what should we do next in order to use categorical variables in a regression we have to create dummy variables this means that we simply create three new variables each new variable has two characteristics these characteristics are zero and one or yes or no so here we have our vehicle type with the characteristics sedan sports car and family van now we create one new variable for each characteristic so first we create a variable is it a sea then yes or no second would be is it a sports car yes or no and our third variable asks is it a family van yes or no so before we had one variable and now we got three variables which are dichotomous and therefore they all have only two characteristics and these three new variables can now be used in the regression model so the next question is now what does that mean for the data preparation you originally have one column with the vehicle type where the individual vehicles from your sample are listed the first is aceton the second is also a sedan the third is a sports car and so on and so forth out of this table you create your three new variables what is the first car a c then so you put a one here and a zero for the others it's not a sports car and it's not a family van the second one is also sedan so we'll put a one here again and the third one is a sports car so we'll put a 1 here and the others are 0. by continuing this procedure we have finally created our new dummy variables now there's only one important thing to note the number of dummy variables is always the number of characteristics minus one why is that the case if we know that it is a sedan we are sure that it's not a family van if we know that it is a sports car we can also be sure that it's not a family van and if it's not a season and not a sports car we can be sure that it is a family van accordingly one of the three variables is not needed because this information is redundant therefore n minus 1 dummy variables are created so in this case we only need the dummy variable is it a sedan or is it a sports car of course you can also drop sedan and use the other two and now i show you how you can create the dummy variables online with data tab in this example we use salary as our dependent variable and as independent variable the age and the company the variable company here has three characteristics bmw ford and gm now we can see that datadep automatically creates the dummy variables once we have the age and once the new dummy variables is it bmw and is it ford since we only need n-1 dummy variables the last category was dropped if your dependent variable is categorical then you need the logistic regression and logistic regression is what we are going to look at now in the previous part we discussed linear regression in a linear regression the dependent variable is metric for example it could be salary or hate now let's look at the logistic regression here the dependent variable is a categorical variable let's say you want to predict whether a person is at risk for burnout or not you want to make this prediction by asking a person's highest educational attainment weekly working hours and a person's age since the variable whether a person is at risk for burnout or not is a categorical variable you have to use a logistic regression the logistic regression is a special case of regression analysis and it is calculated when the dependent variable is normally or ordinally scaled let's look at a few examples first for an online retailer you need to predict which product a particular customer is most likely to buy in order to do this you receive a data set with past visitors and their purchases from the online retailer in this case you need a logistic regression because your dependent variable is a categorical variable namely which product is a particular customer most likely to buy the second example comes from the medical area you want to investigate whether a person is susceptible to a certain disease or not for this purpose you receive a data set with diseased and non-deceased people as well as other medical parameters again in this case you need a logistic regression because a dependent variable is categorical and it asks is a person susceptible to a particular disease or not and the last example comes from politics the question could be would a person vote for party a if there were elections next weekend so again a categorical dependent variable with yes and no as response options what is logistic regression now in the basic form of logistic regression diagonal variables that is variables with the characteristics 0 and 1 or yes or no can be predicted for this purpose the probability of occurrence of characteristic 1 which is characteristic present is estimated in medicine for example a common goal is to find out which variables have an impact on a disease in this case zero could be not deceased and one deceased and the influence of age gender and for example smoking status on this particular disease could be examined let's look at this in a graphical way so we have our independent variables age gender and smoking status and we use these three variables in order to predict whether a person is likely to get a certain disease or not so the dependent variable is will the person get a disease or not so maybe now you might ask yourself the question and why do i need a logistic regression for this why can't i just use a linear regression a quick recap in a linear regression this is our regression equation now we have a dependent variable that is zero or one therefore no matter what value we have in a dependent variables we always get either zero or one if we are using a linear regression we would simply put a linear straight line into these points the graph shows that values between plus and minus infinity can now occur the goal of logistic regression is to estimate the probability of occurrence not the value of the variable itself so we want to know how likely it is that a value of 1 will be the result of the given values for our independent variables the range of values for the prediction is thus restricted to the range of 0 to 1. in order to ensure that only values between 0 and 1 are possible the logistic function is used the logistic model is based on the logistic function the important thing about the logistic function is that only values between 0 and 1 are possible so no matter where we are here on the x axis between minus and plus infinity we can only get values between 0 and 1 as a result and that is exactly what we want the equation for logistic regression now looks like this we have 1 divided by 1 plus e to the power of minus set for set we now use the usual equation from the linear regression which is the equation here below b1 to bk are the regression coefficients and a is the on points x1 to xk are the independent variables after we insert all that our logistic function looks like this now we need to determine the coefficients so that our model best represents the given data in order to solve this problem we use the so-called maximum likelihood method there are good numerical solutions that help us solve the problem so usually we just use a statistics program and it will give us the values b1 b2 up to bk finally i will now show you how to calculate a logistic regression online with data tab in order to do this you just visit datadeb.net and you copy your own data into this table when you select a categorical dependent variable a logistic regression is automatically calculated let's say your dependent variable is a person buys a product or not with the expressions yes and no and the independent variables are salary and age after you click on the variables a logistic regression is automatically calculated and now you get the results which you can see below the first table is the so called classification table this table tells us how well we can classify the categories with the regression model after that we get the model summary and the coefficients i hope you like this video and see you next time

Transcript for:Understanding Regression Analysis Techniques

Transcript for:
Understanding Regression Analysis Techniques