Simple Linear Regression in Microsoft Excel

in this video tutorial i'm going to show you how to perform a simple linear regression by using microsoft excel not only will i show you how to perform the linear regression but i'll also show you how to analyze the outputs of the regression test as always please drop a like on this video if you do find it useful and leave me a comment below if you have any questions now let's jump into excel and get started for this example i just have two variables of data weight measured in kilograms and height measured in centimeters i have these measures for 49 different participants and each row represents a different participant so the first participant i can see that they had a weight of 51.24 kilograms and a height of 167.08 centimeters what i want to do is to perform a simple linear regression to see how well the measures of height in my sample can predict the measures of weight there are many ways you can perform a linear regression in excel but perhaps the easiest method is to use the analysis toolpak this is an add-on created by microsoft to provide data analysis tools for statistical analyses to install the tool pack go to file options then click on add-ins at the bottom you want to manage the excel add-ins and click the go button then ensure you tick the analysis toolpak add-in and click ok now when you click on the data ribbon at the top you should see a data analysis button in a subsection called analyze now we are ready to perform the regression to perform the linear regression i will click on the data analysis button then i will select regression from the list the first thing i must enter is the input y range this is the data for the y variable otherwise known as the dependent variable the y variable is the one that you want to predict in the regression model since i want to predict weight i will select this column in the sheet by clicking on the up arrow button and then click and dragging on the data for weight i also advise highlighting the label in the top row doing so will help when it comes to interpreting the results i now need to repeat this but this time i need to select the x variable otherwise known as the independent variable for me this will be the height measures so i'll click and drag on this data and press the enter key once more when i'm done next i will tick the labels option this is because i highlighted my column labels when i selected the data if you didn't have any labels when you selected your data then you would just not take this option the next option called constant is zero is used when you want the regression line to start at zero otherwise known as the origin doing so would mean there is no y-intercept in the model generally though for linear regression this option is not selected so i'll leave it unchecked for this example it is also possible to specify the confidence level for the test by default the results will return the 95 confidence intervals without having to change any options however if you want to use a different confidence level than 95 then you can select this option here and enter a desired value since i only want the 95 confidence intervals i will leave this option unchecked for the output options you can specify where you want the regression results to be placed if you select output range you can highlight where you want the results to be placed in that worksheet the second option lets you place the results in a new worksheet and the third option lets you place the results in an entirely new excel document i'm going to select the second option and call the new worksheet results the final set of options concerns the residuals in the analysis i'll leave all the residuals options checked so you can see what is returned and i'll click the ok button to run the regression since i checked all of the options in the regression test i have quite a lot of information that is returned so i'll break down the output and go through each in more detail separately in the first table called summary output there are some regression statistics from the test first up we have the multiple r this is the correlation coefficient between the two variables of interest if you're interested to learn more about the correlation then i suggest you check out my previous video on the pearson correlation test currently explained briefly it is a value that tells you how strong the linear relationship is a value of 0.65 in this case indicates a fairly strong linear correlation between height and weight measures moving on we have the r squared otherwise known as the coefficient of determination to get this value you simply square the multiple r value the r squared value tells you how much variance the dependent variable can be accounted for by the values of the independent variable and if we multiply this value by 100 we get a percentage value so we can say that 43 of the variance in weight can be accounted for by the height measures the other 57 of the variance is therefore caused by other factors such as measurement errors next we have the adjusted r squared the adjusted r squared takes into account the number of independent variables in the regression analysis and corrects for any bias usually this value is only really relevant when you are performing multiple linear regression where there are more than one independent variables in the model next we have the standard error the standard error of the regression is the average distance that the observed values fall from the regression line and what's useful about the standard error is that it is in the same units as the dependent variable so here my standard error is 4.31 kilograms when rounded this means on average my observed values were 4.31 kilograms from the regression line the smaller the standard error the more precise the linear regression model is finally we have the number of observations this is just the number of subjects in the test so here i had 49 participants let me now move on to the next results table called anova the main thing you want to be concerned with here when looking at this table is the value under the significance f header this is in fact the p value for the regression model to be able to interpret this we need our hypotheses my null hypothesis was that there is no linear relationship between the height and weight measures my alternative hypothesis was that there is a linear relationship between the height and weight measures and let's say my alpha was 0.05 this means i will reject the null hypothesis and accept the alternative hypothesis if the p-value was less than or equal to 0.05 and the opposite will be true if the p-value is greater than 0.05 in this case i would fail to reject the null hypothesis as you can see the p-value for this model was considerably lower than my alpha value of 0.05 so i conclude that the linear regression model is significant let me now move on to the final table of results regarding the coefficients the first row displays the results for the intercept this is the point where the line of best fit or regression line crosses the y axis when the value of x is zero the second row displays the results for the slope for a simple linear regression model the most basic version of the equation is y equals m multiplied by x plus b where y is the predicted value of the dependent variable weight in this example m is the slope of the line of best fit x is the value of the independent variable height in this example and finally b is the intercept using the information reported from our results we can say that the predicted y value is 0.800264 multiplied by x subtract 79.599 so in this example if we knew a participant's height in centimeters we can predict their weight in kilograms by using this equation for example if a participant measured centimeters the model estimates their weight to be 60.45 kilograms looking back at the coefficient results table we can see that there are other columns which tells us the standard error as well as the lower and upper 95 confidence intervals for the intercept and the slope values you will also notice both has a t statistic this value is used to compute the p-value again to interpret this p-value we need our hypotheses in this case the null hypothesis is that the intercept or slope is zero my alternative hypothesis is that the intercept or slope is not zero and as you can see both values are less than my alpha of 0.05 this means that height is a significant variable that impacts weight in this case so that's an overview of the regression model results let me now cover the other outputs from the regression test if you selected to have the residuals option during the regression setup you will have a table titled residual output for each observation from your data that was entered into regression test you will get a predicted value of y based on the regression model for example if you look at the first observation in my original data you will see that the participant had a height of 167.08 centimeters just like we did before if i put this value into the regression equation along with the slope and the intercept values i get a predicted weight value of 54.10999 kilograms this is what this column represents excel then does this for each of my observations then using these predicted values excel can then calculate the residuals but what is a residual to understand this better let me plot my data on a scatter bot after performing a simple linear regression analysis a line of best fit or regression line can be plotted and this is based on the model that i explained earlier a residual is simply the distance between the actual data point and the line of best fit for example this is the data point for the first participant they had a height of 167.08 centimeters and a weight of 51.24 kilograms and as we calculated recently the predicted weight value based on our model was 54.10999 and as you can see this is on the line of best fit the residual for this point therefore is the difference between the actual weight value 51.24 kilograms and the predicted weight volume 54.10999 kilograms and this comes out at negative 2.867 kilograms when rounded excel then repeats this for the rest of the observations if you also selected the residual plots option in the regression setup window you will also get a graph returned here is my residual plot this is a scatter plot of the residual values from the table i just talked about on the y axis and the values for the independent variable on the x-axis height measures in my case residual parts are useful to look at when you are investigating homogeneity of variance which is an assumption of the linear regression test what you're looking for here is a random pattern to the graph there should roughly be half the number of data points above zero and the other half below zero and there should be a vertical spread of the data points the further along the x-axis you go if you selected the standardized residuals option in the regression options earlier you will see a column called standard residuals in the residual table the standardized residual equals the value of a residual divided by an estimate of its standard deviation and you can think of these as z-scores they are generally useful to look at when trying to identify potential outliers in your sample generally any standardized residuals with a value greater than three or negative three is a sign that it may be an outlier fortunately all of my standardized residual values are within three and negative three moving on if you selected to have the wine fit plots option you also see a scatter plot containing your data of interest in my case i have height measures on the x-axis and the weight measures on the y-axis you will also see another set of data as shown here in orange which are in fact the predicted y values based on the model and these values come from the predicted values from the residuals table if instead of showing the predicted values on the graph you wanted to plot the line of best fit which will pass through the predicted values then you could remove the predicted values from the graph by right clicking on it and going to select the data highlight the predicted y variable in the legend entry and select remove and then click ok if you also wanted to see the line of best fit you can select the graph and then go to add chart element trendline and select the linear option if you also wanted to show the equation of the line you can double click on the line then in the format trendline options that have just opened to the right scroll down and select display equation on chart and as you can see this is the same equation we looked at earlier finally if you selected the normal probability plots option in the regression setup window you will also see a table called probability output and a graph called the normal probability plot which is a scatter part of this data the x-axis plots the percentile value ranging from 0 to 100 and the y axis plots the y variable data weight in this case the normal probability plot is used to determine whether the y data fits a normal distribution essentially what you are looking for is a straight line of data and as you can see there is a nice straight line of data for my example which suggests the weight data are normally distributed however it's worth noting that the y variable does not actually have to be normally distributed when fitting a linear regression model and i'll go into a bit more detail about the assumptions of the linear regression in a future tutorial and that brings me to the end of this tutorial you now know how to perform a simple linear regression test in microsoft excel and how to interpret the output of results if you found this video useful please leave a like it really does help support the channel if you've got a question pop it down in the comments below also consider subscribing for more weekly tutorials

in this video tutorial i&#39;m going to show you how to perform a simple linear regression by using microsoft excel not only will i show you how to perform the linear regression but i&#39;ll also show you how to analyze the outputs of the regression test as always please drop a like on this video if you do find it useful and leave me a comment below if you have any questions now let&#39;s jump into excel and get started for this example i just have two variables of data weight measured in kilograms and height measured in centimeters i have these measures for 49 different participants and each row represents a different participant so the first participant i can see that they had a weight of 51.24 kilograms and a height of 167.08 centimeters what i want to do is to perform a simple linear regression to see how well the measures of height in my sample can predict the measures of weight there are many ways you can perform a linear regression in excel but perhaps the easiest method is to use the analysis toolpak this is an add-on created by microsoft to provide data analysis tools for statistical analyses to install the tool pack go to file options then click on add-ins at the bottom you want to manage the excel add-ins and click the go button then ensure you tick the analysis toolpak add-in and click ok now when you click on the data ribbon at the top you should see a data analysis button in a subsection called analyze now we are ready to perform the regression to perform the linear regression i will click on the data analysis button then i will select regression from the list the first thing i must enter is the input y range this is the data for the y variable otherwise known as the dependent variable the y variable is the one that you want to predict in the regression model since i want to predict weight i will select this column in the sheet by clicking on the up arrow button and then click and dragging on the data for weight i also advise highlighting the label in the top row doing so will help when it comes to interpreting the results i now need to repeat this but this time i need to select the x variable otherwise known as the independent variable for me this will be the height measures so i&#39;ll click and drag on this data and press the enter key once more when i&#39;m done next i will tick the labels option this is because i highlighted my column labels when i selected the data if you didn&#39;t have any labels when you selected your data then you would just not take this option the next option called constant is zero is used when you want the regression line to start at zero otherwise known as the origin doing so would mean there is no y-intercept in the model generally though for linear regression this option is not selected so i&#39;ll leave it unchecked for this example it is also possible to specify the confidence level for the test by default the results will return the 95 confidence intervals without having to change any options however if you want to use a different confidence level than 95 then you can select this option here and enter a desired value since i only want the 95 confidence intervals i will leave this option unchecked for the output options you can specify where you want the regression results to be placed if you select output range you can highlight where you want the results to be placed in that worksheet the second option lets you place the results in a new worksheet and the third option lets you place the results in an entirely new excel document i&#39;m going to select the second option and call the new worksheet results the final set of options concerns the residuals in the analysis i&#39;ll leave all the residuals options checked so you can see what is returned and i&#39;ll click the ok button to run the regression since i checked all of the options in the regression test i have quite a lot of information that is returned so i&#39;ll break down the output and go through each in more detail separately in the first table called summary output there are some regression statistics from the test first up we have the multiple r this is the correlation coefficient between the two variables of interest if you&#39;re interested to learn more about the correlation then i suggest you check out my previous video on the pearson correlation test currently explained briefly it is a value that tells you how strong the linear relationship is a value of 0.65 in this case indicates a fairly strong linear correlation between height and weight measures moving on we have the r squared otherwise known as the coefficient of determination to get this value you simply square the multiple r value the r squared value tells you how much variance the dependent variable can be accounted for by the values of the independent variable and if we multiply this value by 100 we get a percentage value so we can say that 43 of the variance in weight can be accounted for by the height measures the other 57 of the variance is therefore caused by other factors such as measurement errors next we have the adjusted r squared the adjusted r squared takes into account the number of independent variables in the regression analysis and corrects for any bias usually this value is only really relevant when you are performing multiple linear regression where there are more than one independent variables in the model next we have the standard error the standard error of the regression is the average distance that the observed values fall from the regression line and what&#39;s useful about the standard error is that it is in the same units as the dependent variable so here my standard error is 4.31 kilograms when rounded this means on average my observed values were 4.31 kilograms from the regression line the smaller the standard error the more precise the linear regression model is finally we have the number of observations this is just the number of subjects in the test so here i had 49 participants let me now move on to the next results table called anova the main thing you want to be concerned with here when looking at this table is the value under the significance f header this is in fact the p value for the regression model to be able to interpret this we need our hypotheses my null hypothesis was that there is no linear relationship between the height and weight measures my alternative hypothesis was that there is a linear relationship between the height and weight measures and let&#39;s say my alpha was 0.05 this means i will reject the null hypothesis and accept the alternative hypothesis if the p-value was less than or equal to 0.05 and the opposite will be true if the p-value is greater than 0.05 in this case i would fail to reject the null hypothesis as you can see the p-value for this model was considerably lower than my alpha value of 0.05 so i conclude that the linear regression model is significant let me now move on to the final table of results regarding the coefficients the first row displays the results for the intercept this is the point where the line of best fit or regression line crosses the y axis when the value of x is zero the second row displays the results for the slope for a simple linear regression model the most basic version of the equation is y equals m multiplied by x plus b where y is the predicted value of the dependent variable weight in this example m is the slope of the line of best fit x is the value of the independent variable height in this example and finally b is the intercept using the information reported from our results we can say that the predicted y value is 0.800264 multiplied by x subtract 79.599 so in this example if we knew a participant&#39;s height in centimeters we can predict their weight in kilograms by using this equation for example if a participant measured centimeters the model estimates their weight to be 60.45 kilograms looking back at the coefficient results table we can see that there are other columns which tells us the standard error as well as the lower and upper 95 confidence intervals for the intercept and the slope values you will also notice both has a t statistic this value is used to compute the p-value again to interpret this p-value we need our hypotheses in this case the null hypothesis is that the intercept or slope is zero my alternative hypothesis is that the intercept or slope is not zero and as you can see both values are less than my alpha of 0.05 this means that height is a significant variable that impacts weight in this case so that&#39;s an overview of the regression model results let me now cover the other outputs from the regression test if you selected to have the residuals option during the regression setup you will have a table titled residual output for each observation from your data that was entered into regression test you will get a predicted value of y based on the regression model for example if you look at the first observation in my original data you will see that the participant had a height of 167.08 centimeters just like we did before if i put this value into the regression equation along with the slope and the intercept values i get a predicted weight value of 54.10999 kilograms this is what this column represents excel then does this for each of my observations then using these predicted values excel can then calculate the residuals but what is a residual to understand this better let me plot my data on a scatter bot after performing a simple linear regression analysis a line of best fit or regression line can be plotted and this is based on the model that i explained earlier a residual is simply the distance between the actual data point and the line of best fit for example this is the data point for the first participant they had a height of 167.08 centimeters and a weight of 51.24 kilograms and as we calculated recently the predicted weight value based on our model was 54.10999 and as you can see this is on the line of best fit the residual for this point therefore is the difference between the actual weight value 51.24 kilograms and the predicted weight volume 54.10999 kilograms and this comes out at negative 2.867 kilograms when rounded excel then repeats this for the rest of the observations if you also selected the residual plots option in the regression setup window you will also get a graph returned here is my residual plot this is a scatter plot of the residual values from the table i just talked about on the y axis and the values for the independent variable on the x-axis height measures in my case residual parts are useful to look at when you are investigating homogeneity of variance which is an assumption of the linear regression test what you&#39;re looking for here is a random pattern to the graph there should roughly be half the number of data points above zero and the other half below zero and there should be a vertical spread of the data points the further along the x-axis you go if you selected the standardized residuals option in the regression options earlier you will see a column called standard residuals in the residual table the standardized residual equals the value of a residual divided by an estimate of its standard deviation and you can think of these as z-scores they are generally useful to look at when trying to identify potential outliers in your sample generally any standardized residuals with a value greater than three or negative three is a sign that it may be an outlier fortunately all of my standardized residual values are within three and negative three moving on if you selected to have the wine fit plots option you also see a scatter plot containing your data of interest in my case i have height measures on the x-axis and the weight measures on the y-axis you will also see another set of data as shown here in orange which are in fact the predicted y values based on the model and these values come from the predicted values from the residuals table if instead of showing the predicted values on the graph you wanted to plot the line of best fit which will pass through the predicted values then you could remove the predicted values from the graph by right clicking on it and going to select the data highlight the predicted y variable in the legend entry and select remove and then click ok if you also wanted to see the line of best fit you can select the graph and then go to add chart element trendline and select the linear option if you also wanted to show the equation of the line you can double click on the line then in the format trendline options that have just opened to the right scroll down and select display equation on chart and as you can see this is the same equation we looked at earlier finally if you selected the normal probability plots option in the regression setup window you will also see a table called probability output and a graph called the normal probability plot which is a scatter part of this data the x-axis plots the percentile value ranging from 0 to 100 and the y axis plots the y variable data weight in this case the normal probability plot is used to determine whether the y data fits a normal distribution essentially what you are looking for is a straight line of data and as you can see there is a nice straight line of data for my example which suggests the weight data are normally distributed however it&#39;s worth noting that the y variable does not actually have to be normally distributed when fitting a linear regression model and i&#39;ll go into a bit more detail about the assumptions of the linear regression in a future tutorial and that brings me to the end of this tutorial you now know how to perform a simple linear regression test in microsoft excel and how to interpret the output of results if you found this video useful please leave a like it really does help support the channel if you&#39;ve got a question pop it down in the comments below also consider subscribing for more weekly tutorials

Transcript for:Simple Linear Regression in Microsoft Excel

Transcript for:
Simple Linear Regression in Microsoft Excel