[Music] we've seen that simple linear regression provides us a great tool for understanding the relationship between two numeric variables for instance if we were working at Kroger we might be interested in making a prediction in terms of how much money people are spending per year at Kroger based on their yearly income so simple linear regression provides us an equation to make a prediction and even to come up with 95% confidence in prediction intervals for a particular household that has an income or are the average amount spent over all households that share the same common income of say $50,000 the simple linear regression also allows us to compare the expected amount spent across different households two households that differ in household income by let's say $1,000 per year might be expected to differ in terms of their spending at Kroger by maybe $3.40 on average but you know what's better than predicting the household expenditure from one variable predicting it from two variables predicting it from three variables predicting it from four variables predicting it from five variables bananas peanut butter and that's going to be the job of multiple linear regression where we have multiple potential predictors of a quantity like household expenditure at Kroger and we want to utilize all of the information that's available to us so if we want to predict the household expenditure at Kroger we'd love to use not only their income and maybe the distance that a household lives from the nearest Kroger the number of kids the number of pets in that household the number of cars that belong in that household Etc there might be multiple viable predictors that we want to use in order to come up with the best predictions possible now the multiple regression model allows us to leverage multiple sources of of information to come up with a better prediction than any one of those predictors could have given us so in the model we're going to stick by to predicting a y variable this might be the amount of money that someone spends at Kroger this might be someone's salary this might be the length of time someone spends on a website and as the name suggests we're going to have multiple X variables that serve as predictors X1 is going to be our first predictor variable maybe someone's income X2 is going to be the second predictor variable maybe the number of kids in the household X3 X4 etc etc are going to be other predictors that we might use we're going to say that we have K total predictor variables X1 through XK and we see the form of the model allows us to piece together the predicted value of y as more oress a weighted sum of the predictor variables with each of the predictor variables getting a weight the coefficient beta 1 beta 2 Beta 3 Etc along with that we add in an intercept and since we know that individual values of Y are going to deviate from the overall average we're going to throw in a disturbance an Epsilon term that allows us to capture the random amount by which individuals differ either above or below the overall average value of y at that particular combination of X values so let's use some multiple linear regression to do some Predictive Analytics and just like with simple linear regression our predictions are going to come in two flavors we might be interested in predicting the average value of y among all individuals in the population with a common set of X variables or we might be interested in predicting a particular individual's value of y who has a particular set of predictor variables let's see how this works so the example that I want to look at involves predicting the tip percentage left on a bill based on the bill amount and the party size but first let me read in this first chunk of code that's creating some additional data sets that we'll use a bit later and then I'll scroll down down to where we fit a multiple regression model to try to figure out the tip percentage so I'll load in the library regge class and I'll load in the ex2 tips data frame because this data lets us do the analysis we have a column for the tip percentage left on a bill we have information on the bill in US Dollars and we have information on the party size so how do we set up a regression model to use multiple predictors when doing the predictions well it's a pretty simple extension of the simple linear regression framework so what what we'll do is we'll create an object m. tips that's going to be the result of running the LM command LM for linear model just like with simple linear regression and the setup is fairly similar so the first argument is going to be a formula that tells our what we're predicting and then what we're using to make those predictions our y variable in this case what we're predicting is tip. percentage so we'll put that to the left hand side of the twiddle to let R know that that's our quantity of interest and then on the right hand side of the twiddle that's our place to put all of the predictor variables The Columns that we want to use in the data frame that are to serve as predictors so we'll separate these by plus signs so we're predicting the tip percentage from the bill and the size of the party since both these columns live in the ex2 tips data frame We'll add in the data argument as well so when we run this we're able to get a summary and just like how we translated the output of summary into a regression equation with simple linear regression we can do something very similar so The Intercept is clearly labeled for us $19.90 well it's tip percentages so 19.9% then minus 27 time the bill amount then plus 6 time the size of the party we'll talk about how to interpret those coefficients a little bit later on but we can immediately use this regression equation to make predictions what if a bill was $20 and the party size was two well what would we predict the tip percentage left on that bill all we have to do is plug in the relevant predictors into the equation so we start out with the intercept the 19.9% that's kind of our Baseline and then we adjust it based on the bill size and the party size we take the coefficient of Bill size we multiply it by the bill amount $20 and then we add that to the coefficient of party size times our party size of two and once we get that out we predict that the tip percentage left is going to be $15 15.68% now remember there's two types of predictions that we do in analytics we might might be predicting the overall average value of y among all individuals in the population that share that particular combination of x's or we might be trying to predict an individual's value of y at a particular combination of X so in this case we can run both of these predictions by doing the predict command so just like before what we'll do is we'll set up a data frame called two. predict that tells us what value of Bill and what value of party size we want to be getting estimates for so in the data frame I'm going to match up the column name that I used as a predictor as the column name in two. predict and I'll give it a vector of values for Bill amounts that I want to be predicting at so let's make a prediction at Bill sizes of $20 and $60 and then we'll add in a second column corresponding to the second predictor that name will match up with the name of the column in our data frame that we use to build the model size of party equals 2 and three so we're going to make predictions for bills of size 20 from parties of size 2 and bills of size 60 from parties of size three so we can confirm that the two. predict data frame contains the combinations that we're interested in and then we can just throw this through the predict function the name of our model the new data equals 2. predict and then we'll add the interval argument that specifies what kind of interval we want out for our predictions the confidence interval is going to give us a 95% confidence interval for the average tip amount among all all parties of size two that spend $20 and all parties of size three that spend $60 the interval equals prediction is going to be a 95% prediction interval for some particular party that shows up and has two people in it and they spend $20 or some party shows up in it and there's three people in it and they spend $60 so let's look at both so our 95% confidence interval for when the bill is $20 and the party is size to is as follows our best guess is that the tip percentage is going to be 15.68% and we're 95% confident that the average tip percentage among all parties of size two that spend $20 is going to be somewhere between 14.77% and 16.6% so is the actual average tip percentage in there for these types of parties well we don't know that's exactly what we're trying to estimate we know that a 95% confidence interval the procedure we're using to build it gets it right covers is the true value of the average 95% of the time now normally in analytics we're more interested in predicting individual amounts so a party just came in and there's two people in that party and they just ordered some food so we know their bill is going to be 20 what do we predict for that particular party's tip percentage well if we have the interval equals prediction argument that's going to come up with a 95% prediction interval so the point estimate our best guess is still exactly the same 15.68% left on that bill and now we can say that there's a 95% chance that the tip percentage will be between 4.3 and 27.1 percentage points you'll notice that the model kind of breaks down and starts giving nonsensical predictions for the other combination bills of $60 from parties of size three because it contains negative values inside the 95% prediction interval the model simply doesn't know that the lowest possible tip percentage is zero that's going to be a shortcoming of our model we know all models are are wrong some are useful this prediction interval isn't the most useful for that particular type of combination so Predictive Analytics with multiple regression pretty easy to do of course you'll have to make sure that the regression model is a reasonable reflection of reality and we'll talk about how we check the assumptions of a multiple regression model later on but for now if everything is peachy keen then it's pretty easy to do Predictive Analytics with multiple linear [Music] regression now multiple regression can also be used for descriptive analytics to try to understand what is the relationship between our y variable and our predictor variables but it's sufficiently more complicated that even people that have been doing regression for years very often misinterpret the coefficients of a multiple linear regression model so we want to make sure by the time you're done with this class that you are experts at talking about exactly what the coefficients of a multiple linear regression model mean and what they don't mean so first very quick review about how we interpret the coefficient in a simple linear regression when we just have one predictor variable so that's what we call the slope or beta 1 of our model and we've had a lot of practice interpreting this we envision two individuals they differ in X by one unit and so the slope beta 1 tells us the expected difference in y between them so if that coefficient is positive the individual with the larger value of x is expected to have the larger value of y and if the coefficient is negative the individual with the smaller value of x is expected to have the larger value of y now simple linear regression has a pretty severe shortcoming if there are multiple different characteristics that are good predictors of Y so sometimes you can't actually put too much weight in terms of the interpretation of the coefficient when there is more complexity involved in the relationship so what I want to do is I want to illust illustrate this with a particular example let's imagine that someone's grade in a class is tied into their score on the assignments and their score based on their attendance and it's weighted in terms of 80% assignments and 20% attendance here now here's a little bit of a Twist the students act in a very particular way if the student feels very confident in their ability to do the work they're doing well on their assignments they tend to not show up to class all that often so their attendance scor is fairly low and vice versa the students that feel like they're struggling not doing the best on their assignments are showing up a bit more often because the class sessions seem to help seem to boost their scores from where they would be if they weren't attending so let's check out a simple linear regression model predicting someone's grade based on their assignments based on this data that was created in that very first R chunk so first thing I'm going to do is I'm just going to create a scatter plot that's going to show us the relationship between the course grade and the their assignment grade and I'm going to color code this by the two types of students that we know live in this data set the ones that are excelling in the class and the ones that are struggling so if we look at the scatter plot it looks like we have two separate streams of points and that's true because these two different types of students are well differentiated from each other we can fit a regression model and add that line to it this would be our overall linear regression but let's check out to see if we can capture the weight that's been assigned to the assignment scores when determining the course grade if we get a summary of our regression model we find that the estimated coefficient of assignments that estimated slope is 583 and if we get a 95% confidence inle for that we're 95% confident that the true slope is somewhere between 0.52 and 65 here but that's wrong because we know that by Design the way the grade is set up the assignment should have a slope of 0.8 because we take the assignment score and multiply by 08 to get someone's grade so regression is kind of missing the actual coefficient in this case and the news is even worse if we go and look at the relationship between grade and attendance if we make the scatter plot here we have the two groups of students now if we go fit a linear regression model add it to the scatter plot and then look at a summary we find something a bit puzzling we find that the estimated coefficient of attendance is negative here implying that students that tend to attend more often have a lower grade than students that attend less often and in fact if we get a 95% confidence interval we find that we're 95% confident that really the true slope is negative all the values in this interval are negative implying that you know the more students attend the less they typically do well in the course overall and we know that that's wrong as well because the course grade is weighted based on their attendance in fact the weight assigned to it is 02 so here the simple linear regression is kind of missing the truth here it's missing the full picture of how these variables are related and that's because when we fit a simple linear regression we're really only accounting for one source of variation in our y variable so yes someone's grade varies based on their assignments yes someone's grade varies based on their attendance but if we fit a model that just predicts grade from one of them but not the other we're leaving out a very key component of how individuals grades vary and so when we do have these extra sources of variation that aren't inside the model we call them lurking variables because it's another reason we expect individuals y values to vary and if we're not putting that in the model we're leaving ourselves open to potentially misleading coefficients these lurking variables can profoundly influence the impression of the relationship between Y and X leaving us kind of scratching our heads in some cases like how we saw the coefficient of attendance was negative when determining the grade of an individual so when we go back and look at the summary of the grades versus assignments we would say that okay two students whose assignment grade differs by one are expected to differ in course grade by 0.58 but you know when we say that they differ in assignment grade by one they're also potentially differing in other substantial and important ways we don't have a comment about how they're differing in terms of their attendance and because attendance is also tied into someone's final grade we're missing out on that key part of the story what we're going to do is we're going to call the attendance a lurking variable in this case so we're predicting grade from assignments but attendence also influences someone's grade it's a lurking variable because it's another reason that these students might vary from one another when they do differ in terms of assignment by say one percentage point but our model's not accounting for that and we don't capture all the sources of variation the coefficients of our model can turn out to be you know pretty misleading the good news is though though is that that's the purpose of multiple regression multiple regression is going to allow us for uh to account for lurking variables to account for all the potential sources of variation in the individual's y values as long as we can identify all the reasons why individuals might have different yv values here so let's try to predict someone's course grade from both the assignment score and their attendance score if we predict using both these variables the only two sources of variation among people's grade what we find is that we recovered those coefficients exactly the weight given to assignments was 8 the weight given to attendance was .2 and so we're discovering the true form of the relationship once we've accounted for all of the sources of variation in individuals po course grades here so multiple linear regression is a fantastic tool to really fully study the complex relationships between an individual's yv value like their salary their amount spent Etc based on a multitude of poten potential predictors a multitude of reasons why they might actually be expected to have different y values so let's see how the multiple linear regression model does this how do we interpret the coefficients that come out of a multiple linear regression model and surprisingly this is one of the most complex and subtle things in all of analytics even people that have been doing regression for years and years still often misinterpret the coefficients of predictors in a multiple linear regression model and so we want to make sure that by the time you finish this course you are experts at letting everyone know what a coefficient does and actually doesn't tell you so here's how we're going to interpret the coefficient of a predictor in a multiple linear regression we're still going to Envision two different individuals but these individuals are going to be differing in a very specific way so if we're looking at the coefficient of a predictor let's call it X we're going to say that these two individuals differ in X by one unit unit but otherwise they are identical meaning that all of the other predictors we put in the model those two individuals have identical values for all those other predictors the only thing they differ is in terms of X and they differ in terms of X by one unit so when we interpret this coefficient we envision these two individuals and we say that two otherwise identical individuals that differ in X by one unit are expected to differ in y by whatever that coefficient was if the co coefficient is positive the individual with the larger value of x is expected to have the larger value of y if it's negative the individual with the larger value of x is expected to have the smaller value of y so the key update in our interpretation is that we're envisioning otherwise identical individuals they differ in X by one unit but they're the same in terms of all of the other predictors that we have in the regression model and so this allows us to account for variations in individuals yv values due to these other factors CU we're kind of holding them constant between these two individuals so let's get some practice in interpreting the coefficients of a regression model here's a different salary data set that lives in the regge class package and this is a real world data set that comes from a bank and let's try to predict the salary from someone's number of years of experience now in this data set we actually have multiple potential predictors we have the number of years of education the number of years of experience the number of months they've already been working at the company and also the employees gender here what if we just fit a simple linear regression that tried to predict salary from experience we're kind of leaving out other potential important sources of variation and we'll see why this matters in a second so we've loaded up the data we fit the simple linear regression model and let's take a look at the summary so what we see is that the estimated coefficient of experience is 10.4 however it's not statistically significant because the P value is at least 0.05 and so so we would say that experience is not a statistically significant predictor of salary so a little bit weird because you would expect it to be tied in there but notice what happens when we actually include all potential sources of variation for someone's salary not only their experience but then also the length of time they've worked at the company and also the number of years of education well we know that once we account for these other sources uh variation we're going to get a more complete picture of what these relationships look like and so we will fit the model and get a summary of it all right now the story has changed a little bit if we look at the coefficient of experience we find that it's 11.8 and it is also statistically significant its P value is less than 005 much more on that in a later lecture but we're discovering something actually pretty important after all experience does seem to be tied in to someone's salary a simple linear regression kind of missed that fact because it didn't incorporate other reasons why individual salaries might vary we didn't have the number of months a com a person was working at the company we didn't have the education that they had when they were hired here by focusing on just experience alone the variation in salary due to these other factors the tenure at the company and also the number of years of education washed out the effect that experience had leaving us unable to really detect it with just that simple linear regression model but by throwing in all of the potential sources of variation education experience and months we have a much clearer picture so what does this 11.8 really tell us well it tells us that among two otherwise identical individuals meaning that they have the same years of education they have the same number of months at that company among those two otherwise identical individuals that differ an experience by one year we expect their salaries to differ by $11.85 with the person with more experience expected to have the higher salary let's do this again but for the educ variable the coefficient is just about 93 so how do we interpret this well we say okay if we can compare two individuals that have the same value of experience they have the same value of months doesn't actually matter what those values are they're just identical among those two employees that have the same experience level that have the same value of months if they differ in terms of Education by one year we expect their salaries to differ by about $93 with the person with more years of education expected to have the higher salary let's do another example where we try to predict the tip percentage left on a bill based on the bill amount and the size of the party so we have an intercept of 19.9% a coefficient of Bill of minus. 27 and a coefficient of party size of right at about 6 so what does this minus .27 mean well we start out our interpretation by saying two otherwise identical parties with respect to the other predictors in the model so the only other predictor besides bill is party size so we're assuming that these two parties have the same number of people in it so two equally sized parties that differ in terms of their bill amount by $1 are expected to differ in their tip percentage by. 27 percentage points since the coefficient is negative the party that had the higher bill is expected to leave the lower tip percentage all right what about the point 6 the coefficient of size of party well once again we start out out our interpretation two otherwise identical parties meaning they have the same bill amount identical values for the other predictors in the model bill is the only other predictor so two parties that spend the same amount of money on their bill when they differ in terms of party size by one we expect them to differ in tip percentage by 6 percentage points since it's positive the party with the larger number of people in it is expected to have the larger tip percentage one more data set to practice on let's load up the body fat data set that lives in regge class and in the bot body fat data set were given someone's body fat percentage as well as their age their weight their height and then a bunch of physical measurements so the circumference of their neck their chest their abdomen etc etc so let's try to predict the body fat percentage from every predictor we have available to us instead of typing out every single column name and separating by pluses there's a shortcut for that so body fat twiddle dot means predict body fat from every available predictor in the data frame so we'll fit that linear regression model and we'll look at a summary of it a lot of different predictors a lot of different coefficients so let's see what we can do with it so let's look at the coefficient of abdomen it's about 089 what does this tell us so we start out the interpretation by saying two otherwise identical individuals that differ in abdomen circumference by 1 cm are expected to differ in body fat percentage by 89 percentage points now by Otherwise identical we mean they have the exact same age they have the exact same weight the exact same height exact same neck circumference chest circumference etc etc they're identical with respect to all other predictors in the model and you might be scratching your head or your chin and asking are individuals like this are there two individuals that differ in abdomen size by 1 cmet but otherwise have the exact same body proportions height weight and age well maybe there is maybe there isn't but that's how we do the interpretation of this coefficient in the multiple linear regression model we're envisioning two otherwise identical individuals same body measurements height weight and age but they differ in terms of their abdomen or comence by 1 cmet if that's the case we would expect them to differ in terms of body fat percentage by 089 percentage points all right what about the coefficient of weight here we see that it's negative minus 08 what does that tell us well two otherwise identical individuals that differ in weight by one pound are expected to differ in body fat percentage by 0.008% percentage points now since it's negative though we expect the person with the larger weight to have the smaller body fat percentage now does that sound right to you it doesn't sound right to me because the way that I know how weight and body fat percentage are related you the more you weigh typically the more body fat you have but remember the specifics in terms of how we interpreted this number because we're envisioning two otherwise identical individuals with respect to all the other predictors in the model so it's letting us know that these two people that we're comparing that differ in weight by one pound they have the same age they have the same height and they have the same circumferences for all of these physical measurements their neck their abdomen their thigh Etc and so now the coefficient actually makes a bit of sense if we're comparing two people with the exact same body proportions but one weighs a bit heavier than the other well it stands to reason that that person probably has more muscle mass muscle mass weighs more and so if someone weighs more because of additional muscle Mass they're going to have a smaller body fat percentage so I think ultimately it really does make sense after all so once we've come up with our estimated coefficient the natural question is okay well how far off is our estimated coefficient from the truth the true coefficient that we would measure if we had a census data on everyone in the population we know our guess is a good guess but it's wrong but hopefully it's not too wrong here so this is a word that we've used before we call this the standard error of our estimated coefficient the standard error of our estimated coefficient is our best guess for how wrong our guess is it's the typical difference between our guess and what the true value of the coefficient is if we had a census so for example we estimated the coefficient of weight to be minus 08 surely that's wrong but it's not going to be that wrong it's only going to be off by a little bit how far off do we think it is well if we just move our eyes to the right we get the standard error of our estimated coefficient so we think our guess of minus 08 is off off by maybe 0.05 or so so what I want to do is actually talk about kind of a quirk when it comes to multiple linear regression and that is that the size of the standard error of our estimated coefficients is actually tied in to the amount of redundancy we have in terms of our predictor variables so if we think about it abdomen definitely is related to someone's body fat percentage but the information about body fat percentage that abdomen contains is likely somewhat redundant with the information that say chest circumference has that neck circumference has as well as people gain more and more body fat they kind of gain it all around their body so all of these measurements are somewhat correlated with one another people with above average abdomen circumferences are also going to have above average uh neck circumferences chest circumferences thigh circumferences Etc so if we think about it the information that abdomen has about someone's body fat percentage is kind of shared by the information that neck circumference chest circumference has about body uh fat percentage as well the information that all of these physical measurements have about body fat percentage are kind of redundant with one another they're not independent measurements of people's body fat percentage because they're all related and it turns out that the size of our standard error is tied into to this amount of redundancy so let me flesh this out what I want to do is I want to look to see well how well can I actually predict abdomen let's say from the other predictors in the model so if I go and set up a linear model in LM that says okay let's predict abdomen from everything twiddle dot but body fat I can actually do everything but a particular predictor by doing twiddle dot minus the predictor I don't want to use so if I want to predict abdin from all of the other X's which is all the other columns except for body fat that's my y variable I can set up the model to be like this and when I do so here's what I find if we look at the r s of this model it's letting us know that if we have all all of these other physical measurements that gets us 91.5% of the way there to knowing someone's abdomen circumference so yeah while abdomen circumference surely is tied into someone's body fat percentage the information that abdomen has about body fat percentage is kind of already encoded to some extent like 91.5% of that information is already encoded by the set of the other physical measurements and it turns out when there's a lot of redundant information flying around that's going to degrade the Precision of our estimated coefficient it's going to make the standard error of our coefficient larger than if every predictor variable was independent of one another let's see that in action if we were to look at the model where we just predicted body fat from someone's abdomen circumference we would estimate the coefficient to be0 585 basically give or take maybe 003 or so we think our guess is probably off by maybe 026 now if I fit the full multiple regression model predicting body fat from everything look at a summary and then look at the standard error of my abdomen coefficient I see that the standard error is quite a bit larger 0.08 versus the 0.026 when I just had it as a prictor by itself and so this is a Hallmark of a multiple linear regression model when there is a lot of redundant information flying around among the predictors the Precision of those estimated coefficients gets degraded the standard errors get quite a bit larger and that kind of makes sense if we think back to it we were a bit puzzled when we were thinking about two individuals that differed in terms of abdomen by 1 centimeter but we're exactly the same in terms of all of these other measurements that's not really what happens in reality if they're going to differ in abdomen cumference by 1 cmet they're going to differ in terms of all of these other variables by some amount as well so since the interpretation of a coefficient in a multiple linear regression model puts us in this world where our two individuals differ only in an abdomen by 1 cm but are exactly the same in you know every other regard it's going to make it really difficult to really disentangle these effects individuals don't really differ in abdomen by 1 cm they're differing in other things simultaneously and so the regression model it's doing its best it's trying to imagine a world where yes we can find these individuals that differ in abdomen circumference by 1 centimeter but nothing else but the data doesn't really bear that out so it's having to make kind of a bit of guesswork to see what those two individuals would really look like and that extra guesswork is degrading that Precision of our estimated coefficient so long story short and to make a long story short too late if there's a lot of correlation among the predictors you're using in your multiple linear regression model the Precision of those estimated coefficients gets degraded the standard errors of those coefficients gets larger it gets harder and harder to nail down really how the Y values of two individuals are going to differ when they differ in X by just that one unit when they're expected to differ in all these other characteristics simultaneously so we can gauge how much of an effect this correlation has on our model by looking at what's known as the variance inflation factors of the coefficients if I after I fit a model type out vif and put my model in there it lets me know how much the correlation between the different predictors has affected the Precision of my estimated coefficient so abdomen right here gets a variance inflation factor of 11 .77 if I were to take the square root of this this lets me know the factor by which the standard error has increased due to correlation between abdomen and these other predictors so it's letting me know that the standard error of the abdomen coefficient in my model is about 3.4 times larger than it would have been if every single one of these predictors was independent of one another they're all sharing information about body fat percentage in these measurements but there's so much redundancy flying around it's hard for us to separate out really what can be attributable to abdomen circumference alone now how large does the variance inflation Factor have to be in order for us to start getting worried about the Precision of our estimated coefficients well surprisingly there's no hard and fast rule very often if you open up a textbook they'll say hey if you see a variance inflation factor of around five or 10 then that means the model is having trouble estimating the coefficient of that predictor variable but I have another guideline that I seem to like a little bit better so let's go and actually fit that body fat example one more time predicting body fat percentage from everything that we have available and let's look at the r s of the model so in this case using our model gets us about 74.9% of the way there to knowing someone's body fat percentage so pretty decent model actually for making predictions what I'm going to do is I'm going to calculate the quantity 1 divid by 1 - R 2 and I'm going to use this as a threshold to compare the variance inflation factors too so printing out those variance inflation factors and printing out that threshold one more time I'm looking for any variance inflation factors that are bigger than four so why do you see those pretty much all over the place neck chest abdomen hip Etc so why does this threshold make sense well let's look at the variance inflation factor of abdomen one more time at about 11.8 because that's larger than the threshold of four essentially all the other predictors in the model are actually predicting abdomen better than they're predicting the body fat percentage so if you're other predictors are predicting an X variable better than it's at predicting the Y variable then that's kind of an issue and that's a signal to me that hey the model's having difficulty trying to estimate the coefficient of the predictor variable so bigger than five bigger than 10 kind of a red flag or bigger than that 1 / 1 - r^ 2 all pretty good thresholds for determining when this intervariable correlation is an issue and we have a special name for this if we do have a lot of predictors suffering from high correlations we say that if we have a bunch of these predictors having large BFS that the model suffers from multicolinearity so multicolinearity just means that a lot of these predictor variables are correlated the model's going to have a hard time disentangling the effects of each predictor variable on the Y variable from each other so the consequences of a model being fraught with multicolinearity just means that the model is having trouble estimating those coefficients if you're doing descriptive analytics you have your work cut out for you because there aren't really individuals that differ in X by just one unit and that are identical otherwise since these predictor variables are changing simultaneously with their correlations so multicolinearity indicates to us that okay the coefficients of the predictors aren't being well estimated but the key thing is is that the presence of multicolinearity doesn't inv validate the model the model's still good it's just having issues estimating those coefficients with nice precision and even better actually if we're going to be using our model for doing Predictive Analytics it turns out that multicolinearity really doesn't have that much of an effect on the predictive value of the model so when we do see large vifs we just know that all right the model's having a hard time estimating those coefficients the predictions will still be pretty good but trying to interpret those coefficients might be a bit on the the tricky side