hello and welcome to the next video in my series on basic statistics if you are a firsttime viewer please stick around for the intro it is worth the time if you are a regular viewer feel free to skip ahead using The annotation so first a few things I do these videos because I love to learn and help others learn we are all good at something so I encourage you to give back to the world in a similar way share your passion any way you can now this video focuses on basic stats and is not a quick fix it aims to be thorough my goal is an understanding of fundamental concepts and that takes time but when you understand the fundamentals learning other topics is much easier now related to that if you are watching because you are struggling in a class or at work I want you to stay positive and keep your head up you can learn this I have faith in you many other people around you have faith in you and so should you feel free to connect with me on LinkedIn go+ Twitter and of course subscribe here on YouTube now if you think there is something I can do better please leave a constructive comment below I do take those comments into account when I make new videos I also encourage you to talk with other viewers in the comments help each other out when you can and finally if you like the video please give it a thumbs up share it with classmates or colleagues and put it on a playlist to review later that does encourage me to keep making them for you so all that being said let's go ahead and start learning so here we are in the fourth video in our series on multiple regression now as I'm sure you know there are many different data types we're most familiar with interval data like the temperature outside or the value of money or the mass of something on a scale but there are other types like categorical variables so male female yes no true false north south east west London Liverpool Blackpool Newcastle you get the idea but luckily regression is a very flexible statistical technique and we can implement or use categorical variables in our analysis so this first video is all about that using a technique called dummy variables to represent categorical information so as usual let's go ahead and start out with an actual problem now I will say that most of the data in this problem is actually real I went out and got it on the internet now I did change some of the numbers for pedagogical reasons but other than that this is actually real data so here is our scenario you are an analyst for a small company that develops house pricing models for independent Realtors to generate your models you use publicly available data such as list price the square footage of the the home the number of bedrooms the home has the number of bathrooms Etc but you're thinking sort of outside of the box here you are interested in another question is the public high school in the neighborhood exemplary that's the highest rating and how is that rating related to the home price so the high school rating is not quantitative it is qualitative it's categorical so for each home price the high school is either exemplary or not yes or no and those are going to be the two categories for one of our variables so here is our home price data so on the top we have price in thousands that's our dependent variable then we have square feet that's our first independent variable and then we have exempt high school that's our second independent variable so as we can see the first home has a price of $145,000 the square footage is $1,872 square feet and that home is in a school district where the high school is not exemplary so if we go down to the third one that home is $315,000 it is $ 4,14 Square ft and it is in a school district where the public high school is exemplary so you can see how this data works we have price that's our dependent then we have our two independent variables square footage and whether or not it's in a school district where the high school is exemplary so what I went ahead and did is coded the exemplary High School column so everything is the same but in the last column you'll see that if the high school is not exemplary then I coded that as zero if the high school is exemplary I coded that as a one this is sort of the first lesson in dummy variables so we have two categories here I assigned one zero and the other one a one that's completely arbitrary I could have switched them I could have made not exemplary one and exemplary zero it does not matter which one it goes in but for me it sort of made more sense that exemplary would be denoted with a one so here why is the home price in thousands X X1 is the square footage of the home and X2 is one if the high school is exemplary and zero otherwise that's also a common way to write these so one if it's exemplary zero otherwise because often times I'll show you here in a minute there are more than two categories so it's best just to put zero otherwise so here is a grouped scatter plot of our data you can see a definite pattern here so the blue dots represent homes where the high school is not exemplary and the red squares represent homes where the high school is exemplary so on the bottom of our graph we have square footage and on the left hand the y- AIS we have the price in thousands so most of our homes down here in the lower left are homes that are smaller they are homes where the high school is not exemplary and the price is less now the red squares show us that those schools are in districts where the high school is exemplary they're larger homes and therefore they are higher priced so we can see a different pattern here so you can look at it two ways we can look at the two groups individually but if you sort of squint your eyes and look at the data points as a group we can see that there seems to be a definite pattern here so we have two patterns going on the data points as a whole start in the lower left and go up to the upper right and then we have sort of a separation in the Middle where the non-exemplar schools are on the lower left and the exemplary schools are in the upper right so there's kind of an imaginary line that runs through the graph here separating the blue dots from the red ones so what are dummy variables exactly now in many situations we must work with categorical independent variables so in regression analysis we call these dummy variables or sometimes they're called indicator variables they mean the same thing for a variable with a certain number in categories there are always going to be n minus one dummy variables and we'll walk through that here in a minute so for example in this case we have exemplary high schools and not exemplary high schools therefore there are two categories so 2 - 1 equals one dummy variable and we saw that in our original data we had one dummy variable that represented the exemplary schools and the non-exemplar schools with ones and zeros now not related to this problem necessarily at least yet let's say we have four categories north south east and west so there there are four categories so 4 minus 1 would equal three dummy variables and we'll look at that here in a second now even though it's not related to this problem necessarily let's go ahead and look at the north south east and west example we talked about in the previous slide so let's say we have north south east and west maybe we're looking at housing data or sales data across these four regions now how could we code these as dummy variables so we have four categories we're going to need four minus one dummy variables so we could code it like this so we have X1 X2 and X3 along the top those are our three dummy variables now we could represent North where X1 is 1 and X2 and X3 are zero the South Region would be 0er for X1 1 for X2 and 0er for X3 East would be 001 and here's what confuses some people the West the 4th region would be zeros all across so West would be coded nothing so North would be one for X1 South would be one for X2 East would be one for X3 and for the west region we would not put any ones in our regression and we'll see how that works as we go forward so this is an example of a variable with four categories and three dummy variables so let's back to our problem at hand so this is very similar to some of the other multiple progression we did in previous uh videos so the expected value of y the dependent variable equals beta 0 that's our intercept plus beta 1 X1 that's our first coefficient and our first independent variable plus beta 2 and X2 that's our second coefficient and our second variable now we have two things going on here we have one case where the X2 is zero where the high school is not exemplary and then we have another case where the high school is exemplary and X2 is a one so let's look at the first case first so the expected value of home price given the high school is not exemplary that's where X2 equals 0 so we're going to go ahead and change this estimated regression equation up there at the top to reflect that so e the expected value of y our dependent variable given that's the straight line given that the high school is not exemplary so we rewrite that equation as beta 0 + beta 1 X1 plus beta 2 * 0 CU remember when X2 is0 that means our high school is not exemplary so we can go ahead and put that in for X2 now we just do some simple algebra well beta sub 2 * 0 is 0 so it basically disappears and we're left with beta Sub 0 plus beta 1 X1 now what about when the high school is exemplary and X2 = 1 same process so beta Sub 0 plus beta 1 X1 + beta 2 but this time times 1 CU remember that's the value when the high school is exemplary so again some simple algebra beta 0 plus beta 1 X1 plus beta 2 cuz is beta 2 * 1 is itself beta 2 now those are going to be two constant numbers beta Sub 0 and beta 2 are just going to be numbers without any variables attached so we can actually combine them so in parentheses we have beta Sub 0 plus beta 2 plus beta 1 X1 so we actually have two different regression equations here in the first case X2 is Z in the second case X2 is 1 so we always have to realize that there is a regression equation for every possible scenario in the dummy variable in this case it's two we have zero and one as far as the values of that dummy variable so when we conduct the regression in manyi tab this is what we get we can see our categorical predictor coding of 1 and zero then we have our Innova table as usual so we look across the regression line there we can see that our F value is 35.94 with a P value of 0. that of course means it's less than 0.1 and is significant now we look down here at the bottom our model summary we have an r s of 85.7 an R squ adjusted of 83.3 1 and an r s predicted of 70.3 so so a high R squ and high R squ adjusted along with an r s predicted that doesn't fall off a cliff remember in part three I mentioned that even if we have an high R squar and a high R squ adjusted and then our R squ predicted just goes crazy low like off a cliff say you know 50% then we would be concerned our regression equation is not doing a good job at predicting so everything here looks good let's go ahead and look at our Co efficients so we have the constant term we don't worry about that in this case then we have the square foot variable that has a P value of 0.10 so that is significant at 05 and then we have exempt High School where the value is one and that is significant at 018 so square foot coefficient is significant and the exempt High School coefficient is significant now what about the values of the coefficients for square foot it's 0621 now remember that's in thousands of dollars now for exempt high school it's 98.6 and we'll talk about that here in a second so first let's talk about square foot and its coefficient so every square foot is related to an increase in price in the home of 0 621,000 or when you multiply that out $620 per square foot so think about this for a minute that's 1 sare ft what if the house is 10 s ft larger well that would be $621 what about 100 squ ft larger that would be $621 but what about a th000 squ ft larger home well that would be $62,400 in price so you can see how each additional square foot is related to the price of the home in this regression model so now let's interpret exempt High School remember its coefficient is 98.6 so what does that mean well it means that on average a home in an area with an exemplary High School is related to a $98,500 higher price so if we have two homes the same square footage let's say both homes are, 1500 Square ft one is in a district where the high school is not exemplary one is in a district where the high school is exemplary same square foot the only difference is the high school rating the home in the district with the higher rated high school will be $98,500 higher in price so let's go ahead and do the full interpretation with the numbers we just generated along with our regression equation so the equation we got for minab is 27.1 plus 0621 X1 we talked about that last slide plus 98.6 X2 so the only thing we didn't talk about really in the last slide was The Intercept but that's not really relevant for this type of model but we still need it of course so let's go ahead and just plug everything in so the expected value of a home price given the high school is not exemplary where X2 equals 0 so all we do is take the zero and substitute that in for X2 so we go ahead and do that so 98.6 * 0 is 0 so we're left with 27.1 plus 0621 X1 so it's a very simple algebraic linear equation now what about when the high school is exemplary so there X2 equal 1 so we go ahead and substitute everything back in there so 98.6 * 1 is 98.6 now we can combine the 98.6 with the 27.1 so we put that in parenthesis and that is 125.77 21 X1 so here are our two lines here are our two linear equations and we'll actually look at that on a graph here next so here are the two regression equations that mini tab gives us it's saying that when the high school is not exemplary that's 0 the price in thousands is equal to 27.1 + 0621 ft we just figured that out on the last slide when the high school is exemplary it's one the price in thousands is 125 plus 06 to 1T so you notice there that the slope of these lines are the same 0621 the only thing that's different are the intercepts so let's go ahead and put those actually on a graph and here it is so we can see that the first example where the high school is exemplary where it's one that's the 125.77 21 ft that is sort of the blue purple line here on the top when we actually graph that on our graph then of course the red line is the homes without an exemplary high school so you can see that these two lines are just algebra 1 we go ahead and put those on a graph now it actually has meaning though there's a distance between them and guess what that is 98.6 so the average distance between these two lines is the $98,500 that we talked about a couple of slides ago so everywhere along this distance the average distance or the average price difference is 98.6 or 98,6 now just for the sake of learning I went ahead and conducted another scou plot but actually put each groups regression line inside of it now as you can see this looks somewhat similar to the graph we had in the previous slide but remember the previous slide is the average over everything here each line is unique to each group so the general pattern is the same and actually the red line is actually pretty close to the exact same but because the way the homes are distributed on here it's not going to be exactly the same as we saw in the in the last slide cuz it's again regression is all about the averages over everything but here you can definitely see the difference between the homes that have an exemplary High School in its district and those that don't okay so that wraps up our introduction to dummy variables now we'll be doing more with dummy variables in the next video but I just wanted to get your feet wet so you understand what they are where they come from how we use them to code different categories of data and then we how we use those variables that we code in actual regression of course they do get more complex but the basic interpretation is the same so hopefully you were able to develop a good fundamental understanding of dummy variables so you can apply that to more complex problems so thank you very much for watching please subscribe if you have not done so already and I look forward to seeing you again in the next video