Transcript for:
Understanding Linear Regression and Outliers

Okay, happy Friday everyone. So we're going to pick up from where we left off last time. We were just about to kind of calculate a p-value for our slope here. So we're testing the null hypothesis that beta sub 1 is equal to 0. And you know the reason we want to test this specific hypothesis is because you know if the slope is actually just 0 then that means it's a flat line. and if it's a flat line it's like then you don't actually need a linear equation to describe the data right you can just calculate the mean of the data or the mean of the y values i should say because your model is basically the equivalent of that so because we calculated the confidence interval right we know basically what the the we know that the p value here is you going to be significant actually uh 0.46 yeah so notice in our confidence interval for the slope right zero is not contained within this interval so we know that our slope is going to have a significant p-value but we'll go ahead and calculate this formally anyway so the similar to the confidence similar to the confidence interval the pattern is going to be the same as what we sort of seen in the past Right, so we have our test statistic, which we'll just denote as t here, right, and we're going to basically take our parameter estimates, subtract our null hypothesized value in the population, so the hypothesized parameter hypotheses. and that just gets divided the standard error of the parameter estimate we'll say okay in simpler terms really what this means is we basically just take our slope which is b sub 1 like our sample slope and we subtract beta sub 1 which is going to be 0 in this case and that's actually let's do this and then we divide that by the standard error of the slope which we're denoting as s sub b sub 1 then we just input our values so we had calculated a slope for this data of 0 point track that from zero or subtract zero from that I should say and then do seven is our standard error and that gives us a value of something so let's figure out what that something is we'll head over to the our script we were working in rerun all that we'll say test statistic so basically you know beta sub 1 here will which is our population parameter right we're basically testing the hypothesis that that is equal to zero okay And then we'll write out our formula for the test statistic. So let's say b sub 1. I think, what did, oh no, it's just, or is it just b1? I can't remember. I think it's just b1. What do we store it as? Yeah, b1. That's what we store it as. Okay. Yeah, it's just b1 minus beta1. And that just gets divided by our standard error, which is S sub B sub 1, like so. Z stat, run that. And you can see we get an error. Oh, I didn't run beta. That's what I didn't do. So we get a value of 18.99728 which is for a test statistic that's pretty large. So let's head back here we'll write that down 18.997 okay and now that we have our test statistic we can calculate the p-value But we can tell that it's going to be significant because if you recall our critical T is 1.96, right? Whereas this is at 18.9. So this is like 18 units basically kind of to the right there. So it's going to be extremely significant, but we'll calculate this just the same. So we'll say P value. And this is a two-tailed test. It's important to remember that. So we're going to use the function pt here. And our test statistic in this case is positive. And since this is a two-tailed test, that means we need to calculate from right to left. So we'll say tstat. Give it the degrees of freedom, which we had stored as df, I believe. Yes, we did. Okay. And like I say, we need to calculate from right to left. So we're going to say lower.tail is false. And There's one other thing we need to do. Does anyone know what it is? Yes, exactly. Times two. It's a two-tailed test, so we multiply this by two. Otherwise, we're only looking at half the probability. So you can see we get a nice number there in lovely scientific notation. So this is obviously a very, very small value. Take the decimal point, move it 69 spaces to the left. And, you know, you just get effectively zero much. So in this situation, you know, our p-value is, in like a manuscript, you would just write something like p is less than 0.001. Okay, so our slope is significant. This corresponds with what we sort of saw with the confidence interval. So it seems like we have a regression model that, you know, has some value to it, right? At the very least, like there's some value to describing our data using a linear equation because if this was not significant then we couldn't rule out that our slope is any different than zero. But that's ordinary least squares in a nutshell. So specifically everything we've done here is what is known as linear regression and some people will refer to this as simple linear regression and the reason it gets called simple linear regression is because we're dealing with just one predictor but the thing about regression is it can actually have more than one predictor which isn't something we're going to worry about in this course but the the the kind of logic that you employ is the same like nothing nothing really ends up changing all that much Now what I think we should do real quick here is we should run these same calculations we've done sort of throughout all these notes here. We should run these same calculations in R and see kind of how many mistakes we've made. So sorry question? Oh so if you were doing this the quick and dirty way in R like so we've done this but we've basically done manual calculations here. Obviously we've used some like certain functions to sort of make our life a little easier but on the whole this is kind of like a manual way of calculating all of this. But R has a nice convenient quick and dirty way so we'll say R cheat codes here. So and I've explained this before but I'll do it again. So what we can do is we can actually store our little linear model as an object so we can call this whatever we want. You know, we can call it model, we can call it Bob, we'll call it Bob, just to emphasize that you can name your variables whatever you want, okay? So we'll call this Bob, and what we're going to do is we're going to use the function lm, and in this little function we just write an equation, or not an equation, but like a little formula. We basically have to give it our... predicted variable and our predictor variable. So the thing we're trying to predict is the height of the suns. So we give it, so recall our data here, right? We have the father heights and the son's heights. So we give it the son column. And then we type a little tilde, like so. And then we say father. And what you can do is you can then just tell it what your data frame is named. So in this case, the data frame is named heights. So we can type heights. Alternatively, there's another way you could do this. You could just... give it the columns directly so you could say like whoops you could say heights like this and use the dollar sign notation but this this requires a little bit more typing so and it's a little little harder to see what's going on so i prefer this method so we'll say data heights run that now you can see we can now access bob here so if we run bob What you get is you get our kind of coefficients there. So we have the y-intercept and the slope. Now what's kind of cool about R is we can actually take Bob here and we can insert him into a function. So there's a function called the summary function. And we'll just put Bob in the summary function there. And now if we run this, What you get is you get a bunch of fancy information. Let me expand my window here so you can read this better. There we go. What you can see is we get a whole host of information about our regression model. So this output that we're looking at here is really important, especially as you get into more advanced stats. When you start doing like multiple linear regression, ANOVAs, that sort of thing, this output. that we're looking at is going to be really important. So let's maybe just kind of go through this because we've actually done we've basically dealt with everything that's in this this kind of output here. So right at the top what you get is just literally the model equation that we sort of typed. There's nothing terribly useful there but right here in the next row you can see we have information about our residuals. So you know with this data set we have like a thousand. residuals or whatever you can see the median value for the residuals and you can see the quantiles for the residuals and you know with the residuals right you want them centered around zero so ideally you want this median value to be close to zero and you can see that it is And the reason we're looking at the median here and the quartiles as opposed to like the mean and the standard deviation of the residuals Is because the median is is robust to non normality So if your residuals are not normally distributed, this is still a good statistic to look at You know It's giving you a good picture of where the center center of the values is and ideally you want that center around zero Next we have this kind of table coefficients here. So this is a really important table. So in this first column, this estimate column, what you can see is we have the y-intercept and the slope respectively, and we also have the standard error of the y-intercept and the standard error of the slope. Now we didn't actually calculate, I don't know, did we calculate the standard error of the y-intercept? I can't actually remember. Yes we did. And there's the slope. So you can see these two numbers, they correspond to what is in the table here, which is always a good thing. Then we have the t value. So recall we did a t test on the slope, and you can see it has a value of 19 there, which is a bit of a rounded value actually, because we had calculated that to be like 18.997. But actually what you can do is if you want to look at a more precise version of these numbers There there is a way to do it So what you could do is you could say summary give it Bob and then if you put this inside the print function and then type digits and you could set this to I don't know something like 10 for instance and What it'll do is it'll show you kind of all of the values to 10 decimal points so you can see Now we have a test statistic that is more like what we actually calculated because it's not as rounded. Anyway, that's a thing you can do, but we'll just look at the regular version here. So yeah, that's the test statistic we calculated. There's the p-value, and there's a bit of rounding there, so it's kind of showing you a slightly simplified version of the p-value because it was so small that it was effectively zero. And you can see you have these nifty little asterisks just as a kind of convenient shorthand to tell you how significant the p-value is. And you can see we have three asterisks. So in essence, our p-value is effectively zero. Yes. Correct. Yeah. We didn't do, I don't think, did we calculate? We calculated the confidence interval on the intercept. We didn't actually really need to do that. Yeah, the statistics for your intercept are not. Typically something you care about. Sometimes they are, but in this situation and in most situations, you don't actually care. The intercept, there's no kind of logic, there's no obvious test you would conduct for the intercept. So this test that we're looking at here, this p-value, this is testing whether the intercept is equal to zero, but there's no real reason to test that, right? It r does it out of... kind of convenience but yeah it's not really a statistic we care about. The slope is really what matters like testing the slope. But yeah okay so that's the table of coefficients with the p-values and such. Next we have a bunch of information at the bottom here. So you'll notice we have residual standard error. So we calculated this you'll recall so if we kind of go back to our notes here is that it's, oh yes, we called it the standard error of the residuals. Same thing. So you can see we get a value of 2.4381, which is basically what we have here. This is a slightly rounded version. You can see we have the degrees of freedom, which should match what we calculated, and it does. And we have something here called the multiple r squared. So this is just r squared. Okay, so this is r squared as like we've been calculating it. So we calculated the coefficient, the correlation coefficient. So if we square that, didn't we calculate it? Maybe we didn't. No, we didn't. Never mind. But if you had calculated the correlation and squared it, this is the value you would get. So this is just your standard r squared. The reason it's called multiple r squared here is because this output is intended for multiple linear regression. So that's the reason for that. adjusted R squared that's not something you have to worry about that's for multiple linear regression and the F statistic here is also something you don't have to worry about doesn't really apply to what we're doing. Lastly don't forget that you can easily obtain confidence intervals for the intercept in the slope here all you need to do is type the function confint and then insert bob and look at that You now have confidence intervals, 95% confidence intervals here. The last thing that I do want to show you guys is our model object, so Bob in this case. Bob contains a lot of useful information. So if you type a dollar sign, you can see a little window pops up, and it gives us a whole bunch of options here. So if we type coefficients, right, you guys know this bit. you can get the coefficients and of course you can index those right so if you type like two here you'll get the slope bob contains some other useful information so bob will give you the residuals so there's all of the residuals and you'll find that this actually matches the residuals that we had stored when we calculated them so we had stored all the residuals in the data frame right there and you'll find that that the residuals that Bob has and these residuals are the same. Similarly the predicted y values so the y hat values Bob has access to those as well. Where is it? Fitted values that's what it's called. So these are all of your y hat values pretty nifty. Okay, so there's lots of useful information in there that can save you a lot of time and a lot of effort. For your homework assignments, actually, I don't think I've made this stuff up yet, but, you know, for your homework assignments, I usually go through kind of a long way. calculating a lot of this stuff but you know just be aware these shortcuts do exist right so you know feel free to use bob as you see fit right so before i move on here are there any questions about linear regression no understand this perfectly good So what we're going to do now is we're going to run through another regression example, but we're going to do it with some slightly nastier, maybe more realistic data. So what this means is we need to talk a bit about outliers. So I'm going to actually make a new R script here. Let's head over to our notes here. We'll make a new note page. We'll call this outliers. Format this. Okay, so to kind of illustrate outliers and the problem with outliers, I actually have a different regression example for you. So on Canvas, there's a data file called ldoi.csv, and that contains data from a study that was trying to predict a reading ability from a measure of... From a measurement that involved digit naming speed. Don't worry the specifics of the study aren't too important for us What you just need to kind of know is that there's sort of two variables a predictor variable and a response variable. So Speaking of that data. Hopefully I have that computer here. Do I? Should be on canvas though. Okay so what I'm gonna do is I'm gonna take the data so I have it right here it's just called ldoi.csv I'm gonna put that in our data folder that we've been sort of keeping all of our data in. Alright and what we'll do is we'll load this data in R here and close this we don't need that. And I'll just maybe illustrate something. I've kind of talked about this in the past, but if you click the little environment tab, you know what you can see is we have a whole bunch of stuff in the environment tab. So this is just showing us all of the variables that are stored in memory. Alright, but these are all variables that apply to the father and son height data, right? So we don't need Bob anymore. So we can erase Bob. So what you can do is you can just hit this little... Bob and everything associated with Bob. Also, when you close RStudio, it wipes all the variables as well. It should....tidy verse......we'll call this reading. Let's say read CSV. It's called L. Doy. That's the name of the author of... the study that this data comes from. So that's oops and we put that in the data folder so we're going to make sure to tell it that it's in the data folder. Okay so we take a look at this data you can see it's just got three columns pretty simple. This DNS just stands for digit naming speed and word ident is just short for word identification. We'll write a little note here. We'll say predictor variable So this is DNS which stands for We'll say that's it's just a measure of digit naming speed, you know, which is just sort of a little task that a little boring task that psychologists like to give people So our response variable in this case is a measure of word identification. So we'll say word measure. Okay so that's the data. Let's go ahead and plot this. So what we'll do I guess is we'll just say ggplot. We're going to give it the data frame which is reading in this case say aes. So on the x-axis we want to put our predictor variable so that's dns. It's the dns column and then on the y-axis we're going to put our response variable which is word ident column plus and then we'll just say geom points we'll make the size of the points a little larger so like three okay so zoom out here you can see here's our data we have kind of a whole bunch of values over here and then we have kind of a smattering of values over here so assuming these aren't measurement errors Let's go ahead and plot a regression line on this, just to see what the regression line looks like. And we can do this the quick and dirty way by just having ggplot create the regression line for us. So we can say geom smooth method is linear model. I'll turn the standard error off and we'll say line width. Make that like two. Okay. Whoops. Okay. So we have our regression line here. And it's worth asking, is this how you would have predicted the regression line to turn out. Like just looking at this data, does this regression line look correct? It doesn't really seem to kind of move in the direction that you would expect it to move in, right? It doesn't really pass the sanity check. You know, if I were drawing a regression line, I'd probably draw one more like this or something. I don't even know actually. So clearly something's off here. So the problem we're looking at with this graph is we're running into a problem known as kind of bad or high leverage points. So leverage is actually a concept in regression that refers to how much pull a value has over the regression line. So when you get even a single value that is like far away on the x-axis from the rest of the data, that value can have a disproportionate pull on your regression line. So if you kind of think of your regression line as like a teeter totter, you know, you have on the one end, you have the data kind of pulling it down one way. And on the other end, you have like this one outlier pulling it the other way, except this one outlier is basically like an elephant compared to all of the other data, which is just like our little ants, you know. So bad leverage points are going to pull the regression model in a way that makes it unrepresentative of the data as a whole. So whenever you have basically a few extreme data points that are unrepresentative of the bulk of the data, these are what we refer to as outliers. And, you know, as we discussed back when we talked about the mean and the standard deviation, these can substantially affect your statistics and, you know, make them potentially a poor description of your data. Now, one thing it's important to draw your attention to in this case is, you know, the outliers we're looking at here are on the x-axis, okay? You know, the values here are extreme in terms of digit naming speed, right? So you can see, like, the values are extreme in terms of digit naming speed, but they're not extreme in terms of word identification, which is on the y-axis. And this is going to be an important point. for how we deal with outliers later on. So it's important that you kind of remember this aspect. Okay, so in this example, right, the outliers are on the x-axis. And I'm going to come back to that point later on. Now interestingly, statisticians actually have a way of quantifying the effect of outliers on a statistic. And they call this the finite sample breakdown point. Don't worry, it's nowhere near as complicated as it sounds. So let's maybe head over to our notes here. Let's just define the finite sample breakdown point. So this is the smallest proportion of n observations that can make a statistic arbitrarily large or small. So, as an example, the mean, right, which we all know and love, the mean has a finite sample breakdown point of 1 over n, 1 divided by n. The median, by contrast, has a finite sample breakdown point of 0.5. So... The means finite sample breakdown point here is 1 over n. So what that means in technical terms is you only need one extreme value in your data set to significantly alter the means value. The median, by contrast, has a finite sample breakdown point here of 0.5, which is actually the highest breakdown point that you can have. And that means that 50% of the data would need to consist of extreme values to affect its conclusions. Regression, it turns out, has this, or ordinary least squares regression, I should say, has the same breakdown point as the mean. Okay, so in other words, with regression, it only takes a single outlier to mess with your results, which kind of sucks. Now to deal with outliers, we first need to identify them. And, you know, there's numerous statistical methods out there for detecting outliers. So one method that you'll see researchers use quite a lot, and is not really a method per se, but one method is to simply eyeball the data and see if something looks extreme. You know, and to be honest, this method works quite well for really obvious outliers. But in most cases, it's not obvious what is and isn't extreme, which kind of makes that whole process pretty subjective and prone to biases that the researcher might have. So we don't want to use that method. Now, another common method that you'll see used a lot by researchers is what we might kind of refer to as the two-standard deviation rule, which is intended to be objective, but it's a pretty horrible rule, actually. This rule, which I'll describe for you so you know not to use it in the future, this rule is pretty simple. Basically, you declare any value in your data an outlier if it is greater than two standard deviations from the mean. Though, you know, the problem with this method is hopefully very obvious to you, right? One of the main reasons for flagging outliers is that they inflate your standard deviation and they shift your mean. So if your rule is that anything two standard deviations away from the mean is declared an outlier, then you're literally using flawed statistics to try and identify the flaw in those statistics. It's like trying to inflate a balloon by the air out, right? It doesn't make any sense why you would attempt to do that. So as an example, let's suppose this is your data set. So let's say we have x, and let's say x is 2, 3. 3, 2, 15, and 30. Okay, let's suppose this is your data. Now 15 and 30 here are obviously both outliers. But if you use the two standard deviation rule, what you'll find is only 30 ends up getting flagged as an outlier. And you can check this just by converting these values to z-scores, right? Yeah, so only 30 would get flagged with this two standard deviation rule because both 15 and 30 are influencing the mean and standard deviation. And this is a problem in outlier detection research known as masking. So this is basically where the presence of outliers in your data actually causes those outliers to be overlooked. Let's actually define that. A much better method for detecting outliers is what we're going to kind of call the Mad Median Rule. But before we can use this rule to detect outliers, We need to first learn about a brand new statistic we've not talked about in the course called the MAD statistic. So MAD stands for Median Absolute Deviation. So we'll say Median Absolute Deviation. So this is, so the MAD is really just kind of a, it's a measure of spread in the same way that the standard deviation is a measure of spread, except that it's a robust measure of spread, meaning that outliers aren't going to affect it all that much. So we'll say a robust measure of spread, you know, because it's based around the median, which we know has the highest finite sample breakdown point. So let's kind of walk through. how you calculate the uh the mad statistic and luckily everything we kind of need is literally in its name so step one calculate the median deviations um so um the first step here yeah is to basically calculate deviation scores like we did with standard deviation except that Instead of taking deviations from the mean, we're going to take deviations from the median. Okay. So actually we need to know what the median is, by the way. So the median of like these values that we have right here, the median is three. In case you're curious, let's write that down. Median is three. So what that means is we basically just take all of our values. So we say like 2, 2, 2 is our first value. We go 2 minus 3, which is negative 1. And then we have 1 minus 3, which is equal to negative 2. And then we have 3 minus 3, which I think is 0. And then minus 3. Should be 1 minus 3, which I think is 0. Space here. And then 2 minus 3 again. And then 15 and 30 minus 3. That's 12 and 27. Okay. So that's our first step. Get the median deviations. So based on the name, what do you guys think the next step is? Anyone? Oh, I didn't see hands raised. Sure. Yep. Next, we take the absolute value. So if you kind of look at the deviations that we've calculated here, right, some are positive, some are negative. This happened with standard deviation as well, right? So the next, so with standard deviation, we squared the values, but here we're going to take the absolute and recall, you know, the absolute value of a number is just sort of the distance that number is from zero. So if you have a negative value, all you need to do is make it positive. That's literally. all that's required. So the way absolute value gets notated is by writing kind of horizontal lines. So if you write like horizontal lines like that, that indicates that you're taking the absolute value of 2 minus 3. So like the absolute value of negative 1 is just 1, absolute of 2 is 2. We just kind of continue this on. So yeah, anytime you see like horizontal lines like this... kind of instead of regular parentheses that just means you're taking the absolute value 0 12 27 okay So actually let's write this step down. Say step two. Okay. Anyone want to guess what step three is? It's all in the name. Nope. Let's see. So we've calculated a deviation, calculated absolute values. That's a and d. What is m? Yes. I guess you both, not compare them, but just calculate the median of all of these values. So we're getting the median absolute deviation. So we'll say step three here, calculate the median. So actually maybe we'll put step 3. So if we take our values, right, how you calculate the median, you got to put them in order, right? We kind of put all these values in order, what you end up with is 0, 0, 2. Seven and then we just got to figure out what's in the middle One two, six seven we have seven values So wait a second. Do we have some guys? I'm missing a one this I rewrite this this is gross 227. So what that means is the middle value is these two ones, right? So when you have two values that are in the middle, you just take the median of those, or the mean of those, pardon me. So 1 plus 1 divided by 2 should be, okay. So this is an ad statistic. Alright, so this is the median absolute deviation score, if you like. Now, to use this number to detect outliers, what we need to do is we actually need to convert this number into kind of a form or a variant that actually works with a normal distribution. And we'll kind of note this new version as the MADN statistic. So we'll say, MadN statistic. And luckily, it's very easy to sort of do this. All you need to do is divide your Mad score by kind of a decimal value. So the MadN statistic, so MadN is just equal to Mad statistic divided by. 0.674. So in our case, what that means is we end up with a value which is equal to 1.4826. Okay. Now we can take this number, this mad end statistic, and use it to kind of calculate or spot outliers. And the way we do that is by basically kind of creating a little rule. So what we're going to do is we're going to define an outlier. So outlier definition, we'll say. Yes. You don't want to know. Just accept it. That's the simplest answer to that. If you want I can point you in the direction of some statistical proofs and stuff, but it's... that's not important for our course. So we're gonna define an outlier kind of in the following way. So we're gonna say an outlier is any value, so x here is just any value. So if you take that value and you subtract we'll say the median, which we'll just say med, you take the absolute value of that, and you divide that by your mad n statistic. So if this little equation is greater than 2.24, then we declare x an outlier. Okay, so You're probably wondering where this 2.24 comes from. So this is a special number known as Hample's identifier. Not really something you need to know so much, but I guess you need to know it for the N statistics. So this is what's called identifier. And this value is always going to be 2.24 and you're always going to divide MAD by 0.6745. So those numbers aren't ever going to change they're always going to be the same So let's put this to work and see if for instance 15 here Actually, let's see if 30 is an outlier We'll do 30 first because we know 30 quite obviously is an outlier So we use a different color use like red or something. So what you do is you say 30 minus the median, which is 3. We take the absolute value of that, divide that by our MADN statistic, which is 1.4826. And that gives us a value of 18.21. And the question is, is 18.21 greater than 2.24? What do you guys think? If you say no, a little part of me will probably just die because I... Yeah. Yes, 18.1 is obviously greater than 2.24, so we would flag that as an outlier. Like 30 we would flag as an outlier. Let's try 15 here. 15 minus 3, absolute value that. 826 gives us a value of 8.09, which is also greater than 2.24. So according to this rule, both 15 and 30 are declared errors. So what we're going to do next is we're going to go through how you can actually use this rule in R, and we're going to see if we can spot any outliers with that LDOI data. But, you know, there's only like a minute, maybe two minutes left. So I guess we'll have to leave that for Monday.