Transcript for:
Five-Number Summary and Outliers

What's the five-number summary? It's exactly like what it sounds; it's going to be five numbers consisting of the minimum, Q1, median, Q3, and max. All right, so let's go back to our example of those six quiz scores: 28, 24, 27, 30, 19, and 20. I want you guys to take a quick moment and enter this into list one of your calculator. All right, now let's go back and remember how do we find this five-number summary? Let's go back and remember we're going to go to the "Stat" button, go to the "Calc" option, and choose "1-VarStat". So let's do that together. All right, go to the "Stat" button, go to the "Calc" button, the middle column, and choose that first option of "1-VarStat". Keep in mind you're going to need to type in your List 1 so you'll have to hit "2nd" "1" to make sure that's in the list, and we hit calculate. And what I want you guys to see is ultimately at the top of your list you'll see the things like the mean and the standard deviation, but if you scroll all the way down, all the way down, I want you to see that it will in fact give us the Min, the max, q1, Q3, and the median. All right, and so I just wanted you guys to note that is how we ultimately find that five-number summary. So in this case, it's 19, 20, 25.5, 28, and 30. Just wanted to do a quick example for you guys on how to practically find that five-number summary on your graphing calculator. Why do I want to find this five-number summary? Remember that in section 3.2 we were looking at symmetric data, and we used the empirical rule which applied to symmetric graphs to develop the z-score to identify by when a piece of data was unusual. Now the thing is is that concept of z-score, that concept of unusual, is only going to make sense when your graph of data is ultimately symmetric. Now what we're going to focus on here in section 3.5 is what happens when your data's not symmetric, meaning what happens when we have skewed data or what happens when we have symmetric data but then you have that one random outlier. How can we identify those outliers? And so here in section 3.5 we are going to develop these new line in the sand, these formulas much like z-score, a formula to help us determine when we have an outlier. All right, what we're going to do is ultimately find potential outliers using fences. I love that we use the term fences to help identify an outlier because when we think of the word fences, I want you to think of fences like in a baseball park where ultimately around the baseball field are fences. Sure, fences to keep fans out, fences to also keep the baseballs in, but what are the outliers? What are the outliers in baseball? The outliers are when players hit home runs. The outliers are when players hit home runs and ultimately the ball goes beyond the fence. So do you guys see what I just did there? I made a little sports analogy where I'm trying to emphasize that what the fences are doing is that they are emphasizing when a ball, when a data value, is beyond the fence. And so what we need to do is first and foremost calculate that fence. And so here we go, how do we calculate that fence? The first thing we need to do is calculate the IQR. All right, so yes, we're going to use those two numbers that we found from the five-number summary, Q1 and Q3, and that we are going to use this value of IQR to help us create those fences. So let's think about it, let's look at my first fence, the lower fence. Let's go back and remember, the point of the lower fence is to create a fence for the lower values so that any value that's so low is considered an outlier. So we're looking at the lower half of my data. I'm looking at the bottom 50%, I'm looking at below the median, and let's go back and remember Q1 is the middle of the lower 50%. And so because of that, lower fence will begin with Q1. We'll begin with Q1, and what we're going to do is subtract, subtract from Q1. Why? Because ultimately, let's go back and remember that if we are ultimately trying to look for values that are outliers, it means you are looking at a value that's even beyond Q1. But how much beyond Q1? We're going to take one and a half of IQR. That's just standard using that number 1.5. If you go into other fields of study, sometimes that 1.5 might change to 2, it might change to 1, but in this particular class, we will use 1.5. So again, the idea by subtracting from Q1 is it's emphasizing we're going even further to the left from the center of my data. In the same way, the upper fence is going to start with Q3. Why? Because let's remember that if you split your data in half, the upper 50% of data has a middle value of Q3. So we're going to start with the value of Q3, but to ensure that we are going even higher than Q3, we are going to add to it and add to it the same value of 1.5 * IQR. Right, and so here's the idea. The idea then is you would have found some value that's the lower fence, found some value that's the upper fence, and the idea is that any value that's less than the lower fence, so any value that's below the lower fence, or any value that's greater than the upper fence, is then considered an outlier. So this is a perfect example of someone hitting that home run out of the park. That's the three-step process for finding potential outliers. And so let's do a practical example now of finding those fences. Let's go back to our friend, the CO2 emissions per capita. Again, I helped calculate for you guys in the past. What was Q1? 1.61. I calculated 2.68. We ultimately found the median was 2.65, and you guys can probably identify the minimum and maximum. What is the minimum here? Perfect, 1.01. What about the maximum? Yeah, 6.81. We're gonna keep these in the back of our pocket because these max and mins are going to help us after we find the fences. So we found the five-number summary here. All right, now again, if you're like, "Wait, how do we find this if you know you don't give it to me Shannon?" Remember, we just did that at the top of the page when it came to using "1-VarStat". But let's just suppose we type this into our calculator and one bar stat ultimately spat this out for us. Step one for finding of Step One is finding the IQR, meaning you're going to take Q1 and Q3 and subtract them. Why don't you guys do that really quick and tell me what is going to be the IQR value here? Yeah, 1.07. I love it, perfect. And ultimately what we're going to do is we are now going to find the lower and upper fence. So the lower fence, again, you start with your value of Q1, so we're going to start with a value of 1.61, and again, we want to go lower than that. So I emphasize this because that means you want to subtract, okay? And subtract by how much? You're going to take 1.5 times that value of the IQR. 1.5 times that value of the IQR. So in this case, can you guys type that into your calculator and tell me what lower fence are you going to get? Perfect, 0.005. And so the idea here is if we draw this out on a number line, we have our median of 2.65, we have our Q1 of 1.61, and we now have this lower fence of 0.005. The idea here is we want to then ask, are there any data values smaller than this? And can you guys look at your data set, can you look at your data set, look at your minimum value, and tell me is there any value smaller than this fence? Is there any value smaller than this fence? No, there isn't. And honestly guys, that's where the minimum value was super helpful. The fact that my minimum value is still bigger than my lower fence is emphasizing there is no outliers that are beyond on the lower fence, and you know what? That's okay guys, sometimes when you are working with fences, sometimes there won't be an outlier beyond it, and that is okay. There are no values less than 0.005, so we have no outliers coming from this lower fence. But we can also calculate the upper fence, the upper fence where you begin with Q3, you begin with Q3, and you add to it. Why? Because ultimately the idea of Q3 is that you are ultimately looking at values then that are beyond it, and that's the reason why we're going to take that Q3 value of 2.68 and add to it, add to it that same value in pink of the 1.5 times 1.07. So guys, we now know what the upper fence is, it's 4.285, and again, just like the baseball example, the idea is you want to look for values that go beyond, values that go beyond that. And so I want you guys to look at the data and tell me, do you see any data values that are greater than this upper fence? Yeah, all right, perfect, you guys are seeing one, the 6.81. Yeah, we can see that max value is definitely going to be beyond anything else, yeah, the 5.22. See, both of these numbers then are going to be considered outliers, why? Because they are both greater than that upper fence. See, the idea again of these fences is that it is trying to help us identify values that live beyond it, because the values that live beyond the fences are then potential outliers. So in this case, that 5.22 tons per person as well as the 6.81 tons per person are both examples of potential outliers.