So far we've talked about measures of center of data and measures of variation. And what these got us were, for the entire data set, one number that represented the center or one number that represented how spread out the numbers were, or how spread out the data was. One of the main objectives of this lesson, or this part of the lesson, measures of relative standing, are to compare how our If we choose one of our data points from the entire data set, how does that data point compare to the rest of the data set?
One of the outcomes or benefits of this section will be we'll be able to compare two different data points that come from different scales. So consider this example. SAT scores have a mean of 1518 and a standard deviation of 325. Scores on the ACT test have a mean of 21.1 and a standard deviation of 4.8. So these are tests that are on a radically different scale. 1518 has a mean for the SAT versus 21.1 for the ACT.
We obviously couldn't just take two different scores, one from the SAT and one from the ACT, and compare them directly because the ACT almost is guaranteed to have a higher score, but that doesn't mean it's necessarily a better score because it's on a different scale. So we'll come back to that problem in a moment. We first need to talk about this idea of a z-score. For any data value, the z-score tells you how many standard deviations that value is above or below the mean. So if we know the mean of some data set, and we want to calculate the z-score for one particular value, The z-score for a sample is calculated by taking that data value, subtracting the mean, and then dividing by the standard deviation of the set.
This resulting z-score will tell us how many standard deviations our data point is above or below the mean. For a population, it's the same formula. We just, of course, use the population symbols.
So let's calculate the z-score for the number 10 in the following set of data. So we want this number's z-score, and that will tell us how many standard deviations above or below the mean 10 is. So the first thing we have to do in order to calculate the z-score is to calculate the mean and calculate the standard deviation of the entire data set. The mean, of course, is just the numbers, the data added together, divided by 5, and I get 12.4. And the standard deviation, of course, we have that formula, or by now you can use Excel to calculate the standard deviation.
I get 3.05. So to calculate the z-score for 10, I simply take the number 10 that I'm calculating the z-score for, minus the mean 12.4, divided by the standard deviation 3.05. That gets me negative 0.79. So my z-score is negative 0.79, and if I were asked how many standard deviations 10 is, From the mean, I would say it is 0.79 standard deviations from the mean. It happens to be below because it's negative.
But if I just want to know how many standard deviations from the mean I am, it's always going to be the positive version of my z-score. So what the z-score will let us do is compare values that are on different scales. So again, looking at this SAT and ACT example, now I can compare the scores of 1030 and 14.0. First of all, I'll calculate the z-score for 1030 on the SAT.
To calculate that, I simply take the number 1030 minus the mean of SAT scores, divided by the standard deviation of SAT scores, and that gets me a z-score of negative... 1.5. Now let me calculate the z-score for 14 on the ACT.
So I'll take 14 minus the mean of ACT scores divided by the standard deviation of ACT scores. And that gets me a negative 1.48. Now you have to be careful when you're comparing these.
You have to think which of these is a higher z-score. And of course... Negative numbers, negative 1.48 is higher, or is further to the right on the number line, than negative 1.5. So the better score here would be the ACT score.
So if I were taking both of these, and I had to choose one of them to send off to colleges, I would send my ACT score. Usual versus unusual values. This is similar to what we've talked about in previous lessons about usual and unusual values.
If you keep in mind that the z-score tells you how many standard deviations you are from the mean, in the previous lessons we said that the maximum usual value should be the mean plus two standard deviations from the mean. So if the z-score is the number of standard deviations from the mean, then usual values should fall between 2 and negative 2, or maybe I should say it higher than negative 2 and less than positive 2. Those will be our usual z-scores. An unusual z-score would be anything greater than 2, because it would be more than 2 standard deviations from the mean. Just using intuition, which of the following could possibly have a z-score less than negative 2?
Well, of course, it's got to be this smallest number here, 5, because it is so far out of the range of the rest of the data. It's so much smaller. So it's the only one that could possibly have a z-score that is less than negative 2. Similarly, if we want to look at which of these data could possibly have a z-score that's greater than 2, 1,239 is the only one that is so much larger than the rest of the data. That's our best candidate for having a z-score greater than 2. So on this data set and the last data set, it would be good practice to double-check that on the last data set, 5 was actually less than negative 2 for a z-score. And in this case, 1239 had a z-score greater than positive 2. Somewhat interestingly, z-scores have no units.
What that means is it allows us to compare two z-scores without having to worry about what scale the original data was on. Our next measure of relative standing is something called a percentile. Percentiles are a measure of location based on dividing a set of data into 100 groups with about 1% of the values in each group. We use p sub 10 to denote the 10th percentile, which has about 10% of the data below it.
And we use p sub 50 to denote the 50th percentile, which has about 50% of all the data below it. Now this is what we called the median before. To calculate the percentile of some value in your data, you take the number of values that are less than that value, divide it by the total number of data that you have, and then multiply it by 100 to get a percentage. So let's look at an example.
Suppose we wanted to find the percentile for the value 41. As we go through these calculations, it's important to realize that if you look at other books, there will be slightly different ways to calculate percentiles. So stick with these methods for the homework. So first of all, I'm going to count the number of data that are less than 41. And it's important to realize I've ordered...
my data from lowest to highest. If you haven't ordered your data, then who knows where 41 would fall and you'd calculate, you'd count up all the ones less than that. Wouldn't make any sense.
So be sure to order your data before you're calculating percentiles. So I've got five numbers that are less than 41 in my data set. I've got 28 total numbers in my data set. So I'm going to take five divided by 28. and then multiply by 100 gives me 17.86%.
Now I want to round that up to the nearest percentage, so that means 48, excuse me, 41 is in the 18th percentile. Find the percentile value of 89. So in this case, I'm going to count up all the numbers less than 89, which of course, since there's 28 data points, then 27 of them must be less than 89. So I'll have 27 divided by 28 gives me 96.42, and I'm going to round that up to 97. So that means 89 is in the 97th percentile. Let's go the other way. Now suppose we're given the percentile and we want to find what data value is in that percentile. So in this case, let's find the value of the 80th percentile or p sub 80. To do that, I'm going to find the location of the data value first.
And then once I have the location, I can go to that location and find the actual data value I'm looking for. So n, of course, is the number of data values. k is the percentile that we're looking for.
So in this case, since I'm looking for the 80th percentile, my k will be 80. And l will tell me the location of the value that I'm looking for. So a couple notes. If l happens to be a whole integer, we're going to find the mean of the values in the lth and the l plus 1th position. Otherwise, we're going to round l up. to the next integer and that will be our location.
So here's some data. We're going to find p sub 25 or the 25th percentile. So first of all let's find the location of the 25th percentile. k of course is 25 divided by 100 times the number of data points I have 28 gives me 7. So that means I have to go to the seventh spot, which is 42. And since L happened to be a whole integer, I'm going to average the values in the seventh and the eighth spot.
So that means the 25th percentile is 45. Looking at another example, let's find the 20th percentile. So again, we'll start by finding the location. 20 divided by 100 times 28 gives me 5.6. So we're going to round up to 6. That means the 20th percentile is 41. So again, make sure that your data is ordered.
Quartiles are very similar to percentiles, except instead of having a hundred different percentiles, we're going to have only three quartiles. They divide your data set into four groups with about 25% of the values in each group. So your first quartile will have 25% to the left and 75% of your data to the right.
Your Your second quartile will have about 50% of the data to the left and 50% to the right. And your third quartile will have about 75% of the data to the left and 25% of the data to the right. So finding Q1 happens to be the same thing as finding P sub 25. And finding Q2 is the same thing as finding P sub 50. And finding Q sub 3 is the same thing as finding P sub 75. So we'll use the same location formula where k, if we wanted to find q1, k will be 25. So find q3 in the following table of data. So since we're talking about q3, which is the same thing as p sub 75, the k in our location formula will be 75. So this comes out to be 21. That means... My percentile will be the average of the 21st and 22nd positions.
So that is 70 and 72. So my 75th percentile is 71. So the box and whisker plot is a visual representation of Q1, Q2, and Q3, along with the minimum and maximum value. These five pieces of information are called the five number summary. And drawing it out on a number line, you get a box with some whiskers off of it. And what these represent is this far line to the left represents the minimum value. The far line to the right represents the maximum value.
The left end of the box is Q1, or the 25th percentile. Far right of the box is Q3, or the 75th percentile, and this middle line here is Q2, or the 50th percentile. Now, one thing we can get from a box and whisker plot is some kind of summary of how our data is distributed and where the center is.
So if you look at this particular box and whisker plot, this distance, 50% of the data that's to the left of 40 here, versus 50% of the data to the right, we can see that to the left of 40, our data is all scrunched within 20. So between 20 and 40. Whereas to the right of 40, our data is a little more spread out between 40 and 95. So let's look at an example creating a box and whisker plot for this set of data. So assuming this is already ordered and you can verify that, The minimum is 20, Q1 or P25 is 45, and you can verify that by plugging in 25 into our location. Q2 or P50 is 60, Q3 or P75 is 71, which we calculated a moment ago, and the maximum data value is 89. So, throwing this onto a box and whisker plot above a number line, I get my left end is 20, the left end of my box is 45, the middle of my box is 60, the right of my box is 71, and my right whisker ends at 89.