2.5 Descriptive Statistics

Allow me to paint the scene. I'm finishing my Scrabble turn and boom, it's the greatest achievement of my entire life. I managed the word flapjack. An eightletter word, potentially two triple word scores, and a 50 point bonus for using all of my letters. I was in the lead with 463 points. Just let me take a second to relive my victory because in my mind, I was the greatest scrabbler of all time. The gassote, if you will. you won't. But as I look back, I'm wondering, is 463 typical? Is it good, let alone great? And how on earth did my nemesis Nigel manage a score of 720? Every number tells a story, including whether or not something is super unusual and maybe kind of special or just pretty average. Hi, I'm Sabrina Cruz, the greatest scrabbler of all time, and this is Study Hall, real world statistics. [Music] Managing expectations is important in life, but it usually requires context. Like a difference of one point in a top tier soccer game seems like a bigger deal than it is in an NBA basketball game because soccer games typically are lower scoring in general. You just sit there for 2 hours waiting for one goal. Sports are painful. To find out if my incredible Scrabble score was one for the record books, I need a few ways to measure data. Outliers are data points that look extreme compared to the rest of the data points, like a 13goal soccer game in a sport where even five is considered a lot or my tuna casserole consumption. At least according to my doctor. Sometimes outliers just reflect a rare reality, but they can also reveal a mistake. For example, if a measurement device like a thermometer breaks, you might suddenly see a reading of 27 million degrees Fahrenheit in this room. That's like if we were hanging out at the sun's core, or someone can make a typo when entering data. 4,000 goats in my backyard. It's more likely than you think. It's important to find these mistakes because they can really impact our understanding. Like if we're trying to describe the data, measures like the mean and standard deviation can get really thrown off by outliers which can make things confusing and lead to incorrect conclusions. Luckily, these outliers, well, stand out. So, we can usually catch them early if we visualize or summarize our data. And if we know their mistakes or measurement errors, we can take them out of the data set. But sometimes outliers don't happen by accident. They're actually correct measurements. Extreme things do happen sometimes, and they're really interesting when they do. Changing how we think about what's possible in everything from sports to scientific research, like when the US women's soccer team beat Thailand by 13 goals, or how black holes interact with the fabric of spaceime. So, to decide if I'm an outlier and a Scrabble prodigy, we can look at the distribution of other winning Scrabble scores and compare to my amazing score. Look, it's me right there, the genius that gave you Flapjack. And there's Nigel. Now, we know that visualizing data can go a long way, but it doesn't tell us everything. After all, what I think looks extreme might not match what another researcher or Scrabble player does. Breaking up data into segments is a useful way to standardize what counts as extreme and save you a lot of time as you try to track down the answers to pressing questions like whether I am the supreme ultimate reigning champion of Scrabble, which I'm pretty sure I am. If you wanted to break data into segments to get a quick sense of how those points are spread out, you might want to split it into quarters. Now, the points where you make those splits are called quartiles. To find a quartile using a spreadsheet, you can usually click on an empty cell, type an equal sign, and the word quartile in all capital letters. This may change depending on the spreadsheet you use, but you can then calculate the quartile by inputting the right code. Each quartile is the one number that divides the range. They split data into four equal parts, each holding 25% of the data. And they mark the points between these parts with three numbers, the first, second, and third quartiles, or Q1, Q2, and Q3. Like if the quartiles of winning Scrabble scores are close together, that means that there isn't a lot of variation in scores that end up winning, like in soccer. But if the quartiles are further away, that means that there are many more scores that can lead to a win, like in basketball. Now, our middle data point is called the median since it's literally the middle point between the highest and lowest values. That means that 50% of the data lie to the left of it and 50% of the data lie to the right of it. So, the median isn't typically swayed by outliers which appear at the beginning or end of the lineup. That's why the median is a good measure to use here because we can count on it to be the same regardless of extreme values. But it's also the second quartile, too. half of the data fall on either side of it and it marks the middle of the distribution. If we look at the portion of the data to the left and right of the median separately, the medians of those subsets are also quartiles. So the median of the bottom half of the data is the first or lower quartiles, sometimes labeled Q1 for short. And that means that 25% of the data fall below Q1. Similarly, the median for the top half of the data is the third or upper quartile, sometimes labeled Q3. Now 75% of the data lie below Q3. So if we take any two segments, we'd have 50% of all the points in our data set. Specifically, the middle 50% of the data lies between Q3 and Q1 with Q2 right in the middle. The width of this part of the data or the difference between Q3 and Q1 is called the interquartile range or IQR for short. Like the standard deviation, the IQR is another measure of a distribution spread, or how far a typical value is away from the center of the data. We would expect a bunch of teams final scores in basketball to be more spread out than in soccer. So, it would have a bigger IQR, and that expectation impacts how we interpret any one score. But the IQR has a superpower. Like the median, it's a measure that's not easily affected by outliers. This means it won't be thrown off by that one 13goal soccer game. But let's not forget why we're learning about all of these measures in the first place. We want to know if I'm a special Scrabble player. Whoa, looks like I'm above the third quartile. That means that my score is higher than over 75% of other Scrabble winning scores. But 75% is not 100%. And that's where I want to be. These cortiles were helpful in giving me a rough sense of where I fall in the winning Scrabble score distribution. But I don't want to just be really good. I want to be an outlier and ahead of Nigel. Scrappble scores are quantitative, so we've been looking at a histogram of them to see what's going on with my score, but it would be nice if we had some reference values when visualizing a data distribution. Say we want to quickly know if a particular data point is above or below the midpoint. A histogram doesn't make that question easy to answer. However, a box plot is a type of graph that is actually built using quartiles. So it makes it simple for us to make a quick comparison. It's also called sometimes a box and whisker plot. And you'll start to see why as we build the graph from our summary values. To form the box part of the plot, we place three lines at Q1, Q2, and Q3. Q1 and Q3 mark the boundaries of the box. lines will appear perpendicular to Q1 and Q3 and extend to either the minimum and maximum values in the data or to 1.5 times the interquartile range beyond the boundaries of the box. Whichever comes first. See the whiskers up here? I just want to boop that little catnose. Now, why 1.5 times? That's not a hard and fast rule, but it gives us a place to start and make sure that only a small proportion of the data could be flagged as an outlier. After all, if everything is unusual, then nothing is. If you look at our box plot, you can see that any data points that lie beyond those whiskers can be considered outliers. That's great news for me. Now I can finally get a sense for whether or not I'm an outlier. The interquartile range of this data set is 457 minus 388, which equals 69. If we add 1.5 * 69 to Q3, that gives us an upper whisker cutoff of 560.5. Since my Scrabble score is less than that, maybe I'm not an outlier after all. And even worse, Nigel is. This is absolutely devastating. This is the worst day of my entire life. But let's say we're skeptical of Nigel's outlier score. We keep it in the data set to investigate and ultimately learn he cheated. Unlike the 13-nil soccer game, this outlier was too good to be true, and we know to get rid of it. And that's the thing with outliers. They can show us amazing and unique things. and also when something is way off. Identifying them will help you figure out either way. Backtracking a bit, there are also other ways to identify outliers. For now, we can just consider a data point that is more than three standard deviations away from the mean extreme compared to the rest of the data. Why three? We'll get to that in future episodes, but it basically helps us only flag points that are really far away from the center as potential outliers. By that rule, my score would have to be even higher, over 584 to be considered extreme. But even if I'm not an outlier, at least 75% of winning Scrabble scores are below mine. Maybe even more. But to get to that detail, I'll need something a little bit more fine-tuned than quartortiles. The great news for me, and my perfectly healthy need to be the best, is that there are more ways to separate data than quartiles. Quartiles broke up our data into four equal parts. But we can actually break our data into even more parts. Just like quartiles divide data into four equal parts, percentiles divide data into 100 equal parts, each containing 1% of the data. Basically, quartiles are special percentiles. Q1 is the 25th percentile. Q2 or the median is the 50th percentile and Q3 is the 75th percentile. Percentiles give us a finer picture of the spread of the data. It turns out I represent the 78th percentile in this distribution of winning Scrabble scores. That means that 78% of other winner scores are smaller than my current score, which is still not enough. We actually hear a lot about percentiles in our daily life. As long as we ignore that Scrabble is my life. When we're kids, doctors track our height and weight in terms of percentiles to give our family members a sense of where we fall in comparison to other kids our age. And people often compare academic performance like this, too. Like if someone brags about graduating in the top 10% of their class, that means that their GPA is at the 90th percentile. You might also recognize percentiles from the news. Activist during the Occupy Wall Street movement in 2011 popularized the saying that we are the 99%. That's a reference to the lower 99th percentile of wealth. All of these numbers tell stories from activism and social problems to whether you are a worse person than me in Scrabble, of course. But those numbers don't stand alone. Putting them in context helps us better understand what they mean and whether or not we should be surprised by them. Being able to identify data points that are extreme compared to the rest of the data is important, both for preventing errors and for appreciating when something new and different comes along. If you're enjoying this series and are interested in taking the full study hall real world statistics course and earning college credit from ASU, check out gostudyhall.com or click on the button to learn more. And if you want to help us out, give this video a like. Comment your winning Scrabble strategy and smash that subscribe button. Thanks for watching.

Transcript for:2.5 Descriptive Statistics

Transcript for:
2.5 Descriptive Statistics