3.2 How to Use the Empirical Rule

Mariam is a manager at Oxford Outliers, a men's shoe store. And currently, her job is ruining her life. Her backroom is overflowing with shoes, so much so that she can barely breathe. She is being suffocated by shoes. Worse yet, even more new Oxfords have just arrived from her supplier. Now, Miriam has decided that she does not want to die under this pile of wing tipped footwear. So, she needs to decide what inventory to send back and fast. At the same time, she has to hold on to enough shoes to cater to most of her customers. But after just a few seconds and without so much as a calculator, Miam confidently works out that she can return all sizes above 13 and 1/2 and below 7 and 1/2 and still cover about 95% of her future sales. Witchcraft or just a good grasp of data distribution. I'm Sabrina Cruz and this is Study Hall, realworld statistics. [Music] The seemingly magical analysis Miriam did is simpler than it looks and it comes down to the properties of my favorite shape. The bell curve, the normal distribution, the gausian, the spectre that is definitely not haunting me. Whatever we call it, this shape shows the frequency distribution data often fall into like average shoe size or the typical height of a goat. When we plot the average goat heights of all of my friends goat pastures, they fall into a curve like this where lots of them are around here in the middle and fewer and fewer of them are out here on the edges. We don't see pastures full of goats as small as mice or as big as giraffes in the real world because the odds are so incredibly rare for goats of that height to exist. But I can dream. Normal distributions can tell us a lot about the world. They come up in all kinds of different situations and they're a quick way to make sense of a lot of data. They help us make decisions like how many shoes to send back so we don't experience death by Oxford or signaling to a doctor that our blood pressure needs another look. But we can only make sense of them if we really understand where they come from and what to do once we find them. The central limit theorem tells us that under certain conditions, data are mathematically compelled to fall into a normal distribution. Those conditions require that data should come from a whole bunch of sample means or averages from different data sets about events that have the same odds of happening each time and where the outcome of one event can't affect the others like the sum of a whole bunch of shoe sizes or goat heights. Now, the dirty little truth is that these things aren't perfectly random, but they still come out pretty much normal. Let's return to goats and specifically the mean size of full-grown goats across a region. The probability that a particular goat grows to a certain final size is mostly independent of other goats, which all share the same probabilities. But if one goat grows fast, then maybe it steals more milk from its mother, blocking a smaller, runtier goat from getting the same sustenance and influencing how fast the littleer one grows. Still, if goats across the region all share similar air and nutrients, that will have a combined influence on how large they grow. In other words, yes, we know the true odds for each goat aren't really independent or even identical, but given enough data about the average size of goats across the county, we still get something that looks like a normal distribution. There's some deep and very involved math behind the scenes. But to put it simply, the closer conditions are to what the central limit theorem assumes, the more closely data will resemble a normal distribution. And the further we stray from those conditions, the less the data will appear normal. In other words, having somewhat independent and identical probabilities across events is enough for a normalish set of data to emerge as it did for the goats. That's pretty great news. There are lots of situations where we can expect data to look like a normal distribution. The challenge is working out what those situations are. And unfortunately, it's generally not enough to just plot your data and eyeball it. To figure out whether data are normally distributed, we can use some properties that will probably sound familiar. Like we remember skewess, which describes which ways the tails point, and curtosis, which describes how big the tails are, or like how many goats are way bigger or smaller than the mean, or whether you're dealing with tiny Cinderellaized feet or a shoe so big an old lady could fit in it. In general, properties that describe the shape of data are called cumulants. And the first four are the mean, variance, skew, and curtosis. But the important thing to know is that cumulants can be calculated as a number from a set of data. And looking at those cumulants is how you figure out whether data are normally distributed. Now skewness and curtosis are general properties, but there are specific formulas that allow us to assign exact quantities for how skewed or fat tailed a distribution is. We've left some details in the description if you'd like to get into the math, but for practical purposes, statistical software has got you covered, which is great because I love making the computer do the work for me. Thank you. In Google Sheets, for instance, you can use the skew function on a list of data to produce a number. Positive values mean that the data skew to the right, while negative values mean the data skew to the left. Similarly, the Kurt function gives us a number representing curtosis. A positive curtosis value means that the data have a sharper peak and fat tails, which means lots of extreme values. A negative curtosis value means that the data have a sort of flat peak and light tails. Now comes some sorcery, the kind that will eventually help you avoid dying under a pile of shoes like our friend Miam. The normal distribution is the only distribution of data where convention holds that skewess and curtosis are exactly zero. That means in a totally perfect normal distribution, the spread of values is symmetrical and evenly spread around the mean. As we said though, real world data aren't perfect. Even when the central limit theorem should apply, the data will have some amount of skew and curtosis. The upshot is that calculating cumulence gives us a way of measuring how close to a normal distribution the data are. Specifically, if the skew and curtosis of a distribution of data are small, we can treat the data like a normal distribution, which is key for working the sort of witchcraft Mariam did. But we'll get back to that in a second. Now, there's no hard and fast rule about just how small counts as small, and different statistitians have suggested different safe limits for kurtosis and skew, but there are some quick and dirty rules we've learned over time. One is that if the skew is between -1 and 1 and the curtosis is between -6 and 6, you can treat the data as being normally distributed. There are other methods we'll look at later on in this series, but the skew and curtosis rules do a great job for checking your data are broadly behaving like a normal distribution. For instance, say we take a whole bunch of data on shoe sizes for Outlier Oxfords, the brand that Mariam was managing. Looking at the data, we can calculate the mean as 10 and a half and the standard deviation as 1 and 1/2. That seems sensible. 10 and 1/2 is a pretty standard shoe size and it seems reasonable for it to vary by a size and a half on average. But when we calculate skewess and curtosis, the values are both tiny and much smaller than one. That tells us that we can confidently say we're very close to a normal distribution. This brings us back to Miam's wizardry when she worked out that most of her stock for her shoes was in the 9 to 12 size range. And that means with just a couple of cool tricks, she can do some data enchantments. The normal distribution is a frequency distribution which means it's basically like a histogram of data. The area that the data occupy on the graph as a proportion of the whole area is also the fraction of data that fall within a given range. Like if we look at the data in this range of the normal distribution representing men's shoe sizes, that's about 25% of the total area under the curve. which means that 25% of all men's shoe sizes are between a 10 and an 11, which probably sounds enormous if you're Cinderella. For a normal distribution, we can get even more specific. The proportion of data lying in a certain range is determined entirely by the shape of the curve. What's more, the spread of data in a normal distribution depends entirely on the standard deviation. After all, the mean and the standard deviation completely define the shape of the curve. Putting those facts together, we get the secret to Mariam's statistical magic. The proportion of the data contained within a certain number of standard deviations away from the mean are exactly the same for any normal distribution. So exactly the same that there's a rule for that, which some people call the 1 2 3 rule. They weren't feeling super creative that day. I get it. It's also called the empirical rule, which sounds a bit fancier, but no matter what you call it, this rule can turn a daunting task into something pretty easy. The rule works like this. If you take a normal distribution like shoe sizes and look one standard deviation above and below the mean, the data contained in that range will always make up about 68% of the data. And we can go even further. If we look at the range of values above and below the mean by two standard deviations, it always contains about 95% of the data. And finally, if we look three standard deviations above and below the mean, it always contains about 99.7% of the data. It comes down to how much data fall between two points on the graph. And the shape of the curve is determined by the standard deviation. It's the yard stick that tells us how far the data are spread from the mean. That's why proportions for data inside the normal curve are always the same when we use standard deviations. In Miriam's case, she knows that the average men's shoe size is 10 1/2 with a standard deviation of 1 and 1/2. To hold on to what 95% of her customers would wear, she uses twice the standard deviation, which is three sizes. Looking three sizes above and below the mean of 10 and 1/2 gives her a range of 7 1/2 to 13 1/2 to hold on to in the store room and also spares her from near certain death by shoe. If she'd wanted to offload even more shoes, she could have just subtracted or added 1.5 from 10 1/2, giving her shoes in the 9 to 12 size range, which is 68% of the data by definition. Now, the 1 2 3 rule can only tell you the ranges for three fairly specific percentages about the data, 68, 95, and 99.7. But those three numbers are still very helpful. The three standard deviation rule, for instance, helps us spot extreme outliers in the data. Knowing what proportion of data are contained within 1, 2, and three standard deviations from the mean is incredibly handy for making quick inferences about normal distributions. For Miriam, it tells her just three in every 1,000 customers buying shoes will ever need a size below a six or above a 15, which are beyond three standard deviations. So, it might be better to refer customers wanting those sizes to order online, then hold on to that stock in store. This all goes beyond shoes, too. Maybe you're a coach ordering jerseys for the school soccer team or even a doctor trying to assess if a patient's blood pressure is suspiciously high or low. A quick sense check with the empirical rule can tell you the broad strokes of what proportions you could expect your data to fall into with just the mean and standard deviation. Knowing these useful tricks and tools will help you size up data and check if the normal distribution fits. Like a shoe, if the shoe fits. But figuring out if your distribution is normal and mastering things like the empirical rule also helps us make sense of how data about one thing, whether it's a shoe or a goat, stand with respect to everything else in the same category. And that's going to help you perform spellbinding feats or just make solid business decisions every day in ways that will make your life way easier. If you're enjoying this series and are interested in taking the full study hall real world statistics course and earning college credit from ASU, check out gost studyhall.com or click on the button to learn more. And if you want to help us out, give this video a like, smash that subscribe button, and comment your shoe size. Thanks for watching and see you next

Transcript for:3.2 How to Use the Empirical Rule

Transcript for:
3.2 How to Use the Empirical Rule