Transcript for:
Frequentist Hypothesis Testing

  • [Justin] In the nation of blobs, there's a popular game based around flipping coins. Each blob brings their own coin and they take turns flipping. When a coin comes up heads, the one who flipped it feels happy. And the other one feels sad. That's it. That's the game. It seems kind of simple, but these blobs are a simple folk. There's a rumor going around that some of the players are using trick coins that come up heads more than half the time. And that's just not fair, so we would like to catch these cheaters. As a warmup, let's have each of these blobs flip its coin five times. (playful electro-percussive music) Okay, you might be able to tell that this is an artificial sample. We have results ranging from zero heads, all the way up to five heads in a row. Which of these blobs, if any, would you accuse of being a cheater? If you'd like to try your hand at judging the blobs yourself there is an interactive version I introduced in the last video. Looking at the data from that game, when a blob got five heads out of five flips, it turned out to be a cheater only about 88% of the time. Because of the randomness, it's impossible to be completely sure whether a blob is a cheater. But some approaches are better than others. During the rest of this video, we're gonna build up one of the main methods of making decisions with limited data. If you like learning new vocabulary terms, you're in for a real treat. The name of this method is frequentist hypothesis testing. We're gonna design a test that this blob detective can use in its day-to-day work searching for cheaters. We want three things from this test. First, if a player is using a fair coin, we want to have a low chance of wrongly accusing them. Second, if a player is cheating, using an unfair coin, we want to have a high chance of catching them. And third, we want this test to use the smallest number of coins possible. We only have one blob detective, and we want it to be able to test as many players as it can. And it's also nice not to bother the players too much. We're gonna design that test together, but if you're feeling up for it, it can be good for learning to try things on your own first. So this is your chance to pause and take a moment to think of a test that might satisfy these goals. Okay, let's take it one flip at a time. It came up heads. The cheaters have heads come up more often, so this blob must be a cheater. Well, no, we can't just call them a cheater after one flip. I mean, we could, but with that policy, we'd wrongly accuse quite a lot of fair players. After all, even if the player is fair, there's a 50% chance that the first flip would come out heads. So let's see the second flip. Heads again! Cheater? Well, it's more suspicious for sure, but again, we should think about how likely it is for us to see this if the coin is actually fair. There are two possible outcomes for the first flip and two possible outcomes for the second flip. Two heads in a row is one of four possible outcomes that are all equally likely. So the probability of two out of two heads is one fourth or 25%. Another way to get that number is to multiply the probability values of the two events together. You do have to be careful about multiplying probabilities, since it only works if the two events are independent of each other, and that is the case here, because getting the first heads doesn't make the second heads more or less likely. Anyway, with a one in four chance of falsely accusing an innocent blob, it still feels a bit too early to accuse the player of cheating. After another heads, this probability is divided by two again. I'm starting to get pretty suspicious, but we'd still accuse one out of eight innocent blobs if we accused after three heads in a row. We want that rate of false accusations to be as low as we can get it but we're never gonna get it all the way to zero. It'll always be possible for an innocent blob to get an epic streak of heads and look suspicious. So we have to make a decision about what's good enough. The standard choice here is 5%, or one false accusation out of every 20 fair players. We could choose a different value if we wanted to, but we might as well start here. Okay, so at this point, we've crossed the threshold. There's only a one in 32 or 3.125% chance of seeing five heads in a row from a fair coin. So one possible test we could use would be, if a player gets five out of five heads, accuse them of being a cheater. Otherwise, decide they're innocent. So let's see how this test performs. We're gonna want a lot of data. So let's make a set of 1000 players where half of them are cheaters. Before we see the results, try making some predictions. How often will it wrongfully accuse fair players? And what fraction of the cheaters do you think it'll catch? Alright, we can divide these blobs into four categories. Fair players the test decided are fair. Fair players we wrongly accused of cheating. Cheaters who got away with it. And cheaters we caught. It looks like we achieved goal one with flying colors. Not only did we accuse fewer than 5% of the fair players, the test did even better than expected. When we use this test in the real world, we won't know how many fair players there are, but seeing how the test performed on this sample, combined with our analysis from before, it feels like we can be pretty confident that we would accuse fewer than 5% of players. We didn't catch very many cheaters, but that's not too surprising. We haven't even thought about them yet, so I'm sure we could do better. Before we make the next version of the test, I think it's worth mentioning some fancy statistics terms. They aren't always necessary, but you might see them around, and like any specialized words, they do make communication easier in some contexts. If a test result says not to accuse a blob of cheating, it's called a negative result, since nothing was found. And when the test does say to accuse a blob of cheating, it's called a positive result. Cheating is a bad thing, so not very positive, but the term positive here is referring to the test saying "Yes, the thing I'm looking for is here." The same is true for medical tests. If the test finds what it's looking for, the result is called positive, even though it's usually a bad thing. So we have positive and negative test results, but the results may not agree with reality. When we test a blob that's using a fair coin, the correct result would be negative. So if the test does come up negative, it's called a true negative. And if the test comes out positive, that's wrong, so it's called a false positive. And when we test a cheater, the correct result would be positive, so if the test does come up positive, we call it a true positive. And if the test incorrectly gives a negative result, that's a false negative. We can also rephrase that first goal using another new term, the false positive rate. This can be a dangerously confusing term though. It's easy to mix up what the denominator should be. False positive rate sounds like you're saying, out of all the positives, what fraction of those positives are false. Or even, out of all the tests, how many are false positives? But really, it's saying, out of all the fair players, how many of them are falsely labeled positive? I've known these words for quite a while, but my brain still automatically interprets it the wrong way basically every time. So to keep things as clear as possible for this video, we'll keep using the longer wording for goal one. Okay, let's go back to designing the test. We still need to figure out a way to achieve goal number two. Let's start by making the goal more precise. To do that, we need to pick a number for the minimum fraction of cheaters we want to catch. Using the terms from before, we could also call this the minimum true positive rate. But again, let's stick with the plain language. And to throw even more words at you, this minimum is sometimes called the statistical power of the test. It's the power of the test to detect a cheater. The standard target for statistical power is 80%. Just like the 5% number in the first goal, we could pick any value we want here. But let's run with 80% for now, and we'll talk about different choices later on. Now for calculating what we expect the true positive rate to be. What's the probability that a cheater would get five heads in a row? Take a moment to try that yourself. Okay, that was kind of a trick question. There's no way to calculate that number, since we haven't actually said anything about how often an unfair coin comes up heads. In that trial we just did with 1000 blobs, the cheaters were using coins that land heads 75% of the time. We don't know for sure if that's what the real blobs do. So this 75% is an assumption. But we need some number here to calculate the probabilities, so we gotta run with something. And yet another word, this is called the effect size. In this case, it's the effect of using an unfair coin. You might be getting annoyed that this is the third time I've said we should just run with some arbitrary number. But what can I tell ya? Some things are uncertain and some things are up to us. The important thing is to remember when we're making an assumption or making a choice. That way we can note our assumptions when we make any conclusions, and we can adjust the test for different choices if we change our minds later. But now that we have a number, let's do the calculation. If the probability of each heads is 0.75, the probability of five heads in a row is 0.75 to the fifth, or about 24%. So our existing test should catch about 24% of cheaters. And hey, that is pretty close to what we saw in the trial, so everything seems to be fitting together. But our goal is to catch 80% of cheaters. The current test is a little bit extreme. It requires 100% heads for a positive result. This does make false positives unlikely, which is good, but it also makes true positives unlikely, which is bad. So we're gonna have to think about a test that allows for a mixture of heads and tails. Calculating probabilities for something like this can be a bit confusing though. For example, if we make a new test that requires a blob to flip their coin 10 times, and accuses them of being a cheater if they get seven or more heads, the calculations in that situation are gonna be a lot harder. There are a bunch of ways for there to be seven heads out of 10 flips. And we also have to think about the possibilities of eight, nine, and 10 heads. To start making sense of this, let's go back to just two flips. With a fair coin, each of these four possible outcomes is equally likely. So the probabilities are one out of four for getting zero heads, two out of four for getting exactly one heads, and one out of four to get two heads. But with an unfair coin that favors heads, they're skewed toward results with more heads. With three flips, there are eight possibilities total, with four possible numbers of heads. As we add more and more flips, it quickly becomes quite a chore to list out all the possible outcomes and add up the probabilities. But there is a pattern to it, so thankfully there's a formula for cases like this called the binomial distribution. It's not as scary as it looks, but still a full explanation deserves its own video. I'll put some links about this in the description, but for now just know that this formula is what we're using to make these bar graphs, and it follows the same pattern we used for two flips and three flips. Now let's go back to our test rule from before, where we accuse a player if they get five out of five heads. We can show the rule on these graphs by drawing a vertical line that separates the positive test results on the right, from the negative test results on the left. On the fair player graph, the bars to the left represent the true negatives, or the innocent blobs we leave alone, and to the right are the false positives, the fair players we wrongfully accuse. And on the cheater graph, the bars to the left represent the false negatives, the cheaters who evade our detection, and the bars to the right are the true positives, the cheaters we catch. Just like before, we can see that this test satisfies our first goal of accusing less than 5% of the fair players we test on average. But it doesn't satisfy our second goal of catching at least 80% of the cheaters we test, again, on average. But now that we have these graphs, we can see what happens when we change the number of heads. If we lower the threshold to four or more heads out of five, we don't meet either requirement. If we keep lowering the threshold, it can allow us to meet goal two, catching more than 80% of the cheaters, but then we accuse even more fair blobs, so that won't work. Apparently, if we want to meet both goals at the same time, we're gonna need more flips. If we put these graphs right next to each other, we can see that the blue and red distributions overlap quite a lot. So it's impossible to make a test that reliably separates fair players from cheaters. But if we increase the number of flips to, say, 100, now there's a big gap between the distributions, so it's easy to find a line that separates them. But we also have this third goal of using as few coin flips as possible, so we should try to find a happy medium somehow. Since we already have the computer set up to run the numbers, we can go back to five flips and just keep trying different thresholds with more and more flips until we find a test rule that works. It turns out that the smallest test that meets our first two goals has a blob flip its coin 23 times, and the blob is accused of being a cheater if they get 16 or more heads. That's more than I would've guessed at the start, but it's not so, so huge, so, it'll do. Alright, let's use this to test a few blobs. This blob got 17 heads. That fits our rule of 16 or more, so according to that test, we should call this blob a cheater. There is another term worth mentioning here. Assuming this blob is innocent, the probability that they'd get 17 or more heads is about 1.7%. We call this 1.7% the P value for this test result. It's kind of like a false positive rate for a single test result. Kind of. 1.7% is below the 5% we set as our threshold, so according to the test, we call this one a cheater. And looking at it from the other direction, if the blob is cheating, using a coin that comes up heads 75% of the time, there's a 65% chance that they'd get 17 or more heads. Another way to say it is that they're in the top 65% of results we'd expect from cheaters. So if we wanna catch 80% of the cheaters we'd better call this one a cheater. Okay, let's try it with one more blob. This one got 13 heads. This is more than half of the 23 flips, so it's tempting to call it a cheater. But 13 is below the 16 heads the test requires for a positive result, so, we call it a fair player. The P value of this result is about 34%. So if we accuse players with results like this, we'd expect to wrongly accuse about 34% of players. That's well beyond our 5% tolerance, so we can't call it a cheater. And looking at it from the other direction, if it were a cheater, there would be about a 99% chance that they'd get this many heads or more. We don't have to catch 99% of the cheaters to hit our 80% goal, so we can still meet that goal if we let this one off the hook. Is the first one really a cheater? Is that second one really playing fair? We can't know for sure, but based on how we designed our test we should expect to catch at least 80% of the cheaters, and falsely accuse less than 5% of the fair players. So now let's see how this test does on another group of 1000 blobs. Like before, half the blobs in this group are using a trick coin that has a 75% probability of landing heads. Okay, the results do look about right. We accused less than 5% of the fair players, and we caught more than 80% of the cheaters. 5% and 80% are the normal numbers for historical reasons. So we could make different decisions if we like. Maybe we decide that we really do not want to bother the blobs who are playing fairly. So we wanna lower the false positive rate to 1%. To achieve this with 23 flips, we'd have to raise the heads threshold to 18 heads. This would lower the fraction of cheaters we catch to about 47% though. If we don't want to increase the number of flips, we could decide we're okay with that 47%, maybe we just want cheating to feel risky, so 47% is good enough. Or, if we still want to catch 80% of the cheaters we test, we could increase the number of flips until we find a test that achieves both of those goals. We could also be super hardcore and go for a 99% true positive rate, and a 1% false positive rate. But we'd have to flip the coin 80 times to get to that level. We'll always be able to set two of these goals in stone, but that'll limit how well we can do on the third goal. How to set these goals depends which trade-offs we're willing to make. For the rest of this video though, we're just gonna go with the standard 5% and 80%. Now that we've settled on the goals we're going for, and we have a test that seems to be achieving those goals, let's test one more set of blobs. To pretend these are real blobs and not some artificial sample, I'm not going to tell you anything about this group except that there are 1000 of them. How do you think this test will do on this more mysterious group? Will it manage to accuse fewer than 5% of the fair players? And will it catch 80% of the cheaters? At this point in the video, it would be easy to get lazy and not actually make the predictions. But if I'm asking you these questions yet again, something must be about to go wrong, right? Or, maybe I'm just pretending so you'll engage a little more. Who can say? But really, what do you think? Okay, so we labeled about a fifth of them to be cheaters, which is a bit less than before. If this were the real world, that's all you would get. You wouldn't get to see who was really cheating and who was really innocent to get confirmation that the test is working as expected. I mean, maybe you could, but it would take more testing. You couldn't do it with this test alone. But because this is a computer simulation, I do know the full truth. This group was 90% cheaters. We still accused less than 5% of the fair players, but we only caught about a quarter of the cheaters. Something went wrong. The problem is that we assumed that the cheater coins came up heads 75% of the time. And that assumption was wrong. The real world cheaters were using coins that came up heads 60% of the time. If we knew that from the beginning, we still could have designed a test to achieve our goals, but it would need 158 flips and require 90 heads to reach those same thresholds, which is honestly way more coin flips than I was expecting. But in hindsight, it's not that surprising that we need a lot of data to tease out that smaller difference. But we didn't design that test because we got the effect size wrong. I know, I know, I was the one who said we should assume 75%. But be honest with yourself. Did you remember that assumption when making your prediction? It's very easy to forget that assumptions are assumptions, and instead just treat them as facts. This concludes me tricking you to try to teach you a lesson, but they really are easy mistakes to make in real life. On the bright side, though, our test did succeed at accusing less than 5% of the fair players. The framework we built up here isn't just good for catching unfair coins. It's the dominant framework used in actual scientific studies. To summarize we take a yes or no question. In this case, our question was, is this particular blob using a biased coin? But it could be any question. Then we come up with a model for what kinds of results we'd expect if the answer is yes, and if the answer is no. Then we come up with a test that can do a decent job of telling those two situations apart, according to the models. The details are usually a bit more complicated, since most real world systems that we wanna study are more complicated than coin flips. But most scientific studies have this framework at their core. Like I mentioned at the beginning this is called frequentist hypothesis testing. There's another method called Bayesian hypothesis testing which we'll look at in the next video in this series. See you then.