[Justin] In the nation of blobs, there's a popular game
based around flipping coins. Each blob brings their own coin and they take turns flipping. When a coin comes up heads, the one who flipped it feels happy. And the other one feels sad. That's it. That's the game. It seems kind of simple, but
these blobs are a simple folk. There's a rumor going around
that some of the players are using trick coins that come up heads more
than half the time. And that's just not fair, so we would like to catch these cheaters. As a warmup, let's have each of these blobs
flip its coin five times. (playful electro-percussive music) Okay, you might be able to tell that this is an artificial sample. We have results ranging from zero heads, all the way up to five heads in a row. Which of these blobs, if any, would you accuse of being a cheater? If you'd like to try your hand at judging the blobs yourself there is an interactive version I introduced in the last video. Looking at the data from that game, when a blob got five
heads out of five flips, it turned out to be a cheater
only about 88% of the time. Because of the randomness, it's impossible to be completely sure whether a blob is a cheater. But some approaches
are better than others. During the rest of this video, we're gonna build up
one of the main methods of making decisions with limited data. If you like learning new vocabulary terms, you're in for a real treat. The name of this method is
frequentist hypothesis testing. We're gonna design a test
that this blob detective can use in its day-to-day
work searching for cheaters. We want three things from this test. First, if a player is using a fair coin, we want to have a low chance
of wrongly accusing them. Second, if a player is
cheating, using an unfair coin, we want to have a high
chance of catching them. And third, we want this test to use the smallest
number of coins possible. We only have one blob detective, and we want it to be able to
test as many players as it can. And it's also nice not to
bother the players too much. We're gonna design that test together, but if you're feeling up for it, it can be good for learning to
try things on your own first. So this is your chance to
pause and take a moment to think of a test that
might satisfy these goals. Okay, let's take it one flip at a time. It came up heads. The cheaters have heads
come up more often, so this blob must be a cheater. Well, no, we can't just call
them a cheater after one flip. I mean, we could, but with that policy, we'd wrongly accuse quite
a lot of fair players. After all, even if the player
is fair, there's a 50% chance that the first flip would come out heads. So let's see the second flip. Heads again! Cheater? Well, it's more suspicious
for sure, but again, we should think about how
likely it is for us to see this if the coin is actually fair. There are two possible
outcomes for the first flip and two possible outcomes
for the second flip. Two heads in a row is one
of four possible outcomes that are all equally likely. So the probability of two out of two heads is one fourth or 25%. Another way to get that number is to multiply the probability values of the two events together. You do have to be careful about multiplying probabilities,
since it only works if the two events are
independent of each other, and that is the case here,
because getting the first heads doesn't make the second
heads more or less likely. Anyway, with a one in four chance of falsely accusing an innocent blob, it still feels a bit too early to accuse the player of cheating. After another heads, this probability is divided by two again. I'm starting to get pretty suspicious, but we'd still accuse one
out of eight innocent blobs if we accused after three heads in a row. We want that rate of false accusations to be as low as we can get it but we're never gonna get
it all the way to zero. It'll always be possible
for an innocent blob to get an epic streak of
heads and look suspicious. So we have to make a decision
about what's good enough. The standard choice here is 5%, or one false accusation out
of every 20 fair players. We could choose a different
value if we wanted to, but we might as well start here. Okay, so at this point,
we've crossed the threshold. There's only a one in 32 or 3.125% chance of seeing five heads in
a row from a fair coin. So one possible test
we could use would be, if a player gets five out of five heads, accuse them of being a cheater. Otherwise, decide they're innocent. So let's see how this test performs. We're gonna want a lot of data. So let's make a set of 1000 players where half of them are cheaters. Before we see the results,
try making some predictions. How often will it wrongfully
accuse fair players? And what fraction of the cheaters
do you think it'll catch? Alright, we can divide these
blobs into four categories. Fair players the test decided are fair. Fair players we wrongly
accused of cheating. Cheaters who got away with it. And cheaters we caught. It looks like we achieved
goal one with flying colors. Not only did we accuse fewer
than 5% of the fair players, the test did even better than expected. When we use this test in the real world, we won't know how many
fair players there are, but seeing how the test
performed on this sample, combined with our analysis from before, it feels like we can be pretty confident that we would accuse
fewer than 5% of players. We didn't catch very many cheaters, but that's not too surprising. We haven't even thought about them yet, so I'm sure we could do better. Before we make the next
version of the test, I think it's worth mentioning
some fancy statistics terms. They aren't always necessary,
but you might see them around, and like any specialized words, they do make communication
easier in some contexts. If a test result says not to
accuse a blob of cheating, it's called a negative result,
since nothing was found. And when the test does say
to accuse a blob of cheating, it's called a positive result. Cheating is a bad thing,
so not very positive, but the term positive here is referring to the test saying "Yes, the thing I'm looking for is here." The same is true for medical tests. If the test finds what it's looking for, the result is called positive, even though it's usually a bad thing. So we have positive and
negative test results, but the results may
not agree with reality. When we test a blob
that's using a fair coin, the correct result would be negative. So if the test does come up negative, it's called a true negative. And if the test comes out
positive, that's wrong, so it's called a false positive. And when we test a cheater, the correct result would be positive, so if the test does come up positive, we call it a true positive. And if the test incorrectly
gives a negative result, that's a false negative. We can also rephrase that first
goal using another new term, the false positive rate. This can be a dangerously
confusing term though. It's easy to mix up what
the denominator should be. False positive rate
sounds like you're saying, out of all the positives, what fraction of those
positives are false. Or even, out of all the tests,
how many are false positives? But really, it's saying,
out of all the fair players, how many of them are
falsely labeled positive? I've known these words for quite a while, but my brain still
automatically interprets it the wrong way basically every time. So to keep things as clear
as possible for this video, we'll keep using the longer
wording for goal one. Okay, let's go back to designing the test. We still need to figure out a way to achieve goal number two. Let's start by making
the goal more precise. To do that, we need to pick a number for the minimum fraction of
cheaters we want to catch. Using the terms from before, we could also call this the
minimum true positive rate. But again, let's stick
with the plain language. And to throw even more words at you, this minimum is sometimes called the statistical power of the test. It's the power of the
test to detect a cheater. The standard target for
statistical power is 80%. Just like the 5% number in the first goal, we could pick any value we want here. But let's run with 80% for now, and we'll talk about
different choices later on. Now for calculating what we expect the true
positive rate to be. What's the probability that a cheater would
get five heads in a row? Take a moment to try that yourself. Okay, that was kind of a trick question. There's no way to calculate that number, since we haven't actually said anything about how often an unfair
coin comes up heads. In that trial we just did with 1000 blobs, the cheaters were using
coins that land heads 75% of the time. We don't know for sure if
that's what the real blobs do. So this 75% is an assumption. But we need some number here
to calculate the probabilities, so we gotta run with something. And yet another word, this
is called the effect size. In this case, it's the effect
of using an unfair coin. You might be getting annoyed that this is the third time I've said we should just run with
some arbitrary number. But what can I tell ya? Some things are uncertain
and some things are up to us. The important thing is to remember when we're making an
assumption or making a choice. That way we can note our assumptions when we make any conclusions,
and we can adjust the test for different choices if
we change our minds later. But now that we have a number,
let's do the calculation. If the probability of each heads is 0.75, the probability of five heads
in a row is 0.75 to the fifth, or about 24%. So our existing test should
catch about 24% of cheaters. And hey, that is pretty close
to what we saw in the trial, so everything seems to
be fitting together. But our goal is to catch 80% of cheaters. The current test is a little bit extreme. It requires 100% heads
for a positive result. This does make false positives
unlikely, which is good, but it also makes true positives
unlikely, which is bad. So we're gonna have to think about a test that allows for a mixture
of heads and tails. Calculating probabilities
for something like this can be a bit confusing though. For example, if we make a new test that requires a blob to
flip their coin 10 times, and accuses them of being a cheater if they get seven or more heads, the calculations in that situation are gonna be a lot harder. There are a bunch of ways for there to be seven
heads out of 10 flips. And we also have to think
about the possibilities of eight, nine, and 10 heads. To start making sense of this, let's go back to just two flips. With a fair coin, each of these four possible
outcomes is equally likely. So the probabilities are one out of four for getting zero heads, two out of four for
getting exactly one heads, and one out of four to get two heads. But with an unfair coin that favors heads, they're skewed toward
results with more heads. With three flips, there are
eight possibilities total, with four possible numbers of heads. As we add more and more flips, it quickly becomes quite a chore to list out all the possible outcomes and add up the probabilities. But there is a pattern to it, so thankfully there's a
formula for cases like this called the binomial distribution. It's not as scary as it looks, but still a full explanation
deserves its own video. I'll put some links about
this in the description, but for now just know that this formula is what we're using to
make these bar graphs, and it follows the same pattern we used for two flips and three flips. Now let's go back to our
test rule from before, where we accuse a player if they get five out of five heads. We can show the rule on these graphs by drawing a vertical line that separates the positive
test results on the right, from the negative test
results on the left. On the fair player graph, the bars to the left
represent the true negatives, or the innocent blobs we leave alone, and to the right are the false positives, the fair players we wrongfully accuse. And on the cheater graph,
the bars to the left represent the false negatives, the cheaters who evade our detection, and the bars to the right
are the true positives, the cheaters we catch. Just like before, we
can see that this test satisfies our first goal
of accusing less than 5% of the fair players we test on average. But it doesn't satisfy our second goal of catching at least 80%
of the cheaters we test, again, on average. But now that we have these graphs, we can see what happens when
we change the number of heads. If we lower the threshold to
four or more heads out of five, we don't meet either requirement. If we keep lowering the threshold, it can allow us to meet goal two, catching more than 80% of the cheaters, but then we accuse even more
fair blobs, so that won't work. Apparently, if we want to meet
both goals at the same time, we're gonna need more flips. If we put these graphs
right next to each other, we can see that the blue
and red distributions overlap quite a lot. So it's impossible to make a test that reliably separates
fair players from cheaters. But if we increase the
number of flips to, say, 100, now there's a big gap
between the distributions, so it's easy to find a
line that separates them. But we also have this third goal of using as few coin flips as possible, so we should try to find
a happy medium somehow. Since we already have the computer set up to run the numbers, we
can go back to five flips and just keep trying different thresholds with more and more flips until we find a test rule that works. It turns out that the smallest test that meets our first two goals has a blob flip its coin 23 times, and the blob is accused of being a cheater if they get 16 or more heads. That's more than I would've
guessed at the start, but it's not so, so huge, so, it'll do. Alright, let's use this
to test a few blobs. This blob got 17 heads. That fits our rule of 16 or more, so according to that test, we should call this blob a cheater. There is another term
worth mentioning here. Assuming this blob is innocent, the probability that
they'd get 17 or more heads is about 1.7%. We call this 1.7% the P
value for this test result. It's kind of like a false positive rate for a single test result. Kind of. 1.7% is below the 5% we
set as our threshold, so according to the test,
we call this one a cheater. And looking at it from
the other direction, if the blob is cheating, using a coin that comes
up heads 75% of the time, there's a 65% chance that
they'd get 17 or more heads. Another way to say it is
that they're in the top 65% of results we'd expect from cheaters. So if we wanna catch 80% of the cheaters we'd better call this one a cheater. Okay, let's try it with one more blob. This one got 13 heads. This is more than half of the 23 flips, so it's tempting to call it a cheater. But 13 is below the 16 heads the test requires for
a positive result, so, we call it a fair player. The P value of this result is about 34%. So if we accuse players
with results like this, we'd expect to wrongly
accuse about 34% of players. That's well beyond our 5% tolerance, so we can't call it a cheater. And looking at it from
the other direction, if it were a cheater, there
would be about a 99% chance that they'd get this many heads or more. We don't have to catch 99% of the cheaters to hit our 80% goal, so we
can still meet that goal if we let this one off the hook. Is the first one really a cheater? Is that second one really playing fair? We can't know for sure, but based on how we designed our test we should expect to catch at
least 80% of the cheaters, and falsely accuse less
than 5% of the fair players. So now let's see how this test does on another group of 1000 blobs. Like before, half the blobs in this group are using a trick coin that has a 75% probability
of landing heads. Okay, the results do look about right. We accused less than
5% of the fair players, and we caught more than
80% of the cheaters. 5% and 80% are the normal
numbers for historical reasons. So we could make different
decisions if we like. Maybe we decide that we really do not
want to bother the blobs who are playing fairly. So we wanna lower the
false positive rate to 1%. To achieve this with 23 flips, we'd have to raise the
heads threshold to 18 heads. This would lower the
fraction of cheaters we catch to about 47% though. If we don't want to increase
the number of flips, we could decide we're okay with that 47%, maybe we just want cheating to feel risky, so 47% is good enough. Or, if we still want to catch
80% of the cheaters we test, we could increase the number of flips until we find a test that
achieves both of those goals. We could also be super hardcore and go for a 99% true positive rate, and a 1% false positive rate. But we'd have to flip the coin 80 times to get to that level. We'll always be able to set
two of these goals in stone, but that'll limit how well
we can do on the third goal. How to set these goals
depends which trade-offs we're willing to make. For the rest of this video though, we're just gonna go with
the standard 5% and 80%. Now that we've settled on
the goals we're going for, and we have a test that seems
to be achieving those goals, let's test one more set of blobs. To pretend these are real blobs and not some artificial sample, I'm not going to tell you
anything about this group except that there are 1000 of them. How do you think this test will do on this more mysterious group? Will it manage to accuse fewer
than 5% of the fair players? And will it catch 80% of the cheaters? At this point in the video,
it would be easy to get lazy and not actually make the predictions. But if I'm asking you
these questions yet again, something must be about
to go wrong, right? Or, maybe I'm just pretending so you'll engage a little more. Who can say? But really, what do you think? Okay, so we labeled about a
fifth of them to be cheaters, which is a bit less than before. If this were the real world,
that's all you would get. You wouldn't get to see
who was really cheating and who was really innocent to get confirmation that the
test is working as expected. I mean, maybe you could, but
it would take more testing. You couldn't do it with this test alone. But because this is a computer simulation, I do know the full truth. This group was 90% cheaters. We still accused less than
5% of the fair players, but we only caught about
a quarter of the cheaters. Something went wrong. The problem is that we assumed that the cheater coins came
up heads 75% of the time. And that assumption was wrong. The real world cheaters were using coins that came up heads 60% of the time. If we knew that from the beginning, we still could have designed
a test to achieve our goals, but it would need 158 flips and require 90 heads to
reach those same thresholds, which is honestly way more coin
flips than I was expecting. But in hindsight, it's not that surprising that we need a lot of data to tease out that smaller difference. But we didn't design that test because we got the effect size wrong. I know, I know, I was the one
who said we should assume 75%. But be honest with yourself. Did you remember that assumption when making your prediction? It's very easy to forget that
assumptions are assumptions, and instead just treat them as facts. This concludes me tricking you
to try to teach you a lesson, but they really are easy
mistakes to make in real life. On the bright side, though,
our test did succeed at accusing less than
5% of the fair players. The framework we built up here isn't just good for catching unfair coins. It's the dominant framework used in actual scientific studies. To summarize we take a yes or no question. In this case, our question was, is this particular blob
using a biased coin? But it could be any question. Then we come up with a model for what kinds of results we'd
expect if the answer is yes, and if the answer is no. Then we come up with a test
that can do a decent job of telling those two situations apart, according to the models. The details are usually
a bit more complicated, since most real world
systems that we wanna study are more complicated than coin flips. But most scientific studies have this framework at their core. Like I mentioned at the beginning this is called frequentist
hypothesis testing. There's another method called
Bayesian hypothesis testing which we'll look at in the
next video in this series. See you then.