Transcript for:
Exploring the Significance of Bayes' Theorem

The goal is for you to come away from this video understanding one of the most important formulas in all of probability, Bayes' theorem. This formula is central to scientific discovery, it's a core tool in machine learning and AI, and it's even been used for treasure hunting, when in the 1980s a small team led by Tommy Thompson, and I'm not making up that name, used Bayesian search tactics to help uncover a ship that had sunk a century and a half earlier, and the ship was carrying what in today's terms amounts to $700 million worth of gold. So it's a formula worth understanding, but of course there are multiple different levels of possible understanding. At the simplest there's just knowing what each one of the parts means, so that you can plug in numbers. Then there's understanding why it's true, and later I'm going to show you a certain diagram that's helpful for rediscovering this formula on the fly as needed. But maybe the most important level is being able to recognize when you need to use it. And with the goal of gaining a deeper understanding, you and I are going to tackle these in reverse order. So before dissecting the formula or explaining the visual that makes it obvious, I'd like to tell you about a man named Steve. Listen carefully now. Steve is very shy and withdrawn, invariably helpful but with very little interest in people or the world of reality. A meek and tidy soul, he has a need for order and structure, and a passion for detail. Which of the following do you find more likely? Steve is a librarian, or Steve is a farmer? Some of you may recognize this as an example from a study conducted by the two psychologists Daniel Kahneman and Amos Tversky. Their work was a big deal, it won a Nobel Prize, and it's been popularized many times over in books like Kahneman's Thinking Fast and Slow, or Michael Lewis's The Undoing Project. What they researched was human judgments, with a frequent focus on when these judgments irrationally contradict what the laws of probability suggest they should be. The example with Steve, our maybe-librarian-maybe-farmer, illustrates one specific type of irrationality, or maybe I should say alleged irrationality, there are people who debate the conclusion here, but more on all of that later on. According to Kahneman and Tversky, after people are given this description of Steve as a meek and tidy soul, most say he's more likely to be a librarian. After all, these traits line up better with the stereotypical view of a librarian than a farmer. And according to Kahneman and Tversky, this is irrational. The point is not whether people hold correct or biased views about the personalities of librarians and farmers, it's that almost nobody thinks to incorporate information about the ratio of farmers to librarians in their judgments. In their paper, Kahneman and Tversky said that in the US that ratio is about 20 to 1. The numbers I could find today put that much higher, but let's stick with the 20 to 1 number, since it's a little easier to illustrate and proves the point as well. To be clear, anyone who has asked this question is not expected to have perfect information about the actual statistics of farmers and librarians and their personality traits. But the question is whether people even think to consider that ratio enough to at least make a rough estimate. Rationality is not about knowing facts, it's about recognizing which facts are relevant. Now if you do think to make that estimate, there's a pretty simple way to reason about the question, which, spoiler alert, involves all of the essential reasoning behind Bayes' theorem. You might start by picturing a representative sample of farmers and librarians, say 200 farmers and 10 librarians. Then when you hear of this meek and tidy soul description, let's say that your gut instinct is that 40% of librarians would fit that description, and 10% of farmers would. If those are your estimates, it would mean that from your sample you would expect about 4 librarians to fit the description, and about 20 farmers to fit that description. So the probability that a random person among those who fit this description is a librarian is 4 out of 24, or 16.7%. So even if you think that a librarian is 4 times as likely as a farmer to fit this description, that's not enough to overcome the fact that there are way more farmers. The upshot, and this is the key mantra underlying Bayes' theorem, is that new evidence does not completely determine your beliefs in a vacuum. It should update prior beliefs. If this line of reasoning makes sense to you, the way that seeing evidence restricts the space of possibilities, and the ratio you need to consider after that, then congratulations! You understand the heart of Bayes' theorem. Maybe the numbers you would estimate would be a little different, but what matters is how you fit the numbers together to update your beliefs based on evidence. Now understanding one example is one thing, but see if you can take a minute to generalize everything we just did and write it all down as a formula. The general situation where Bayes' theorem is relevant is when you have some hypothesis, like Steve is a librarian, and you see some new evidence, say this verbal description of Steve as a meek and tidy soul, and you want to know the probability that your hypothesis holds given that the evidence is true. In the standard notation, this vertical bar means given that, as in we're restricting our view only to the possibilities where the evidence holds. Now remember the first relevant number we used, it was the probability that the hypothesis holds before considering any of that new evidence. In our example, that was 1 out of 21, and it came from considering the ratio of librarians to farmers in the general population. This number is known as the prior. After that, we need to consider the proportion of librarians that fit this description, the probability that we would see the evidence given that the hypothesis is true. Again, when you see this vertical bar, it means we're talking about some proportion of a limited part of the total space of possibilities. In this case, that limited part is the left side, where the hypothesis holds. In the context of Bayes' theorem, this value also has a special name, it's called the likelihood. Similarly, you need to know how much of the other side of the space includes the evidence, the probability of seeing the evidence given that the hypothesis isn't true. This funny little elbow symbol is commonly used in probability to mean not. So with the notation in place, remember what our final answer was, the probability that our librarian hypothesis is true given the evidence is the total number of librarians fitting the evidence, 4, divided by the total number of people fitting the evidence, 24. But where did that 4 come from? Well, it's the total number of people times the prior probability of being a librarian, giving us the 10 total librarians, times the probability that one of those fits the evidence. That same number shows up again in the denominator, but we need to add in the rest, the total number of people times the proportion who are not librarians, times the proportion of those who fit the evidence, which in our example gives 20. Now notice the total number of people here, 210, that gets cancelled out, and of course it should, that was just an arbitrary choice made for the sake of illustration. This leaves us finally with a more abstract representation purely in terms of probabilities, and this, my friends, is Bayes' theorem. More often, you see this denominator written simply as P of E, the total probability of seeing the evidence, which in our example would be the 24 out of 210. But in practice, to calculate it, you almost always have to break it down into the case where the hypothesis is true, and the one where it isn't. Capping things off with one final bit of jargon, this answer is called the posterior, it's your belief about the hypothesis after seeing the evidence. Writing it out abstractly might seem more complicated than just thinking through the example directly with a representative sample. And yeah, it is. Keep in mind though, the value of a formula like this is that it lets you quantify and systematize the idea of changing beliefs. Scientists use this formula when they're analyzing the extent to which new data validates or invalidates their models. Programmers will sometimes use it in building artificial intelligence, where at times you want to explicitly and numerically model a machine's belief. And honestly, just for the way you view yourself and your own opinions and what it takes for your mind to change, Bayes' theorem has a way of reframing how you even think about thought itself. Putting a formula to it can also be more important as the examples get more and more intricate. However you end up writing it, I actually encourage you not to try memorizing the formula, but to instead draw out this diagram as needed. It's sort of a distilled version of thinking with a representative sample, where we think with areas instead of counts, which is more flexible and easier to sketch on the fly. Rather than bringing to mind some specific number of examples, like 210, think of the space of all possibilities as a 1x1 square. Then any event occupies some subset of this space, and the probability of that event can be thought about as the area of that subset. For example, I like to think of the hypothesis as living in the left part of the square with a width of p of h. I recognize I'm being a bit repetitive, but when you see evidence, the space of possibilities gets restricted, right? And the crucial part is that restriction might not be even between the left and the right, so the new probability for the hypothesis is the proportion it occupies in this restricted wonky shape. Now, if you happen to think that a farmer is just as likely to fit the evidence as a librarian, then the proportion doesn't change, which should make sense, right? Irrelevant evidence doesn't change your beliefs. But when these likelihoods are very different from each other, that's when your belief changes a lot. Bayes' theorem spells out what that proportion is, and if you want you can read it geometrically. Something like p of h times p of e given h, the probability of both the hypothesis and the evidence occurring together, is the width times the height of this little left rectangle, the area of that region. Alright, this is probably a good time to take a step back and consider a few of the broader takeaways about how to make probability more intuitive, beyond just Bayes' theorem. First off, notice how the trick of thinking about a representative sample with some specific number of people, like our 210 librarians and farmers, was really helpful. There's actually another Kahneman and Tversky result which is all about this, and it's interesting enough to interject here. They did this experiment that was similar to the one with Steve, but where people were given the following description of a fictitious woman named Linda. Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student she was deeply concerned with issues of discrimination and social justice, and also participated in the anti-nuclear demonstrations. After seeing this, people were asked what's more likely, 1. That Linda is a bank teller, or 2. That Linda is a bank teller and is active in the feminist movement. 85%, 85% of participants said that the latter is more likely than the former, even though the set of bank tellers who are active in the feminist movement is a subset of the set of bank tellers. It has to be smaller. So that's interesting enough, but what's fascinating is that there's a simple way that you can rephrase the question that dropped this error from 85% to 0. Instead, if participants were told that there are 100 people who fit this description, and then they're asked to estimate how many of those 100 are bank tellers, and how many of them are bank tellers who are active in the feminist movement, nobody makes the error. Everybody correctly assigns a higher number to the first option than to the second. It's weird, somehow phrases like 40 out of 100 kick our intuitions into gear much more effectively than 40%, much less 0.4, and much less abstractly referencing the idea of something being more or less likely. That said, representative samples don't easily capture the continuous nature of probability, so turning to area is a nice alternative not just because of the continuity, but also because it's way easier to sketch out when you're sitting there pencil and paper puzzling over some problem. You see, people often think about probability as being the study of uncertainty, and that is of course how it's applied in science, but the actual math of probability, where all the formulas come from, is just the math of proportions, and in that context turning to geometry is exceedingly helpful. I mean, take a look at Bayes' theorem as a statement about proportions, whether that's proportions of people, of areas, whatever. Once you digest what it's saying, it's actually kind of obvious. Both sides tell you to look at the cases where the evidence is true, and then to consider the proportion of those cases where the hypothesis is also true. That's it, that's all it's saying, the right hand side just spells out how to compute it. What's noteworthy is that such a straightforward fact about proportions can become hugely significant for science, for artificial intelligence, and really any situation where you want to quantify belief. I hope to give you a better glimpse of that as we get into more examples. But before more examples, we have a little bit of unfinished business with Steve. As I mentioned, some psychologists debate Kahneman and Tversky's conclusion, that the rational thing to do is to bring to mind the ratio of farmers to librarians. They complain that the context is ambiguous. I mean, who is Steve, exactly? Should you expect that he's a randomly sampled American? Or would you be better to assume that he's a friend of the two psychologists interrogating you? Or maybe that he's someone you're personally likely to know? This assumption determines the prior. I for one run into way more librarians in a given month than I do farmers. And needless to say, the probability of a librarian or farmer fitting this description is highly open to interpretation. For our purposes, understanding the math, what I want to emphasize is that any question worth debating here can be pictured in the context of the diagram. Questions about the context shift around the prior, and questions about the personalities and stereotypes shift around the relevant likelihoods. All that said, whether or not you buy this particular experiment, the ultimate point that evidence should not determine beliefs, but update them, is worth tattooing in your brain. I'm in no position to say whether this does or does not run against natural human instinct. We'll leave that to the psychologists. What's more interesting to me is how we can reprogram our intuition to authentically reflect the implications of math, and bringing to mind the right image can often do just that.