Understanding Statistical Misinterpretations

The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To make a donation or to view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu.

Hello, everybody. Well, here we are at the last lecture. We're going to finish talking about statistical sins and then do a little bit of a wrap-up.

All right, let's look at a hot topic, global fiction, or global warming, fact or fiction. You've done a problem set related to temperatures. In the U.S., here's a plot.

Generally accepted of the change in temperatures on the planet between 1880 and 2014. Now if we look at this plot, we could see this commits one of the statistical sins I complained about on Monday. That look where it's starting the y-axis, way down here, 55. And you remember I told you to beware of charts where the y-axis doesn't start at zero. So maybe the people who are trying to claim about global warming are just deceiving us with this trick of the axis. So here's what happens if you put it at zero. And as you can see, or barely see, this axis runs from zero up to 110 as the average temperature.

And as you can see quite clearly, it's hardly changed at all. So what's the deal here? Well, which is a more accurate presentation of the facts, which conveys the accurate impression? Let's look at another example, maybe a little less controversial than climate change.

Fever and flu. It's generally accepted that when you get the flu, you might run a fever. So here is someone who had the flu, and this is plotting their fever from the beginning to its peak. And it does appear, if we were to fit a curve to this, it would look pretty much like that. On the other hand, if we assume that somebody's temperature could range between 0 and 200, we can see that, in fact, your temperature doesn't move at all when you get the flu.

So the moral is pretty clear, I think. Even though on Monday I talked about being suspicious when people start the y-axis too far from zero, you should truncate it to eliminate totally preposterous values, right? No living person has a temperature of zero degrees Fahrenheit. So again, don't truncate it just to make something look like it isn't.

but don't expand it to deceive either. All right, let's return to global warming. This is a chart that was actually shown on the floor of the U.S.

Senate by a senator from Texas, who I shall not name. And obviously, the argument here was that, well, sure, global warming bounces up and down, but if we go back... We can see here the date is 19... Can we see it? I can see it.

Maybe. 1986, I think. You can see that the argument here is, in fact, if you fit a trend line to this, as he's done, it hasn't changed at all.

And so even though we've had a lot of carbon emissions during this period, maybe global warming is not actually happening. This is in contradiction to the trend I showed before. Well, what's going on here? This is a very common way that people use statistics poorly.

They confuse fluctuations with trends. What we see in any theories of data, time series, other series, you always have fluctuations. And that's not to be confused with the trend. And in particular, what you need to think about when you're looking at a phenomenon is choose an interval consistent with the thing that's being considered.

So we believe that climate change is something that happens over very long periods of time. And it's a little bit silly to look at it on a short period of time. Some of you may remember two years ago we had a very cold winter here. And there were people who were saying, well that shows we don't have global warming. Well you can't really conclude anything about climate change looking at a year.

Or probably not even looking at 10 years or 20 years. It's a very slow phenomenon. On the other hand, if you're looking at the change in somebody's heart rate, seeing if they have a heart condition, you probably don't want to look at it over a 10-year period. So you have to decide what you're doing and find an interval that lets you look at the trends rather than the fluctuations. At any rate, maybe even if we're having global warming, at least the Arctic ice isn't melting.

Though, apparently, I read in the paper this morning, they found a huge crack in it. So this was reported in the Financial Post on April 15, 2013. You can read it yourself. But the basic import of it is they took the period from April 14, 1989, to... April 15, 2013, and said, look, it's not changing. In fact, the amount of Arctic ice is unchanged.

Well, what's the financial, not the financial, what's the statistical sin being committed here? If we look at this data, this is... This is an anomaly chart.

I think you saw one of these in one of the problem sets, where you fix something at zero and then you show fluctuations relative to that. So here it's the Arctic ice relative to a point. And what we see here is that if you go and choose the right date, say this one, In 1989, and you come over here and you choose the right date in 2013, say this one, you can then draw a line and say, oh look, it hasn't changed.

This is something people frequently do, is they find, they take a whole set of data. And they find two points that are consistent with something they believe, and they draw a line between those two points, fit a curve to those two points, and draw some conclusion. This is what we call cherry picking. I guess from the notion that when you go to pick cherries, you only want to pick the ripe ones, leave the others to ripen for a bit on the tree. really bad and it's something that unfortunately a scientific literature is replete with where people look at a lot of data and they pick the points that match what they want to prove and so as you can see while the trend is quite clear you could prove almost anything you wanted by selecting two points very carefully you know I could also show that it's crashing much faster than people think it is by picking these two points.

You know, if I wanted to argue that it's catastrophic, I'd pick those two points and say, look at that. It's disappearing at an incredible rate. So you can lie in either direction with this data by careful cherry-picking. All right. As a service to you, I know the holidays are coming, and many of you have not bought presents for your parents, so here's a modest gift suggestion.

That, you know, the family that shoots together something or other. Well, all right. So we can ask, is this a good gift?

Well, probably. We can look at this statistic. It's not dangerous, at least. We see that 99.8% of the firearms in the U.S. will not be used to commit a violent crime. So guns apparently are not actually dangerous, or at least not in the hands of criminals.

Well, let's look at this. How many privately owned firearms are there in the U.S.? Anyone want to guess who hasn't looked ahead?

Yeah. Four hundred million. Four hundred million. Three hundred and forty million people and four hundred million guns is the guess.

More than one per person. You certainly are the right order of magnitude. I think it's about three hundred million, but it's hard to count them. Maybe this doesn't count water pistols.

So if you assume there are three hundred million firearms and 0.2% of them are used to commit a violent crime in every year. Here we see that, how many crimes is that? 600,000.

So in fact, you know, it's not necessarily very meaningful to say that most of them are not used to commit a crime. Well, Let's look at another place where we look at a statistic. Probably most of you don't even remember the scary swine flu epidemic.

This was a big headline. people got so scared of the swine flu, they were doing things like closing schools to try and limit the spread of the flu. New York City's closed some schools because of it, for example.

So is this a scary statistic? Well, maybe, but here's an interesting statistic. How many deaths per year from the seasonal flu in the U.S.? The ones we try and prevent with a flu shot, 36,000. So what we see is that it doesn't make a lot of sense to panic over 159 in the light of this number.

Okay. So the point here for both this and the issue about the firearms is that context matters. Yeah, I love this cartoon.

A number without context is just a number, and numbers by themselves don't mean anything. So to say that there were 159 deaths from the swine flu is not very meaningful. without some context.

To say that only 0.2% of firearms are used to commit a violent crime is not very meaningful without context. Whenever you're presenting a statistic, reading about a statistic, and you just see a number that seems comforting or terrifying, try and put some context around it. So a related thing is relative to what? Suppose I told you that skipping lectures increases your probability of failing this course by 50%. Well, you would all feel great because you're here, and you would be laughing at your friends who are not here because figuring that will leave much better grades for you.

What does this mean, though? Well, if I told you that It changed the probability of failing from a half to 0.75. You would be very tempted to come to lectures.

On the other hand, if I told you that it changed the probability from 0.005 to 0.0075, you might say, the heck with it, I'd rather go to the gym. Right. Again, this is an issue and this is something that we see all the time when people talk about percentage change.

This is particularly prominent in the pharmaceutical field. You will read a headline saying that drug X for arthritis increases the probability of a heart attack by 1% or 5%. Well, what does that mean?

If the probability was already very low, increasing it by 5%, it's still very low. And maybe it's worth it not to be in pain from arthritis. So talking in percentages is, again, one of these issues of, it doesn't make sense without the context.

In order to know what this means, I need to know what regime I'm in here. In order to make a intelligent decision about whether to attend lecture or not, it goes without saying you've all made the right decision. So beware of percentage change when you don't know the denominator, right? You get a percentage by dividing by something.

And if you don't know what you're dividing by, then the percentage is itself a meaningless number. All right. While we're sort of talking about medical things, and let's look at cancer clusters to illustrate another statistical question. So this is a definition of a cancer cluster by the CDC, a greater than expected number of cancer cases that occurs in a group of people in a geographic area over a period of time.

And the key part of this definition is Greater than expected. About 1,000 cancer cases, cancer clusters per year are reported in the U.S. mostly to the Centers for Disease Control, but in general to other health agencies. Upon analysis, almost none of them passed this test. So the vast majority, some years all of them, are deemed actually not to be cancer clusters. So I don't know if anyone here has seen the movie Aaron Brockovich?

Subsequent analysis showed that was actually not a cancer cluster. So it was a good movie, but turns out statistically wrong. This, by the way, is not a cancer cluster.

This is a constellation. All right. So let's look at a hypothetical example.

By the way, the other movie about cancer clusters was the one set in Massachusetts. What was the name? A Civil Action. Anyone see that? No.

That was a cancer cluster. All right. Massachusetts is about 10,000 square miles. And there are about 36,000 cancer cases per year reported in Massachusetts. Those two numbers are accurate.

And the rest of this is pure fiction. So let's assume that we had some ambitious attorney who partitioned the state into 1,000 regions of 10 square miles each. and looked at the distribution of cancer cases in these regions, trying to find cancer clusters that he or she could file a lawsuit about. Well, you can do some arithmetic, and if there are 36,000 new cancer cases a year, and we have 1,000 regions, that should say that we should get about 36 cancer cases per year per region.

Well, when the attorney looked at the data, this mythical attorney, he discovered that region number 111 had 143 new cancer cases over a three-year period. He compared that to three times 36 and said, wow, that's 32% more than expected. I've got a lawsuit.

So he went to tell all these people they lived in a cancer cluster. The question is, should they be worried? Well, another way to look at the question is, how likely is it that it was just bad luck, right?

That's the question we've always asked when we do statistical analysis. Is this result meaningful or is it just random variation that you would expect to see? So I wrote some code to simulate it, to see what happens.

So number of cases, 36,000. Number of years, three. So all of this is just the numbers I had on the slide. We'll do a simulation.

We'll take 100 trials. And then what I'm going to do is for t in the range number of trials, the locations, the regions if you will, I'll initialize each to zero, a thousand of them in this case. And then for i in the range number of years times number of cases per year, so this will be 3 times 36,000, at random I will assign the case to one of these regions.

Okay, this is the random. Nothing to do with cancer clusters, just at random this case gets assigned to one of the thousand regions. And then I'm going to check if region number 111 had greater than 143, greater than or equal to 143, a number of cases. we assumed it had. If so, we'll increment the variable num greater by one, saying, alright, in this trial of the hundred, indeed it had that many.

And then we'll see how often that happens. That will tell us how improbable it is that region 111 actually had that many cases. And then we'll print it. Does that make sense to everyone?

That here I am doing my simulation to see whether or not it is how probable is it that 111 would have had this many cases. Any questions? Let's run it.

So here's the code we just looked at. Takes just a second. That's why I did only 100 trials instead of 1,000.

I know the suspense is killing you. It's killing me. I don't know why it's taking so long.

Alright, we'll finish. I wish I had the Jeopardy music or something to play while we waited for this. Anna, can you hum some music or something to keep people amused?

She will not. Wow. All right.

So here it is. The estimated probability of region 111 having at least one case, at least 143 cases. Easier to read if I spread this out. 0.01.

So it seems, in fact, that it's pretty surprising. Unlikely to have happened at random. Do you buy it, or is there a flaw here?

Getting back to this whole question, yes. Equality what? Population as opposed to the area. All right. Let's assume that, in fact, instead of by square miles, I assume the populations are balanced.

Then I also, I don't think that works with block. Exactly right. Exactly right. I'm sorry I forgot my candy bag today.

Just means there'll be more candy for the final. What we have here is a variant of cherry picking. What I have done in this simulation is I've looked at a thousand different regions, right?

What the attorney did is not in the simulation, is he looked at a thousand different regions, found the one with the most cancer cases and said, aha, there are too many here, right? And that's not what I did in my simulation. My simulation didn't ask the question, how likely is it that there is at least one region with that many cases?

But it asked the question, how likely is it that this specific region has that many cases? If the attorney had had reason in advance to be suspicious of region 111, then maybe it would have been okay to just go check that. But having looked at 1,000 and then cherry-picked the best is not right. So this is a simulation that does the right thing. I've left out the initialization, but what you can see I'm doing here is I'm looking at the probability of there being any region that has at least 143 cases.

What this is called in the technical literature, what the attorney did is multiple hypothesis checking. So rather than having a single hypothesis that region 111 is bad, he checked a thousand different hypotheses and then chose the one that met what he wanted. Now there are good statistical techniques that exist for dealing with multiple hypotheses.

Things like the Bonferroni correction. I love to say that name. But you have to worry about it.

And in fact, if we go back to the code, And comment out this one and run this one. We'll see we get a very different answer. Nah. All right, the answer we get is, let's see. Oh, I see.

All right, let me just comment this out. Yeah, this should work, right? Well, maybe you don't want to wait for it, but the answer you'll get is that it's actually very probable.

My recollection is it's a 0.6 probability that at least one region has that many cases. And that's really what's going on with this whole business of people reporting cancer clusters, right? It's just by accident, by pure randomness, some region has more than its share.

This particular The form of cherry picking also goes by the name of the Texas sharpshooter fallacy. I don't know why people pick on Texas for this, but they seem to. But the notion is you're driving down a road in Texas and you see a barn with a bunch of bullet holes in the wall right in the middle of a target.

But what actually happened was you had a barn, the farmer just shot some things at random at the barn, then got out his paintbrush and painted a target right around where they happened to land. And that's what happens. when you cherry pick hypotheses. All right. What's the bottom line of all these statistical fallacies?

When drawing inferences from data, skepticism is merited. There are unfortunately more ways to go wrong than to go right. And you'll read the literature that tells you that in the scientific literature, more than half of the papers are later shown to be wrong. You do need to remember that skepticism and denial are different.

It's good to be skeptical, and I love Ambrose Bierce's description of the difference here. If you had never read Ambrose Bierce, he's... well worth reading. He wrote something called the Devil's Dictionary, among other things, in which he has his own definition of a lot of words. And he went by the nickname Bitter Bierce.

And if you read the Devil's Dictionary, you'll see why. But this, I think, has a lot of wisdom in it. All right.

Let's, in the remaining few minutes, wrap up the course. So what did we cover in 6.0002? A lot of things. If you look at the technical things, there were three major units. Optimization problems, stochastic thinking, and modeling aspects of the world.

But there was a big subtext amongst all of it, which was this. There was a reason our problem sets were not pencil and paper probability problems, but all coding. And that's because we really want, an important part of the course is to make you a better programmer. We introduced a few extra features of Python, but more importantly, we emphasized the use of libraries. Because in the real world when you're trying to build things, You rarely start from scratch, and if you do start from scratch, you're probably making a mistake.

And so we wanted to get you used to the idea of finding and using libraries. So we looked at plotting libraries and machine learning libraries and numeric libraries. And hopefully you got a lot of practice and that you're a way better programmer than you were six weeks ago. A little more detail. The optimization problems, the probably most important takeaway is that many important problems can be formulated in terms of an objective function that you either maximize or minimize and some set of constraints.

Once you've done that there are lots of toolboxes, lots of libraries that you can use to solve the problem. You wrote some optimization code yourself, but most of the time we don't solve them ourselves. We just call a built-in function that does it.

So the hard part is not writing the code, but doing the formulation. We talked about different algorithms, greedy algorithms, very often useful, but often don't find the optimal solution. So, for example, we looked at... K-means clustering.

It was a very efficient way to find clusters, but it did not necessarily find the optimal set of clusters. We then observed that many optimization problems are inherently exponential, but even so, dynamic programming often works and gives us a really fast solution. And the notion here is this is not an approximate solution. It's not like using a greedy algorithm.

It gives you an exact solution, and in many circumstances, it gives it to you quickly. And the other thing I want you to take away is, outside the context of dynamic programming, memoization is a generally useful technique. What we've done there is we've traded time for space.

We compute something, we save it, and when we need it, we look it up. And that's a very common programming technique. And we looked at a lot of different examples of optimization, knapsack problems, several graph problems, curve fitting, clustering, logistic regression.

Those are all optimization problems. Can all be formulated as optimization problems. So it's very powerful and...

Fits lots of needs. The next unit, and of course, I'm speaking as if these things were discrete in time, but they're not. Right, we talked about optimization at the beginning, and I talked about optimization last week. So these things were sort of spread out over the term. We talked about stochastic thinking, and the basic notion here is the world is non-deterministic, or at least predictably non-deterministic.

And therefore we need to think about things in terms of probabilities most of the time or frequently. And randomness is a powerful tool for building computations that model the world. If you think the world is stochastic, then you need to have ways to write programs that are stochastic if you're trying to model the world itself. The other point we made is that random computations, randomness is a computational technique.

It's useful even for problems that don't appear to involve any randomness. So we used it to find the value of pi. We showed you can use it to do integration. There's nothing random about the value of the integral of a function. Yet the easiest way to solve it in a program is to use randomness.

So randomness is a very powerful tool, and there's this whole area of random algorithms, research area and practical area that's used to solve non-probabilistic problems. All right. Modeling the world. Well, we just talked about part of it.

Models are always inaccurate. They're providing some abstraction of reality. We looked at deterministic models, the graph theory models.

There was nothing non-deterministic about the graphs we looked at. And then we spent more time on statistical models. We looked at simulation models, in particular, spent quite a bit of time on Monte Carlo simulation.

We looked at models based on sampling. And there, and also we talked about simulation, I really hope I emphasized enough the notion that we need to be able to characterize how believable the results are. It's not good enough to just run a program and say, oh it has an answer. You need to know whether to believe the answer, and the point we made is it's not a binary question. It's not yes, it's right, no, it's wrong.

Typically what we do is we have some statement about confidence intervals and confidence levels, right? We use two variables to describe how believable the answer is. And that's an important thing. And then we looked at tools we use for doing that.

We looked at the central limit theorem. We looked at the empirical rule. We talked about different distributions.

And especially we spent a fair amount of time on the normal or Gaussian distribution. And then finally we looked at statistical models based upon machine learning. We looked at unsupervised learning, basically just clustering.

Looked at two algorithms, hierarchical and k-means. And we looked at supervised learning. And there we essentially focused mostly on classification.

And we looked at two ways of doing that, k-nearest neighbors and logistic regression. Finally, we talked about presentation of data, how to build plots, utility of plots, and recently, in the last two lectures, good and bad practices in presenting results about data. All right, so... My summary is I hope that you think you've come a long way, particularly those of you, how many of you were here in September when we started 6-0001? All right, most of you.

This, by the way, was a very popular ad for a long time saying that finally women are allowed to smoke. Isn't this great? And Virginia Slims sponsored the Women's Tennis Tour to show how good it was that women were now able to smoke.

But anyway, I know not everyone in this class is a woman, so just for the men in the room, you too could have come a long way. I hope you think that if you look back at how you struggled in those early problem sets, I hope you really feel that you've learned a lot about how to build programs. And if you spend enough time in front of a terminal, this is what you get to look like.

All right, what might be next? I should start by saying this is a hard course. We know that many of you worked hard, and the staff and I really do appreciate it.

You know your return on investment. I'd like you to remember that you can now write programs to do useful things. So, you know, if you're doing a UROP, you're sitting in a lab, and you get a bunch of data from some experiments, Don't just stare at it.

Sit down and write some code to plot it, to do something useful with it. Don't be afraid to write programs to help you out. There are some courses that I think you're now well prepared to take. I've listed the ones I know best, the courses in course 6. 6.0009 is a sort of introduction to computer science.

I think many of you will find that too easy. After taking this course, but maybe that's not a downside. 6.005 is a software engineering course with all switch programming languages on you.

You get to program in Java. 6.006 is an algorithms course in Python, and I think actually quite interesting, and students seem to like it a lot, and learn about algorithms and implementing them. And 6034 is an introduction to artificial intelligence, also in Python, and I should have listed 6036. Another introduction to machine learning in Python. You should go look for an interesting UROP.

A lot of students come out of this course and go do UROPs where they use what they've learned in this course. And many of them really have a very positive experience. So if you were worried that you're not ready for a UROP, you probably are.

that you're up using what's been done here. You can minor in computer science. This is now available for the first time this year.

But really, if you have time, you should major in computer science, because it is really the best major on campus. Not even close, as somebody I know would say. All right.

Finally, sometimes people ask me where I think computing is headed, and I'll quote one of my favorite baseball players, it's tough to make predictions, especially about the future. And instead of my predictions, I'm going to show you the predictions of some famous people. So Thomas Watson, who is the chairman of IBM, a company you've probably heard of, and he said, I think there's a world market for maybe five computers. This was in response to should they become a computer company, which they were not at the time.

He was off by a little bit. Some few years later, there was an article in Popular Mechanics, which was saying computers are amazing. They're going to change enormously.

Someday, they may be no more than one and a half tons. You might get a computer that's no more than 3,000 pounds someday. We're still waiting for that, I guess.

I like this one. This is having written a book recently. The editor in charge of books for Prentice Hall.

I have traveled the length and breadth of this country and talked with the best people. And I can assure you that data processing is a fad that won't last out the year. MIT had that attitude for a while, for about 35 years.

Computer science was in a building off campus because they weren't sure we were here to stay. Maybe that's not why, but that's why I interpret it. Ken Olsen, an MIT graduate, I should say a course six graduate, was the founder and president and chair of Digital Equipment Corporation, which in 1977 was the second largest computer manufacturer in the world.

Based in Maynard, Massachusetts. None of you have ever heard of it. They disappeared, and this is in part why, because Ken said there's no reason anyone would want a computer in their home.

And totally missed that part of computation. All right, finally, since this is the end of some famous last words, Douglas Fairbank Sr., a famous actor, This is true. The last thing he said before he died was, never felt better.

Amazing. This is from the movie The Mark of Zorro. Scientists are better.

Luther Burbank says his last words were, I don't feel so good. And well, I guess not. And this is the last one.

John Sedgwick was a Union general in the Civil War. This is a true story. He was riding behind the lines and trying to rally his men to not hide behind the stone walls, but to stand up and shoot at the enemy. And he said, they couldn't hit an elephant at this distance.

Moments later, he was shot in the face and died. And I thought this was an apocryphal story, but in fact there's a plaque at the battlefield where this happened, documenting the story, and apparently it's quite true. So with that, I'll say my last words for the course, which is I appreciate all your coming, and I guess you're the survivors, so thank you for being here.

Transcript for:Understanding Statistical Misinterpretations

Transcript for:
Understanding Statistical Misinterpretations