Transcript for:
Histograms and Data Visualization

All right, let's talk about one method for visually displaying continuous data, histograms. Again, as I said at the conclusion of the last section, means, median, standard deviations, they don't tell the whole story. Differences in shapes of the distribution might be interesting, and they wouldn't be encapsulated by those single number summaries. So histograms are a way of displaying the distribution of a set of data by charting the number, or percentage, of observations whose values fall within predefined numerical ranges. Histograms are synonymous more or less with bar charts. So let's just give you an example. Consider the following data collected from the 1995 Statistical Abstract of the United States. This is older data, but for each of the 50 United States, the proportions of individuals over 65 years of age has been recorded. So here's the data in its raw form. So even though 50 isn't a particularly large size in the world of data sets, it's pretty hard to get the general gist of what's going on here by looking at 50 measures. And this is where a mean and standard deviation would be useful. But in addition, it may be worth looking at a picture as well. So here are just some interesting facts. Alaska had the smallest percentage. You wouldn't necessarily be able to see that by looking at the table, but I'm telling you now at 4.6%. And Florida had the highest percentage of residents over 65 years of age at 18.4%. So how could we visually display this to figure out what the smallest and largest values are and what the distribution of data is in this group around their center? So here's what we're going to do. We're going to break the data into mutually exclusive, equally sized bins. I'm only going to do this once by hand. You're never going to do it by hand. And I will show you in an add-on optional section to these lecture slides how to do this stuff in Stata. But here's how we would do it if we were on a desert island and Stata-less, but had data we wanted to analyze. We would break the data range into mutually exclusive, equally sized bins. So the width is somewhat arbitrary, we'll talk a little bit more about that, but here they're each 1% wide, starting at 4% because the lowest value was in the range of 4-5%, and going up to 19%. And then we count the number of observations in each bin. So this notation I'm using, look at the first bin, 4-5. There is a bracket on the left side and a parenthesis on the right. What that means is that we include the number 4 in our interval but go up to just before 5. I.e. go up to 4.9999. And then the next interval starts at 5, exactly includes 5, and goes up to just before 6. so that these are mutually exclusive and there's no double counting those at the end points. There's only one observation between 4% and 5%. No observations between 5% and 6%. No observations between 6% and 7%. In fact, we don't even get any action until we hit the interval between 8% and 9%. And then you can see how it plays out from there on in. But again, this summarizes the data somewhat. It's not a single number summary like a mean or a standard deviation shown below. but it is more concise than the 50 values themselves. But again, this can be better displayed as a graph. Here on this page, you see the histogram, the graphic that corresponds to that tabular summary there. So these bars on the graph actually track the number of states that fall in each of those percentage bins. And this tracks the percentage over 65 years of age. You can see in the 4 to 5 bin on the left there, there's a bar that actually goes up to 1 because there's only one state. And then there's nothing until we get to the 8 to 9 bin, and again it goes up to 1, and then we start to see a little more action. And then you can see that the majority of the data falls between 10 and 16 percent, in a sense to be centered in the range of 12 to 13 percent around the mean, basically. And then we have another lone point, the state of Florida, that falls in the 18, 19 percent. So we get a nice summary, not only where the center is and how much the spread is, which we also got from the mean and standard deviation, but how these observations spread themselves out about the mean. So we're going to go ahead and do that. You can see it's roughly symmetric, a little longer in the left than the right, but roughly symmetric around that center. Okay, suppose we have a sample of blood pressure data on a sample of 113 men. And the sample mean is 123.6 millimeters, and here's the median. The sample standard deviation is 12.9 millimeters. So we kind of know where things lie in terms of the middle and how much they're spread around that mean, but we don't necessarily know the shape of the data. Is it symmetric about that measure of center? Is it skewed to one direction or the other? And that's where a graphic will come in. So here's a histogram of the systolic blood pressure for this sample. And here, the bin width in which we classify the observations was arbitrarily set to 5 mmHg. And again, here the height of each bar represents the number of men among the total of 113 who fall in the bin. So you can see in the bin that encapsulates 90 to 95, there's only one person. And then we see five people in the bin that's 100 to 105. And you can see this data actually looks relatively similar in shape to that. The previous data set, which has nothing to do with blood pressure, well, not explicitly anyway, see it's pretty symmetric, relatively centered at its mean and medium, and somewhat bell-shaped. Now this, like I said, this bin size is somewhat arbitrary. We could make these bins wider, for example. Here's a histogram where each bar has a width of 20 millimeters of mercury. But you notice that that actually... is a little too crude to get the essence. It doesn't betray what we saw in the other picture, but it makes it look more glompy. That's not really a word, but it is now. Please try and use that in your general lexicon. And it eschews some of the detail that was a little too crude. Here, on the other hand, This is a histogram with the same blood pressure information with 113 men, but here each the bin width is 1 mmHg. You can see this is almost too fine to actually get the essence. Remember we're about data summarization to some extent and this may basically give us the same information as looking at the 113 numbers on their own. So you can see that we can oversimplify by making the bins too wide or make it too complicated by making them too thin. And we'll talk about the optimal size, but rest assured the computer will choose that for you unless you specify otherwise. And just before we talk about the optimal bin size, we can also instead of putting the number of observations on the vertical axis, we could talk about the percentage. So we can convert things from the raw number to the percentage out of 113 that each set of observations within each bin has. So here's another way to present this data. And if you added up the heights of the bars on this, it should add up to not the total sample size, but to 100%. So how many intervals should you have in a histogram? Well there's no perfect answer to this. It really depends on the sample size and the actual spread in your data. A rule of thumb if you're doing this by hand, which again you never will be so you can pretty much ignore this, but it is interesting. Another rule of thumb is the number of intervals should be roughly equal to the square root of the sample size. So if you had 10 observations you need about 3 bins. You basically estimate the width of each bin by taking the data range and dividing by 3. 50 observations about 7, 100 about 10, etc. But unless you tell it otherwise, the computer will choose this optimal number of bins and set the bin width for you. And you can see some evidence of that in the optional section. Histograms are a nice way to present a picture of continuous data, but there are other options which we'll explore in the next section.