In this lecture section, we'll just look at a few other visual presentations for continuous data. Stem and leaf plots and box plots. Stem and leaf plots aren't so commonly used in the literature, but box plots are very much in favor, and I like them very much, so I wanted to show them to you.
We'll use them throughout the rest of the course. Suppose we take another look at our random sample of 113 men and their blood pressure measurements. Well, again, we saw before one tool for visualizing the data above and beyond the single summary measures of means, medians, and standard deviations is to look at a histogram.
And this is one of the histograms we featured in the last section. Another common tool for visually displaying continuous data is called the stem and leaf plot. And basically it actually looks very similar to a histogram, like a histogram you knocked on its side.
What it allows for above and beyond the histogram is the identification of individual values in the sample from the graphic. When you look at a histogram you can talk about the percentage of values or the number of values, depending on how it's displayed, that fall within a certain range, but you can't necessarily identify the exact numerical values that are in that range. Whereas with a stem and leaf you can do that as well. So here's our stem and leaf plot. Alright, everybody turn your head or turn your monitor and you'll see that it looks just like the histogram in shape, but instead of having solid bars, it's made up of numbers.
And there's two pieces to display on the left hand side before that vertical dash line, some numbers and some symbols thereafter, and these are called the stems. And each of these represents a bin of fixed width here, and I'll explain more about that in a minute. On the right hand side of the dashed line are sets of individual numbers called the leaves. In order to recreate any single observation in the data set, you look for the stem row, and you take the number part of the stem and concatenate it with the number part of any single leaf.
So, for example, in that first row there where the stem says 8 dot and the leaf says 9, that 9 represents a single observation whose value is 89 millimeters of mercury. To actually get the value of the observation, you concatenate the stem and leaf. Additionally, you can see, for example, that there are five persons in this sample of 113 who have blood pressure measurements of 116. That's why in the stem that starts 1, 1 dot, there are five leaves of value 6. What does the 8 dot, 9 star, 9 dot mean? Well, the 9 star, 9 dot, 10 star, 10 dot just means that it's basically breaking up the 90s into two bins.
90 to 94, 95 to 99, the hundreds to two bins, 100 to 104, 105 to 109. Basically, you'll see different variants on this depending on how much data there is. But basically, the fact that a number is repeated in the stem just means it's breaking up that set of tens or hundreds, depending on the units of the data, into bins, similar to a histogram. Another common visual display tool is called the box plot. It gives good insight into the distribution in terms of the skewness and what we'll call outlying values, things that are extreme, values that are different than most of the rest of the data. We'll talk more about that.
It's a very nice tool for easily comparing the distribution of continuous data across multiple groups. Things can be plotted side by side nicely. So here's the box plot displayed at the distribution for 113 males. So it's a little more concise in its summarization than either the histogram or the stem and leaf, but it conveys similar information about the center, shape, and spread.
So anybody who's seen Star Wars will recognize this state for sure, this box plot here. But let's talk about what we're seeing in a box plot and what defines it. Notice the reason it's called a box plot is there's a box, a shaded box in the middle.
And the line that appears between the two sides of this box, if you trace it over to the vertical axis, that value that intersects with gives the sample median blood pressure for these 113 men. So that line corresponds. The sides of the box, the upper side, gives what's called the 75th percentile of the sample, and the lower side gives the 25th percentile. The median is the 50th percentile, the middle of all the data. You can think of the 75th percentile in two ways.
If we were to order a data from lowest to highest, it's the value that cuts off 75% to the left and 25% to the right. Or if you were to look at the middle value between the median and the largest value, that would be your 75th percentile. Similarly, the 25th percentile is the value that cuts off 25% to the left.
and 75% to the right. Or it's the middle value between the lowest value and the medium. These wings, or whiskers, on the box plot, in this case represent the value of the largest observation in the data set, and the smallest, respectively. So what are we seeing here? We see that basically the 75th percentile and 25th percentile are similar distance from the 50th percentile, the median, which suggests that the inner 50% is pretty well balanced or symmetric.
And similarly, the largest and smallest values are at similar distance from those 25th and 75th percentiles respectively, which shows that the rest of the data is similarly symmetric. And so this picture here is an example of a box plot when the data is relatively symmetric and would correspond to a histogram that shows that symmetry. Let's look at a different type of data set and see what we get with histograms and box plots.
Suppose we took a representative sample of discharge records from a thousand patients discharged from a large teaching hospital in a single year and we were interested in their length of stay. How could we visualize the distribution of length of stay values? Well, here's one approach, a histogram. What do you notice about this histogram? So far we've been spoiled.
Every histogram we've looked at has been symmetric and well balanced around the middle. Well here, that's not the case. We have something that we would characterize as right skewed, and we'll formally define skew in the next section. But you can see that the tail, the majority observations are lesser in value with a few extremes that are greater.
And you can think of why that may be for length of stay data at a teaching hospital, and we can discuss that in the bulletin board of Live Talk. Let's see what happens when we take something like this and transform it to the box plot presentation. Well this box plot certainly looks different than the box plot on blood pressure data.
So let's hone in on those key quantities we defined and talk about some we didn't see in that symmetric data. So here's our box and the line that falls between the two sides of the box again represents the median length of stay. Similarly, the upper edge of the box is the 75th percentile, and the lower edge is the 25th percentile. And you can see here the median is no longer smack dab in the middle of those percentiles. It's closer to the 25th than it is to the 75th, which indicates that there's not symmetry about that measure of center.
The smaller values are closer to the median than the larger, and it gives a hint to that right skew. Here, these outer whiskers no longer... necessarily correspond to the largest and smallest observations because you can see dots beyond the upper whisker there that indicate individual data points.
When we have data that's not symmetric, some observations are termed outliers. Meaning they fall outside the majority. And we saw that in histogram that the majority of the observations were low, small length of stay values, and then there were a few extreme larger values.
In situations where our data is not symmetric, the box plot will highlight these extreme values, and these whiskers now correspond to the largest non-outlier value and the smallest non-outlier value, respectively. And you can see here that there's some dots beyond that largest value, and those are the ones that correspond to outliers or extreme points. You can see that in this data set, the only extremes are positive, i.e., greater.
then the majority of the observations, there's no small outliers. So that lower whisker, the smallest non-outlier, is also the smallest value, but the upper whisker is not the largest value, but the largest non-extreme value. For information about how this determines what is an outlier and what isn't, check out the document I posted on the anatomy of a box plot. For those of you who are not interested in the technical details, you can ignore that. So we see a very different picture here, one that captures the essence of that skew we saw on the histogram, that the larger values are the more extreme, that most of the values are small, and then the ones that are extreme are relatively large.
How about a stem and leaf plot? Well, we've got a A thousand observations here, and here's my stem and leaf shrunk to the page. And you can see that the leaves... on the first corresponding first set of stems have dot dot dots at the end and then they give a number and that's because there's actually in this data set of the a thousand records there were actually 298 whose length of stay was one day so instead of showing 298 ones it gives a few of them and then actually gives the raw number at the top right then and there we're seeing that with this much data and this skewedness the stem and leaf is probably not a very efficient display you and would really need to span a couple pages to probably show it.
So that's just an FYI. The stem and leaf is interesting, if not all that useful in most datasets. Let me now show you what I think of as the true power of box plots and maybe gives them a little edge above me on histograms, in my opinion, although I'll never force you to choose between one or the other.
Suppose we wanted to do a side-by-side comparison of the distribution of length of stay values between two subgroups in our data set. For example, I wanted to see how the distribution of length of stay values compared for males and females. Well, we could do side-by-side histograms, and we see in this picture here that both of them have similarly non-symmetric distributions where the bulk of the observations are low and there's extremes that are large.
But look at this presentation of side-by-side box plots. I think it's a little easier to suss up what's going on. We can actually see that the medians and the 25th and 75th percentiles are all relatively similar between the males and females, as is the most extreme non-outlying value in the positive direction. But you can see that for females there's one really extreme outlier. as compared to the males.
This is just interesting. I think it's a little easier to size up on a side-by-side basis than histograms are. And if we had three or four groups like type of insurance or that sort of thing we wanted to compare, then I think it would be much easier to do it using the boxplot approach.