The other type of graph we're going to study is histograms. More often than not, when we're looking at numerical variables, we are going to study a histogram. Why? Because histograms are the type of graphs you want when you're looking at large sets of data. As we talked about at the very end of chapter one last week, the larger the dataset, the better. So, histograms are going to be the graph we will make when we have large sets of data. Why? Well, because dot plots are literally drawing a dot for each object of interest. So, if, say, we were looking at 50,000 students, you would have to draw 50,000 dots. And that's just painful. Super, super painful. And so, because of that, we've created a second type of graph for numerical data, wherein histograms, now instead of dots, we are going to graph bars—rectangles where we're going to group our observations into equally sized intervals, which we will call bins. So, let's now talk about how we're going to create a histogram. One of the biggest differences that will come with the histogram is that you're now going to have both an x-axis (horizontal axis) as well as a y-axis (a vertical axis). Now, the horizontal axis (the x-axis) is still going to be the same. It's still going to be representing your variable. It's still going to be representing your variable. So, once again, we are going to be drawing a number line, very similar to what we did on the dot plot, where that horizontal line will represent your variable. But now, the vertical axis is going to represent the frequency. So, instead of drawing a bunch of dots on top of each other and drawing the dots, we're going to create a vertical axis to represent the frequency. Let's take a look at that in a practical example. Example two: the number of home runs hit by each major league baseball team during the season. That is what these histograms are going to represent. So, in this case, I want you to see what we are studying is Major League Baseball teams. Those are my objects of interest. They're literally what I am talking to—Giants, team A's, Dodgers—and that the variable that we want to know about each team is how many home runs did your team get that season? And so, I want you to know that on the horizontal axis is HR, representing home runs. Once again, we are going to be creating a number line with numbers where those numbers are representing the number of home runs. And so, what's new then is the vertical axis, where each one of these numbers represents how many teams got a specific home run amount. So, when you think of frequency, when you think of the vertical axis, it's representing the number of objects of interest we're looking at, which in this case is Major League Baseball teams. And so, before even talking about the blue histogram, I want you to see in gray and in brown just what our axes are going to represent. Just what our axes represent: the variable and the objects of interest. Now, let's talk about the axes. Along the x-axis, what we're going to do is divide the x-axis into equal width intervals to form bins. Why? Well, as I already told you earlier, in general, when you're working with number lines, you need to make sure your tick marks are equally spaced. You need to make sure your tick marks are equally spaced, but not just equally spaced, but increasing by the same amount from tick mark to tick mark. What amount am I increasing from tick mark to tick mark? It's by 50. We give this a name: this increasing by 50 on the horizontal axis is called the bin width. So, we say here the bin width is 50. It is emphasizing the fact that each bar in blue is going to range over 50 home runs. So, then, what does the height of each rectangle we've just talked about mean? Now, let's talk about the height. Ultimately, what you're going to do is talk to each of your Major League Baseball teams and ask them, 'How many home runs does your team have, Giants? How many home runs do your team have, A's? How many home run do you have?' And you figure out how many home runs each team has. And then, from there, you identify which bin does that home run fall in. You talk to the A's, and they say, 'Okay, I got 92 home runs.' They go into the first bin. You talk to the Giants, and they say, 'Alright, I have 140 home runs.' They go into the second bin. What we're going to do is count how many observations fall into each bin so that the height of the bin will be the frequency—how many teams are falling within that bin. I want you guys to see that the first bin has a height—has a frequency of seven. So, that number seven is representing the number of Major League Baseball teams. The height of the bar represents the number of objects of interest. So, we would say here that seven Major League Baseball teams are doing what? Well, again, my variable is hitting home runs. So, in this case, they're hitting between two home runs. What two home runs are we looking at? Well, when it comes to looking at bins, you include the left number, but you do not include the right number. So, we would say seven Major League Baseball teams are hitting between 80 to 129 home runs. Again, what we are doing here is that you are including the left value but not the right value. Alright, in general, when you are looking at these bins, the reason why we only include the left number is so that those boundary values like 80, 130, 180 only fall into one bin. If you have 80 home runs, you fall into the first bin. If you have 130 home runs, you fall into the second bin. But one of the things I want to emphasize is that this bin width is not set. When it comes to histograms, you can absolutely change the width of the bin. So, for instance, in this second graph, I want you guys to know that this is the same exact Major League Baseball team home run data we just collected. And in this case, we changed the bin width. Can you guys tell me what is the bin width for this graph? The bin width is 10. Yeah, I know we only gave increments of 50, but notice how we have one, two, three, four, five bars in this increment of 50. 50 divided by 5 is 10. We can see here my bin width has changed from 50 to 10. And my question for you is that when the bin width changed, did the overall shape of the histogram change? They're totally different, leading to that real big, honest drawback of histograms: changing the width of the bin will then change the shape of the overall histogram. And so, that'll just be one of the things you need to just be aware of when you are studying histograms. The bin width matters because the width of the bin is going to change the shape of the graph. Now, on top of being able to count how many, say, Major League Baseball teams got a range of home runs, we might want to instead compare those percentages, those decimals, meaning the ratio, the fraction of how many teams out of the overall sample. We call that relative frequency, which honestly sometimes frustrates me because this is a word you guys already know. Relative frequency is exactly the fractions that we learned back in chapter one. And what we sometimes want to do when doing graph is find the relative frequency for each bin and then form the corresponding histogram using those relative frequencies, meaning you put the relative frequency on the y-axis. What do I mean by this? Well, if you guys look at the graph on the bottom, I want you to note the variable is still home run. And I want you to note that the bin width is still 50, meaning everything that we had from this top graph is exactly what we have here in this bottom graph. I just literally used the copy and paste function of this count-table function. Alright, what I want to emphasize here is that what is the same is the x-axis, what is the same is the bin width. And so, ultimately, when it comes to working with relative frequency, it's the y-axis that's going to be different. See, now instead of using frequency counting numbers like four, five, six, seven, eight, the vertical axis is now going to be relative frequency, meaning it's going to be the decimal. It's going to be a decimal representing the ratio of teams in that bin to the overall. What do I mean by that? Well, let's note the fact my first bin is exactly the same in both histograms. They both are from 80 to 130. 80 to 130. And so, to calculate the relative frequency, we are looking at that total number of teams having hits between 80 and 129, and we are dividing that by the total number of Major League Baseball teams surveyed here—n equals 30. You should start getting comfortable with the fact that n is going to represent the total. And again, that total can sometimes be given to you. Remember, you can also do this by adding up all the frequencies. So, we can add up the frequency of seven, 139, and one, and we can see here that that ultimately gives us... So again, it's not that the relative frequency is seven; it's that I'm forming the fraction to make the relative frequency 7 over 30—0.233. And I want you guys to see that the height of that first bar is 0.23. I want you to note the y-axis is different. Instead of using counting numbers, we're now using decimals. But so many other things are still the same. The y-axis representing home runs is still the same, the bin width of 50 is still the same. But there's one more thing I want to talk about that's the same, and that's the actual shape. We can see here we have the same shape.