Transcript for:
T-Distribution and T-Statistic Overview

As we start talking about confidence intervals, as we start talking about hypotheses tests here in chapter nine, as we want to ultimately estimate and test for a mean, we are going to now be using a new statistic to determine how weird my sample is, and that new statistic we're going to use is the T statistic. Let's go back and remember when we were in chapter 7, when we were studying proportions, when we were studying categorical data, that standard error formula was this huge square root expression with a fraction inside of it divided by n. Versus in section 9.2, where when we were talking about means, when we were talking about numerical data, remember the standard error formula was a fraction where the square root of n was the only thing on the bottom. What I'm trying to emphasize here is that these two standard error formulas are totally different. Because the standard error formulas are different, it's these differences in the two standard error formulas from what we were doing when looking at proportions and looking at means is the reason why we now need to develop a new statistic to understand just how weird my sample is. This new T statistic is still going to be based off of the z-score. The general z-score formula was observed minus center divided by spread. These two formulas are pretty similar in that the observed value is once again going to be the statistic coming from your sample, that's still the same, and the center is then going to be the parameter coming from your population. Once again, we are looking at a comparison between what's going on in the population, knew the population mean, and what's going on with my sample. In this case, my sample. Need so in a lot of ways the top of the fraction is actually very similar. You're taking the difference between now the means. The big difference is the spread formula because in this standard error formula we're no longer going to use Sigma, but rather we are going to replace Sigma with now the sample standard deviation. Why is this so critical? Well, it's because then in this formula we're using x-bar, x-bar, the sample mean, we're going to use s, s, the sample sample standard deviation, and we're going to use n, n, the sample size. Why is this formula so interesting? It's because literally we are utilizing every aspect of the sample to calculate this T statistic, and it's that alone, the reason why the T statistic is not going to be like the statistic we learned in chapter 7. Ultimately, because of this, the T statistic is not going to follow the normal distribution. Because by utilizing the sample standard deviation and sample size, we are dividing now by an estimate of the standard error. The bottom line then is that this T statistic is not going to be a perfectly normal distribution like we've studied, but rather it's going to be a t distribution. So what was the whole point of me explaining this paragraph right here? It was to emphasize the fact that when you are looking at numerical data, you are looking now at a T distribution. All right, ultimately when you are studying numerical data, when you are studying population means, you're no longer in the realm of that perfect normal distribution, but rather you are in the realm of a t distribution. So what is a t distribution? Well, for starters, I want us to look at the graph at the bottom of the page. Notice how there's a blue curve. That curve is what is the normal curve we know and love, whereas the pink and orange curve, the curves are a little slightly off, a little lower at the top of the curve, a little farther out on the edges of the curve. Those are what we call the T distribution curves. Now, ultimately, I want you to see the similarity is that all three of these graphs are symmetric and unimodal, but yet we can clearly see they're not the exact same graphs. So what are the differences? The biggest difference is that the tails are slightly bigger in the T distribution. Let's zoom in on one of these tails. Let's go back and remember the blue curve is the normal curve. And so what I'm shading in and in blue is ultimately the area of the tail of the normal curve. Let's go back and remember the orange curve is ultimately a t distribution curve. And if we start the tail at the same place, I want you to see here that the orange area is just practically bigger than the blue area. I mean literally I can highlight further in orange the part of the tail that is clearly beyond the blue normal curve. I want you to see the T distribution ultimately has a bigger tail area compared to the normal curve which has a smaller tail area. Why is this so critical? Let's go back and remember in general tails are representing the extreme values and the bigger the area, the more extreme values you can have. I'll say that one more time: the bigger the area, the more extreme values you can have. And so why is it notable the T distribution has a thicker tail? Well then it's emphasizing it's more likely to have an extreme value, a value far away from the center. Numerical data is frankly more varying than categorical data. Numerical data like we saw in chapter two when we were asking students how many siblings do you have or how many email addresses did you have? Note how in those particular problems there existed things like outliers. And so what the T distribution is doing is it's allowing for a higher degree of options to have extreme values. When it comes to looking at a T distribution, we give a name to the shape because as we can see here with the pink T distribution curve and then the orange T distribution curve, we can clearly see they're different. And we can clearly see they're different because of their sample sizes as well as just practically the overall shape of the pink versus orange graphs. It's the sample size that's going to dictate how we describe the shape of the T distribution. We describe the shape of the T distribution by taking the sample size n, subtracting one from it, and saying then the T distribution has n minus one degrees of freedom. So for instance, if we look at my orange graph, my orange graph which was created with a sample size of five, we would say the orange graph would have four degrees of freedom. So my pink graph, how many degrees of freedom does that pink graph have when its sample size was 15? Yeah, it would have 14 degrees of freedom. I want you guys to know that as the degrees of freedom increases, as we go in essence from the orange graph to the pink graph, what I want you to note is that as that sample size increases, notice that the T distribution, the pink graph gets closer and closer to the blue graph. The T distribution as the sample size gets bigger becomes more and more normal. So ultimately, what is this emphasizing? It's just emphasizing the fact that we're actually still looking at a normal curve. You're like, really? Yes. And ultimately, this is just emphasizing that the bigger the sample size, the better. Pretty much the whole conversation about T distributions is really to allow us to be able to look at numerical data when the sample sizes are pretty small, like five or 15. But the big point of this final bullet point, the reason why I even made it a fill in the blank, is to emphasize the fact that if your sample size is huge, the T distribution is pretty much a normal curve in which everything we know and love about normal curves can apply. Again, what was the point of this entire page? First things first, it's to make us understand that when we are looking at numerical data, when we are studying means, at its baseline, we are looking at a T distribution. All right, numerical data studying means will always live in the realm of being a T distribution. And yet, as your sample size gets larger, that T distribution is practically going to become normal. And so once again, this is emphasizing the importance of having a big sample size. All right, and now ultimately, within this T distribution, we once again can do statistical inferences. We once again can do statistical inferences of taking a good sample and using that to understand my population. So what I'm trying to emphasize here is even though we are in the realm now of numerical data, even though we are in the realm now of T distributions, we can still do statistical inferences. We can still use a good sample to understand my population. And again, we're going to do that utilizing the same two things we've learned in chapters seven and eight. We're once again going to do that using confidence intervals and hypothesis testing. Now, why? All right, again, the big point is: Why? Why are you making us learn this? Why are you making us learn that when we're looking at a numerical variable, when I'm looking at means, I'm in the realm of T distributions? Why are you focusing on this here? Well, it's ultimately because when we study confidence intervals and hypothesis testing here in chapter nine, I want you to understand that the calculator functions we're going to use are both going to begin with the letter T. That is the main reason why I'm covering this here. It's so that you understand that as we study numerical data, as we study means which are in the T distribution, the calculator functions are both going to begin with T. T interval for when we are making confidence intervals or T test when we are making hypothesis testing. Whole reason why I just did this huge explanation here is to help you guys understand that when it comes to the next two sections, so in particular section 9.3 when we look at confidence intervals and section 9.4 when we look at hypothesis testing, I want you guys to just understand the reason why we're going to use the calculator functions involving T is because numerical data comes from a T distribution. So if you are going to take anything away from this page here, it's this little flowchart summary everything I highlighted in red so that you understand the calculator functions we're going to use are going to begin with T because normal data is actually going to live in the T distribution.