...and start talking about a new statistical test today. This will be the analysis of variance and this is one of several videos. This first video is going to focus on the theoretical background and understanding how to use ANOVA. And then the subsequent lectures will then...
provide examples and how to actually use this with data. So when we're thinking about the analysis of variance it provides a similar function to that of a two sample t-test in which we can determine if different population means are different from one another or not. But the two sample t-test is limited because we can only sample across two different populations and what if we've got more than two populations in which to test?
Well we can do so through the analysis of variance. And we can use the F test as the foundation and as the name of this test, analysis of variance, indicates we're going to be looking at differences in variance across these different populations to get an understanding if their means are different or not. So let's say that we're interested in aquaculture and we want to make sure that the fish that we're purchasing are from the highest rated facility.
It's based on a variety of different criteria. And we can look across different states and we can notice what the mean rating is of these different states across different periods of samples. And we notice that Texas has the higher mean rating, but is this significantly different than these other states? Does it really matter for purchasing fish from Texas, California, or Maine based on these ratings?
And so ANOVA will provide us with the opportunity to determine this by testing the null hypothesis that the means are equal to one another. So in its simplest form, analysis of variance is going to test if means of different populations are equal or not. And so this is going to expand our t-test to more than two different populations.
And to do this, ANOVA is going to partition the variance. of these populations into their respective components. And it's going to do so by looking at the difference in the populations across one another, as well as the populations regarding the samples within each of those populations.
And we'll clarify this as we move forward to make sure everybody's on the same page. So why don't we just use multiple t-tests. Well, the first reason is that this will increase the amount of time that we need to do our calculations, and we know our time is valuable. But more importantly, doing multiple t-tests is going to increase our type 1 error of rejecting the null hypothesis when in fact it's true in the real world. So, let's focus on this first issue of increased calculation time.
Now, let's say that we're interested in testing the means of three different populations to see if they're equal or not. Well, this will then lead us to have to do three separate t-tests. A compared to B, A compared to C, and B compared to C. Let's say that we're interested in testing the means of four different populations.
Well, then we need to compare A and B, A and C, A and D, B and C, B and D, and C and D. So six different t-tests. And as we increase the number of populations, this is increasing the number of tests that we need to do.
So this is going to significantly increase the amount of time that we spend calculating out these test statistics and comparing them to our t-table. More importantly, if we're doing multiple t-tests, this is going to increase the likelihood that we reject an old hypothesis when it's actually true in the real world. Now, if we think about our t-tests and any other tests, we set the significance level in which to reduce the likelihood of a type 1 error. and our default has been 0.05, and so what this means is that there's a 95% probability that if we do reject the null hypothesis, we're correct in the real world. But if we start doing multiple t-tests, let's say that we have three different t-tests that we're going to do, we start to decrease the likelihood that we're making the correct result, the correct inference.
What we find is that if we do the null hypothesis, is that the probability of incorrectly rejecting the null hypothesis is going to increase. The chance of making a wrong decision and making a type 1 error are going to increase the more t-tests that we do. And so ANOVA alleviates this particular issue. And so we have a couple assumptions for ANOVA, and these are not new.
The populations are normally distributed, data is independent of each other. And again, we're going to make these assumptions in this particular class unless we're told otherwise. Now a couple important terms that we need to be aware of, particularly as you're reading through the book and you're going through those examples, the first of which is a factor, and this is the characteristic that is going to distinguish the different populations from one another.
Let's say that we might be interested in a different species, different bays, or different age classes. And within those different factors, we'll have different levels or different groups. And so these are our experimental treatments. So if we're interested in different species, these could be three different fish that we're comparing to each other. If we're looking across different bays, these could be different estuaries that we have along the Texas coast or different age classes of a particular animal.
So if we're comparing the diets of different sharks, our factor would be the age class of these sharks and the level or the group would be the different respective age classes. For comparing growth rate of oysters across different salinities, our factor in this case would be salinity, and our level or groups would be those specific salinities that we're comparing. So if we're thinking about the two sample t-test, our null hypothesis is that the two means are equal to one another. And ANOVA follows a similar format in which we are testing against the null hypothesis that all of the means that we're considering are equal.
And so our alternative hypothesis is at least two of these means are different from one another. And for our particular purposes it's not critical that we identify which of those differences occur. So it doesn't matter if population one is different than population two, population two is different than three, population three different than four.
What we're concerned with is the null hypothesis in this particular case in that we're testing against all the means being equal. And just for reference, if we were to run an ANOVA for just two samples, we'd actually get the same result as a two-sample t-test. That doesn't alleviate you from being able to discern when we should use a t-test versus an ANOVA, but this is just reassurance that if you were to run an ANOVA on two different data sets and compare two populations, you'd get the same result as that two-sample t-test.
Okay. So to test this null hypothesis we're going to partition the variability of our data so that way we can determine how much of that variability is attributed to the difference in the means of our groups. So if we're comparing our fishes, the different species, or our different bays, as well as how much of that variability can be attributed across the individuals within each of those respective populations, within each of those groups.
So we discern these as among group variability and within group variability. So the greater the differences across those different groups, the greater the likelihood we're going to have of rejecting the null hypothesis. If we have a lot of differences across our Bay systems or across our salinity regimes, we're going to have a greater likelihood, more support for rejecting the null hypothesis.
But if we have more variability within these groups, within these salinity regimes or within these estuaries, well then... It might be that the differences we're observing are not necessarily due to the differences across the treatments or the groups or the levels. It's more likely that we're just seeing inherent variability within our data set. And so we'll have a less likelihood of rejecting that null hypothesis.
So let's look at an example to illustrate this point. Let's say that we're comparing the sizes of red drum across some of our bass systems here in Texas. We want to know if there's a difference in their size. And so for ANOVA we need to know how different the fish are from the different bays.
So if we're comparing these fish, how different are the fish from Sabine Lake to Galveston Bay to Matagorda Bay? We also need to know how much variability do each of the fish have within their respective bay systems. So if there is a lot of difference between the fish from Sabine Lake to Galveston Bay to Matagorda Bay, well then we have a pretty strong argument to reject. the null hypothesis.
We have quite a bit of evidence to do so. But if we have a lot of variability within particular bay systems, there's lots of variance in the size of fish within one or more of these bays, it can lead to us having less evidence and more likelihood of us failing to reject the null hypothesis. So this goes back to our premise of sampling distributions in that each of these bays represents a different population, and within each of these populations we have a variety of different samples that we can collect. And each of those samples is going to tend to be different from one another, but they're going to tend to cluster around that central value.
So we can think about this with regards to a distribution curve in which this curve represents our respective population. and this population curve is generated based on the data from our different samples. And we have two different extreme examples that we can think about, one in which we have very similar samples every time that we go out and collect data, and one in which we have very different samples.
And what we notice is that when the samples tend to be very similar to one another, the distribution curve for the population tends to be less variable and more peaked. versus if we have lots of variability in our samples, it tends to be that our distribution curve spreads out and is more flattened, and there's more variability there. And this is going to affect our ability to discern if there's actually differences among our populations, and that depending on how similar our samples are to one another, how much variability there is in our samples, in turn affects our variability in our populations, and in turn affects our ability to determine differences among these populations. So...
Let's say that we go out and we collect data from these red drum across various time periods in our respective ecosystems of interest and we find that Sabine Lake red drum tend to be the smallest, Matagorda Bay red drum tend to be somewhere in the middle, and then Galveston Bay red drum are the largest. And let's say that we have the first scenario in which all of the samples tend to be pretty similar within their respective ecosystems. Well then we would expect to see distributions similar to this in which they are spread out across the size distribution and we have some variability and some overlap within these respective base systems, but that overlap is pretty small.
And so this is going to provide some pretty strong support for us to show differences across these respective base systems. However, this isn't always the case. Oftentimes we get quite a bit of variability in our respective samples. and let's say that there's quite a bit of variability in the fish that we sampled in Galveston Bay, well then the overlap between Galveston Bay and Matagorda Bay are going to start increasing. And this in turn might lead us to be unable to discern the difference in the sizes of fish between Matagorda Bay and Galveston Bay based on our statistical tests.
And so we can tell that Sabine Lake is smaller than the other two, but we can't really tell that Galveston and Matagorda Bay are different. let's say that we start getting a lot more variability in the fish that we're sampling in Sabine Lake. Well then we're starting to run into this high level of overlap between our different sampling distributions.
And as we think about this, every time that we go out and we collect data, we have the likelihood that we could increase variability in our particular population, but also this could then in turn increase our, also increase our confidence. and reduce the confidence interval that we have in our point estimate of our respective populations. And so in going out and collecting data, sample size is important, and this is particularly important when we're testing the means across these different groups.
And so when we think about this, this goes back to that central limit theorem that we talked about several, several lectures ago, in which increased sample size is going to increase our capacity to describe a population from a sample. So this... This lecture was very theoretical in basis, but it's quite important for us to understand as we move forward. The next two lectures are going to focus more on using data to illustrate how we can actually use ANOVA to test for differences across the means of populations, but it's really key to have this foundational understanding in your back pocket.