Transcript for:
Understanding Paired T-Tests in Research

Now, hey there. Today, the topic is paired t-tests. And if you remember, last time we were talking about two-sample t-tests, and I told you there were two kinds of two-sample t-tests, independent sample tests. That's when you have two conditions comprised of different elements or subjects, maybe comparing children and adults on some test. A paired t-test, which is what we're going to talk about today, independent test we did last time. Pear-T test is when you have two conditions that are comprised of the same elements or subject. You measure the same observations under different testing conditions. So it might be pre-and post-observations of the same subject. You think about some sort of treatment in some sort of medical study to see what their symptoms are prior to going through the treatment and then after going through the treatment, and you have the same subjects in both conditions. So that's the difference between independent test and Pear-T test. As you said, Sometimes, Paratis is called the repeated measures or within subjects design. That's just different kinds of terminology. Again, think about this as a pre and post study. So we could measure your stress prior to going through some sort of yoga experience and then after going through some yoga experience or meditation or any sort of intervention like that. We could look at reaction time sober and your reaction time not sober and see how that. changes to see the effect of alcohol and again measuring the same subjects an independent sample c test way to do this would be to have one group of sober subjects one group of intoxicated subjects and compare their performance pair test same subjects sober they repeat the test when they are intoxicated oh um we we looked at an example right where students we looked at students sleep on weekends and weekdays that was a uh a within subjects design amenable to a pair t-tests but they're the same subjects reporting their sleep habits both on the weekend and during the weekday now their advantage is to repeated measures design you don't need as many subjects because you have the same subjects in both conditions so that cuts down on recruitment issues it allows you to study changes over time or practice effects so if i want to see how well people are doing in class it allows me to measure something like that and it removes individual differences uh between groups which can protect against some variable issues so if we go back and look at these examples, the sober and intoxicated reactions. I'm not sure why I'm doing jazz hands for these things, but we're just going to go with that. What we try and do is, if we're doing that as an independent samples to test different subjects, no jazz hands, in our two groups, it could be that one group just generally has better reaction time than the other group, right? We try and solve that problem through random assignment. But we can't always be sure that we have. If you have a paired subjects test, then you don't have to worry about controlling for things like reaction time, these sort of third variables, because you have the same subjects who might have the same reaction time. So each subject serves as their own control is a good way to think about the advantage of repeated measures design. Now, the disadvantage of repeated measures design is time related factors may influence results. So if you think about a health study. and you're looking at the efficacy of some treatment, as time passes, the subjects might be getting, let's say, sicker and sicker and sicker and progressive disease. And so the treatment might look less effective than it might otherwise be. Another thing that can sometimes interfere with repeated measures design is practice effects. So if I teach you how to do something like some problem solving task, and I measure performance several times, the same subjects, same conditions, you're going to get better, right? at that problem solving task simply because you've been practicing it. So there are ways around all of these problems, but there are things that you have to keep in mind when selecting the design of your experiment. Okay, let's look at an example of a paired samples design. I run a number of a small widget building empire. These are my widget building factories spread all around. western Massachusetts. I am a benevolent widget manufacturing mogul, so I'm concerned about safety at my factories. I want to reduce accidents, keep my workers safe. So we have these factories in these different towns. I measure the number of accidents per month, then I have subjects with my workers. go through a factory safety training program and let's see if that factory safety program reduces errors. Are there fewer accidents after the safety training program than there were before? You can see. The mean for the four was about 54 accidents per month. That's a lot. That's concerning. And after it's 48.6, you see the associated standard deviations for those two samples. The question is, is that change from 53.8 to 48.6 significant? Does it represent a larger change than we'd expect simply by chance variation? So hypothesis testing, we're interested in the different scores. So again, we have, in this case, we don't have people, we have factories. But we're looking at the difference between Amherst factories accidents before and after, South Hadley's accidents before and after. So we're going to focus on the different scores, not the 45 and the 36, but this 9 that represents the difference, not the 73 and 60, but the 13 that represents the difference, the reduction in the number of accidents in South Hadley, for example. So that's going to tell us how much change there was. And we're testing whether the difference between pairs of scores is due to sampling error or represents a huge range. So our null hypothesis for a pair test is going to be that mean of the different scores. That's what we're focused on, not on the individual data point, but the different scores. We're going to say the null hypothesis is that is zero. There's no effect of the factory, the safety training program. And the alternative hypothesis is that mu of the different scores is not going to be equal to zero. So what we'll do is we'll look at the mean of the different scores and notice they're all negative, except for Springfield. Springfield had an increase in accidents from 33 to 35, but everyone else, it's a negative score. And the mean of the different scores, if you sum all these up and take the average, is negative 5.2. So the question is, is that 5.2 significantly different from zero? Another way of putting it is we have a sampling distribution of mean differences. with every possible new difference for every combination blah blah blah and the question is where does that 5.2 fall in the sampling distribution is it out here in one of the tails probably negative tail and we're going to conclude that's a significant difference if it's more in the middle we're going to conclude that it's not a significant difference when we have paired data the formula that we use to run the t-test is very similar to the formula we used before. It's going to be the mean of the different scores. Again, this d0, for our purposes, it's almost always going to be zero. It's always going to be zero. But we could say, oh, we want to see if the factory number of accidents reduced by at least 10 or 20 or 50 or whatever we decide. But this, for all intents and purposes, is going to be zero. So the numerator is going to be the mean of the different scores. The denominator is going to be the standard error. sigma if we know the population standard deviation, but most of the time we aren't, and we're going to use t and the standard deviation, which is, sorry, the standard error based on the standard deviation of the sample, okay? And the degrees of freedom is going to be n minus 1, which is the number of pairs. In this case, it's the number of factories. So if I have 10 factories, I have 20 data points, but I only have nine different scores. Sorry, I only have 10 different scores. So the degrees of freedom is going to be 9 in that case. So we want to run this. We're going to run a two-tail test. Mean of difference course is equal to 0 or it's not equal to 0. We set alpha equal to 0.05. We find our critical value with 9 degrees of freedom. Again, we have 10 different scores, 10 factories, 10 different scores. So we have 9 degrees of freedom. And the critical value for t is 2.262. We've already found the mean of the different scores is negative 5.2. Now the next thing we have to do is calculate the standard deviation of those different scores. You can see the sum here is negative 52 divided by 10. That's where the negative 5.2 comes from. So that's the sum of x. But now we need to figure out the sum of x squared. And I just have d here for the different scores. And we're just going to calculate the standard deviation. using the formula that we've used. So there's a sum of d, it's negative 52. The sum of d squared, you just have to trust me, is 420. But that's this negative 9 squared. Remember, negative times negative is positive. 13 squared, negative 2 squared, etc. We'll get 420. You're going to math that math. 420, the sum of d squared minus the sum of d, negative 52. Negative 52 squared is a number. But again, it's going to be a positive number. Remember, it's going to be a positive number because the negative times the negative is a positive. Calculate the calculations, calculate some more calculations. the standard deviation is 4.08. So that's our standard deviation. We can then plug these values into our formula from the observed value of t. The mean difference score is negative 5.2. That's the numerator. The denominator is going to be 4.08, the standard deviation that we just calculated. We're going to divide by the square root of n, again, the number of difference scores. So it's square root of 10. 4.08 divided by 3.2 is 1.3. And negative 5.2 divided by 1.3 is about negative 4.02. Compare that with our rejection region. Rejection is plus plus and minus 2.262, I think it was. Obviously, 4.02 is going to fall in the rejection region. So we're going to reject the null hypothesis. And our interpretation is going to be there is a significant difference, the safety program. being a 53.8, I forget the standard deviation, the number of accidents before 53.8 was significantly greater than the number of accidents after the safety program. So the safety program significantly reduced the number of accidents in the factories. T with nine degrees of freedom was negative 4.02, P less than 0.05. Okay, so let's call it a a lecture here because this one's a little bit on the long side and then we'll come back and talk about some other things