Transcript for:
Understanding Non-Parametric Statistical Tests

Hi guys, it's Justin Zeltser here from zstatistics.com for the final video in a series on statistical inference. This one is to do with non-parametric testing. Now you don't need to see the other videos to understand this one, it kind of stands alone in that regard.

But if you'd like to, they're up on the website, or if you're watching from university, each of the videos will be up on their respective module homepages. But in this video we're going to be dealing with three particular non-parametric tests. The sign test, the Wilcoxon signed rank test, and also the Mann-Whitney U test, which is also called the Wilcoxon rank sum test, just to be confused. We're going to start with a definition of what a non-parametric method actually is, have a little look at the history of non-parametric methods in medical research, and then we're going to go on to three examples, or an example each, for these particular tests.

Now, as you've noticed, there are two particular pathways we can be going down depending on the type of data that we have. So, if you just have a single sample, then you're in this top arm where we can do a sign test or a Wilcoxon signed rank test. So in that case you might have a single numerical variable measured across a single set of respondents.

Alternatively, if we have matched pairs, we'll also be on this top arm up here. Now an example of a matched pair scenario might be where you've sampled the same people before and after some kind of intervention. So you technically have two samples of data, but because they're matched, you can actually construct the differences between the two.

So say the improvement in some numerical measure. And in that way, you're constructing almost a single sample of differences. So hopefully you can see that that would be very similar to just having a single sample in the first place.

So that's why either of these scenarios are technically the same and we can deal with them this way. However, if you have two independent samples, so say you've got males versus females on some measure, that would definitely be two independent samples. We'll use the Mann-Whitney U-test, or the Wilcoxon Rank Sum.

Anyway, I've kept the theoretical content for this video to a minimum and we're just going to be diving straight into three separate examples, which I think will be quite helpful for you to get your head around the tests themselves. And also be able to use this video as a template, potentially, for answering your own questions. So what's a non-parametric method actually mean? Here I've written that non-parametric methods allow statistical inference without making the assumption that the sample has been taken from a particular distribution.

In this case I've said here for example a normal distribution. And you'll notice that in previous videos or in your studies generally you'll have dealt with means and standard deviations and both of those are parameters. And you might also have recognized that in the testing we've done up to this point, we're very quick to jump to a normal distribution table to assess our inferences. In this case, we're going to be trying to deal without our trusty normal distribution table.

Well, almost. It kind of sneaks in towards the end, but for the most part, we'll be trying to deal without these parameters. Now the history of non-parametric methods is quite interesting. There's a dude here that I've highlighted, John Arbuthnot or Arbuthno, I don't know. And he's an English guy and in his paper, An Argument for Divine Providence.

from 1710, he was trying to assess whether there's a difference in the proportion of males and females being born in London over the period of about 80 years. And in trying to do so, it's actually been credited as being the first paper in inferential statistics. And in the paper he uses the sign test which we're just about to learn about in this video.

In doing so he showed that more males were being born than females. That was the sort of prime conclusion for this paper. And interestingly, a secondary conclusion was that polygamy is contrary to the law of nature and justice, which may have been a little less data-driven than perhaps the first conclusion, I think.

It's a sign of the times of the paper, I would suggest. But this guy, John Arbuthnot, is quite interesting. Apparently he's quite a polymath. was well known for his poetry and other stuff as well.

Seemingly an interesting dude. Anyway, let's get on to having a look at some of these actual examples of these non-parametric tests, the first of which is the sign test itself. So as I said, this is where you have a single sample or matched pairs.

So here's the example. Hemoglobin levels in grams per deciliter were sampled from 10 female vegetarians to assess the prevalence of anemia. And here are those 10 observations in grams per deciliter.

We're asked here to assess whether the median hemoglobin level for female vegetarians is less than 13.0 grams per deciliter. Now the reason why I've written median here is that as we said we're not really on terms of these parameters the mean and standard deviation and all that so we're technically using a non-parametric measure of the central location of this population. So we can't use the mean, so we'll have to use the median here, which itself is a non-parametric measure. And I've used a different symbol, the Greek letter eta here, instead of mu, you might have seen before. It's almost like an upside-down mu, isn't it?

But this is going to be our null and alternate hypothesis. The null hypothesis being that the median is 13 g per deciliter, and the alternate being that the median is less than 13 g per deciliter. Now if you've done hypothesis testing before you'll know that whatever we're trying to seek evidence for goes in our alternate hypothesis. So we're seeking evidence that the median haemoglobin level for female vegos is less than 13. And the null hypothesis turns out to be the converse of that. So you could write here greater than or equal to 13 if you like, but I'm just putting equal to 13. It's a convention thing.

It doesn't actually quite matter how you phrase that. So my first question to you is that if the null hypothesis is true, so say it is in fact equal to 13, the median that is, How many observations in the sample would you expect to have a hemoglobin level under 13? Hmm, well we have 10 observations, and if the median was in fact 13, you'd be correct in thinking there might be 5 observations below 13, and about 5 above 13. You'd allow for some random variation, but that's what you would be expecting from your sample.

What did we get? Well, I've highlighted the observations that are less than 13 here, and they're all in pink. So we get seven observations that are less than 13. So technically, that means we have a sample which is more extreme than we would expect if this null hypothesis is true. But how extreme is this sample?

Is the fact that we had 7 below 13 grams per deciliter enough information for us to reject this null hypothesis? Well, that depends. And that depends on a particular distribution. So let's have a look. This next distribution shows the outcome from 10 observations of the number of I've called them negative observations here but what I mean by negative is just less than 13. So if we just call the observations that are less than 13 we can call them negative and those that are greater than 13 let's just call them positive.

If indeed the median was 13 you'd expect five negative observations. But of course you could get 4 or 6 or 3 or 7 depending on the randomness of your sample. But if you were to get an extreme number like 0 or 10, that might start casting doubt on this null hypothesis, right?

If all 10 of your observations were less than 13, you would have quite a decent amount of doubt cast on that null hypothesis. So where does this distribution come from? Well it's simply a binomial distribution with n being 10 and the probability of a single event being 0.5.

So this will give us a good picture of how extreme our sample really is. And here it is. If we had seven observations which we did out of 10 that were negative, The p-value associated with that is 0.172.

That's just the sum of all the heights of these discrete outcomes. So that's a measure of essentially how extreme our sample is or how likely that null hypothesis still is. So in that case, we can see that we're going to be unable to reject that null hypothesis here. Our sample's not extreme enough, certainly not at the five percent level of significance.

If we had eight out of ten observations that were less than 0.172, than 13, the p-value would be 0.055. So we're getting closer to being able to reject that null hypothesis. Of course, if we had 9 or 10, we would be able to reject that null hypothesis because our p-value is very small in that case.

But as we only have 7, we can say there's not enough evidence at the 5% level of significance to suggest the median hemoglobin level for vegetarian women is less than 13. grams per deciliter. And if I rewind back really quickly to the sample again, just to recap, we found that there were seven observations less than 13, or seven negative observations, which wasn't extreme enough. We needed nine or ten negative observations, observations less than 13, for us to reject that null hypothesis.

And that really is the sign test. So you'll notice that the only piece of information we were using there was whether each particular observation was positive or negative. In other words, is it on one side or the other side of this median value?

So it's their sign that we were interested in. What we're about to do is incorporate an extra piece of information in the Wilcoxon signed rank test. Just from the name you might be able to predict what piece of information that is.

But let's find out. Here's the exact same scenario again. It's the same example with the same 10 observations. And the same question.

Is the median hemoglobin level for female vegetarians less than 13? Now my question to you is what if all the women above 13 grams per deciliter, so the ones in blue, were only marginally above that level, whereas the women below 13.0 grams happen to be significantly below that level. So if we have a look at our sample you'll notice that 13.1 is quite close to 13 as is 13.3.

14.0 is still pretty close to 13. But if you look at all of the pink numbers here, the ones that are less than 13, they appear to be a lot more significantly less than 13. Have a look at this, 10.5, 10.9, 10.1. That's almost three whole units less than this critical value of 13. So is there a way for us to incorporate some of this distance? from that median level in our comparison? Yes, there is, obviously.

So we have the same null and alternate hypotheses here. But what we're going to have to do is rank each of these differences that are created from each of these observations. So here's a table I've created where the first column is indeed those observations, the 10 observations we have in our sample. I've kept the color coding the same, so each row that's pink is one from a woman. with less than 13 grams per deciliter is her hemoglobin level.

And the blue ones are the positive ones. The second column here is the difference between 13 and this particular number. So negative differences are obviously if X is is less than 13, positive is if x is more than 13. So what we're going to do is rank the absolute value of these differences. So this third column is in fact those absolute values, and the ranking starts at 1 for our lowest absolute difference, and counts up to 10 for 10 observations, to the highest absolute difference.

You'll notice there's actually a tied rank here for this 12.3 value, or I should say the difference being 0.7. And the way tied ranks work is just that if you consider these to be ranks 3 and 4, you have to take the average of those two ranks because they're the same. And after finding these ranks, we then apply the sign back to them. So this ranking of 3.5 happened to be a negative one, whereas the ranking of 1 happened to be a positive one. And already we are getting a bit of a picture that the positive ranks happen to be the smaller ones, whereas the negative ranks, the 10, the 9, the 8, the 7, the 6, they're all the big ones.

They're all the larger differences from the number 13. Now if you've got statistical software at your disposal, you obviously don't have to create this table. You could just press the button that says Wilcoxon side rank test and it'll come out with a test statistic. and a p-value. But I kind of want to show you what's under the hood and it's actually not that difficult, especially with a small sample that we'll have to calculate this test statistic.

Now I'm going to take the first of these two possible test statistics. The first one is just the sum of all the positive ranks and the next one's the sum of all the negative ranks. So you could look at this either way. But for the purpose of what we're about to use, I'm going to take this top one here, the sum of all the positive ranks, and I'm going to compare it to a particular critical value that'll tell us whether we can reject this test or not.

So let's find out how to do that. So I've repeated here that our test statistic is 8, because that's the sum of all those positive ranks, the 1, the 2, and the 5. And n is 10. So to run this test, if n is small enough, we can compare t, our test statistic, to the exact distribution table. Now this is something you can find online if you just type in Google Wilcoxon signed ranks critical value, so something like that.

You'll get something that looks a bit like this. So we're busy doing a one-tailed test, so we're in this final column over here, these final two columns. And if we're conducting a 5% level of significance test, then it's this particular column we're interested in. And scrolling down, you'll notice that there's n on this left-hand column.

So if our number of observations is 10, the number supplied to us here is 10. Now, this is the critical value equal to or below which we can reject the null hypothesis. And keep in mind, we're always comparing the smaller of the two possible test statistics. So remember the sum of the positive ranks was 8 and the sum of the negative ranks was 47. This is why I chose the sum of the positive ranks.

So we had a test statistic that was 8, right? So 8 is below 10, meaning that we can reject this null hypothesis. It means our sample is extreme enough for us to reject the null hypothesis, at the 5% level at least.

At the 1%, you couldn't do it because the critical value is 5. which is less than 8. Now if n exists on this table, I would suggest using the exact distribution table. But if n is a little bit larger, and this table doesn't get down to it, you can approximate this using a normal distribution. So remember, at the beginning of the video, I said, we're not going to be dealing with normal distributions and means and all that. When n gets large, of course, the normal distribution will be able to help us out if required.

I'm not going to delve too much into these formula, but you can just sub in all the values into this, t being the test statistic. And technically, that could be either of the two test statistics here, because we have this absolute value situation. But let's just say we use the test statistic 8 here.

n is 10. So you can substitute away. and find your z statistic is 1.99 and in this case we'll reject this null hypothesis if z is greater than 1.645 where did i get that from it's just a typical critical value for z when there's five percent in the upper tail you could calculate that from your z tables if you so choose therefore either way here we're going to be rejecting our null hypothesis, which if we recap, it means that the seven observations that we had, that were less than 13, had ranks that were quite a deal higher than the three observations that were greater than 13. And the result of this test is quite interesting when read in the context of what happened in the previous test. You'll notice for our sign test that we couldn't reject the null hypothesis. But with the Wilcoxon signed rank test, we could.

So why is it possible to get these contradicting test results for the same data set? Well, it all comes down to what information we're feeding into it. For the sign test, we were just interested in whether the observations were less than 13 or greater than 13. But we incorporated some more information in the Wilcoxon signed rank test, and that was critical. We incorporated the ranks, not just the signs, of each of the observations. So, I don't know, I think that was a very good example to try to distinguish these two tests and how they might be used in a non-parametric context.

Alright, we ready to move on? Let's now look at the Mann-Whitney U-test, also called the Wilcoxon Rank Sum Test. And as I signposted earlier, we are going to have two distinct samples here. So, hemoglobin levels were sampled from 10 female vegetarians and 8 male vegetarians in this case. Is there evidence of a difference in median haemoglobin levels?

So here's our sample females. It's the same sample as before for the females. I've just ordered it differently.

I've put it from the minimum to the maximum. And here is a sample of eight male haemoglobin levels. So the question is, Is there enough evidence to suggest that the median female hemoglobin level is different to the median male hemoglobin level?

And in this case, you'll see it's a two-tailed test. Just doing it two-tailed to make a point of difference to the last example. Now, much like the Wilcoxon signed rank test, in this case, we're still going to be ranking all of the data. But no longer are we interested in the differences from a particular value 13. We're just going to rank them from smallest to largest. So the smallest observation is that female 10.1.

So that gets rank of one. The next one goes up to 10.5. and that's rank 2. And then we get that male observation 10.8 which is rank 3. And you can continue on for all of the observations.

We still have those tied ranks at 12.3 so that becomes a tied rank of 10.5 and 10.5 which should have been 10 and 11 respectively. Keep in mind that those tied ranks could happen across the categories here. You could have a tied rank for a male value and a female value.

There's nothing stopping that happening, but in this case it hasn't. So again, the question is going to involve the sum of the ranks for females versus the sum of the ranks for males. And if you sum up all those numbers you get 85 for the females and 80 sorry 86 for the females and 85 for the males.

Now you might look at those and think straight away that oh that's very similar so clearly there's not much difference between females and males. But don't forget females have two extra observations so it's not as simple as just looking at those two numbers. So let's find out how we're going to incorporate all of that information.

So In brackets here I've put t1 equals 85 and what tends to happen is that we choose the sample with the fewer observations which in this case is males and that rank sum is going to be our test statistic. So T1 happens to be Tm which is 85. Now to find the expected value of T1, and when I say expected value I mean assuming the null hypothesis is true. So the expected value of T1 is just found using this formula here where n1 is the number of observations in the smaller sample and n2 is the number of observations in the larger sample so it would be 8 and 10 respectively and subbing all that in we get 76. So all things being equal we would expect the sum of those male ranks to be 76 but they're not, they're 85. So how are we going to assess whether that's far enough away from 76, whether that's extreme enough for us to reject our null hypothesis?

Well again there's going to be a difference if n is small or large. Always use the exact table if there's if it's available to you, so Even though I've written if n is small there's no harm in using this table if you find an exact critical value for your particular combination of n1 and n2. But let's see how that works. Again, you can find this on the internet if you just type in critical values for the Wilcoxon rank sum test.

Or you could probably write Mann-Whitney u-test as well. You'll get something that looks like this. The first two columns here represent the two sample sizes, m being the smaller of the two sample sizes, and n being the larger.

So we'll scroll down until we have our combination of 8 and 10, 8 males, 10 females. And W here provides us with the critical values for this hypothesis test. So if our test statistic lies outside this interval, so less than 53 or greater than 99, then we'll be able to reject the null hypothesis. And for those advanced players at home, you'll notice that the middle of this interval happens to be 76 which should be no surprise to you but we have our test statistic of 85 so clearly in this case we're not going to be rejecting and just checking we're in the correct column this is for where we have a two-tailed test and alpha is 0.05 which is in fact what we have.

So that's good. Now don't worry so much about what D represents. P is actually the p-value associated with this interval. Appreciate because this is a discrete distribution this p-value won't always be the exact value you would expect from these levels of significance.

It would be slightly less each time. But that's a bit of an advanced topic, I'm not going to jump into that at the moment. The important bit is this interval here, which we've now dealt with.

And just for completion, let's just see what happens if n, or I should say n1 and n2 here, if they happen to be too large for this table, how would you deal with it? Well, again, it's the normal distribution. to the rescue.

We know that in large samples everything tends towards a normal distribution thanks to the central limit theorem. So you can just use this formula subbing in t1, the expected value of t1, and all the n1s and n2s you can fill in there. And we find that the z-score here would be 0.8.

Now technically I don't think n is large enough in our case to warrant a normal approximation, but I've just done it there for completeness. Unsurprisingly it concords with our result from the exact table because that z-score is quite low. There's no way we'll be rejecting a null hypothesis based on that z-score.

So, quite rightly, we say do not reject H0. So there's not enough evidence here to suggest that the median for females differs from the median for males. So that's it. That's non-parametric methods testing for statistical inference. Thanks for watching.

As I said at the beginning, all the other videos are up on zstatistics.com, along with a whole bunch of other statistical resources. There's a podcast which I've started which you might be interested in, and a few other things. So yeah.

Finally, don't forget to subscribe to this YouTube channel. Catch you around.