Null Hypothesis Testing Overview

Before we directly get into the nitty-gritty details of the null hypothesis significance testing, let us start with a research to set the scene for what's coming. The baby in the picture is suffering from what's known as Hutchinson-Guilford Progeria syndrome, which is an extremely rare and progressive genetic condition that produces rapid ageing in children. The average life expectancy of the children with the syndrome is known to be 13 years old. In individuals with this syndrome, cardiovascular diseases such as heart failure or strokes are common causes of death in the teenage years. Unfortunately, there is no cure for progeria at the moment but research for treatment is ongoing. As a part of the ongoing effort, a clinical study in 2012 examined the effect of treatment with the drug called Lornafarnib on a number of physiological outcomes. First, researchers measured the pulse wave velocity of 18 children diagnosed with the syndrome. PWV is often used as a measure of vascular stiffness, which is an important factor in cardiovascular health and it is known that this outcome measure is abnormally high in children with the syndrome. In normal children, the average PWV is about 6.6 metres per second or less. Having said that, one of the research questions of the study can be the following: Is PWV of children with progeria different from that of normal children? Even though I made the research question simpler than what's in the slide, but in general, you can summarise research questions of a study in more detail by identifying what is measured, why it is measured, and the expected results. This applies not only when you conduct your own research but also reading and critiquing other people's research too. So what is the outcome measure here? What is being measured basically? That is pulse wave velocity, PWV in metres per second. And... why did they measure this PWV? Because PWV is one of the indicators related to the arterial stiffness known to negatively affect the children with progeria compared to normal. So fast PWV is known to be responsible for accelerating the ageing process in the progeria children. So when everything is measured, the PWV of progeria children as a group is expected to be higher than the given cutoff of 6.6 metres per second. So we expect that the PWV of the patient children will be higher than 6.6 metres per second. But the question is, by how much should they be different? Let's suppose that you measure the average PWV from the progeria children and say it turned out to be 6.7 metres per second, then can we say that the difference is large enough that they are abnormal? Maybe or maybe not as there are issues such as accuracy and precision of the measurement as well as the individual variations in PWV. Given that the measurement is collected using the same method, meaning the accuracy and precision are constant, what matters the most is then the sampling variation. Remember from the essential limit theorem where every random sample will have different sample means because individual makeup of each sample is different? Because there can be million different ways to have a sample of progeria, we need to take account into the sampling variation at least for the given sample size. This is the reason why people run the null hypothesis significance testing as a formal way to test their research question quantitatively as it is developed exactly for that purpose. In doing so, they can claim that their result is not just sample-specific but more generalisable by considering the sampling variation. Now we need to split the research question into two competing hypotheses, namely the null and the alternative hypotheses. Typically, the alternative hypothesis or H1 is your research hypothesis that you support or expect to happen against the null hypothesis also known as H0 or H naught. Basically the null is the devil's advocate representing what if. It is the antithesis against the alternative hypothesis. The reason why we have these two competing hypotheses is that none of them can be 100% true or false in the game of null hypothesis significance testing. Whatever hypothesis you choose at the end of null hypothesis significance testing, you should always remember that you are only making probabilistic statement considering the possibility, no matter how unlikely it can be, that the other hypothesis that was not selected might still be true in reality. In that sense, you should not say that you prove anything with the null hypothesis significance testing. NEVER, EVER. Having all said that, now let's turn our research hypothesis into the alternative hypothesis where we expect to see the average PWV value of the patients is different from the normal value of 6.6 metres per second. More specifically, the difference should be shown in the direction where the average PWV value is greater than 6.6. If less than, then we have a wrong group of patients or they are not the right patients. In many instances, the former is the default mode of setting up an alternative hypothesis, which is called a two-tailed or two-sided hypothesis. On the other hand, the latter is called a one-tailed or one-sided hypothesis where you are only interested in the effect in the specific direction. However, even when the expected direction of the outcome seems crystal-clear like in the current example, two-tailed testing is still recommended as two-tailed testing considers both directions at the same time and the direction of the outcome will become evident after all. There are lots of controversy behind which one to use but I am not going to go over that in much detail here. When in doubt, just stick to the two-tailed testing only and you'll be safe. However, I will use both one-tailed and two-tailed testings as I go for the illustration purposes to help you better understand the mechanisms of null hypothesis significance testing. For the current example, let's say we set our H1 as one-tailed as the following: On average, PWV of the patients will be greater than 6.6 metres per second. Once you have your H1 setup, then your null is just the opposite: the null goes like, "on average PWV of the patients will not be different or the same so will be the same as 6.6 metres per second." Please beware that you do not implicate the direction of the outcome when setting up the null even when you are running a one-tailed testing in your alternative hypothesis. For example, let's suppose that you set up the H1 like this bottom one, "On average PWV of the patients will be greater than 6.6 metres per second. Then the null is still the same - so you're assuming that there will be no difference because null is about no difference, no effect, and there's no way we can't figure out in which direction it will not differ or will be the same. Therefore, it would be wrong to set the null like, "on average PWV of the patients will not be greater than 6.6 metres per second." So this is just not the right way of setting up the null. Setting up the hypothesis in words can be quite mouthful, so sometimes you can use symbols instead to make it simpler. Because we are making inference about the population, we will use Greek letters to represent the parameter, which in this case, the population mean. So the mu here, so mu here represents the population PWV we try to infer and to see if this parameter is same or different from the given testing population mean, which is replaced and generalised with the mu zero. So here in this case, our mu zero is 6.6 metres per second. So these two equations basically the same thing. If you assume no difference between the population average against some kind of a value, another population mean value and you expect that the difference between these two values should be equal to zero. that's the same way to say that they are basically the same. On the other hand, if your alternative hypothesis is set up in two-tailed, then you are saying that these two values will be just different. And if you just subtract them each other and the difference will not be zero. It'll be either positive or negative but we don't know which one it will be. On the other hand, if you set up your alternative hypothesis in a one-tailed way, then you expect... So in this case, the PWV, so the population mean that we try to infer should be greater than the cutoff and the difference should be positive or the population mean in question is actually greater than a cut-off value. Once you finish setting up the pair of hypotheses, then you need a decision rule so that you can choose only one of them later. Typically two probabilities are compared to make a decision. One is called a P-value, representing the probability of observing the test statistics as big as the one that is calculated from the data or more extreme ones, given the sampling distribution of the test statistics. The other probability is called a level of significance or alpha, which is another probability set to be compared against the P-value before data collection. This is somewhat arbitrary but the general consensus is to use 0.05 or 5% unless noted otherwise. The rule is simple: when the P-value is smaller than the alpha 0.05 then your result is statistically significant and you reject the null. In other words, your data are in support of the research hypotheses. Otherwise, you fail to reject the null and you do not have strong enough evidence to support the research hypothesis. We will talk more about what the alpha 0.05 means and P-value later in more detail. So this is the sample of PWV data collected from the 18 patients. Now let's calculate the mean and the standard deviation of this sample using Jamovi. Here is our PWV data. Let's just calculate the mean and the standard deviation. maybe we want to just have about There we go, so... It's a bit skewed. So the mean is 12.4 metres per second and standard deviation 3.64 metres per second. Now the difference between this mean value and the normal value of 6.6 seems big enough. The sample mean is almost twice bigger than the normal value of 6.6 metres per second but the mean difference is only a part of the story. No matter how big the difference is, the difference will be muddled if there exists huge variability in the difference, even though it doesn't really look like a huge difference here in terms of standard deviation. Now let's calculate how unlikely to observe such sample means as big as this one that we have or bigger sample means considering both size and the variation of the test statistics together. Given the research context, the test statistics of choice will be what's called the one sample z-statistics that we can calculate using the equation which was introduced previously. Remember the equation describes how sample means vary from the mean of the population from which the samples are drawn? Therefore, these statistics can be used to test how far or close a sample mean, so which in this case x bar of size n so n here is the sample size, is from or to the population mean which is denoted as mu zero with a known sigma which is almost always never known and that is actually the the population standard deviation and this is not known, almost always never the case. So in many instances, including this one, we do not know the true population standard deviation and that is why we are using a sample, right? Otherwise, there is no reason to infer a parameter, the population parameter, right? from a sample because we already know about the population. Therefore in practice, we use the sample standard deviation to approximate the sigma instead. So when the standard deviation is used in place of sigma, then the resulting statistics t does not follow normal distribution anymore. The sampling distribution of t statistics follows the t distribution with n minus 1 degrees of freedom. So here the df inside the bracket represents the degrees of freedom and for this one the degrees of freedom is n minus one so one less than the size of the sample and as a function of the degrees of freedom, the shape of the t distribution will change. So what we're seeing here in the graph are the four different t distributions with the different degrees of freedom. So as you can see, t distribution is very much looking like a normal distribution except it is leptokurtic. What that means is that the centre looks slender right and then it has heavy tails, heavier than the normal distribution. So for example, this black one, the black curve represents the t distribution with the infinite number of degrees of freedom, which is basically the same as the normal distribution. And compared to this black one, if we look at the yellow one with the degrees of freedom of one. So that means you only have a sample size of two right because this ni in Greek represents degrees of freedom again right and then when the degrees of freedom is 1 then the sample size is 2 because it's n minus 1 right so as you can see and it looks slender compared to the black one and then it has fatter tails compared to the black one right the yellow one is. So that is the characteristic of the t distribution and so if you have a large sample size, then the t distribution approaches to normal distribution. Now let's calculate the t statistics, now that we have all the numbers. So t 18 minus 1 degrees of freedom which is 17 is x bar minus 6.6 so that's what we are comparing against and then the standard deviation of 3.64 and this should be divided by 18. Now the numerator becomes 5.8 and we need a calculator below. So here's our calculator now 3.64 divided by the square root of 18 is this. Now because this has been underneath the denominator we flip it and then multiply 5.8 becomes 6.76. So the bottom was... and 0.86 I believe and the answer is this right What we just did was to standardise the sample mean by shifting it against the testing population mean of 6.6 and dividing the difference by the sample standard deviation. That way, the size of our t statistics will tell us how close or far away it is to or from the centre of the sampling distribution which is 6.6 metres per second. Now how do we make a decision if this t value is large enough to say that the sample mean is statistically different from 6.6 millimetres per second? As I said earlier, you will compare the probability of observing the test statistics under the null distribution against the preset probability, which is the decision rule we set in place before we calculate this t-statistics. So this preset likelihood is called the level of significance alpha and it is typically set at 0.05, which is the area under any sampling distribution representing the likelihood of observing a critical statistics or more extreme. This is the standardised sampling distribution of t with the degrees of freedom of 17, representing the null distribution. So this curve tells you a relative likelihood a certain test statistics is observed given the degrees of freedom. For example the most likely observation under this curve is t equals actually zero, which represents no difference right between the sample mean and the testing population mean. So basically this is the default position of the null, assuming no difference. However, because of the sampling variation different observations are also possible with different likelihoods. Now due to the symmetry of the t distribution around the centre, if the observed statistics is far away from the centre, so this way or that way, then the likelihood of observing the statistics quickly drops off, especially those statistics fall under either tail ends will be very unlikely. So say here if the test statistics falls right here and the likelihood that we will see or obtain this large statistics is very unlikely. So when these statistics fall under either tail ends, then we can say that it is highly unlikely that the sample statistics is the same as the testing population mean because the difference is quite large. So it becomes more and more unlikely that these data are the same as the population mean or the some kind of testing population mean value but the question is then how unlikely should it be? So we need to draw a line somewhere to make a call so that drawing a line somewhere is your level of significance alpha 0.05 or 5% chance level. it is in fact, the area around the tail end where it represents the 5% of chance under which a sample statistics will fall. So in our current example, our alpha lies somewhere on the right tail end of the curve as our H1 concerns only if our sample mean is greater, not less than 6.6 metres per second. So that actually zero point in t, that's where 6.6 metres per second lies. Because we're expecting to see our statistics is greater so we only look at this right side, right ends to see the positive difference. So if we predicted our sample to be less, then our alpha would lie on the left side somewhere here right here and which represents the negative difference. So just for the sake of an illustration, you can think of this alpha as a betting odds before the game starts. You place your bet given the odds. You run the experiment knowing what your alpha is. So you can claim the prize money if the result beats the odds. Otherwise, you don't. So just to reiterate, this is a preset odds before you collect the data. So you cannot change this odds after you see the data like you cannot change your bet after the game is over. By definition, the critical statistics is the boundary beyond which the area under the curve becomes 0.05 or 5%. It is somewhere here and that area under the curve from the critical statistics so that is on here so somewhere here that is t crit, critical statistics. So that is the boundary, that is your line beyond which area becomes 0.05 so that area under the curve is alpha 0.05. okay so critical statistics is the boundary but alpha 0.05 is the area under the curve from the boundary to the tail ends. In this case, we only consider this right tail but for two-tailed, if it is to be two-tailed hypothesis then you would consider both ends and both tail ends so then that means you have another critical t critical statistics but we're gonna talk about this later. Anyhow, so this t critical statistics, the critical t statistics differs depending upon the sample size. Now we can use Jamovi to figure out what this value should be given the sample size. So here is our Jamovi again and you probably noticed if you look at this menu icons up here that is probably different from yours. So basically if you installed your Jamovi first, then the first six menus are the kind of a basic Jamovi menu but other people actually created special modules to run special statistics depending upon the research context. So to calculate the area under the curve or finding out the critical statistics of a certain distribution, so in our case is the t distribution, you need a special module called distraction. This one distraction. So you can actually download and install each module by going like click on this plus sign and it says modules if you click on the plus sign then it'll actually give you a list of installed modules and Jamovi library and manage installed and so you just click this manage installed then it'll give you what's installed so JMV, the first one analysis bundled with Jamovi is the default base six menus, the first six menus and if you click on the available and there are different modules you can actually install and one of them so for the current purpose you need distraction. So obviously it is installed already for mine but for you, you probably have this kind of button to click to install so if you haven't done so please click this install and make sure that you have this module showing up here in the menu. So let's just increase the font size and then I'll just click on distraction and you will see the different distributions, the continuous distributions you can use and we have normal, t, chi-squared, and F distribution and we need t distribution. So click t and the menu is quite simple. So you have to just then plugin and give Jamovi the necessary parameters of the distribution. So df is degrees of freedom so for our current example the degrees of freedom is 17 and lambda is the location parameter of the distribution, which is the mean, the centre, central tendency measure which in this case is a zero right it is normalised, a standardised t. So we do not have to change this. Now we have two different functions. So the left one is to compute probability and the right one is compute quantiles. So quantiles basically calculate... If you just click it outside, then it'll actually change the shape of the t distribution right so this is a t distribution with the degrees of freedom of 17 basically. Because we want to find out the critical statistics on the right tail end where the area under the curve beyond this critical statistics become... so the right tail end right and then somewhere here beyond which area under the curve becomes alpha 0.05. And the quantiles is basically the cumulative distribution. So if you just tick this box then it becomes activated and what you need is the cumulative quantile until the 95%. So the quantile, So basically I'm talking about percentile which is kind of a special kind of quantiles because the right tail end was 5% so the remaining left area under the curve will become 95% So this is basically the one minus alpha but by convention all these quantiles are calculated, the cumulative function calculates the area under the curve from the left side to the right so that is coming from the calculus right the calculus convention where the integration of any function starts from the negative side or the negative infinity to the right end. So you will enter 0.95 and calculate the cumulative quantile. So what it actually calculated was the cumulative, the 95th quantile and that's and it actually gives you the t value on the x-axis. So this is the boundary where the left side of the curve becomes 95th percentile. So the remaining right tail end will become 5% or 0.05 and that boundary is 1.74. So that is our t crit, the critical statistics. So from that statistics, the area under the curve to the right tail end will become 5% alpha 0.05. So to check this we can actually you know type in this value and calculate the probability so this big P represents the cumulative probability that the big X so that's just any t value that is greater than or equal to this x one value let's just tick this and this yellow area represents the P-value when the statistics on the x-axis is 1.74 and the probability is exactly 0.05. They just go hand in hand so what we calculated here is actually correct right so going back to our slide so the t critical statistics for that is 1.74 right for the t distribution of degrees of freedom 17. That's how you calculate the critical statistics using Jamovi given the area under the curve. Finally, P-value is the probability of observing the sample statistics as big as the one we have or more extreme. So in this case, our sample statistics becomes boundary to calculate the area under the curve which is the P-value of the statistics. Again we can calculate the exact P-value given the statistics and the sample size using Jamovi. So here is our Jamovi again so let's just untick this. Now we can compute probability. Now what we're interested in is the P-value. So we need to calculate the probability. Now then that means we know that our test statistics. So that value, what was that? That was 6.76. That was our t. So that's why we plug in that number to find out the probability that we'll see this statistics as big as this one or more extreme, so to the right end. If we just click it and P is zero which is not true. it is just close to zero it is just so small that it is not actually showing me the right number but let's see if we increase... no still this okay I mean it is not zero and you cannot say that this is zero okay because no P-value becomes zero under null hypothesis significance testing. So what that means is that if we look at... if we go back to the slide, see our critical statistics was here 1.74. somewhere that was our t crit and where is our test statistics? it's somewhere out here right and so four five and 6.76 is almost seven. So it's just far out here. So the probability is quite small compared to the 0.05 right it's alpha here. but obviously, our test statistics is much smaller. So our P-value will be much smaller too. So this is test statistics t observed. So in this case, our t statistics from the data is much greater than t critical statistics and the P-value, so the probability that we will see this statistics is much smaller compared to alpha 0.05. So that means our test statistics is significant. So it's significantly different from no difference. That's what that means. So here is a summary table how to make a decision with null hypothesis significance testing. So the conventional way to make a decision in null hypothesis and testing is to use p-value compare the p-value against alpha 0.05 this is more significance testing in general because alpha 0.05 never changes basically unless noted otherwise. So you don't have to calculate the alpha all the time to compare the P-value. Of course the P-value, you need to calculate from your statistics but your statistical software will give you the P-value automatically pretty much. So all you have to do is to compare P-value against alpha. And when the P-value is less than alpha 0.05, 0.05, not 0.5 right 0.05 then you reject the null of no difference. So it is not likely that your statistics is the same as the test, the cutoff value or the value that you're comparing. So your statistics or your mean simple meaning is different statistically right from and the whatever value you're comparing. Otherwise, you fail to reject the null. I know that you may find it very strange to make a decision this way but it is all about the null as the name of the game suggests the null hypothesis significance testing. Personally, I know that some other people do this in accept or reject but I don't like that convention so I only reject the null or fail to reject the null and so you might want to just get used to this way of saying or the way of making the decision. So all you have to remember is to reject the null of no difference when P is less than alpha 0.05. That means your alternative hypothesis is supported. You have strong evidence to support your research hypothesis basically. Otherwise, you fail to reject the null and so you do not have strong enough evidence to support your alternative hypothesis by failing to reject the null. So the other way to make the same decision is to use test statistics, in our case t statistics. So if the absolute value, these two vertical lines represent absolute values. So basically you do not care about the sign of the value. You only care about the size of the value, t statistics. So if the t value is greater than the critical t beyond which area under the curve becomes 0.05, then you will reject the null otherwise you fail to reject the null. So the inequality sign actually goes opposite between these two. if you use P-value, then you reject the null when the P-value is less than alpha 0.05 whereas if you use test statistics to make the same decision, you reject the null when the t-test statistics is greater than the critical statistics. If it is too confusing, then you can only use this one, the P-value against alpha. So all you have to remember is that you reject the null when P is less than alpha 0.05. That's all there is to it. The previous hypothesis testing is called a directional or one-tailed or one-sided testing because our alternative hypothesis only concerned with a single direction of the difference, which was made explicit in the alternative hypothesis. However, in many cases you don't have a clear prediction about the direction of the difference because it is not always the case you know the expected direction of outcome, which is why the two-tailed testing is safer. Just to reiterate, when you are setting up a pair of hypotheses, null is always stated as there will be no effect, no change, or no difference between the values being compared. So no direction is implicated when you're setting up a null. The direction of the difference is only made clear when setting up the alternative hypothesis in one-tailed way. So if you do have the directional prediction about the difference, then your hypothesis is called one-tailed or one-sided. If the direction of the difference does not matter or you are not sure about the direction of the difference, then you can make your alternative hypothesis two-tailed or two-sided. In our previous cases, we have one-tailed in the right tail end because we expect mu is greater than the mu zero which was 6.6 and the P-value is the area under the curve where the t's, the t value is greater than a certain t right which was 1.74 but if you expect your mean is less than certain value then you are concerned with this left tail end and the probability is defined as the area under the curve from the negative t to the left and the negative infinity and when you do not have such directional hypothesis, then your alternative hypothesis becomes two-tailed where you predict that your sample mean or the population mean will be just different than a certain value mu zero. So the difference can be either in positive or negative direction and the P-value is two times of both sides. So there is this an absolute value sign around the t so it's a negative t and positive t and you're going to add the P-value on both sides to calculate the two-tailed P-value. So this is pretty much all about how to play the game of null hypothesis significance testing. So the principle applies to all the other statistical analyses. So basically you just run the statistics and make a decision by comparing the P-value of that statistics against alpha 0.05. So if the P-value is less than alpha 0.05, if your P-value is smaller than alpha 0.05 then you have something significant. Otherwise, you don't. So that's all there is to it. So next time, we're going to talk about different test statistics and different research context to apply which statistics is appropriate.

Transcript for:Null Hypothesis Testing Overview

Transcript for:
Null Hypothesis Testing Overview