okay uh hello everyone so uh just a quick little announcement um I did post the next course project for you guys on uh canvas it's just in the module section hopefully you'll find this next course project a fair bit easier than the previous one um if you actually look at what you have to do it says you have to calculate something called a one sample T test and a you know have toate a QQ plot um we're going to cover what those are so don't worry about that uh but the rest of that should be pretty straightforward making a histogram calculating some basic statistics stuff we've kind of already done here quite a bit but with that out of the way um at the end of the last lecture we were talking about sampling distributions and it kind of became clear to me that maybe my explanation wasn't sufficiently clear to some of you so what I did is I made you a present here I made this lovely graph in R so just to sort of illustrate sort of how sampling distributions work sort of what they are so imagine you have a population and as I say it doesn't matter what that population is let's just say you have a population of something you're interested in measuring um and let's just assume for the sake of an example here that that population the the the distribution of that population looks like this okay so we have this really skewed distribution here all right so that's the population um now here's the thing when you go and you do research right you collect a sample so let's suppose a researcher goes and they collect a sample and then they calculate the mean of that sample okay um and then let's suppose a different researcher comes along and they also collect a sample from this same population okay and they calculate the mean of that and then let's say a third researcher comes along and they collect another sample and they calculate the mean of their sample and you just continue this add infin item okay you just get researchers coming collecting samples calculating the mean of their samples then you take all of these means that have been calculated by all of these researchers and you dump them into a histogram or in this case a density plot what you end up with is what's called a sampling distribution of the mean you're literally taking a bunch of means and that's your data and you're looking at how they're distributed and what's really fascinating is when you repeatedly sample like that from a population um the more samples you collect in other words the more means that you sort of dump into your sampling distribution the more it starts to approach normality so in this sort of example here that I have um basically what I did is I took um samples of size 10 so for instance on this first um on this first sampling distribution of the mean here basically you know you can you can look at the number four here what that means is basically four researchers came along and they each sampled 10 observations from this population and they calculated a mean for those samples so there's four means inside this first distribution here and what you can see is it looks pretty gross right it's it's it's basically like this kind of weird multimodal distribution and then on this next line we have what happens when there's eight researchers who come along and dump eight means into this sampling distribution okay and then here we have like 16 researchers you know and they've each collected um samples of 10 so that's um so that's 16 means and they dump that into a distribution and as you keep increasing kind of the amount of samples that have been taken and you know thus the amount of means that have been calculated um what you can see is the sampling distribution eventually starts to just become a normal distribution and it becomes a normal distribution even though the underlying population that is being sampled from is skewed okay uh is that clear like do you guys understand kind of what's happening there yeah all right so uh if we kind of head back to my R example um you know recall what we did we created a population so that's what I did on this first line I just created a population right and then we looked at that oops let's rerun this here um you know I had created a population where's the plot there we go so I created a population you can see this population is really really skewed and then you know a researcher comes along and they take a sample of 50 values from this population and calculate the mean that's what's happening here um so I'm basically just randomly sampling 50 values from the population calculating the mean and you can see we get a mean of 999 Etc and then you know another researcher comes along they do the same thing see they get a mean um very similar to that not quite the same but pretty close and then another researcher comes along they calculate it and they get a slightly different mean right because they're getting different individuals like in their in their sample right so it makes sense that the mean here is going to fluctuate so then what I did is I have this code here that you don't need to understand all you really need to know is that this code that I've written is repeating this process that I just sort of showed you here so it's pulling samples collecting means so um in this case um you know I'm each sample each individual sample Le that that a single researcher collects is going to have 50 values in it so that's what I mean by sample size here so the sample of each individual sample and right here is the number of samples that are going to be taken so you can think of this as being like there's like uh 10,000 researchers are going to come along and they're each going to take a sample of 50 individuals from this population and they're going to calculate the mean for their sample which is going to give us 10,000 means and then I'm going to dump all of that into a normal distribution okay so that's all this code is doing right now so if I run this hopefully my computer handles it because it is recording video and multiple monitors and there's all sorts of things happening here so hopefully this whoops hopefully this doesn't uh take too long here there we go and look at that you can see we get this lovely kind of normally distributed distribution here pretty nice um and as a and and you know this is really one of the reasons why the normal distribution is so so useful and just FYI um DF here so this data frame that I've got this contains all the means that were calculated um in case you're curious so if you run this code at home you can uh play around with this um but yeah as I say this is this data frame contains every single mean so the first sample had a mean of 1.35 the second sample had a mean of 0.96 the third sample had a mean of one and so on so as we talked about the the principle that we're sort of seeing or what we're seeing happen here is is the result of a mathematical principle known as the central limit theorem you know which so basically the central limit theorem says you know um the sampling distribution of any statistic you care to name becomes normal as um uh as you kind of go to Infinity with kind of repeated sampling and that's precisely what we see in this image that I sort of gave you right so the more samples we add the more normal Things become the more normal the sampling distribution comes now the downside is you never really know when the sampling distribution is going to be sufficiently normal um but there are a few criteria for the the central limit theorem to work um so um well actually there's really just one criteria the samples need to be independent of one another and you know the observations should be randomly collected and this is you know one of the reasons why random sampling is so important now so that's all review we technically talked about that last class so we've got our population and we have this sampling distribution right the sampling distribution is this this thing that was spit out at the end here um and what's kind of interesting is we should examine the mean of this sampling distribution and the population distribution and we can actually do that in r with this code okay so recall we had created the population right here so what we could do is we could look at the mean of that population so we'll say mean we'll just say population so the mean here let's do this the mean of the population is one all right now let's also take a look at the mean of the sampling distribution that I had generated in other words let's look at the mean of this histogram that I made so we're going to say mean and I have all of the data stored in DF here so we're just going to grab all of those means run that and take a look at that here let's do this so it's easier to see two of them what you can see is the means are almost identical um and you might imagine that if you had an infinite amount of samples in your sampling distribution that these two numbers would end up being completely identical and mathematicians have actually proven that you know when you use random sampling that is what is in that is what does occur um so more specifically we would say you know when random sampling is used the average of all of the sample means is equal to Mu your population mean and this is kind of the reason sampling distributions are important so let's maybe write that down so we'll say when random sampling is used the average of all the sample means say is equal to Mu uh is equal to to the population mean in other words the mean of this distribution right here is the same as this distribution right here which is pretty remarkable so in Practical terms what this means is the sample the mean of a sample is an unbiased estimate of your population mean in other words what in Practical terms this means we can use the mean like the statistic of the mean of a sample that we collect um to estimate the mean of the population because over the long run with repeated sampling you're going to converge on a result that is equal to the population mean which is what as researchers were interested in usually right now um if we kind of look at um if we look at the the histogram of the sampling distribution here right we we can get a pretty good sense of how much the mean is sort of fluctuating from sample to sample and you know we can also see pretty clearly that it is normally distributed here and you know mathematicians have shown you know um as I say after an infinite amount of samples this will be perfectly normal here's the thing though if we know that the sampling distribution is normally distributed and we know that the mean of this sampling distribution is equal to the mean of the population then we can actually use this distribution to make an inference about the population mean which is something we don't usually know right specifically we can determine How likely we are to obtain a mean of a certain size when we collect a sample from from the population which if you're a researcher you know collecting samples is is a useful thing you know if you're a researcher who is collecting samples you know it's a useful thing to know about um all we need to know is the standard deviation of this histogram here of this sampling distribution which technically we can calculate right here right because we've sort of generated this we have all the data it's in it's in memory here so we could just say SD we can say DF means like that and we get a standard deviation of .14 essentially um um so what we can do now so now that we have a a standard deviation of this we can actually use this to say something meaningful so recall that 95% of the data in a normal distribution Falls between two standard deviations of its mean or or technically it Falls between like 1.96 standard deviations um so what that means is we can calculate an interval here so why don't we just do that real quick here so um we'll do this we'll say we'll say the mean DF means we'll say minus 1.96 times our standard deviation DF means oops means and then we'll do the same thing and we'll just add oops this okay so what we've just calculated here so we have this interval um so the interval is uh basically 73 and and 1.28 basically so basically between these two values there is a 95% chance that the population mean is somewhere between those values that is what we have just calculated here so more technically this is what's referred to as a 95% confidence interval okay um so let's maybe Define that let's do this we'll say confidence interval oops so a range of values that are likely to contain the true population mean in other words you know that interval we calculated is telling us that there's a 95% chance that the true population mean is somewhere between those two values and you know the reason we're able to say that is because the mean of the population and the mean of the sampling distribution are equivalent okay um now it should be obvious that kind of this this uh this confidence interval I've just calculated here is no different than the 95% intervals that we've calculated in the past okay the different the reason though we call this a confidence interval um is that the standard deviation that we're using here so this right here the standard deviation that we're using here um is technically the standard deviation of the sampling distribution okay specifically you know because it's the standard deviation of the sampling distribution it allow allows us to talk about populations all right that's the key difference here between kind of what we've done here and what we've been doing previously in other words this interval is telling us something about the population whereas previously the intervals we calculated only told us about the data of a single sample all right and this kind of brings us to the the concept of standard error so let's go back here so we'll say um so basically you know when we calculated or when I calculated there the standard deviation of the sampling distribution that number I produced so the um so when I did this right right here this number that's produced this is what we call the standard error standard error in case it's not obvious and so I'm going head back here um so let's define the standard error standard error so this is the standard deviation of a sampling distribution okay so that's all the standard error is um now the reason the reason we're calling it error here is because if you think about it that's really what it's quantifying right it's quantifying Error so the the histogram that I kind of plotted here right um this is really just a histogram showing us the natural fluctuations of the mean that you see from sample to sample right and you can think of those fluctuations as error right the true population mean is what we're interested in as researchers and the estimates that we sort of obtain from all of the samples that we collect you know they deviate from that true population mean to greater or lesser degrees and those deviations from the true population mean we can also just call error so the term deviation and error mean the same thing okay they're synonymous um now um let's actually maybe do this we'll say deviation equals error they're the same thing go away um now recall that I had mentioned in the last lecture that if you take a bigger sample then your sampling distribution is going to approach normality faster so like you can see in this in this image here right you can see the progression to normality well if you take bigger samples like let's say you take samples of 100 individuals as opposed to 50 individuals you will get to normality faster and vice versa if you take smaller samples then you're going to get to normality slower okay so um for our histogram here right we had uh 50 observations in each sample that was taken and that gave us you know this standard error of 0.14 if I had had a 100 observations in each sample what you would see is this standard error would actually be smaller okay in other words our sampling distribution here would have less spread around the mean whereas if we had say only 10 observations instead of 50 this would actually be bigger it would expand the variability in other words our sampling distribution would be fatter it would be kind of wider here there'd be greater spread around that mean and this is why sample size is so important in research um it's because a larger sample size is going to do a better job at approximating the true population mean than a smaller sample size is so what this means is that a 95% confidence interval that was produced with a sample size of 50 um which is what we used here um that is going to be smaller than a confidence interval that was produced using a sample size of 10 and a smaller confidence interval is a good thing okay because it means you have a more precise idea of where that true population mean is located so I'm going to write this down for you guys because this is a really important point or really important set of points so we'll say bigger sample sizes shrink standard error and you'll see why that is when we write out the formula for standard error so bigger sample sizes string standard error and thus produce smaller confidence intervals um and then we'll say in other words you have a more oops more precise estimate where the true population mean is by contrast smaller sample sizes increase standard error thus produce a larger so this is a bad thing and this is a good thing other words okay so this larger sample size is making the spread of that sampling distribution so it's it's making the spread of that green histogram um smaller so you can better pinpoint where the true population mean is actually falling whereas having you can think of having a large sample size as being sort of equivalent to having a more powerful microscope right it allows you to really narrow in on something a lot better uh now the actual mathematical relationship between standard error of the mean and sample size is actually pretty straightforward and actually maybe we should uh um do this we should say standard error of the mean oops just to be specific about it all right so let's write this out let's write out the formula for this so we'll say that's not how they spell that luckily the formula is pretty simple so the way we notate this the way we notate standard error that is is by writing Sigma subx bar so basically we're saying we looking at the standard deviation of the mean or the standard error of the mean means that it's the same it's the same thing in both cases I'm going to write that a little nicer here so it's Sigma xar like that and so that's how you notate it and the formula is going to be your population standard deviation so that's supposed to be a Greek Sigma there it's your population standard deviation divid the square root of your sample size and that's when I say sample size I mean the sample size of an individual sample so in so in our in the case of our example it would be 50 right um so basically in English you know what this is saying is that the standard deviation of our sampling distribution right that's what this is um is equal to the standard deviation of our population right divid the square root of the sample size um and you know by and as I say by sample size we mean the sample size of an individual sample and we can kind of demonstrate this in R um so since we since we have the actual population right we we artificially created it what I can do is is I can uh calculate that so the whoops the standard deviation of the population is that and then if that formula I just gave you is correct then the standard deviation of our population divided by the square root of 50 should give us basically the same number now we don't have an infinite amount of samples so it's it's not going to be exact but it should be pretty close so if we say SD of our population um oh sorry um this should be the SD um of the standard deviation of yeah there we go of the means right so this is kind of one way to calculate the standard error but if we go square roots of 50 like that so these two numbers should basically be the same and you can see they're pretty darn close um so uh if the sampling distribution of the means had had an infinite amount of values you would find that these two numbers would be the same yesal this is not something this is just like pure Theory this is not something you have to worry about calculating at this point um and I'll explain why in a bit here uh the only thing you really need to worry about like the only thing you need to kind of really grasp here is that standard error is the standard deviation of the sampling distribution and this is how you calculate it okay that's what I kind of want you to take away from this and you know using the standard error we can calculate a confidence interval which gives us a sense of where that true population mean is so I'm not really going to ask you like a calculation question about this this is more understanding why sampling distributions are important as we kind of move on here you'll kind of see how all of this ties together because I realize it's a little abstract right now but you know there is a reason for for talking about this so um let's talk about that reason actually let's talk about using this in the real world okay so there's a problem with everything I've basically just told you um and you may have caught on to it you maybe maybe you haven't but the problem is this when you do research in the real world uh you don't have access to a sampling distribution let alone one that has a theoretically infinite amount of samples right it's almost always the case that you have one sample and that is the sample you collected for your research right um but if you kind of look at our formula here right to calculate a confidence interval you need to know the standard error right um but if you look at this formula for the standard error you know it requires knowing the population standard deviation um which is something you almost never know um you know uh additionally like when we calculated this in R right we could calculate the standard error by basically taking all the values in the sampling distribution but as I say you never have that in real life right so if you never have access to all of the values in the sampling distribution and if you never have access access to the the population standard deviation what do you do you know how do you calculate that confidence interval is the question so let's maybe do this let's say problem in most cases we can't determine uh the standard error and thus can't calculate a confidence interval because in TI um because of the following on the one hand we don't know the true population standard deviation and we don't have access to the sampling distribution of that population so you know I kind of cheated by using R and creating my own population but you know obviously in real life you can't do that um so the way that these problems are usually solved is actually by substituting the population standard standard deviation for the standard deviation of your sample um so basically the way this is conventionally solved is by essentially saying your standard error of the mean can be sort of approximated so I'm just going to use that squiggly equal sign as approximation can be approximated by taking your sample standard deviation and dividing by the square root of n and we would conventionally abbreviate this as s subxbar like that just to indicate that we're using the sample standard deviation here you know in our formula um and the reason we're allowed to do this is because if we repeatedly sample like we did for the mean in in my example but we use standard deviations so like let's say a researcher goes out collects a sample calculates a standard deviation another researcher goes out collects a sample calculates the standard deviation another researcher goes out collects the sample calculates the standard deviation and so on and you dump all of those standard deviations into a into a distribution like into a histogram um what you'll get is is another normal distribution and the mean of that normal distribution will be equal to the standard deviation of the population which is pretty interesting um so this uh this strategy that I've basically described here this strategy of replacing the population standard deviation with the sample standard deviation this was developed by a guy named Pierre Simon llas who is a pretty famous figure in science but there is a problem with his method um and the problem is this when you have a large sample size this works pretty well um as a strategy but as we discussed with the sampling distribution of the mean a small sample size is going to produce more fluctuation in your sampling distribution um so basically at small sample sizes this this ends up being a very poor estimate of your standard error and it's so poor in fact that it just ends up messing with your conclusions there is however a solution to this and the solution is to actually use a different probability distribution so instead of using a normal distribution as the model of your sampling distribution we would typically use something called a t distribution which is going to be our next topic so we're going to say solution here use the sample standard deviation which is just s here to estimate the population or I guess we should say use a sample standard deviation as an estimate of the population however as I just mentioned there's a problem with doing this problem does a poor job estimating standard error when sample sizes are small solution use a t distribution to model sampling distribution okay so to kind of recap because I know that was sort of a lot of stuff was spread across two classes which I don't really like uh but whatever what we had to kind of deal with um so in our discussion of sampling distributions right we've been using a normal distribution to model the natural sampling error that occurs when you repeatedly sample from a population right so you know we know if you repeatedly sample from a population calculate a statistic for each of those samples and dump them into a histogram you get a nice normal distribution assuming you have a lot of samples um and you know the central limit theorem tells us that this is going to go and become a normal distribution um you know and we know that if we calculate the standard deviation of this sampling distribution in other words if we calculate the standard error that's all the standard error is it's just the standard deviation of a sampling distribution if we calculate that we can we can use we can then calculate a confidence interval that tells us um you know with kind of 95% certainty where the true population mean is falling you know the problem though that as as I just mentioned is calculating the standard error depends on knowing the population standard deviation right so that's what we see right here um um so the solution is to swap out the population standard deviation for the sample but this only works at large sample sizes so what we need to do is we need to use a different distribution to model our sampling distribution and and that distribution is as I say called the tea distribution now the tea distribution is fascinating um and it came about from the work of a man named William cely goset and he was actually the quality control engineer at a small Brewing Company you may have heard of called the Guinness Brewery and as you might imagine you know testing the quality of beer on large samples of individuals is problematic um especially if quality controls an issue and you might poison them um so in a lot of his work Gosset was forced to use very small samples of just like three or four people and gossett's kind of big Insight was that when you use lass's method to calculate the standard error of the sampling distribution um in other words when you replace the uh the population standard deviation for the sample standard deviation uh to get your standard error um you can't actually use a normal distribution anymore um but this uh this is only true for small samples um and the question is well what is a small sample well basically in our class A small sample is anything less than a thousand um so if you have a sample size larger than a thousand you're probably safe in using lass's method um but that almost never happens um so and also there's there's another reason why you still want to use a t distribution after that um it's interesting actually when goset initially discovered this um like when he initially discovered that uh replacing the population standard deviation with the sample standard deviation means you can no longer use the the normal distribution when he sort of discovered that um the uh the Guinness Brewery wouldn't let him publish his findings um they did eventually let him publish his findings uh but they made him do it under a pseudonym and that pseudonym he chose was student um so sometimes you'll hear what we're about to talk about referred to as the students te distribution um so visually the te distribution looks just like a normal distribution actually let's do this we'll go add a page here say t distribution and I think I have an image of that yep there we go we'll drop that into your notes uh this is on canvas by the way uh in case you guys are curious or want to use it all right so you can see the T distribution looks very similar to a normal distribution uh it's just got slightly heavier Tails uh that's all um and it only has heavier Tails kind of at at small sample sizes so um what I think we should do and I think this will help kind of clarify everything is we should just consider an example where we might use a t distribution okay um let me see here we got five minutes um so let's suppose there's a researcher named sigment okay and let's say sigment claims that on average college students spend 10 hours a day on their phones or actually um no we'll say nine hours yeah we'll say 9 hours a day on their phones okay so we'll say uh [Music] researcher name sigment claims that students oops have an average screen time of nine hours okay okay in other words he's saying the population mean is equal to 9 hours you know on average that is what the population of college students spend on their phones and let's say you're suspicious of sigman's claims you don't necessarily agree with it um so in order to check the claim you decide to go and randomly ask people you know what their screen usage is on their phone but of course you know a lot of people are quite embarrassed by that statistic um you know a lot of people are quite embarrassed about how much time they actually spend on their phones and they're reluctant to hand that information over to you um consequently you're only able to get a sample of six students okay so let's suppose this is the data you collect so um so one student had a had a screen time of 9.5 hours another student had a screen time of 3.7 another had 6.8 another had 7.2 another had 5.7 and 8.7 so if you kind of go to the trouble of calculating that that gives you a mean screen time of 6.93 I hope 6.93 okay and the standard deviation of this the sample standard deviation is 2.09 and the question is is sigmin being unreasonable by saying the average screen time of the population is 9 hours that's kind of what we want to determine here now one simple way to answer that question is to compute a confidence interval just like we did when we were discussing sampling distributions so the confidence interval recall is going to tell us where you know we can be confident the middle like it's where we can be 95% confident that the true population mean is falling okay so it'll give us you know a good sense of of where the the true population mean is so what we're going to do is we're going to use two methods here we're going to use lass's method and then we're going to use gossett's method so we'll start with lasses method okay so the first thing we need to do is we need to calculate the standard error of the sampling distribution um let's maybe do this we'll bold that maybe we'll even yeah we'll just do that so the standard error now now formally the standard error technically requires you to know what the population standard deviation is but we don't know that so we're going to swap it out for the sample standard deviation so the formula is s subxbar and that is equal to our sample standard deviation divided by the square root of our sample size so in this case we have a sample size of six so basically what this ends up being is 2.09 over < TK 6 which gives us a value of 0.85 so that is the standard error of the mean actually we should probably say standard error of the mean just to be specific standard error of the mean good enough um so remember standard error and standard deviation are the same thing it's just that when we talk about standard error we're talking about the standard error or we're talking about the standard deviation pardon me of the sampling distribution specifically okay now now since we don't know what the mean of the sampling distribution is what we're going to do is we're going to use the mean um the mean of our sample as basically the center of our confidence interval you know basically the mean of our sample is sort of our best guess here so that's what we're going to use so if we want to be 95% confident um we're going to estimate um our sample mean to be the center of our confidence interval and we just need to see where basically you know approximately two standard deviations above and below the mean kind of fall and I guess we'll do that on Friday so