Transcript for:
Reviewing Sampling Distributions for AP Stats

what's up my stat Stars welcome to the unit 5 summer review video for AP Statistics this unit is over sampling distributions and it might be one of the easiest units but also one of the most important units because basically it connects everything we've learned so far this year to what we're going to learn next which is inference and this is the unit that connects it all together which makes it super important so before we begin let me of course remind you of two things first this is just a review video it does not cover every teeny teeny tiny detail that you learned in class it just covers the real big Concepts that you're gonna need to know to prepare for your unit test or for the AP stats exam in May the second thing you need to do is get out that study guide if you haven't gotten out already downloaded print it whatever you do get ready to fill it in as you watch the video or feel free to filled in while you're watching the video or even when you're all done I don't really care but that study guide is going to help you tremendously in terms of practicing all all the different skills that we talk about in this video so without further Ado let's get ready to rock and roll with sampling distributions [Music] now before we dive too deep into sampling distributions the very first part of unit 5 is really a review over the normal distribution it's actually called the normal distribution Revisited because we want to quickly revisit the normal model and remind you all the key aspects of it the normal distribution could be used to model continuous random variables that of course follow a normal distribution which many continuous random variables do but I guess we should probably quickly remind you what a continuous random variable is a continuous random variable is a variable that can take on any numerical value within a specified domain in any interval within that specified domain has a probability associated with it the one thing unfortunately you cannot do with the continuous random variable is find the probability of an individual specific numerical outcome discrete random variables yes you could do that but not with continuous random variables all we could do is specify an interval meaning a lower value to an upper value and within that interval we could use hopefully the normal distribution if the continuous random variable follows a normal distribution but if it does we could use a normal distribution to tell us the probability of a particular value of that continuous random variable falling in that specified interval so here in this picture we see a normal distribution and of course hopefully remember a normal distribution is entirely ran by two values the mean smack dab in the middle and the standard deviation and the normal distribution tells us that we can go up one two three standard deviations down one two three centimations as well now we could go further up and further down technically the normal distribution goes all the way from negative Infinity to Infinity but what the normal distribution tells us is that 99.7 percent of your variable or of your outcomes Falls within three standard deviations that's negative three to positive three so in this particular picture what we're looking at is the area between z-score one and z-score 2 and the area in between represents the probability that a outcome of the continuous random variable will fall in that interval and once again we could find that probability using a z table if you know how to use those or kind of ancient not too many people use them anymore or you can also use technology there's plenty of websites out there including the TI-84 calculator which has that normalcdf feature that again can find that probability in between two z-scores now we've already covered a ton of the normal distribution way back in unit one I have tons more videos on my YouTube channel about normal model calculations but let's actually take a look at an example right now so we can quickly remind ourselves how the normal distribution works Maxi's been saving money each month for a long time the amount she contributes to her savings account each month follows a normal distribution with a mean of 55.20 and a standard deviation of eight dollars and fifteen cents so here we're working with A continuous random variable X how much money does Maxi contribute to her savings account we know the mean we know the standard deviation and we know that it falls to normal distribution which means that life is going to be really easy for us now the first question here is what is the probability that next month she contributes more than sixty dollars so the first thing you have to do is figure out on that standard normal model where does sixty dollars fall and that's of course by finding the z-score so we take six seed the value that we're asked about subtract the mean divide by the standard deviation and we get our z-score in this case 0.589 which I have marked with the red line then we want to find the probability that a particular outcome how much money she contributes is greater than 60 which is equivalent to a z-score being greater than 0.589 which I have shaded in red and again you could use a z table or you could use your tid for calculator to find that probability and we get 27.8 percent so it's a 27.8 probability that in any given month she contributes more than sixty dollars here's a second example following that same problem this time we're asked what amount of money would indicate the top five percent and bottom five percent of contributions so once again we have that normal distribution set up and what we have to first do in this case is find the Z scores that represent that bottom five percent and that top five percent now once again how do you find those Z scores well you could use a z table in backwards or you could go ahead and use invert Norm on your tid4 calculator and again we've already learned all that back in unit one but if you know how to do all that you could quickly get the z-score that represents that top five percent and bottom five percent which is the Z score of either negative 1.645 or positive 1.645 then what we can do is take those two Z scores and plug them into our z-score formula where we already know the mean of 55.20 and a standard deviation of eight dollars and fifteen cents so we could back solve for x multiplied by the standard deviation add the mean over and we get our two individual values that represent that bottom and top five percent so any value above 68.61 would represent the top five percent of contrib contributions and anything below 41.79 would represent the bottom now let's continue here with another example this time we're going to bring in another person Cassandra he contributes on average 62.45 but her standard deviation is 12.66 so a great question we could ask you here is let's combine them together what is the probability that the total amount saved in a month for both Maxi and Cassandra is greater than a hundred and forty dollars now we're not just talking about Cassandra we're not just talks about Maxi we're talking about both of them together so the first we have to do is find the mean incentivation for their total contributions so for the mean it's really easy all we got to do is add the 55.20 for Maxi with the 62.45 Cassandra and we get 117.65 for the mean of the total but what about the standard deviation well hopefully you learned a lot in the random variable unit so we know how to combine because we are not allowed to combine standard deviations well we are allowed to combine is a variance so Maxi's variance is 8.15 squared let's just change deviation squared plus Cassandra's variance is 12.66 squared add up all that variance through a square root around all of it to get back the same deviation and now we have the standard deviation for the total fifteen dollars and about a little bit more than five cents 15.056 I know that really with money you want to keep a third decimal but I'm going to do it anyway just for accuracy so now that we know the meanest incentivation for the total we can now answer the question what is the probability that the total is greater than 140 dollars of course the first thing we need is the Z score for 140 dollars so we're going to take 140 subtract the mean divide by standard deviation that we just found and we get a z-score of 1.484 so asking about the total being greater than 140 is the same thing as asking about a z score on a standard normal model being greater than 1.484 and again we can use our calculator we can use a program on our computer we could use an old school Z table to get the probability of 6.89 percent so this is 6.8 percent problem ability that Cassandra and Maxi together contribute more than 140 dollars all right that's a pretty quick review over the normal distribution I really hope it made a lot of sense and it wasn't overly complicated but again if you're like I don't remember the normal model go back to you know one review real quick or go back to you know the all the you know normal calculation videos I have my YouTube channel to really kind of hone in on those skills but it all involves using normalcdf and invert normal on your calculator or really knowing how to use those Z tables or if you don't like any of that you can also use programs on the computer however your teacher is teaching you how to use normal distributions and make calculations within it now let's dive into sampling distributions but before we dive too far let's talk about what samples are for Yes again we learned about samples a long time ago but let me remind you that we take samples and analyze the statistics of those samples in hopes to estimate a population parameter kind of kind of the whole goal right is to take a sample statistic and use it to estimate a population parameter now let's quickly remind you that a sample statistic is any information that well summarizes a sample we got the mean we got the median we got the standard deviation we got the range we got a proportion from a sample but anything that we collect from a sample the whole point of collecting it from that sample is because we want to learn what might be true for the population parameter that it represents now every sample statistic is a point estimator of the corresponding population parameter think of it like this a sample statistic should point us to what the population parameter is going to be so again that's why a sample statistic is called a point estimate of the population parameter now we can actually look at a couple examples here so for example a sample mean X bar we would hope points us to what the population mean mu is or a sample proportion P hat will point us to what a population proportion p is or a sample median which sorry we don't have any fancy symbol for a sample medium would point us to what a population median is and again when we're working with medians we just use the word median we don't have any symbols for them and then another example would be the sample range we would really assume that a sample range would point us to the direction of what a population range is but the one thing we do have to remember is sample statistics are never going to match a population parameter perfectly we actually have a name for this it's called sampling variability the idea that when you take sample rules they are going to vary not only amongst themselves but they're going to vary towards what the population parameter is they're never going to match it perfectly now imagine we start sampling that is to start taking repeated samples of the same size from the same population now there's a couple rules when we do this that have to be true of everything that I'm going to talk about in this video is going to make sense first the samples have to be random to avoid bias and second the samples must be independent of each other now the only way we can guarantee Independence between our samples is if we sample with replacement which means once we take out we put them back take out analyze put them back but unfortunately we don't always do that so if we are sampling without replacement then we do have to make sure that our sample size n is under 10 of the population now a lot of kids just remember the under 10 rule but they don't really understand why the idea is this think about if you have a small bag of M Ms and you're going to take out a sample of 10. before you know it you're not going to have any M M's left in the bag to even take up any more at all and all of a sudden your samples are well not independent of each other because if you take out 10 what have happens next is certainly going to be impacted by what you took out first because you took out probably 50 of the bag at that point however if we go to the m Ms Factory where they might have a barrels of millions of M M's and we take out 10 throw them away take out 10 throw them away take out ten throw them away well yes we technically don't have Independence because we are removing M M's and not putting them back but because we're removing such a small amount under 10 of the population any differences is considered negligible and it's really just not a big deal so we're still allowed to assume that we have Independence between our samples that's the general idea of why we need our sample size to be under 10 of the population to assume Independence when we are sampling without replacement so as long as we meet those rules random samples and sample size in our 10 population then we could start sampling so now that we've taken a look at sample after sample after sample after sample and from every one of those samples we collected a statistic from each of those samples so essentially right now we have a bunch of statistics and with all those statistics we can create a sampling distribution a sampling distribution of a statistic is the distribution of values for the statistic for all possible samples of a given size from a given population now you might be like uh huh what was that let me try to explain for example imagine we're analyzing a quantitative variable so we get a bunch of samples and from every sample we calculate the sample mean okay now all of these sample means are quantitative variables because each sample mean is a numerical value so we can create a distribution of all the sample means so again picture it we got a whole bunch of sample means we're going to collectively look at them all together and that distribution of all the sample means would be a sampling distribution we could do the same thing with a categorical variable now if we have a categorical variable what we could do is we could collect a bunch of samples and we can find the sample proportion from each of those samples so we'd essentially have a bunch of P hats we had after P hat after P hat after P hat and again every individual P hat is a quantitative variable itself and we could take all those P hats and create a distribution of them and that distribution would be the start of a sampling distribution for a sample proportion now we could do the exact same thing with medians we could take a median from every single sample we get and then we can make a picture or a distribution of all of those sample medians and again we can do the same thing for ranges we can take a whole bunch of samples and from every sample calculate the range and then make a distribution of all those ranges so again that's what a sampling distribution is it's a collection of all possible sample statistics for all possible samples of a given sample size from a given population now there are lots of sample statistics out there but by far the two most popular that we're going to analyze in this unit and in more to come is the mean of a sample a sample mean X bar and the sample proportion P hat let's take a look at a couple of specific examples for sample means and Sample proportions now let's say in a large city 65 of all registered voters are going to vote Yes on an issue in the next election now we can simulate taking repeated random samples of size 150 from the population of all registered voters now let's make sure that our rules are in play here first they have to be random samples second we would have to assume that 150 registered voters is under 10 of all registered votes in the city so we could assume Independence now if we were to take sample after sample after sample and from every one of those samples we would collect the sample proportion of people that are going to vote Yes Well we'd have a bunch of P hats if we put all those piats together we could create a distribution this distribution is the start of the sampling distribution every Green Dot represents an individual sample proportion and you'll see that some could be 61 some could be 62 percent some could be 63 percent well they vary and that's okay actually that's the one thing samples do very well is that they vary and that idea is called sampling variability now when we look at this start of a sampling distribution we actually noticed three things the first thing I notice is the center the center is the mean of all P hats so okay let's talk about this the mean of all the sample proportions now notice it's about 65 percent 0.65 right smack dab in the middle when this occurs we have what we call sample statistics that are unbiased estimators if all the samples are pointing us to the truth they must be unbiased if the center wasn't 0.65 then the sample estimate would be biased now the next thing we notice is the fact that there is a spread here they're not all the same again that's what I already mentioned sampling variability not all samples are going to match the true population parameter but as a collective group the mean of them all will match up with the truth in the center now the third thing I notice is the shape and the shape is well you guessed it normal this means we could measure the variation we see in the sampling distribution with some probability we're going to do some more examples of that later now let's do an example with sample means let's say that the true mean weight of all cell phones at a large Roosevelt High School including the case is 180 grams now we can simulate taking repeated samples of size 45 cell phones from the population of all student cell phones so again we're going to take a sample 45 analyze the mean weight of those cell phones take another sample another sample another sample so we'd have a ton of X bars and if we take all of those sample means and we create a distribution with them all we would get the start of a sampling distribution now again notice three things here first the center the mean of all the sample means well notice it's all piling up around the true mean of 180 grams that once again means our sample estimates are unbiased if we were getting all of our sample means to pile up say around 160 well that's not the truth we already said the truth is 180 well that would mean that something is biased now the second thing we notice is once again not every sample is 180 grams some are more some are less there's some variability there now we also notice the shape in that is normal this is where the central limit theorem or CLT to abbreviate it comes into play the central limit theorem says that when the sample size is sufficiently large at least 30 a sampling distribution of the mean will be approximately normal so regardless of what the population of all values look like the sampling distribution of the mean will be normal if the sample size is 30 or larger now it is important to note that simulating say a thousand sample statistics is still not a true sampling distribution that is because a sampling distribution needs to have the values for all possible statistics from all possible samples and if your population is really really large like a lot of populations are that's going to mean a lot a ton millions of possible samples which means millions of possible sample statistics and we can't possibly simulate them all but what we can do is create a model for what that sampling distribution could look like and we can really use that model it's going to be extremely helpful to us now to build a model of a sampling distribution all we need to know is the population parameters now in this unit we're going to take a look at four specific sampling distributions let's start looking at them now first let's take a look at a sampling distribution for a sample proportions first let's take a look at a building a model for a sampling distribution for sample proportions to create a model of the sampling distribution for sample proportions we need two things the true population proportion p and the sample size n now when we build this model it has three things a center a spread and a shape let's talk about each of those now so the center of a sampling distribution for sample proportions would be the mean of all possible P hats really make sure you understand that notation it is a mean of all those b-hats and remember I already said this the mean of all those P hats should equal the truth p as long as we are unbiased that is why we have a condition attached to this formula for the center and that is that the samples must be random to avoid bias next up we have the spread or the standard deviation of all those possible sample proportions again remember we we noticed when we took a look at those simulated sampling distributions that every sample as well not the same there is always going to be some spread which means there's going to be a standard deviation so the standard deviation of all those P hats can be found by this simple formula it is the square root of P that's the true population proportion times 1 minus P all divided by the sample size n now in order for the standard deviation formula to work we once again need those samples to be independent and again if we're sampling with replacement we don't have to worry about that they will be independent but most cases were not that's why we have to have that condition that I've already mentioned that samples must be less than 10 percent of the population to assume Independence now lastly we have the shape and of course the shape is going to be normal as long as our samples are big enough now how do you know if your sample is big enough well when you're working with proportions your symbol is big enough as long as it has 10 or more successes and 10 or more failures expected in it now when I say success or failure I simply mean what we're looking for like if we're looking for people who are going to vote Yes we need 10 or more of those and then if we're looking for people that vote Yes the others would be no's that's what we're calling our failures so we need 10 or more of those so the shape of a sampling distribution for sample proportions will be normal as long as the sample is big enough so you've got to make sure you check that you're expected to have 10 or more successes and 10 or more failures so that's how easy it is to build the sampling distribution for a sample proportion all you need to know is the center the spread and the shape but don't forget to check those three conditions that are attached to each of them all right let's take a look at a quick example let's say that once again the true proportion of all registered voters in a large city that are going to vote Yes on a issue in the next election is 65 percent what would the sampling distribution of all possible sample proportions look like well let's build it it's this easy first we need the center the mean of all the P hats is going to be the truth as long as it's unbiased we should see that Center of 65 or 0.65 now some samples are going to be Iron some are going to be lower so of course there is a spread the standard deviation is going to be the square root of 0.65 times 0.35 divided by the sample size of 150 and we get 0.0389 as long as 150 is assumed under 10 of all registered voters in these cities so we have Independence now lastly is the shape but of course that shape is going to be normal but we do have to confirm that we have 10 or more successes and failures so if we take 150 times it by 0.65 we have 97.5 people that are expected to say yes and 150 times 0.35 is 52.5 people that we expect to say no so since both of those numbers are 10 or more we definitely are going to have a normal sampling distribution now I actually built the sampling distribution for you on this page we see the shape normal the center 0.65 then I went up one two three down one two three standard deviations of 0.0389 because remember if you have a normal distribution you have the mean in the center and you go up one two three you go down one two three standard deviations so this model allows us to see what all possible P hats are going to look like now as I mentioned earlier we can also examine the variation within the model using probability for example what is the probability that a sample proportion from 150 registered voters is less than 58 percent now to answer this question the first thing we're going to do is bring in our model then we need to identify where is a sample proportion of 58 we see that it kind of falls on the bottom side but really more specifically we need to find the z-score 458 percent so we're going to Simply find the z-score 58 and we see that it's negative 1.799 now we can answer the question what is the probability that a sample proportion P hat is less than 58 that is equivalent to a asking the question of what is the probability a z-score is less than negative 1.799 and we could use our Z tables or we could use our calculator to get that probability of 0.0360 now another great question we could ask is what sample would Mark the top five percent of All sample proportions now to do this once again we're going to start with that sampling distribution that we already built and now we have to figure out where is the top five percent well this is where you could use technology or you could use your Z tables or you could use a computer program to find the z-score that represents the top five percent now let me remind you if you are going to use your Tandy for a calculator you want to use invert norm and when you're using invert Norm you want to put in 0.95 because the top five percent is equivalent to the bottom ninety five percent if we're looking at the picture here we see five percent above 95 below and when you're using invert Norm you need to put in the percent below or the area below so that's how we get the z-score of 1.645 that is the z-score that represents where that top five percent begins now we just have to use our z-score formula in Reverse we know the z-score 1.645 we already know the mean 0.65 we had another standard deviation 0.039 what we don't know is the P hat the sample proportion that represents that five percent then all we got to do is multiply by the same deviation add the mean over and we get a p hat of 0.714 so if a sample proportion comes back 71.4 or higher it means it's in that top five percent of all possible sample proportions next we can model the sampling distribution for differences between sample proportions here we are looking at possible differences between two sample proportions taken from two different populations let's say population one versus population 2. now to build this sampling distribution for the differences in Sample proportions we're going to need three things the center the spread and the shape let's start with the center the center would be the mean of all possible differences between a sample proportion from population one and a sample proportion from population two and if we think about it the mean of all possible differences between two samples should be the true difference between the parameter from population one and the parameter from population two now of course this is only going to be true both samples are random to avoid bias next up is the formula for the spread or the standard deviation now this is the standard deviation of all possible differences between a sample proportion from population one and a sample portion from population two a little bit of a tricky formula but it's given to on the AP stats form a sheet so no need to memorize it but again this form is only going to be usable if we know that these samples are independent of each other now again as long as our sample sizes are less than 10 percent of the populations that they came from we can assume Independence but not only do we need to assume Independence between the samples in each population but we also need to make sure that our two samples the one from population one and the one from population two are independent of each other now lastly is the shape and of course it's going to be normal as long as both samples the one from population one and the one for population two are big enough which means we got to check that both samples have 10 or more successes and 10 or more failures so we're big enough to use a normal distribution now let's look at an example that's going to make this whole scenario of looking at the differences of proportions make a whole lot more sense let's look at an example that examines two school districts and the proportion of students in each School District that pass a state math test School District day has 80 percent pass and School District B has 76 percent that pass and we want to model the possible differences between the proportion that passed from a sample of 75 students from District a and a hundred students from District B all right so the model shows us that the possible differences between two sample proportions could be this now we got the center the spread in the shape the sender is of course going to be the mean of all possible sample differences so again this should be the true difference the true proportion of those that pass from District a is 80 percent the true proportion from District b 76 percent that means the true difference the difference that we would expect to see from a sample would be four percent or 0.04 however again samples are going to vary so we have our standard deviation so here we see that giant square root and we have 0.8 and 0.2 that is the success rate and the failure rate from District a divided by 75 the sample size taken from District a plus 0.76 and 0.24 the success and the failure rate for district B divided by 100 which is the sample size taken from District B and you can put all the tag calculator at once just make sure it's all on the inside of a giant square root and we get the same deviation of 0.0629 finally we have our shape of course normal as long as we can check and let's actually check that you do need to do this on the AP test to prove that we do have those 10 or more successes and failures so in this sample from District a is 75 80 percent should pass that 60 more than 10 20 percent will not pass that's 15 more than 10 and the same thing for district B so since we have enough success and failures the shape will be normal so right snack dab in the center we see the 0.04 that we expect but we also see it could be a little bit higher could be a little bit lower now a couple common questions here is what is the probability that a sample of 100 from District B has a higher proportion of passage they sample 75 from District a now the model shows us that a sample from a is supposed to have a four percent higher passage rate than the sample from B but the model also shows us that the differences are going to vary in fact a negative difference would indicate that the proportion from B is higher than a because we built the model based on a minus B so if B ends up being a bigger value then the difference would end up being negative so we could examine the likelihood of a negative difference which is any difference below zero so the first thing we have to do is create our model which we've already done then we got to locate zero zero difference would be well no difference the sample from a is the same as the sample from B so if we locate the z-score for 0 which is negative 0.636 anything below zero would obviously be a negative outcome and a negative outcome in this situation tells us that the sample proportion from B is bigger than the sample proportion from a so all we have to do is locate that z-score which we did use some type of normal distribution whether it's normal CDF on your calculator or a normal table and we got to look below that z-score of negative 0.636 to get 0.262 so it's a 26.2 percent chance that if you get a sample from each 75 from District a 100 from District B there's a 26 percent chance that the sample proportion from B will be higher than the sample proportion from a now the opposite that 26 chance would be the probability that District a is higher which district a is supposed to be higher so that's what should be more likely anyway so answering that question about one sample being more than another is a very common question when you're analog the difference between two sample proportions next up is a sampling distribution for sample means if we have a numerical variable and we are examining the mean of the sample then a sampling distribution could be used to model what all possible sample means will look like all we need is the population mean mu the population standard deviation Sigma and of course the sample size n now to build this model we once again need three things the center of the spread and the shape let's start off with the center this would literally be a mean of means so again we have a bunch of X bars a bunch of means and if we were to take the mean of all those X bars we should get the true mean smack dab in the center which would mean that we have an unbiased estimator then of course we have that condition that the samples must be random to avoid bias now next up we have the spread here is the formula for the standard deviation of the sampling distribution now there's two sigmas in this formula so some kids get it confused this Sigma that doesn't have any subscript that is the standard deviation of the population the sigma on the left that has the subscript of X bar that's the standard deviation of the sampling distribution which is a Sam standard deviation of a bunch of X bars that's why the X bar is there it's emphasizing that this is the standard deviation not of the population but the standard deviation of the sampling distribution made up of a bunch of X bars and to find that standard deviation we simply have to take Sigma the same deviation of the population and divide by the square root of n but once again we need our samples to be independent or at least we have to assume that they're independent so we need our sample size to be under 10 of our population to assume Independence and lastly we have the shape of the sampling distribution which is going to be normal but this is where we have to be a little bit careful form because there are no such thing as successes and failures when you're working with quantitative data you have numbers right so here's the deal if we want the shape to be normal we actually have two different ways this can happen if the population that the samples came from is already said to be normally distributed then any sample size is large enough so that our sampling distribution could be normal even a sample size of one is going to produce a sampling distribution that's normal if the population is already said to be normal but if the population is unknown or not normally distributed then we do need the central limit theorem that states that the sampling distribution will still be approximately normal if the sample size is 30 or larger so let's take a look at a quick example let's say that the population of all students cell phones at Roosevelt High School have a mean weight of 180 grams and a standard deviation of 15 grams and the shape of that population distribution is skewed to the right now what we want to do is analyze what would the sampling distribution look like for samples of size 45 cell phones all right we're going to need a center a spread and a shape so let's start off with the center the mean of all possible sample means would be 180. duh we should see the truth smack dab in the middle as long as the samples are random to avoid bias now naturally some samples are going to be more than 180 on average and some samples are going to be lower that's the standard deviation comes into play so to find the standard deviation of our sampling distribution we're going to take 15 the standard deviation of the population divide by the square root of 45 to get 2.236 now of course this is on the assumption that 45 is less than 10 of all cell phones at the high school to assume Independence lastly this shape is going to be normal now this is where it's really important if you heard me right I said that the population was skewed to the right so you're like well wait a minute how could the shape of the sampling distribution therefore be normal well that's where the essential limit theorem comes in the central limit theorem tells us that as long as your sample size is 30 or larger which 45 of course is then the sampling distribution will be approximately normal even though the population was not a really powerful theorem so once again in this normal distribution we have 180 right smacked up in the middle then we could go up one two three and down one two three standard deviations now what kind of questions can I ask you based on this model well we could say what is the probability that a sample of 45 random cell phones has a mean greater than 182.5 grams so of course we have to take a look at our model locate 182.5 grams on our model but then we have to find the z-score 492.5 grams and then once we have that z-score we could use normalcdf on our calculator we could use Z tables to find the probability or the proportion above it so the probability that any sample mean is greater than 182.5 is the same thing as asking about the probability that a z-score is greater than 1.118 and using technology we get a probability of 0.132 so there's a 13 chance that a sample is going to come back 182.5 grams or higher now that's not weird at all 13.2 percent probability is not unlikely by any search the imagination so if I did get a sample mean above 182.5 I'd be like oh hey guess what samples vary no big deal now here's another question that we could ask it's actually pretty popular create an interval for the middle 95 percent of All sample means so here what we're looking for is what average or or what sample mean value to what sample mean value would represent the middle 95 percent of all possible sample means so the first thing we have to do is analyze the Z scores for where that middle 95 percent start and end now if we have 95 percent in the middle that means we have five percent left out that's two and a half at the bottom and two and a half percent at the top so what we could do is we could use technology we could use invert Norm on our calculator or we could use some type of website or we could use our Z tables to determine the Z scores that have two and a half percent below it which is negative 1.96 and then the other z-score technically has 97.5 percent below it that would be positive 1.96 but it makes sense that one's positive and one's negative because of symmetry so once you know the bottom two and a half percent is negative 1.96 Z score that instantly tells you that the top two and a half percent is 1.96 and in between that bottom and top two and a half percent would be the 95 in the middle so now all we got to do is take the Z scores and turn them into sample means by using our z-score formulas and it's really that simple multiply the standard deviations over add the mean and we get the beginning and the end of our interval so any sample mean between 175.62 grams and 184.38 grams would be the middle 95 percent of All sample means so of all possible sample means out there for 45 cell phones we or 95 of all those possible sample means are going to be in that interval the last sampling distribution we analyze in this unit is the sampling distribution for the differences in Sample means here we're looking at the possible differences between a sample a sample mean taken from population one and a sample mean taken from population two now to build this model you guessed it we're going to need the center of the spread and the shape so let's take a look at them now the center would be the mean of all possible differences between a sample mean from population one and a sample mean from population two and the center or the mean of all of those possible differences should be the true difference between the population mean from um from population one and the population mean from population two now of course the samples have to be random to avoid bias now we have the spread or the standard deviation because of course the samples are going to vary so what is the standard deviation for all possible differences between a sample mean from population one and a sample mean from population two well again it's kind of an ugly formula but trust me it's on the AP stats exam form the sheets you don't have to memorize it but it's a giant square root and in that square root we have the standard deviation from population one squared divided by the sample size for population one plus the standard deviation of population 2 squared divided by the sample size that we took from population 2. and once again to do this to use this formula we need our samples to be independent of each other which means we need both sample sizes to be under 10 of the populations that we can assume Independence all right lastly of course the shape is going to be normal but once again we have to make sure that we're big enough to be normal if the populations that the samples were taken from are already normally distributed then any sample size is big enough even three four or five but otherwise we need the central limit theorem so if our populations are unknown or not normally distributed then we need the central limit theorem that tells us as long as the samples are 30 or larger the sampling distribution will still be normal let's look at a couple examples that cover looking at the differences of sample means in this example we're going to examine the mean weight of 15 male orca whales and the sample mean weight of 10 female orca whales now we start by being told that the mean weight of male orca whales Falls in normal distribution with a mean of 12 000 pounds and a standard deviation of 800 pounds and female orca whales also follow a normal distribution with a mean weight of 10 000 pounds and a standard deviation of 900 pounds now of course we have to build the sampling distribution by creating the center spread and shape the center is going to be the mean of all possible differences between the mean of a sample of males and a mean of the sample of females and that is going to be the true difference twelve thousand for the males minus ten thousand for the females is two thousand so we would expect there to be a difference of two thousand pounds from a sample of males versus sample of females next up we have the standard deviation which just going to be taking the standard deviation for the males 800 squared divided by the sample size for males 15 plus the standard deviation for the females 900 squared divided by the sample size for the females 10 and put it all of that under a giant square root and we get 351.663 pounds and of course the shape is going to be normal as long as the samples are big enough but wait a minute our samples are kind of small 15 and 10. most kids are going to say oh those aren't bigger than 30 but go back I remember I said that both male weights and female weights of orcas already follow a normal distribution so sample size of 15 and 10 yes they are under 30 but we don't even need the central limit theorem in this problem because the populations are already normal to begin with all right so we have our sampling distribution model sensitive 2 000 pounds but the difference could be a little bit more and the difference could be a little bit less all right let's answer a couple questions based on this what is the probability a sample of 15 male orcas will have a sample mean that is 3 000 pounds or more than the sample mean of 10 female orcas okay so the first thing we have to do is of course locate 3 000 pound difference on our sampling distribution so we see that it's pretty high on the scale there so we're going to need to identify z-score which is of course 2.844 so now we have to find the probability that the difference between any sample of male orcas mean the mean of the sample of male orcas minus the mean of the sample of female orcas is greater than 3 000 pounds that's the same thing as find the probability that a z-score is greater than 2.844 and we get a probability of .00223 which is pretty unlikely so it would be weird if we found a sample of males that was three thousand pounds more than a sample of females now they're another important thing to start talking about at the end of this unit is what happens when something unlikely actually occurs so if we look at this problem we just found out that the probability that a sample of 15 male orcas have a wheat a mean weight 3 000 pounds or more than a sample of 10 female orcas is really really really really really really really really really really really unlikely well under one percent so what if a sample of male orcas really was three thousand pounds on average more than a sample of female orchids what would that tell us well here's what you have to understand and the world statistics we do not believe in miracles so if this very low occurring event actually happened then that would cause us reason to question now what's the only thing that we could question well that would be the original information gave to us that would be the average weight of the male orcas and the average weight of the female orcas so we originally told twelve thousand ten thousand but I would actually say that I don't know if those numbers are right anymore I would question those numbers were true in the first place because we should not see male orcas on average weight three thousand pounds more than female orcas because the probability that happening is so low so if it happened I wouldn't just be like oh that's weird I guess samples vary no because the probability of it happening was so unlikely if it happened I would start to actually question if the original information given to us was correct or not now the last thing I want to mention in this unit is I want you to notice that in all of the formulas for standard deviation the sample size n was in the denominator now what that means is that as the sample size gets bigger our standard deviation gets smaller that's a really important concept to understand bigger samples vary less actually that should make complete sense just think about that if we're trying to use a sample statistic to estimate a population parameter wouldn't a bigger sample give us a more reliable estimate an estimate that's closer to what is true I mean just think about that if you're trying to find the average weight of all Bullfrogs and you only analyze the mean of two bullfrogs well it's only two Brill Fox who knows what you're gonna get you could get something really really high something really really low and all of a sudden you have no idea what's true but if you were to analyze a sample of 2 000 bullfrogs well two thousand bullfrogs is a lot yeah you might have some big ones and some small ones but they're going to average each other out because there's 2 000 of them and you're going to get an overall statistic that's much much more reliable and much much closer to the true average of all bullfrogs so again bigger samples vary less always keep that in mind now here's an example to actually prove this to you if we were looking at our cell phone problem were we the standard deviation of all cell phone weights was 15. now if we have a sample size of 45 we would take that 15 divided by the square root of 45 and we get a standard deviation of 2.236 but if we were to increase our sample size to say a hundred we would take that 15 divided by the square root of 100 now all of a sudden we get a standard deviation of 1.5 now again that's because samples of 100 cell phones are going to be more reliable they're going to give us values that are closer to the truth that's what the standard deviation is smaller so here we see the two sampling distributions the one in blue and the one in red and we see that the one in blue is more narrow because it has less spread to it which means it's going to be more accurate both of them have the same sender of 180 grams because that's what is supposed to be true no matter the sample size doesn't even matter what your sample size is the truth should be in the center but we do see that with a larger sample size of 100 we have a more narrow more reliable less spread out distribution all right that's it for unit five hopefully made a lot of sense to you and hopefully you don't think it's too too bad but keep in mind there are four different sampling distributions that we explained to you in every one of those four different sampling distributions has a center a spread and a shape now all of those formulas are given to you on the AP stats format sheets actually really easy to follow there's a table for proportions there's a table for means and in the proportion table there's a row for one sample and there's a row for the difference of two samples and same thing with means so all these formulas are there for you you don't have to memorize any of them but you do need to truly understand what a sampling distribution is and how to find the center the spread and the shape of a sampling distribution based on the given information to you so you need those population parameters and you need that sample size but you also have to make sure that you check all those conditions because if some of those conditions fail you might not be normal you might not be able to use the standard deviation and you certainly might be biased which means your sender is not going to be where you think it is so hopefully it all made sense and how you can actually use those normal distributions to then ask probability questions and answer them is really really important all those types of things are going to be on the AP exam so hopefully unit 5 wasn't too bad but take a look at that study guide to really dive into the different problems I gave them to you how to build the sampling distributions and how to calculate probabilities with them alright see in the next video hopefully you enjoyed it talk to you later