Key Concepts in Sampling and Statistics

okay so the last time we were uh talking about sampling and that's what we're going to continue talking about today uh we got about quarter way through the chapter six notes and of highlight to our current work is the fact that right here this little segment of the formula which if you've done a typical statistics class before you might may or may not have seen the finite population uh correction factor uh or it's sometimes yeah well the finite population correction factor which is sometimes just referred to as the f pc now if you've not seen it before the reason why you have likely not seen it before is because you were dealing with an infinite population and an infinite population the sampling approach is where the standard deviation of x bar where x bar is our average of the sample is simply evaluated as it appears right here which is sigma of x bar is sigma which is the population standard deviation divided by the square root of n okay so on to then more details about the finite population correction factor the finite population correction factor uh is an interesting little thing uh what is the purpose of the finite population correction factor well principally uh where it comes from and what it's used for is that the finite population correction factor is this term which is the square root of capital n minus lowercase n divided by capital n minus one like so and the purpose of such a term is that we use this when n over capital n is greater than or equal to 0.05 now what's the reasoning for this well the reasoning for this is uh more complicated than what we choose to go into for this course because this is from the world of theoretical statistics so i'll just simply do some notes here this comes from the hypergeometric distribution which we are not specifically looking into in this course at least at the moment and this is an application of the following theorems so we are looking at facts about the normal curve the central limit theorem and markov and inequalities and here's the basic idea of this the basic idea of this is the the hypergeometric series uh our permanent hybrid geometric series hypergeometric distribution which okay well what is the hypergeometric distribution well the hypergeometric distribution is a discrete probability distribution that describes the probability of k successes in n draws without replacement so in other words you're going to out of some n number of draws from a population of size capital n so this this little n right here is the draws you're going to be talking about then the distribution for k number of successes from this so you'll notice that the k doesn't appear in this formula and the reason is is because we're we're modeling uh sampling from this distribution that has the hypergeometric distribution uh then what you're going to do then is that you're going to then use the normal distribution approximation coming from the central limit theorem and markov and chebyshev inequalities to be able to reach an approximation for the following uh scenario which is that when you have uh that the finite uh or probably when you have that your terms of what you're looking at uh are less than uh five percent of your finite population then your approximation can be ignored and the reason being our apartment the the difference between your approximation can be ignored because uh this term which is the finite population correction factor nears one so if you look back on this right here as this term right here which is the finite population correction factor as it approaches the value of one then these two terms here this term and this term approach equivalent values so the whole idea of this correction is is that when you are sampling a large percentage uh and this instance large is greater than five percent when you're sampling a large percentage of your population which is in some senses what you can almost think of and it doesn't mean this which is in some sense is what you can think of as an over sampling then you have to correct for how you have over sampled um okay so on to then uh continuing with this so we then typically then uh do not know the actual population standard deviation and so as a result then we more often than not use s which is the estimated standard deviation now that then requires us then to have that as change on our two formulas of where the only difference is that we use s instead of sigma so just to sort of highlight the points between these two things it's sigma is a population parameter and s is a sample parameter now when the population has a normal distribution then there are a couple things that we can know which is that the sampling is normal when the population does not have a normal distribution the central limit theorem is helpful in identifying the shape of the sampling specifically what it says is that when our when the size of our sampling is large the sampling behaves like a normal distribution okay and that's the the key fact of the central limit theorem which uh really any of your prior statistics classes should have emphasized this the central theorem is very very important to statistical work in general but more specifically it's very very important to business applications because this really basically says that more often than not what we need to know about is we need to know about the normal distribution okay so let's take a a sort of a glance at a couple different uh sort of pseudo example distributions um so this is a you know illustration as it were of three different distributions so we have population one population two in population three now these means the this means that these three populations are the actual distributions themselves and we can then sample from it where we are then going to have the sampling distribution be from choosing different values so from choosing different values from this the sampling of when you choose only two samples so what we're looking here is not a chart of a particular sample but rather what is the shape of doing all possible kinds of samples of such a distribution and the answer is is that because of the fact that right here that is the mean value that means that the values of the sampling distribution are centered around that mean value and then likewise for here and then likewise for here where they are centered in some sense around that value uh and what happens is that as you look at the sampling distribution as the size of the samples that you're picking from goes up and really as it as it tends towards the number 30 this is again what the central limit theorem says that the shape of the sampling distribution becomes a bell curve overall okay so we're going to look at an example of sampling so let's look at a binomial coin toss and let's choose 100 as our uh sample size and it's a fair coin it's a 50 50. so we're gonna go for the number of heads so what's the number of heads that we should expect so let's generate a data set okay so this is an example data set uh of what we're talking about and this is a hundred coins you know heads tails head sales you know et cetera so we're going to do an experiment of where we will run through uh generating random samples let's generate exactly 10 okay 10 samples with set data points for sample and we're going to generate so right here we have 5 through 10 and you'll notice that in each of these there's a hundred different coin tosses that's because we have uh we've generated it in this way and we're then gonna see okay for all these 10 if you take the average of the numbers of heads and tails et cetera et cetera see because there's some certain number of heads and some certain number of tails for each of the samples that we did this is five through ten so you'll notice that there's a different number of heads here then there are a different number of heads here there are more heads than this one then and this one this has more tails than this sample so 8 versus 9. so we could then compute then the average number of heads in this sample but we don't have to by hand so we're going to go ahead and do that and then we see okay for these samples then we can observe the distribution so the distribution is that from our various samples there were none that exactly hit the overall mean of our distribution but there was a particular sample that had an average value of 71 out of the 100 as the heads and there was there were also uh there's also one that had 54 out of 100 but there were five that had 58 uh heads and there were three that had 67 heads now that's just random of where that is and if you'll notice that you take the one sample here the three samples here that's total four and the five that's nine and then the one that's all ten samples so this is saying that there were three of the samples that had 67 heads there were five of the samples that had 58 heads there were uh one of the samples that had 54 heads et cetera et cetera so this is what we mean when we say sampling we're sampling of from uh you know random tosses of 100 coins and then we're looking at what overall should happen now here's the thing for coin toss we know the number of heads that we should be reaching we should be reaching a number of heads of 50. uh and if you continue to do this kind of a thing so let's reset this coin toss and we'll set 100 number of heads again generate data set and there we go generate random samples and let's do a large number let's do uh well let's do rather a larger number let's do 30 30 samples we're going to generate them you know i have 30 samples so down here we could scroll through 15 through 30. and each of these has some set number of heads that's in it based upon randomly selecting it and let's let's go go go and look at what this distribution looks like you'll notice that our mean value here is .53 so the average number out of all these samples the average number of heads is 53 that's way closer to what we expect and in fact then what we can then do is let's run this one more time binomial coin toss set our number at 100 again generate that data set and it's going to look something like that and then we're going to generate random samples let's keep it at a thousand okay what do we expect to happen if we increase the number of samples that we go with well now if uh sorry i have to generate first my random samples now let's now generate the random samples let's go with the mean here okay so you can notice that then right here we can see that there are 196 samples that have 48 heads the largest number of samples 302 have a value of heads of 52 out of the 100. so when i say that uh the central limit theorem says that things overall in sampling behave like a normal distribution i mean this if you look at the shape of this distribution it looks like a bell curve and by looking like a bell curve uh it therefore then behaves like a normal distribution okay so that's that's kind of the point of of all of that um so let's continue on and uh when we then want to consider then the normal distribution that is approximated by in what we just looked at our bar graph it's then going to be based off of the uh standard deviation of the population and it's going to then uh be divided by the square root of n where n is the number from our sample and so this is then just a sort of additional example the difference then between distributions and population of where when you have a larger sample size you tend towards more of a normal distribution appearance of your sampling distribution and so as you change sample size you are getting a different distribution with each sample size that you pick even sampling from the exact same distribution now we have already actually looked at this our formula for this is x over n where x is the number of elements in the sample that possess the characteristic interest and n is the sample size so we have previously computed well we previously mentioned this we gave the formula for this earlier and proportions are just one additional component of this process now you'll notice that this also has the finite population correction factor the thing that has changed is this portion right here and it's because of the fact that this portion right here is sigma p-hat where p-hat is the uh average proportion or expected proportion from the samples so this is where we're doing then proportions now proportions are percentages so if i say proportion or percentage it's the same thing so we will ultimately see some examples of this and similarly when we look at sampling from proportions they also tend to be bell curved and therefore normally distributed when we look at the sample proportional average so the average proportion from samples and again the key idea and element is that you have an estimated standard deviation of p bar where p bar is the sample proportion and you have two formulas one for finite population and one for infinite population now this leads us then to the topic of interval estimation which i previously described as well because we are now on to then where we will begin to then have confidence intervals which is the sort of popular notion of an interval estimation now because a point estimator cannot be expected to provide the exact value of a population parameter interval estimation is frequently used to generate an estimate of the value of a population parameter this allows us to then bring probabilities into the mix and so as a result of bringing probabilities into the mix it allows us to then have percentage likelihoods of being able to describe a point estimator so the question becomes then what is the general form of an interval estimate and the general form of an interval estimate is of the following variety so in interval notation it's theta hat minus m of x comma theta hat plus mx or interchangeably the notation is described as eta hat plus minus m of x and sometimes it's it's written as theta hat of x plus minus mx now what are these two terms so x here is the data and by that i mean a data set and theta hat is a point estimate and m is a margin of error now the general form of an interval estimate of a population has to do with giving an estimate about how close a particular point estimate is to the value of the population parameter so for instance an interval estimate of a population mean would be of the form x bar plus or minus z of alpha over two where alpha is the confidence level and divided by two in order to have it uh on the symmetric halves of the bell curve and then that times sigma over square root of n and this is for the instance of where sigma is known now the question then becomes what happens if sigma is unknown uh if an estimate of the population standard deviation cannot be developed prior to sampling we use the sample standard deviation s to estimate sigma uh and then in this case the interval estimate for mu because that's what is being estimated here mu estimated in this instance the interval estimate for mu is based on the t distribution as opposed to the normal distribution and of course if we remember our statistics from previous classes the t distribution is a bell-shaped distribution and the only difference then in the formula of what we use is that we use x bar plus minus t of alpha over two again for the symmetric halves s over square root of n and that's an n there as well where we have n minus one degrees of freedom now we will ultimately see more uh to do with this coming up um but it's now about time for us to then say what about the instance of where we have a population proportion well for population proportion p which is the population proportion is estimated with p hat plus or minus and then what we typically do is that we also have a normal distribution i will say though what is oftentimes standard is that in the discovery of the normal distribution uh for its values it's quite typical to give an estimate on the z value as opposed to uh giving any kind of an exact value but it would be z of alpha over 2 times the square root of p times 1 minus p over n where p is known and where p is unknown then these become of course p hats same as usual uh now sometimes there are different correcting uh correcting methods that are used um and in some instances what we do is that we use then the empirical rule to in place of this z of alpha over 2 right here we would use a 3 y why would we use a 3 well because at 2.8 which is really close to three we have approximately 99 percent of a bell curve so it's just an approximation process of where oftentimes we will replace this with a three and so in that instance then the formula looks like plus or minus three this is a relatively standard procedure over n now this is again what i was talking about of where this is the empirical rule uh of where the exact values of what we use for the empirical rule for a normally distributed variable is 1.645 for 90 of the data 1.960 for 95 of data and 99 for uh or 2.576 for 99 percent i did and then if you go out to the 99.9999 2.8 then includes uh that amount of the data so for instance when we're talking about then the sampling idea the way that we can imagine this is that if we take the mu value right here then we could have one particular sample whose average value is here one particular sample whose average value is here one particular sample whose average value is way over here another that's here another that's here that's here that's here the two that's here however the average of those values needs to fall in line with what we have specified before as being a normally normal distribution now as i said in the event that we don't actually know some of the information than what we use as a t-distribution instead the t-distribution is a family of similar probability distributions to the normal distribution the shape of each specific one depends on a parameter referred to as the degrees of freedom which we have seen before as the degrees of freedom increase the t distribution narrows and it becomes higher as well so right here for instance this is a t distribution with 10 degrees of freedom this is a t distribution with 20 degrees of freedom and this is the standard normal distribution now as the degrees of freedom tends towards infinity the t distribution becomes matched with the normal distribution meaning that it is a limiting process so if you have had calculus when i say limit i mean exactly that i mean a limit like from calculus so the limit of the functions is the normal distribution now when we are talking about interval estimation for the t distribution uh as we get closer and closer to 30 the normal distribution with a remarkable degree of accuracy approaches the same empirical values as the normal distribution the the apparently the t distribution approaches the normal distribution and it's in the same values that are the classical empirical values that we read off from the normal distribution for this reason when we have an n of approximately 30 and sometimes it's even boiled down to 27 or up there's various rules that float around these these rules are not rules the same way as we would interpret rules like legal rules of where you cannot break them rules and statistics are guidelines that's that's typically how you need to think of them and so more often than not i do say guidelines but a lot of people read them off as if the rules hardened hard and fast so as a result uh when we are talking about statistics and we come to then where our n for a t distribution is close to 30 then what happens is is that our values then are very close to normal normal distribution values now what we can see right here is a chart of some example samples which is a fun sort of tongue twister and where their interval estimates lie and you can see that then with that each of these interval estimates because this is an interval estimate based upon a particular sample it's an integral estimate based upon particular sample whenever estimated based on a particular sample these various intervals do not necessarily capture the actual true population mean but the majority of them do include the population mean so for instance sample 3 does not include the population mean however again as we said the majority of them do and therefore with some particular percentage of accuracy accuracy in quotations we can say that the samples with their confidence interval on average include the sample mean now you'll notice then that if we reduce the size of each of those intervals more of them will be cut out from including the sample or the actual population mean than not so because approximately 90 of all of the intervals constructed will contain the population mean we say that we are approximately 90 confident that the interval will include the population mean so this is the idea of confidence is that more of the sample intervals the intervals constructed from the samples will contain the population mean than not and therefore we have a certain amount of confidence and by the way 90 was coming from the fact that that uh negative 1.699 and positive 1.699 were coming from the exact same values that we would always compute on a t distribution to the left and to the right this is a t distribution here let's make that clear i think i did comment on that let's let's be precise this is a t distribution and so confidence then is a a measurement of the percentage of uh sample defined intervals interval estimates that will include the actual population parameter so the level of significance is the probability that the interval estimation procedure will generate an interval that does not contain the population mean so there is a back and forth confidence is one minus alpha times 100 percent and alpha is what we would refer to as the level of significance so if alpha is equal to 0.05 then 1 minus alpha times 100 is equal to 95 that's some quick math i'm not going to break that down now that then tells us that then when we have this thing that we were then referring to as the alpha over two value well there are two parts where it happens so you see how this occurs right here well there's a symmetric portion of that gets filled in over here that matches with that alpha two so that alpha two alpha over two and alpha over two together sum to alpha which is the total amount of percentage because keep in mind that a distribution this is a distribution particularly this is a t distribution in this particular instance a distribution always talks about a probability so the area underneath the just underneath the graph of the distribution is a probability amount now we can look at an example problem of where for instance we were talking about uh credit card balances for a sample of 70 households and we can then compute the mean standard error median mode etc et cetera from all of these things and then we can come up with point estimations for the mean credit card balances and included in that can be the upper and lower limits which are the left and right values of the confidence interval which is the uh theta minus m and theta plus m so going back to our interval estimation formula right here this part of underlining is the margin of error and so this part right here graphically is what we refer to as the margin of error now we will see actual computations of this quite soon in excel and how to use well actually not really excel we'll do excel maybe uh but more in particular we'll use r to do our calculations now it's a very similar procedure when we talk about population proportion the only thing that has changed is our formula for the margin of error as well as then what our population parameter is that we're estimating so right here the population parameter in question is uh p hat which is the uh average proportion from our samples and then right over here all of this which the z alpha over two times the square root of p hat p bar times one minus p bar over n inside that square root that becomes then the margin of error part and we have the exact same procedure which is where we have a left and right or what we would refer to as the upper and lower bounds so right here this is the lower bound right here this is the upper bound so to get to this value it's the p value minus z alpha over two times sigma of p bar and sigma of p bar is that square root of p times 1 minus p over n term so it really is remarkably similar and we can do an example then of a 95 confidence interval uh for proportions and the formulas are remarkably similar and it's really just an easy sort of excel calculation to be able to reach those figures so on our next lecture we are going to begin talking about hypothesis tests and that will then kind of work us towards the end have a great one

Transcript for:Key Concepts in Sampling and Statistics

Transcript for:
Key Concepts in Sampling and Statistics