Transcript for:
Understanding Normal Distribution and CLT

so in this module we're going to talk about the standard normal distribution and introduce the concept of the central limit theory so we've talked about how data displays can be used to assess distributions of sample data assess the quality of sample data and can be used to make quantitative statements about sample data and the relationships amongst the measures of central tendency can describe the general shape of distributions so for example in a normal distribution the mean the median and the mode are all in the same location in the distribution while if the distribution is either left or right skewed then the order of the mean median and mode can tell you whether the distribution is left skewed or right skewed so we want to talk a little bit more about the standard normal distribution as an example of how we can use distributions and remember the standard normal distribution is that special bell-shaped curve where the mean is equal to zero and the standard deviation and the variance are equal to one so data from the normal distribution can be converted to the standard normal distribution simply by doing something called centering and standardization so converting normally distributed data to the standard normal distribution takes two steps the first step is to center the data and centering means simply subtracting the average of the data from each individual value so you're going to compute the overall sample mean and then simply subtract that value from each of the observations and then step two involves dividing the center data by the standard deviation of the data and when we do this each new value is now called a z-score and this conversion essentially converts data from its native units to standard deviation units so let's look at our height data as an example so we have our normally distributed height data shown here on the right and in this particular case the average of this data is 65.24 with a standard deviation of 5.008 so to compute the individual z scores we simply take for each observed x sub i we're going to subtract off the mean of 65.24 and we're going to divide by 5.008 the standard deviation and that's now going to give us a new z value for that observation and when we do that we get basically the same shape distribution you can see here below but now you'll notice the statistics over here shown on the right basically say that the says that the mean is zero and the standard deviation is one so whenever you standardize data the mean should be zero or close to zero and the standard deviation should be close to one if not one the data are now represented as how many standard deviations the value is above or below the mean so let's look at an example let's say we have an individual who is 5 feet 3 inches tall or 63 inches how many standard deviations away from the mean is this person well we simply need to convert their 63 inches into a z-score based on the information that was provided on the previous slide so we take our observed value of 63 we subtract off the mean and we divide by the standard deviation and we get a z score of minus 0.4473 so this individual who is five feet three inches tall is approximately 0.45 standard deviations below the mean of 65.24 and we know it's below right because this the negative sign here right so this is minus 0.4473 so it falls below the mean so why do we care about this so remember the normal distribution has specific relative frequencies so the approximate percentage is in the normal distribution right so if you were plus or minus one standard deviation away from the mean that is approximately 68.26 percent of the area under the curve if your plus or minus two standard deviations away it's 95.44 of the area and if you're plus or minus three standard deviations weigh it's approximately 99.74 of the area under the curve so we can use the area under the standard normal distribution to estimate areas under the curve so for example what if we had the question what proportion of people fall below fall between the mean and 0.45 standard deviations below the mean well then all we need to do is compute the area under the curve between the mean and minus 0.45 standard deviations right because we're doing below the mean here and so graphically right minus 0.45 a z value of minus 0.5 is there we want to know the area between the mean and that z value and if we compute that area under the curve it's 7.1736 or 17.36 percent so let's look at another example let's say a student gets a score of 6 on her test and the mean on the test was a 3 with a standard deviation of 2. so what is her z-score just based on that information and then let's say that assuming the distribution of the scores was approximated by a normal distribution what proportion of students scored above a six so we compute her z-score so we know that her score was a six the average score was a three and we know that the standard deviation is 2 so her z score is a 1.5 then we can use the standard normal distribution to compute the proportion of students who scored above a 6 which scores which corresponds to a z-score of 1.5 so here's our z-score of 1.5 we want to know those that scored higher than that so we'd be computing the area to the right of the curve and so that's .0668 or 6.68 percent so roughly 6.7 percent of the students scored with a skit six or higher so let's look at another example let's say a student scores a 3.5 on the very same test as in the previous example so what's this student's z-score and then let's again assume that the distribution of scores is approximated by the normal distribution so what proportion of students scored one standard deviation above and below the student well in order to figure that out we have to compute this student's z-score so again this student scored a 3.5 the mean was 3 the standard deviation was 2. so this student students z-score is a 0.25 now we want to know how many students or what proportion of the students scored one standard deviation above and below this so basically we want to know uh what the just what the area under the curve is between minus 0.75 and 1.25 right because that's one standard deviation below and above this student's z-score so there's a minus 0.75 there's 1.25 and we want to compute the area in between those and we get 66.77 or 66.8 percent of the students had scores there were one standard deviation above and below this student's particular score so for calculating areas related to a bell curve in this course we'll be using the following website and this is basically a website that's provided by bognar who's at the university of iowa and basically you go to this website you can click on normal distribution under continuous distributions on the left hand side of the page and the bognar website we're going to let you use that to compute areas and p-values and other things on both the midterm and the final exam so this is what the page looks like when you select the normal distribution and you can see that it provides a graphical interface so it tells shows you what the graph is going to look like for whatever it is you're going to be looking so if you wanted to look at the standard normal distribution this does all all types of normal distributions you can see here basically at the top you enter your distribution characteristics so if you have a mean of 0 and a standard deviation of one that will give you the standard normal distribution but you could have distributions for other things like mean of ten and a standard deviation of two or whatever okay but that's where you enter the characteristics and then the two boxes you're entering either a p value or a z value so the z value would go under the x so basically here is your z value okay or over here is going to be your p so you're either entering one or the other you can enter a p value to compute at the z or you can enter the z to get a p the drop down menu in the middle uh lets you select which just part of the distribution you want to look at and we'll go into this in a little bit more detail in just a second here so here is the drop down menu you can see that you can look for values greater than the area greater than the observed value in other words your your z value the area below the z value the two-sided value or the area in between the values so if we look at those four options let's say i i'm looking at a z value of 1.96 so if i choose for the area above 1.96 you can see here graphically that i'm basically looking at the blue section showing you where 1.96 is and the red is showing you the area and you can see that that corresponds to 0.025 if i choose for the area below my observed z value so everything below 1.96 again the blue part shows you where 1.96 is it's then computing all of this red area and that's 0.975 remember the area under the curve for this distribution is sums up to 1. so this is basically saying that 97.5 percent of the area falls below 1.96 alternatively if you're looking above 1.96 that's 0.025 or 2.5 percent of the area here this is the two-sided looking at the p-val the area on either side of the distribution so you can see 2p twice the probability so half of the area is in the negative right minus below one point minus 1.96 and half the area is above 1.96 so that gives you a total area of 5 percent and then this is basically calculating the area in between so if you had 1.96 above and below the mean what is that area and that's basically what's computed here and that's 0.95 so this figure is just showing you the four different options that you can select from that drop down menu in the middle to decide what you're going to be computing and we'll go into this in more detail when we talk more about p-values so going back to the proportions of a standard curve let's use the bognar website to reproduce the area under the curve for plus or minus one standard deviation rate which is supposed to give us 68.26 percent of the area so if i go to the bognar website i enter my characteristics for the distribution right so i'm doing standard normal so the mean is zero the standard deviation is one i'm doing one plus or minus one standard deviation so i enter one standard deviation here and i want to know the area in between the positive and negative plus or minus from the mean so i'm choosing the probability of minus x less than or greater than less than capital x less than the absolute value of x right so i'm calculating that area under the curve so again here's minus one there's positive one and it computes this area as 68.268. so the the major limitation of the bognar website is that it only reports up to five decimal places in its output so very small p values get reported as zero now there is no such thing as zero probability so you need to report uh if you get a zero if you go if you go to bognar and this you know let's say there's a zero here then you need to report that as less than zero zero .0001 which is the lower limit of this website so this is generally true true no matter what tool you use if you're using spss or r or sas or whatever if you're if your software basically has some limitation in precision you need to report your value as less than whatever that lower limit of the tool is really good software like r will use scientific notation and give you an exact value no matter what okay so with that let's talk about the central limit theorem so remember we said that as sample sizes increase the shape of the t distribution approaches the standard normal distribution and it had to do with these degrees of freedom in the t distribution if you recall this we showed this figure before that as the degrees of freedom which are related to the sample size increase in the t the shape of the distribution starts to approach the standard normal distribution so the exact statement of the central limit theorem is beyond the scope of this course but the basic idea here is that as sample size increases in other words as your sample size n gets larger the distribution of the sample mean the distribution of the sample mean okay approaches a normal distribution for most numeric variables so specifically if the population mean is mu and the population standard deviation is sigma then the sample mean x bar is approximately normal with a mean mu and a standard deviation equal to sigma divided by the square root of n so it's the distribution should approximate the population mean and the population standard deviation corrected for the sample size that's where the square root of n comes into play and that the sample mean minus the population mean divided by the population standard deviation is approximately standard normal and this is called the standardized sample mean so for example suppose you have data that looks like the what's over here on the left right so it's got two clusters of data right there's a gap here in between where there's no data you've got this spike over here and you got a you know what looks like a bimodal set of data so this is definitely not normally distributed by any stretch of the imagination if we calculate the population mean and standard deviation it the population mean is 6.82 and the standard deviation the population standard deviation is 5.23 so let's see what happens when we take larger and larger samples with replacement that means we draw a sample and then we return it and then we draw again so that's what sampling with replacement means and look at the possibilities for the sample mean and standardized sample mean so remember again standardizing variables we basically subtract off the mean and divide by the standard deviation to get a z-score and that standardizes things so if we take our current data which has this funky distribution we subtract the mean so now instead of the mean value being 6.82 it's now zero and then we divide by the standard deviation so now things are on a standard deviation scale so so if we sort of expand the standardized data you can see here that most of the data are falling you know within plus in fact all the data falls but within plus or minus two standard deviations so the distributions of sample mean and standard dev uh standardized sample means for n equals one so if we go in and we just draw one sample and we do that a bunch of times this is what the distribution looks like so this is the distribution based on the single sample and then if we convert that to a z-score that's what it looks like now consider a sample size of two we go in and we grab two samples from the population we draw that sample we then take the average so here in the example we're saying let's say we got a five and a ten we take the average we get seven point and now we're gonna do this a million times we're gonna go in a million times we're gonna grab two samples and we're gonna calculate the average so we end up with a million averages okay based on sample size of two and then we draw a histogram so that looks like this so now you notice that this distribution looks a little bit different because there's no longer that gap between the two modes and now it looks trimodal instead of bimodal our big spike is no longer there and again if we convert that this is the raw means from ours and of two samples and if we look over here on the right that's our standardized distribution so again still the bulk of the data you notice here we're going a little bit beyond two standard deviations but the bulk of its falls within plus or minus two standard deviations so now if we increase the sample size go to sampling three samples from the distribution computing the mean and doing that a million times you notice the shape of the distribution changes if we go to four samples the sample size of four you notice that the distribution changes even more and as the sample size increases you'll notice that the distribution starts to look more and more like a standard normal and in fact here with 20 50 and we get to 100 it looks very much like a standard normal distribution so the fancy statistical term for this is asymptotic behavior and asymptotics apply for what we call sufficiently large n sufficiently large sample size and what sufficiently large means will vary depending on the particular context that we're talking about but the general idea is that if the sample size is sufficiently large enough then the distribution of the sample statistic will look like a standard normal distribution so the t distribution is asymptotically standard normal uh the clt says that sample means are asymptotically normal and many types of tests on categorical data are asymptotically chi-square not not normal but chi-square but the central limit theorem the basic idea that the central limit theorem says that as long as sample size is sufficiently large that it'll approach that distribution is what the central limit theorem really mean so the basic idea here is that if your sample size is sufficiently large then there's a predictable distribution of the sample means and you can rely on that then to do certain statistical tests