hello students now we begin chapter five so in this chapter we're going to be discussing variability various ways of measuring how much scores vary in a distribution this is probably one of the more important chapters in that variability is the basis for many of the statistical concepts you're going to be presented with in later chapters all right so let's take a look at what we're going to be covering we will talk about formulas for variability what it means what the concept refers to standard deviation variance as measures of variability how to interpret them and how again how to compute variability and especially when describing a sample and when describing a population what else do we have here some new notation and importantly the concept of accounting for variability which will be discussed in future chapters as well as is the case with many of the concepts here you have an introduction to the concept in this chapter but you'll see reference to these concepts really throughout the textbook so it's important to have a good understanding of these concepts at this point some notation so uh you see this the sum of again so this this uh this first notation here indicates the sum of the squared x's so you're going to square all of the x's or scores in a distribution and sum them and then secondly the squared sum of x so in this case you square the sum of the x scores all right more you'll see those again in a minute so what are some of the measures of variability again variability just refers to how much the scores vary in the distribution i'll give you some examples of that in just a second but they it really speaks to how much they differ so how consistent the scores are how accurately the distribution is described by a measure of central tendency for instance using the mean how accurate is the mean in describing a distribution well that depends on how varied the scores are really um how spread out the distribution is let's take a look at some of these concepts so look at sample a b and c this is a small set of scores and in each case the the mean is 6. but the distribution looks quite a bit different moving from sample a to b to c so in sample a here are the scores 0 2 6 10 12. so you have a sum of scores 0 then a 2 then a 6 10 and the 12. so you can see the scores are very spread out ranging all the way from 0 to 12. and again the mean is 6. now let's move to sample b mean is 6. here are your scores 4 5 6 7 8 and this case the distribution the variability is much smaller it ranges from four to eight and then this third sample uh with again a mean of six everyone scored six so there's no variability there's everyone has the same score so think about this for a second which in which one of these cases a b or c do you think is the mean of a better descriptor or or a better predictor of how folks will perform in the future on this particular task whatever generated these scores with with a less variability the mean is going to be a better predictor that hopefully is intuitive to you now you can also look at the spread of scores utilizing the normal distribution so there these are variations of a normal curve it's really the same concept that you just saw in that the distribution a b and c differ in terms of how spread out the scores are but this is just looking at it with a normal curve so with distribution a the scores are here you have the mean okay uh mean of 50 and in all three cases the mean is 50. but with distribution a you can see that the scores are there's less of a spread of scores there there's a much greater frequency of folks but performing right around that mid value of that 50 and remember over here on this axis is frequency represented by f so in distribution b you have a little bit more of a spread okay and then in distribution c you have a much greater spread again all three distributions have a mean of 50. so i mean i asked the question again in which case does the mean better describe where most scores fall or is a better predictor of how people will score in the future well in that case it will be a of course all right moving to some actual measures of variability we'll start with the range this is not used that often but it's an important it is a measure of variability i'm not sure again it's not very used very often in statistics but it does give you a sense of how spread out the scores are and the range is a single value by the way sometimes students confuse that term and and when asked for the range report both the highest and the lowest score it's not the range is a single the difference between the highest and the lowest score so it's quite simply you look at the highest score and you and you look at this the lowest score you identify each and you just take the difference that is the range it tells you how spread out really doesn't tell you much more than that you could have one very very high score but most scores right around the center and one very very low score and and so your overall variability might be quite small but the range would look like um it's quite significant in that case so a better measure and a measure that is more frequently used is the standard deviation or the variance and standard deviation so when we're using uh when our when our variable is either um ratio or interval it uh the mean is uh as you saw in the last chapter often uses as our measure of central tendency and in that case it's appropriate to compute the variance and the standard deviation to describe the variability it's also the case as you saw in the last chapter to use the mean as the measure of central tendency when scores are normally distributed and again with interval or ratio scores so as we discussed in previous chapters as researchers will deal with samples and populations so the population is the group of individuals when we're doing research we're interested in and we typically from that population generate a sample not always sometimes we can look at the entire population but typically we generate a sample so what does the sample variance and the sample standard deviation look like well the sample variance refers to and we saw a bit of this in the last chapter the average square deviations of scores around the sample mean so remember in the last chapter you had a sample mean and we looked at the variation the difference between the sample mean and each individual score and those are deviations of scores around the sample mean so the sample variance and remember when you add up those those differences those deviations around the mean the difference the total will equal zero by definition halfway above the mean half are below the mean well because it's always going to be zero the way we calculate the sample variance is to do a slight variation of that and that is to square all the deviations and then add that up the formula for the sample variance is simply this the sum of all those squared deviations from the mean not the deviations but the squared deviations and then divided by the number of cases you have this is referred to as a defining formula for the sample variance now you're going to see later in the chapter and in other chapters you'll see this distinction as well some formulas will be described as defining formulas and others as computing or computational formulas what's the difference defining formulas are those or formulas that are that help you to understand the concept we're measuring to define if you will the concepts we're measuring so this is very intuitive the way it's presented here the squared deviations from the mean the sum and then divided by the number of cases it's like the average deviation from the mean so it makes sense however in real life when you're computing this sample variance it's very cumbersome if you have multiple cases and so it's easy it's presented so it's easy to understand the concept of variance you'll see later on in this chapter there's a computing formula as well for sample variance that results in the exact same number as this formula but it's simply easier to compute so that's a computational computing formula but we start with the defining formula so this is the notation right here for for sample variance you can see you might not be able to tell this point this is the upper case s so it's the variance of these x scores now what about standard deviation well standard deviation is simply the square root of the sample variance the sample standard deviation square root of the sample variance again this is the defining formula for the sample standard deviation so take a look at this formula and then now i'm going to flash back to the previous uh powerpoint watch it's exactly the same except for the square this is the square root the standard deviation is a square root of the variance so it's the only difference so don't be confused you're going to be presented with a lot of formulas that's the difference it's simply the square root and notice that we don't have notice how the symbol is different here too so this is standard deviation this is variance so you take the square root of this and you have the standard deviation sample standard deviation so it's simply the difference between the score and and the mean and the sum of those those squared differences over n and then the square root of that so the standard deviation is just the average deviation from the mean we originally with the variance we square the values and then we take the square root of that value that we obtained from the variance to bring us back to the the units the original units of measurement and to get a number that's meaningful as an average deviation from the mean in the chapter when we first introduced the normal distribution i made a statement about how the normal distribution allowed us to make statements about the percentage of scores that were under the curve at various locations and the standard deviation is really what allows us to do that so if your if the data is distributed normally distributed that concept of a normal distribution allows us to do is to make a statement about the percentage of scores that fall between the mean of the scores and for instance plus or minus one standard deviation so looking at this normal distribution we can see that the standard deviation is about is 5 for this set of data so what that means is that about 34 of the scores fall between 80 and 75 minus 1 standard deviation and about 34 of the scores fall between 80 and 85 plus one standard deviation so about 68 34 plus 34 percent of the scores fall between 75 the scores of 75 and 85 and you can see here you also have information about the percentage of scores that will fall outside of or beyond one standard deviation in either direction so about 16 of the scores will be will fall in this range of greater than more than one standard deviation in this direction and more than one standard deviation in this direction so the the standard deviation this this formula always works um this these percentages in any case no matter what your distribution looks like as long as it's a normal distribution it's going to look a little different if it's if the standard deviation is smaller or larger more or less spread out but if it's a normal distribution you'll always have about 68 percent of the scores falling between plus or minus one standard deviation from the mean so that will always work and that's nice this is a nice theoretical concept for for working with various statistics that we will encounter later on now as i said the um the distribution the actual curve itself will look a little different depending on the the standard deviation but so as you can see here you have a standard deviation of four and that means that 68 percent of scores fall plus or minus four with 50 50 is the mean actually in all three cases here but most of the scores and you can see from the shape of the curve this is it should be intuitive from the shape of the curve that most of the scores fall right here clustered in the middle so plus or minus four because four is the standard deviation in this case you have a distribution that's a little more spread out because it's the standard deviation is seven so you can see the concept is the same sixty-eight percent of the squares will fall um plus or minus the mean plus or minus one standard deviation in that range but the the distribution is a little more spread out and then um in this distribution c because the standard deviation is 12 the scores are even more spread out but the concept again is the same 68 of the scores plus or minus one standard deviation all right so that's a sample standard deviation and sample variance and again that's if you have that information for the sample that's how you calculate the standard deviation and variance but what about the population what if you have information on the population you have the population mean and you have all the scores can you calculate variance and standard deviation absolutely so the population variance is the true or actual variance of the population of scores and again this is the defining formula so the concept is the same you take each score minus mu the population mean and you square it you sum all of those and it's you divide by n the number of cases and so the symbol is different um and again this is a case where you have all of the scores in the population you have the mean all right so this is population variance and maybe not surprisingly population standard deviation is simply the square root of that value of the variance now it's it's not often the case that you have that information about the population so typically more often what we're doing with statistics is we're taking that sample and we're estimating the population values based on the sample so now we go to the estimated population variance of the estimated population standard deviation from our sample so the sample variance and you see this very large uppercase s here we've just been exposed to it's actually referred to as a bias estimator of the population variance it's going to allow us to estimate the population variance but it's going to be biased and so typically what it does is it underestimates sample variance and sample standard deviation 2 bias as well will underestimate the true population parameters so what do we do about that well we simply divide in the numerator by n minus 1 instead of n so this should look very much like the sample variance that you saw earlier in the chapter it's a couple differences one first of all this is n minus 1 instead of n so when we're estimating population variance from the sample data we divide by n minus 1 so to create an unbiased estimator of the population variance and you'll notice too that this s right here is lowercase so when we're estimating population variance we use a lowercase s instead of an uppercase s uh when we're looking at simply sample variance now again not surprisingly uh the the estimated population standard deviation is simply the square root of the estimated population variance so again you see the lowercase s because we're estimating the standard deviation the population from the sample and again you see the n minus one but it's the same look there we have the estimated population variance and here we have the estimated population standard deviation still have defining formulas here that we're using and that's to help you understand the concept that you're taking each value and subtracting it from the mean that should make sense as a measure of variability so what is this n minus 1 here that results in an unbiased estimator it's called the degrees of freedom this is something you'll run into many times and with statistical tools we will look at later in the semester the degrees of freedom because we're often estimating the population from the sample so we need to use this unbiased estimator called the degrees of freedom now in this particular case it's n minus one it's not always n minus one but in for estimating the population standard deviation and variance from the sample the degrees of freedom is uppercase that's all the cases and minus 1. and it's often symbolized as there's a as this like the lowercase d and f okay how do we keep all these symbols straight well as we move through the chapter i've tried to describe to you the differences between the symbols but just here's a summary so when we have uppercase s this describes the variability of a sample so this is the this is the variance and the second is the standard deviation of the sample when we're using a sample to estimate the population then it's a lowercase s so this is the the estimate of the population variance and the estimate of the population standard deviation and we'll use n minus one in the denominator the degrees of freedom here's a chart to help you out with it so an organ organizational chart of descriptive and in for inferential measures of variability let me talk about these terms descriptive and inferential for a second so when we describe a sample that is a descriptive statistic when we use a sample to make an inference about the population that is an inferential measure so just a basic distinction there so in describing variability or differences among scores descriptive measures are used to describe the known sample of scores or population so if we we have all the scores we know we're not making an inference then um we use you can follow this flowchart down here the formulas and the final division use n not n minus 1. here's the symbol for describing sample of variance and if you take the square root that sample standard deviation here is the symbol to describe population variance take that square root and you've got population standard deviation now if you are using inferential measures again from taking example making an inference about the values in the population you're going to use the degrees of freedom the n minus 1 in your formula so to estimate population variance notice the lowercase s here here is the symbol take the square root for the population standard deviation again this is review but it summarizes in this organizational chart now previously i said that i was there are two different kinds of formulas the defining formula formulas and the computing formulas so now we turn to the computing formulas so as i said previously when you have a small set of data it's pretty easy to use the defining formula and helps you to understand the concept but as your sample size gets larger it's it becomes very cumbersome and this computing formula for sample variance will give you the exact same value as the defining formula it's just a heck of a lot easier to use so um we'll we'll do some of these in class basically your your you have two columns instead of taking the difference between every value and the mean uh you have a squared sum of x and a sum of the squared x's and n's and those are the only values that you are concerned about or need to calculate much easier than the defining formula and of course the sample standard deviation is simply the square root of that the estimated population variance it should look very similar to this sample variance right but but note that the uh the s here is lower case and you have n minus one in the denominator but again you're dealing with the squared uh sum of x and the and the sum of the squared x's much easier to deal with than subtracting each number from the mean and squaring it and so on and not surprisingly you just take the square root of that for the estimated population standard deviation now one more concept that i'm going to discuss here and this is an introduction to this concept because this will become more uh meaningful to you when it's in a context okay but um you're going to be introduced to i'm introducing you to here to the concept of the uh the amount of variance accounted for so typically when we do a study we're looking at variability in scores and we're trying to explain that variability so let's take an example of let's say we think that eating blueberries um helps people to jump higher sort of a silly example let's say we have an experiment where we give 100 people a bowl of blueberries to eat each day for a month and another 100 people don't get any blueberries and then we measure after a month we measure how high each group jumps again sort of silly but so within the blueberry eating group there's going to be variability some people are going to just jump higher than others and within the non-uh blueberry group if you will is going to be variability as well okay so um if you can a strong relationship would occur if there's little variability within those groups but there's great variability between the groups so if the blueberries have an impact on jump how high you can jump you'd expect a great amount of variability between the groups like maybe blueberry eaters have a much much greater average uh score with regard to how high they can jump than the non-blueberry eater so you'd have you want variability between the groups to be high but you want the variability within the group to be relatively small compared to the variability between the groups and the extent to which blueberries can explain how how you can jump it will be can be measured or looked at in terms of variability how much does the blueberry consumption affect or explain the variability in people's ability to jump high so that's that's the the notion that's the concept of uh variability and it's important it's important in statistics you want to be able to explain variability sometimes you have a correlational kind of kind of analysis where you'll have x and y scores so let's say we just along the same line those again we're interested in blueberries and how high you can jump but this is we have a little bit a slightly different design here we just go out to the to the public and we ask people to how many blueberries on average they eat per month and then we take a measure of how high they can jump and so for each person we have two scores number of blueberries they consume in a month and how high they can jump and we do a correlational analysis between those two and the x and y's and uh what we would be interested in in that case is how well the blueberry consumption explains the variability in in jump and in height jump the height at which you can jump so that is the the general idea behind accounting for variance explaining that variance can it be explained by blueberry consumption or perhaps it can be explained better by something else so again just an introduction to that concept and we will see more on that guaranteed in other chapters so that's all for the lecture part where we'll work on some problems in in the class i hope that was helpful see you at the next lecture