Transcript for:
Understanding Variance and Standard Deviation(Lecture4 Dispersion3)

In the previous lecture we discussed the range of our data. In this particular lecture we're going to focus on variance and standard deviation. These calculations will be a bit more challenging than the previous lecture, but they're going to be very important and we're going to go through an example to make sure everyone feels comfortable and we'll spend plenty of time in class covering this, that way we are all on the same page. When we think about statistics, we are using our data to better understand our population. And we've collected all this data, and so it's best to use all of that data to describe the patterns that we are observing. Now if we think about range, that measures dispersion using only the maximum and minimum values. If we are using quartiles, it uses a few more data points, but again we're only using a few data points in ORWCH2 to capture how much our data varies. Now, what we're going to talk about today with regards to variance and standard deviation, we'll actually use all of our data. So it will provide a better measure of dispersion around our central tendency. And so that is why this is going to be extra important to pay attention to and to feel comfortable calculating. Now if we go back to the previous lecture, and we think about the sum of the deviation of our data from the mean, we know that some of our data is greater than the mean, some of it's less than the mean, and when we take the summation of all these different differences, it's going to be zero. So this doesn't really provide us with much information on how much our data vary from the central tendency. But we can add a small wrinkle into this formula. So we square each of these deviations. We then get a value, and the larger that value is, the more variability there is between our data and the central tendency, which would indicate that there's greater dispersion across the sample. Now this particular formula here is called the sum of squares. This is going to be something that's useful both for this lecture as well as many others moving forward, so you want to pay particular attention to this. Now this sum of squares is just the squared value of the difference between our data points and the mean. And then we take the summation of those squared values. And it works really well for describing how much variability we have in our data. But we have one particular problem here that we need to address is that we've squared our data. So this can be kind of complicated if we're thinking about something like length or weight. What does length squared mean? Or what does weight squared mean? So in order to account for this, we can use the basic mathematical function of square root to get us back to the original units that we started with. So let's see how this works. So we've got our sum of squares here, in which we've got the values of our data set, and we're subtracting the mean from them. We'll square each of those values and then take the summation. In order to calculate out the variance, which is our first step, we'll need to divide by... the number of observations that we have. Now something that you'll notice here is that we've got some Greek symbols. We've got mu, we've got sigma. So this indicates that this is the formula for population variance. We can then take the square roots of population variance and then we get our population standard deviation. So squaring our deviations gets rid of that pesky negative signs, puts everything into the appropriate means to actually get a value besides zero. And then taking the square root gets us back to our original units. And so we can use these, the variance and the standard deviation, to get the average amount of variability that we have in our respective data. So it's how much on average do our data points deviate from the mean itself. And this means that our data can range our variance and our population standard deviation can range from zero to a particularly large value. The larger the value, then the more variability we have. And so if we're thinking about a population that's got a lot of variability among the different individuals and observations, well then it's going to have a large variance and a large standard deviation, and our distribution is going to be very spread out and very flattened. And the opposite side, if most of our observations are very similar to one another, then our variance and our standard deviation are going to be quite small. And our population is going to be represented by a more peaked distribution that is much more narrow. Now, when we talked about deviance and dispersion last class, last lecture, we discussed the importance of sample size. Now the reason for this is that the denominator for both of these formulas is our respective sample size. And so as the number of observations, or as our sample size increases, the values for variance and standard deviation are going to go down. there's going to be less variability in our data set. However, as the number of observations decreases, our variance is going to increase. And so this is why it's important for science that we're not only going out and collecting all the information from all the variables that we're interested in, but that we have a relatively large sample size in order to represent the population. Now, one of the tricks here is that these two formulas here are for our population, but rarely do we have information on our individuals from the entire population. Usually we're using a sample, so we need to make a slight adjustment to our respective formulas for variance and for standard deviation. So we can compare our sample variance formula to our population variance formula and we'll notice a couple key differences here. One is that we're comparing our data to the sample mean rather than the population mean. This is good because regardless of our sample size we can calculate out the sample mean rather than potentially relying on a value from the literature or not knowing what that population mean is. Another key difference is that the denominator is different. For the sample variance the denominator is n minus one rather than n. Now this change is a correction to account for any kind of uncertainty that we might have based on our sampling data. It's unlikely that our sample perfectly represents the population, and so this then makes our estimate a bit more conservative. And this is important for statistics, in that we're using sample data to make inferences about the population. We know that our sampling is imperfect, but we do the best that we can. And in doing so, we make sure that our estimates are a bit more conservative than what they may possibly be, so that way... we are making our assessments appropriately. And in this particular case, we'll use measures of central tendency in order to make some estimates about the population from our sample. And then our sample variance and standard deviation will tell us how much variability there is within our data set, which will influence our ability to estimate that population mean. So our sample variance formula here is going to be the sum of the squares of our sample in respect to our sample mean divided by n minus one and then our sample standard deviation will be the variance with the square root of the variance. So our mean is going to provide us information on central tendency for our population and our sample variance and standard deviation are going to tell us how much variability do we have and will influence our confidence in the estimate that we have. And we'll begin talking about that in the next lecture. So how do we actually go through and calculate out this variance and standard deviation? Let's go through an example here together, and we'll spend some time in class doing so as well. So we go out and we're interested in the length and the size of these sharks. We've gone out and taken... some length measurements here as we can see in our observations and we've also calculated the sample mean at this point. Now first step in order to calculate the variance will be to calculate the deviation of each one of our data points from the mean. So we take our respective data point. subtract the mean from that and then we calculate out what that value is. We do so for each one of our respective data points and what we notice is that some of the values are going to be positive and some of the values are going to be negative. Now this is important note to make. If in the event that you're calculating out these deviations and they're all either positive or all either negative then something has gone wrong. Either your calculation of the mean may be off or maybe you're making a miscalculation for these deviations. you You should have a relatively equal proportion of positive and negative values, and when you take their summation they should equal approximately zero. Now before we calculate our variance we need to calculate out the sum of the squares. So we need to take those deviations and we need to square them because at this point all we have is the difference between the observations and the mean. So we go through and we square these values, we square these deviations for each one of our respective observations. Now notice that that all of these values are positive. This is an important checkpoint too. If you have any negative values, you want to make sure that you check your calculations and recalculate those values. Anything that's squared is going to be positive, whether it's a positive number or it's a negative number. Now, another key point here is that following orders of operations is going to be key. You want to make sure that you square these deviations before you take the sum. If you take the sum and then square it, your value is going to be really close to... if not equal to zero, and that's not going to be accurate. So after you get the sum of squares, the summation of this square deviation column, you'll divide it by the number of observations minus one. So in this particular case we've got 13 different observations, so we divide by 12 and we get our variance of 131.58 and some change. Now to get our standard deviation, we take the square root of our variance and we get that value that indicates how much variability we have in our respective data set. So in order to report these values, we would report the mean and then we would have plus or minus 11.47. Now the reason we have plus or minus is because we're taking into account that the data are both larger than and smaller than the mean of itself. and this gives us an understanding of how much dispersion there is around that central tendency. Now we're going to spend quite a bit more time in class going through examples of how to calculate variance and standard deviation, but I encourage you to check out section 4.3.3 of the text and also to work through practice exercise 4.1 and to check your answers. The next video will move forward with measures of dispersion focusing on standard error and coefficient of variation.