Hello and welcome you all to today's lecture. Hope you have had the time to go through our discussion in previous lecture where we focused on three matrix of quantifying data, which is mean, median and mode. Today, we will begin with recapping what we have discussed briefly and then go on to the other another important aspect of quantifying data which is to quantify the variation in data. So let us begin by starting with our discussion of arithmetic mean. So we had shown we had discussed that you could do arithmetic mean simply by writing x bar is equal to summation of xi by n, where n is the number of observations in the sample.
So the summation means from i is equal to 1 to i equal to n. So it is simply x1 plus up to xn by n. And similarly as per jargon if so this x bar is for a sample if you are doing for a population then you replace x bar by mu and it is summation xi by capital N. So now we had then discussed about the different kinds of transformations which can be done on on arithmetic mean and seen how you would vary the different values.
So, for example, if you have y is equal to a of x then we come to the conclusion y bar is equal to a x bar to prove because if you have yi is equal to axi, then y bar is defined by summation of yi by n is equal to summation of axi by n. So, from then on, so if y bar is equal to summation of A xi by n. Since A is a constant, we can take it out and write summation xi by n is equal to A x bar. So, in other words, when you have a prefactor multiplied by, you know, A xi by n. So, p factor a operated on x and that is how you calculate y, you simply multiply a with x bar to obtain the value of y bar.
In the case of y is equal to c plus x, then y bar is nothing since average of a constant is a constant, means c plus x bar. The third case in the general case where y equal to c plus a x, then this would give me the formula y bar is equal to c plus a x bar. So, these transformations are particularly helpful when we are doing things manually by hand.
So, this is arithmetic mean and we found that one of the main caveats of the arithmetic mean is if you have a wide variation in your values. So, let us just take one particular example. Let us say if I have my values as 1, 1, 1. 2 to 20 right.
So, I have 6 values my x bar is equal to 3 plus 2 plus 2 plus 20 by 6, 7. So, 27 by 6 will be 4.5. So, we can clearly see my numbers are 1 1 1 2 2 and I have kind of this outlier which is 20, this completely shifts my average to a value of 4.5 which really does not have any relevance to how my data looks. In other words, this is one of the main deficiencies of arithmetic mean it is sensitive to outliers. So, one of the alternates to arithmetic mean you is use this particular concept of geometric mean. So, where we calculate geometric mean by so geometric mean is nothing but root of pi of xi and pi of xi.
So, this is the nth root of pi of xi, pi of xi means x 1 dot x 2 dot x 3 dot dot dot x n. So, now what we find is for these particular values that I have chosen 15, 10, 5, 8, 17, 100. So, you can clearly see in that most of the values lie within 17 except for this one number which is 100. So, if I calculate the arithmetic mean for this particular sample, the arithmetic mean turns out to be 25.8 and as with the previously discussed case, we can see that 25.8 is much bigger than the number 17. So, the alternative if we calculate the geometric mean, it gives me the value of 14.7 which is much closer to this population. So, this is an example which shows that arithmetic mean, so geometric mean is much less sensitive to variations in outliers compared to the arithmetic mean. And in the generic case, so in this particular case that we worked out, you find that the geometric mean is less than the arithmetic mean and is it true for any data set. So, it can be shown, so let us take two numbers, let us take two numbers a and b, a and b.
So, my arithmetic mean will be defined by a plus b by 2 and geometric mean will be square root of a b. So, can I say anything as to how geometric mean and arithmetic mean relate to each other? So, I can write so if my a plus b whole square a plus b whole square is a square plus b square plus 2 a b.
So, in other words you can see that a plus b whole square if I take if I divide by you know by two a plus b by 2 whole square is a square by 4 by 2, so a b by 2. So, since this is always positive, since this quantity is always positive, I can clearly say that, so I can clearly see that so this term is nothing but arithmetic mean whole square and is so it is basically has to be greater than geometric mean. So, if geometric mean is square of a b, so I can clearly say that arithmetic mean is whole square in general case arithmetic mean is greater or equal to geometric mean. So, as we can see if a is equal to b for the case a equal to b arithmetic mean is equal to a and equal to geometric mean. So, I can have this particular equation which says that arithmetic mean is always greater or equal to geometric mean.
So, this is one of the reasons why your geometric mean is much less sensitive to extreme values. Now, we have the next concept of median, right. So, the median of a set of n measurements is the value that falls in the middle position when measurements are ordered from smallest to largest. So, in other words, the median position is 0.5 slash n plus 1. So, if you have 5 numbers, let us say 1, 2, 3, 4, 5, then your median number and this is a sorted data set.
So, you can see that this is my median. But, if you have 6 numbers let us say you have 1, 2, 3, 4, 5, 6 then the median position comes in between here and this is why your median is going to be. So, in this case median is equal to 3, in this case my median is equal to 3 plus 4 by 2 is equal to 3.5.
So, you have to find the position median position. as half into n plus 1 and then find out whether you have to average between two numbers if your data set is even or you have an unique value if your data set is odd. The third metric we had discussed was the mode. So, mode is the most frequently occurring value and this is for example, the number of visits to a dental clinic in a typical week, this is the data.
So, how do you calculate the mode you want for the patient? first find out the frequency distribution. So, we can see that you have 1, 2, 3, 4, 5, 6, 7, 8, 9 as the number of values I can correspond.
So, this is x and this is my frequency f. So, I can see that for 1 I have 2. So, let us you know without going through the entire list I think my median value. So, for 5 for example, 1, 2, 3, 4, 5, 6, 7. So, there are 7 values for the number 5 and 6 is of course, much small 4 values is 1, 2, 3, 4, 5, 5 for 4 and if I am not mistaken 7 is the value which is maximally occurring 8 is also.
So, 7 occurs for the number of for the most number of time 5, the number 5 is the most frequent in other words 5 is your mode ok. So, this brings us to the question as to which of these 3 values should you really you know consider making mean, median or mode. And it is clear you know generally mode is used when you describe large datasets.
Mean and median can be used interchangeably for both small and large datasets. And as we discussed, so again you know the arithmetic mean is of course sensitive to its outliers, but the median is less sensitive to outliers ok. So, let us just do some few examples.
So, in this particular example let us say you have the numbers 1, 2, 3, 2. 4 2 8 3 6 3 2 5 45 36 89. So, if I were to arrange them in the proper order 4 2s are there, there is 3 3s, 3 3s, 4 is just 1 then you have 5. then you have 6, then you have 8, then 36, 45, 89. So, I have my total n is 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 13, 14, 15. So, my median position is going to be 15 plus 1 by 2 is equal to 8. So, my median is 4, 5, 6, 7, 8. So, this happens to be my median. My average when I take the average because of the presence of these three numbers, my mean is of course, going to be much greater than the median and mode is the maximum occurring value which is 2. So, mode is equal to 2, median is equal to 3. is equal to 3 and mean is of course, greater than median and is greater than mode. So, in this particular example, so in this example that we worked out, we came to the conclusion that mode was less than the median, was less than the mean. So, let us consider the next example.
This is the next example as again we can so if you look at the data set. So, and if I order arrange them in order 2 then 3. So, there is another 2, 3, 3, 3, 3, 3, there are 5 3s. is no 4 there is 1 5 1 6 then you have 36 no sorry 29 36 36. So, 39, 40, 41. So, what you clearly see in this data set if I can partition this data into two groups.
One so there is so as compared to the previous case where there was only three numbers which were huge here you have 1, 2, 3, 4, 5, 6 numbers which are reasonably huge. So, it gives us the idea that you really have. two subpopulations in this whole set. So, in this case neither the mean nor the median nor the mode would make sense. In fact, if you group if I were to group them separately for this group I can have work out my median.
So, median is 3, mode is 3 and mean will of course will slightly greater than 3 because now it is approximately 3 only slightly higher than 3 and for this set you have numbers from 29, 36, 36, 39, 40, 41. So, for these numbers my median is going to be 36 plus 39 by 2, my mode there is no mode. Now, mode is equal to 36 and median will also be somewhere in between. So, what you see here is from the previous case where they were it seemed that there were only three outliers, in this particular case there are clearly two different sets.
So, it makes the question that what would be the best way of you know quantifying this kind of a data. So again let us take this particular example where you have symmetric versus an asymmetric distribution. So, in 3 different days you can have you know in terms of height profiles you can have 3 different distributions. What you see is on day 3 the data is very symmetric, on day 1 it is skewed to the left, on day 3 day 2 it is skewed to the right. So, it.
tells us that in addition to quantifying mean, median and mode there must be other ways of capturing this variation in this data ok. But one of One of the measures which is very frequently used is this measure of range which is nothing but a maximum minus minimum ok. So, I can define the range as maximum minus minimum. So, in this particular case, so my minimum is 40, my maximum is 100. So, that brings us range is equal to 100 minus 40 equal to 60. But as you can see is the range by itself would not have any meaning unless the, the values are also put in context.
So, for example, if I have numbers as 1, 2 and then I add these other numbers which is 40, 60, 75, 90 and 100 then my range is 99 right versus in the other case it was 60. The concept of range has to be thought about in the with respect to the minimum or the maximum. Similarly, for example, if you can have data going from 1000 all the way to 5000 or 1 to 5000. So, it does not. So, your range has to be you know thought about in the concept of your maximum or minimum. So, if you are if you have again outliers then the range is too broad it does not particularly give a clear data as to. where bulk of the data is situated.
So, another way of measuring variability is using the mean absolute deviation. So, we can work out this particular example. So, your mean absolute deviation. So, mean absolute deviation. is summation of mod of xi minus x bar by n.
So, in this particular case let us say if I have a data set as 1, 2, 5, 8, 12, 8, 1, 7, 5, 42 I have to calculate my x bar. So, my x bar becomes 3 plus 5, 8, 8 plus 8, 16, 28, 36, 37, 44 plus 549, 91 by 1 to 3, 4, 5, 6, 7, 8, 9, 10. So, approximately 9 let us say. So, my mean absolute deviation is nothing.
So, this value becomes 8 which is plus 2 minus 9 mod of 2 minus 9 is 7 plus mod of 5 minus 9 plus mod of 8 minus 9. and so on and so forth. So, I can calculate this exact value as mean absolute deviation equal to 8 plus 7 plus 4 plus 1 plus 3 plus 1 plus 8 plus 2 plus 4. 33 whole divided by 10 ok. So, it roughly comes to 8 plus 7 15 20 24 32 and 6 38 71 10 is roughly 7 ok.
So, as you can clearly see in your values that your x bar was now. So, x bar is 9 and this mean absolute deviation is 7. So, this reason because your values ranged across a wide range from 1 all the way to 42. So, when your x bar and this mean absolute deviation is comparable that means that you have a wide heterogeneity in your data. So, the most. The most widely used metric as a sign of deviation as a mark of variance is standard deviation. So I can let us come to the formula of standard deviation.
So, instead of doing just the mean absolute deviation you square the differences. So, whether or not it is positive or negative, whether your x values is less than the population mean or greater than the population mean, this square is always positive. You add them up and then you divide by the total number of observation and you take a square because you had squared them up while adding. So, this is your definition of standard deviation for a population. For standard deviation of a sample, it is pretty much the same except there is a notable difference instead of dividing by capital N, you divide by N minus 1. So, this is there is a small difference in how you define the standard deviation between a population and between a sample.
And this deviation of n minus 1 for a sample is simply to take into account that when your sample size is small when you divide by n minus 1 it gives the better estimate of the standard deviation of the whole population ok. And variance so you can either so sigma square is equal to variance. of the for the population and S square is so variance of population and S square is variance of sample.
So, sigma square is nothing but summation x i minus mu whole square by capital N and sigma square. So, S square is nothing but summation x i minus x bar whole square by N minus 1. So, these are called variances. So, what you can clearly see is variance is just you know it is always positive and it is square of the standard deviation.
So, how do we you know go about computing the variation, what you can clearly see these are two distributions. You can see that in one of them it has a much you know prominent peak in the middle and then these other values are less prevalent versus the second distribution of you know is much more broader. So, in other words if we calculate the standard deviation it will turn out that my standard deviation for this population is going to be smaller than the standard deviation from this population. This is what the variability will convey. Now I can there is just one small mathematical trick.
So when I talk of you know summation of xi minus x bar whole square so I can let us so I can expand it. So, this I can write it as xi square minus 2 xi x bar plus x bar square. So, I can then bring it out I can write summation xi square minus summation 2 xi x bar plus summation x bar square.
So, each of them is i is equal to 1 to n, i is equal to 1 to n, i is equal to 1 to n. So, this remains as summation xi square. but in this particular term since x bar is the mean I can take it out.
So, I can take out 2 x bar summation x i and I can write plus summation x bar square. So, summation x i is nothing but n times x bar. So, this equation then becomes summation x i square minus 2 x bar into n x bar and summation x bar square summed up n times this is also i is equal to 1 to n this will be n x bar square.
So, this final expression is summation x i square minus n x bar square. So, this is a useful formula when we are doing it ok. So, this is what I have written here that your measures of variability this thing can be simplified to form this. So, as opposed to taking the difference from mean if you have xi you can just add them up and then you know your calculate in sample mean or population mean and you just subtract this in x bar square to obtain this particular value.
Now, let us do some transformations of standard deviation, transformations with standard deviation. So, we again come to this particular term where you have three particular cases y is equal to Ax let us say. If this was my sigma y, if S y, so question is how is S y and S x related?
So, what is S y and S x? What is the relationship between S y and S x? So, the way to do it, so I know my S y. So, let us say if I were to do S y square or let us say N S.
n minus 1 S y square is nothing but summation y i minus y bar whole square. Now, yi, so I can put it as axi minus ax bar whole square is nothing but a, a can be taken common a square into summation xi minus x bar whole square. So, I can write n minus 1 into sy square is equal to a square into this term is n minus 1 into sx square.
So, this would give to me that sy is equal to a into sx. So, I can cancel these terms out and this is the final formula which remains. So, S y is nothing but a into S x for this particular case. So, let us say if my y is defined by c plus x then I can think of writing similarly I can write. So, you see constant constant.
So, n minus 1 into S y square. is equal to summation yi minus y bar whole square, but you see in this case yi is c plus xi and minus y bar is c plus x bar whole square. So, I can deduct c from each other which is nothing but xi minus x bar whole square. So, this is nothing but n minus 1 into sx square.
So, this would give me that sy square sy is equal to sx. So, when you have a constant mean added to this value it does not change the final standard deviation. So, in other words standard deviation is insensitive to any constant mean added.
in the most general case when y is equal to c plus Ax then by combining the previous concepts we can arrive at the equation Sy should be simply is equal to Asx because the c does not come into play while computing the standard deviation. Now this thing can be extended to find out standard deviation for grouped data. For grouped data, I mean that if you have Xi and you have an Fi. for the corresponding values.
So, x1 is f1, x2 is f2 so on and so forth. Then I know my x bar is summation fi xy by capital N or by summation fi ok. All I need to do is to compute these frequencies and put them in place to get the final value ok.
So, that is all about the basics of standard deviation. So, I hope you understand. So, in to in to summarize We saw how from mean and median and mode how they can compare and what kind of values they are arithmetic mean is of course, sensitive to outliers, median is not sensitive to outliers at all, mode in the case when you have a bimodal distribution then neither measure makes any value it is better to split the data into two different distributions and then separately calculate their either mean, median or mode for that. From there we went on to discussing what is standard deviation and I hope you have convinced that standard deviation is a very important metric of quantifying how your values are dispersed across.
So mean itself by itself does not convey the picture of how dispersed your data is. So outliers will indeed have an effect in the standard deviation and one important point to note is when you have for the population you divide by N, capital N when you calculate the standard deviation for the sample. You divide simply by n minus 1 and this is because when your sample size is small then dividing by n minus 1 gives a much better estimate of the populations in standard deviation.
And we ended up by doing some few transformations just like calculating the mean for different transformations. I hope you have seen that how if you have a preterm or you know a constant it does not have any impact on the standard deviation of the population. But when you have a prefactor A in front of X.
Then your S y is simply multiplied you know S x multiplied by u a. With that I you know I would like to thank you for your attention and we will meet again in next lecture. Thank you.