Transcript for:
L7

[Music] and hello and welcome to today's lecture so I would begin by doing a brief recap of what we are discussed in last class mainly ways of quantifying dispersion in a population or a sample right and one of the widely used metrics for characterizing this variation in the data is using standard deviation right you either use uh Sigma to describe standard deviation of a population and this is given by su summation of x - mu s by capital N or S is for the sample the summation of x - xar s by nus1 right so again once again note this minus one so when you are doing a sample then it is uh you know it is thought of that by dividing by in minus one you get a better estimate of standard deviation of the population right so what exactly is the Practical significance of the standard deviation and this we had discussed in last class and which is what shef's theorem tells us so it says that given a number K greater than one and a set of n measurements what you are guaranteed is at least 1 - 1 by k s proportion of the measurements will lie within K standard deviations of their mean right so if I substitute K = to 2 then that would become 1 - 1/4 which is 75% of the measurements are expected to lie within one standard deviation of the wind which means 75% of the data will lie between uh mu or xar minus Sigma and xar plus Sigma so as an example you have have n equal to 26 mean 75 variance is 100 so in this case if I have xar is equal to uh 75 and N is equal to 26 variance is equal to 100 so I can roughly calculate this is approximately equal to 10 right so within 75 + - 10 you have 34s of the population right so 34s of the number of VAR Ables will actually lie within this range which isus 6 65 to 85 similarly I can do the same thing for two standard deviations and so you have uh 75 +- 20 in this case we'll have contain 1 - 1 by 9 that is 8 by 9 fraction of the population 8 by 9 is roughly 90% of the population but as I had stated last week in last class that for a generic distribution so shevy shaps theorem is actually a very conservative estimate so this is your xar xar minus s xar + S right so they say that this much so this is roughly uh what we calculated is 75% and in the generic case Okay xar + 2 a and xar minus 2s this is 90% of the population but shevy shaps theorem is a very conservative in approach so it does not make any assumptions of how the distribution of data is there in contrast For What You observe for a mound like distribution or which is gosan distribution or normal distribution you will see that there is 68% of the data which are expected to be there within plus minus one standard deviation so as opposed to 75% predicted by shev's theorem uh in normal distribution 68% stay within the plus minus one standard deviation okay but plus- two standard deviations 95% of the data is there okay and three standard deviations 99.7% of the data out there so this also brings us to the concept of zcore or it is relative standing right so what exactly is zcore it is basically defined by xus mean by standard deviation and you can do this calculation if for you know for a particular experiment if your mean is 25 standard deviation is 4 and X is 30 then zcore returns your value of 30 - 25 by 4 which is 1.1 1.25 okay now you can use Z score to get an estimate of what whether a particular data point is an outlier or not and this can be you know clearly gleaned from this particular example we worked out in last class so what you see if you see look at the data points all the points are clustered between one and four except for this one particular value which is 15 right so we can clearly see it seems to us that 15 is outlier or very close to being an outlier you can do this calculation we had worked out what exactly its value is we I don't remember but you can find out whether as per this uh this statement if the zcore comes out to be greater than three or not another way of you know characterizing our relative standing is using the concept of percentile right so P percentile is the value which is greater than p% of the measurements so 100 percentile is essentially so that person who is in the 100 or 99 percentile is pretty much better than has performed better than 99% of the population in a class okay so you can use these particular positions to determine how you will calculate the first quartile or third quartile the second quartile is of course at position. 5 Star n + 1 is nothing but the median okay so this represents at what 25% of the data point is first quartile 75% of the data point is third quartile and Inter quartile range is defined as Q3 minus q1 okay so using this you know these values one can plot what is called a box plot and in a box plot so the lowest value is your minimum okay this this uh this the box outlin so you have the uh q1 which is the first quarle Q2 or the median Q3 or the third quartile and this is your maximum so what you also see are points which may lie outside this definition of box so if you take this point this coincides with the maximum value of the distribution but this point or for that matter this point really is much much outside the box limits so these points are examples of outliers and it is perhaps not uh you know completely surprising that in many experimental data you do have outliers okay so this Square Inside the Box actually denotes the med the mean what you see here in this particular population you have variables if you look at the y- axis you have variables which vary all the way from around 20 or 30 or 50 all the way to 600 so when you take an average the effect of that 600 is going to have a much greater effect than a value of 50 which is why in this particular case the mean is slightly shifted above the position of the median so you take this particular example it is the other way around where the median is here and the mean is here okay so based on this if you were to you know plot it in terms of histograms so as opposed to having a distribution like this where your position of mean median mode all coin side you might either shift to the left or to the right okay so the way to detect outliers is using this particular formula so you can uh you can construct fence where the lower fence is given by q1 minus 1.5 * inter quartile range and the upper fence is Q3 + 1.5 interquartile range okay so let us just work out a sample case of how we will actually plot our box plot okay so let me write down the points you have the points 350 300 520 340 320 290 260 and 330 okay so first step of course is to sort in ascending order so my lowest value here is 260 then 2 290 I can have 300 I have guess 320 320 [Music] 330 2 340s and 520 okay so we can already see clearly here that as opposed to all of these points which kind of are clustered together this data point seems to be out of the plot okay so but let us find out do our necessary calculations once again so you have 260 290 300 320 330 340 340 520 so my total number of measurements is 1 2 3 4 5 6 7 8 n is equal to 8 that means uh okay so my median position is going to be somewhere like this my median value will be half of 320 and 330 okay equal to 325 the position of q1 position will be 1 * n + 1 = 9 by 4 is 2.25 so after two so this is going to be the position of Q q1 this is your median okay and Q3 position is going to be 3/4 into 9 is 27 by 4 is 6 4 24 6.75 1 3 4 5 6 7 right so Q3 is going to somewhere here okay so my q1 value will be 290 +25 * 10 which is going to be 29 290 + 2.5 292.50 Q3 is going to be 6.75 1 2 3 4 5 6 is 340 + 75 into 340 so you'll still get the value so Q3 is in this particular position. 75 * 340 no but the you know 340 - 340 which is nothing but 340 only okay so we have calculated the values of q1 and Q3 so now we need to see our median so for this particular distribution I have q1 as 292.50 Q3 is equal to 340 so which would mean that IQR is equal to Q3 - q1 is 47 .5 okay so now we know that the lower fence so lower fence q1 - 1.5 * IQR q1 is 22.5 - 1.5 into 47.5 so which will be around let's say 1 * is 47.5 5 which will be around 250 I don't know the exact value please calc but as you can see if you look at our points once more the lowest value is 260 that means that there are no lower outliers so this implies that there are no lower outliers okay I can similarly calculate the value of Q3 + 1.5 so upper Defence Q3 + 1.5 * IQR Q3 is 340 + 1.5 into 47.5 so which will give me a value so if I assume this as 50 so this is approximately is to 340 + 1.5 * 50 so approximately is 340 plus 50 7 5 is roughly 415 so this implies that the number so there is an outlier there is an upper outlier and which is nothing which is the value is equal to 520 okay so if I were to construct the plot if I were to construct the plot my plot would look something like this okay so as you can see that there is no so the minimum is 260 and because so this is your lower fence your upper fence is somewhere here okay and this value lies much Above So this means that so you have an error bar which sticks out but uh this is much outside okay so there are actually no points here there are no data points in this region no data points in this region okay because after 340 so where are the points so because after 340 there you directly have 520 okay so this just means that there is no data point here but your error but this shows you up to the you know maximum time there are no points here okay so with that I show you how to generate a box plot uh now we come to another interesting concept of moments okay so as per so Pearson was the first statistician to make use of moments to describe data now what is how does a moment defined so you have moment about so moment about any variable about zero is defined as sumission y ^ R by n where R can have 1 2 3 any value okay so clearly so this is moment so this is this is moment about [Music] zero okay so in general moment so this is the arth moment this is the arth [Music] moment in general moment about a is defined by Mr star would be in the generic case y - a whole ^ R by n okay so now let us see what the moments convey so you have moment about zero Mr star is defined by summation y r by n it is obvious that if I put r equal to 1 then M1 star is equal to summ Y by n which is nothing is equal to Y Bar right so first moment about zero is your mean what about R is equal to 2 so r equal to 2 then we have M M2 star about zero is summation y² by n right so as you can clearly see that this gives me I know that the way standard deviation or variance is defined you have a term of Y - Y Bar s by n so in other words if you were to go through the moment so the r sample moment about the mean would then have this particular value which is Mr is equal to summation Yus Y Bar whole to^ R by n okay so M2 about 0 is y s so this would mean that if I m m Mr about z i put Y Bar equal to 0 I have y ^ R so Mr R about the mean so in that case Mr is defined as summation y - y bar^ r by n so M1 in this case will be summ y - Y Bar by n and this is nothing but summation Yus summation Y Bar by n so summation Y is equal to n * Y Bar and summation Y Bar n * is nothing but n y bar so this would give me a value of zero okay so first moment about the mean is zero this obviously brings us to the second case that what is the second moment about zero so this would be defined by Yus Y Bar whole Square by n so as you can see that if this was for a population M2 is nothing but variance okay so M2 is variance so I I'm going to make that approximation because if it is for a sample then you have to be n minus one but this is very simply is equal to the variance okay so M2 is you can I can say population variance okay so similarly I can calculate this value which is M3 is equal to summation y - Y Bar whole cubed to the^ n right now let us consider a very symmetric distribution if my distribution was symmetric so there is symmetry right in this distribution if I look at how M3 is defined then I know that for if there the symmetric would mean that for every value which is to the left of this there is similar value at similar frequency to the right of this right so let's say this is Y Bar this is y1 and this is Y2 so the frequency of y1 and the frequency of Y2 is symmetric is equal and that is how the distribution is called it's a symmetric distribution in that case so if I have for every y1 so I have two things symmetric and let's say this distance is the same so y1 - Y Bar is equal to let's say minus Delta Y and Y2 + y bar is going to be plus Delta y y Y 2 minus y bar is going to be plus Delta y so if I do this summation it just means that for every y1 which is to the left of Y Bar so and whatever contribution this gives which will be negative in nature the another point which is equal equid distance on the positive axis and a same frequency will give me a positive response and anything cubed if you have a negative number it's cube is negative if you have a positive number its cube is positive so if you add these two terms so it'll be like let's say f * - d y cubed + f * d y cub and these two terms equal to zero so this would mean that M3 okay so so this would mean that M3 is will return you a value of zero for odd for symmetric distributions okay and this is same for any Mr so Mr about the mean is going to be zero for symmetric distributions for R is equal to odd so in other words M1 M3 M1 is Zer M3 is 0 M5 is z and so on and so forth okay so clearly for all symmetric distributions you have the odd moments about the mean return you a value of zero okay now let's say the variable y that we are measuring is actually some quantity it is not just a number it is a quantity let's say temperature or height so M3 will have so each of them have different units right so if I were to say y represents height then unit of M1 okay in terms of meter let's say it is in meters M3 unit is meter cubed M5 unit is meter 5 so in other words these units are not the same can there be a way of compressing this information and coming up with a non-dimensional parameter and that is what that is what this measure of skewness gives us okay so skewness is defined in a slightly different way is A3 is equal to summation of y - Y Bar whole cubed by summation of y - Y Bar s whole to the ^ of 3x2 I can again rewrite as M3 by M2 ^ 3x 2 okay so as what you can clearly see that M3 will have units of meter cubed M2 will have units of meter Square whole the^ 3x2 will give you units of meter cubed and this is after all a number a dimensionless number so this parameter A3 is called a skewness A3 is called skewness and and for any distribution so as as it is you know obvious from the words q itself so for any symmetric distribution any symmetric distribution my skewness A3 has to be zero okay so it is neither skewed in this direction or skewed in this direction okay so this is what skewness is about now what kind of a values can A3 be negative if we look at our definition of A3 so if let's say we take particular distribution which is skewed to the left okay so this is skew to the right okay so this is going to be my mode my mean will my mean this will be where my mean will lie okay okay and this will be where my median will lie okay so what you can clearly see is when I do this computation for A3 it tells me that there are lot many number of points which are less than my mean so this is my y BAR value and all for all these values I am going to get this this component will return me a negative value okay and only for few of the others this quantity is going to be return me a positive value so when I actually do this calculation I am going to get a value of A3 which is going to be negative so A3 is going to be negative for this kind of distributions okay so we will do one sample calculation to see whether what we think is will remain it like that okay the on the other case so in the other case if it is sked in the other direction if you you have a distribution like this so this is your mode this is your me here is where your mean will lie so I can clearly see that for all these points which are to the right of the mean Yus Y Bar is going to be positive okay and as by this token I will get a value of A3 which is positive okay so let us uh take a sample example uh let us take a sample example where we calculate the skewness of a distribution okay so let me write down some numbers which are which kind of portray this picture so let's say my variables are 1 one one sorry one okay so this this is doing this particular kind of a case so 1 one one one 2 3 4 okay 3 4 5 6 7 okay let us okay let us do this distribution okay you have 3 1 1 2 1 3 1 4 so your Y Bar is equal to 3 + 2 + 3 + 4 by 6 is equal to 6 10 12 is 2 Y Bar is 2 now I can calculate my y1 as y bar so I have 1 1 1 2 3 4 okay so for value of 1 it is -1 -1 -1 0 1 2 okay so in this case y - Y Bar whole cubed will give me -1 -1 -1 0 1 2 Cub is 8 okay so in this particular case okay even though the distribution is y to the left I can see that summation y - Y Bar whole cubed will give me a value of 3 + 3 4 six right so in this case it is though it is cued to the left it is not it is still A3 is giving me a value which is kind of positive okay so though but you can see that if these numbers were much to the left okay so if you had you know few more of 2 three 4 and you had one number as eight and you did the you know you had two more of two three more of two and do this for this particular distribution you might see that this will slowly become negative okay so with that I conclude today's class so what we have done is come up with this metric of skewness so starting from standard deviation and going to how we want to do relative standing by using zcore and then from there we went on to see how you can come up with relative Matrix of finding out moments and coming up with Matrix to characterize the way a distribution is okay so skewness gives us the value for any symmetric distribution skewness will return you a value of zero but typically if it is biased if most of your data lies to the left of your main then some sometimes the skewness value can be negative versus if your data is to the right it can be positive okay with that I conclude today's lecture thank you for your attention C and I i b