Transcript for:
Measures of Central Tendency and Dispersion

okay so welcome back this week's last lecture tomorrow is a holiday so no lecture tomorrow which means uh whatever is finished today will be the last uh sections for this week to be tested on next Monday I'm almost done with the yesterday's lecture so that should be up later on tonight I'm planning to up upload to uh today's lecture as well later on tonight let us uh pick up where we left off okay so finding medians so median as a reminder is where you have your collection of data and the median is what splits the data into two equal parts so find the median of this distribution now notice this is grouped data we have a value that's assigned a frequency so all this means is that within the data there's four ones three twos a pair of Threes uh six fours and eight fivs so solution so the position of the median is going to be uh 23 so the sum remember that this means sum of so the sum of the frequencies so if we add up all these frequencies we get 23 and then add add one you're always adding one to the uh to the sum and then cutting it in half we have a the 12th the 12th value is the median now there are a couple ways of doing this and the the fact that you have group data just kind of in one sense kind of makes it quicker so remember with the median you line up all you well you line up all the values you order them from smallest to largest and then you pick out the exact middle position so one is the smallest value but there's only four of them so we have four here the position the position that we're looking for is the 12th position so the 12th position is beyond this one so we take the four add the three so that's seven so the best we can do with the first two values is the seventh position go here 4 + 3 + two so had we gone through the first three values uh the best we can do is the ninth position so that means that uh the 10th 11th and 12th lies within this value of four so the median is a 12th item which is a four now you could go the other way so this is going from lowest to highest you could start with the highest and say oh wait a minute if we went this way so the best we could do starting from the highest working your way down the best we can do with five is so so so it's a 16th position so 16th through or I should say 16 through 23rd position which again forces us into the uh value of four 15 14 13 our 12th position inside number four so even if the data is grouped you can still work out where that uh middle position is so the best that the first one can do is first through fourth the next one the best uh the the twos has the fifth through six uh seventh position uh the next one has the eighth and Ninth position so 10th 11 12 so you can work out if you even with group data you can work out where that middle position is mode the mode of a data set is the value that occurs most often sometimes a distrib tion is bodal literally two modes and a large distribution this term is commonly applied when the two modes do not have exactly the same frequency but you do have two values that do stand out over all the others they could be something that looks like this so this would be bodal these two stand out over the rest of the values even though this one is higher than this this is still counted as bodal now if you have something like that but with more bumps and the uh in the distribution then you can say it's multimodal so example 10 students in math class were ped as the number of and their individual families and the results were uh these numbers find the mode of the number of the siblings so the solution so if you notice that uh three is one that stands out the most uh the next one that comes out is two but I would not count this as a I mean you could count it as bodal but with such a small set of data uh one mode would be sufficient so the mode for the number of siblings is three find the mode for the distribution so this is the same distribution we saw a few slides earlier and this one is fairly simple we just need to pick the one with the highest frequency so the five has a frequency eight which is bigger than all the rest so the eight is the highest frequency so solution the mode is five since it has the highest frequency that is eight uh central tendency from stem and leaf displays with group data you can uh work out certain things uh stem and Le remember stem and leaf has one advantage over a grouped histogram and that uh you can actually see the actual data so below a stem and leaf display for some data find the mode uh median so if you take a look at this uh there's uh there's a amongst the tens there's two amongst the 20s there's what the five amongst the 30s there's four pieces of data amongst the 40s there's uh three six pieces of data and amongst the 50s there is four pieces of data so add them all up so 2 5 11 17 21 so the median the position it's going to be this formula the sum of the frequencies plus one/ two so when you plug that in so that's going to be 21 + 1 over two and then uh that gives you 11 so now you just need to look for the 11th position and that kind of makes sense that it would be an actual value because uh there's 21 values in this distribution so the exact middle is a an actual value which then splits up all the data of two equal sizes of 10 data each so now you just not in the stem uh Leaf diagram everything is in order already so you just need to count off one two 3 four five 6 7 8 9 10 and 11 there we go um you can start backwards that's fine you know this is going to be the 21st 20th 19 18 uh 17 16 15 14 13 12 11 so it doesn't matter how how you do it so seven is the median now for the mode because everything is in order you can simplify this whole process by asking hey wait a minute are there any ones digit that occur most often I'm not saying all together so if you notice there's two eights here and there's an eight here here but these this doesn't count because these two eights in the bottom row uh essentially represent 58 while this eight in the second row represents 28 on the other hand this the 40s row you have these three twos together and so this looks like the best uh the most frequent value uh you can try other things out so the you know you have a pair of nines here that doesn't count pair of sixes that's still you know two twoo small pair of eights that's still still too small it's only this group of three2 so the mode is going to be 42 the median is going to be 37 so Symmetry and data sets the most useful way to analyze a data set often depends on whether the distribution is symmetric or non-symmetric in a symmetric distribution as we move out from a central point the pattern of frequencies is the same or more or less uh to the left and to the right in a non-symmetric distribution the patterns to the left and right are different so any a symmetric distribution as we move out from the central point the pattern of frequency are the same or roughly so it doesn't have to be a perfect symmetry it just needs to be a roughly symmetric pattern in a non-symmetric distribution the patterns to the left and the right are different we'll look at some examples um a is a uniform distribution so the name gives away all the Val vales are uniform uh here we have uh binomial distribution as opposed to a bodal distribution so binomial distribution that is related to the whole and actually not all binomial distributions are are symmetrical so binomial distribution that was from chapter uh 11.4 uh the whole thing about n choose R uh P to the I'm sorry n choose x uh P to the x q to the N minus X if p is roughly 12 then yes uh it can be symmetric B modal so two modes here are your two modes and they don't have to be perfect so with a uniform distribution you can have a distribution that looks something like this you know it's kind of jaggedy but roughly all the values are the same there are no humps there are no valleys so even if it's jaggedy it's still considered a uniform nonsymmetric so they're different kinds of nonsymmetric a non-symmetric distribution with a tail extending out out to the left shaped like a j is called a skewed to the left if the tail extends out to the right skewed to the right so the here the tail is go on the example a the tail is going out to the left so this is skewed to the left and B is skewed to the right now note these days some people may say oh wait a minute that a is isn't that skewed to the right and the be skewed to the left it's like no well to skew is uh French with the understanding of so to skew is from French has the understanding of emptiness so in this example a graph where is The Emptiness oh The Emptiness is to the left and the example B where's the emptiness oh The Emptiness is to the right so don't let's popular understanding uh direct you into what's uh what means what remember this is statistics so here the uh skewness may have a an opposite understanding of what you've heard it um bodal distribution and the C so notice this is not a perfect uh distribution you know you have one rounded Peak and you have one tall Peak so this is still bodal but in each of these cases uh they are all nonsymmetric so you don't have a peak where things uh uh conform to the same shape in the left and right more or less so summary of common measures of central tendency uh the mean is a set of numbers as found by adding the values in the set and dividing by the number of values the median is a kind of middle number so if you think about it the the name gives it away mean middle middle mean mean middle middle mean to find the median the first rank the values uh for an odd number of values the median is going to be uh the exact middle value in the list for an even number of values the median is the mean of the two middle values so uh example suppose you had a one one one 2 five 7 8 8 8 we have that this uh here the N is 10 which is also the uh sum of the frequencies the position of median is going to be the sum of the frequencies + one all over n which is 10 + 1 all over I'm not sorry over two uh 10 + 1 over 2 which is uh 11 so 55 there is no yeah so 1 1 2 3 4 5 so the median is going to be here its position is between the fifth and the sixth value so we're going to take the surrounding values 5 + 7 over 2 that is going to equal 6 and this is the median in this collection so that's how it would look like if you had an even number of values mode mode is the value that occurs with the greatest frequency some sets of numbers have two uh most frequently occurring values and are called bodal other sets have no modes at all if no values occur more often than the any others or if two values occur most more than two values occur most often so that takes care of that section let me bring up the next one so measures of dispersion find the range of data calculate the standard deviation of a data set interpret the measures of dispersion calculate the coefficient of variation so measures of dispersion sometimes we want to look at measure of dispersion or spread of data the two most common measures of dispersions are the following the range and the standard deviation for example you may have uh two sets of data measuring the same thing like uh for example weights of bears but maybe you want to know hey what does the weight of bears from uh uh California how do they compare to the weight of bears in say Alaska so you may go through the same process of weighing in both cases and collect all your data in the case of California you know you plot all your data and it looks kind of like this uh you look at uh Alaska Alaska the weight data when you plot them looks something like this and you may notice oh wait a minute you know here's the mean uh here's the mean but you might also notice oh wait a minute the California Bears their data spread over this much values while the Alaskan bears are spread over a broader range of value that is just as important as say finding the mean or any other data how spread out is the data even if the means were the same for both States the fact that you have a wider spread in Alaska than with California that tells you something about the bears in Alaska and you might ask yourself well why is there wider spread well if you're statistician that would clue you in to ask hey why is there wider spread and goes into that when you're dealing with uh means it is often helpful to ask hey besides the mean do you have the standard deviation for this data that would add twice as much information into your knowledge than the and not uh than just having the mean on hand range this is the simpler of the two the range for any data the range is of a set is given by you uh find the greatest value in the set and subtract from that the smallest value of the set that's it so the two sets Below have the same mean and median so if they have the same mean and median you might suspect oh this is the same data well no find the range of each set solution for a set a the highest value is 13 lowest value is one so the range here is 12 range B is four so a covers a wider range of values than say B even though they have the same mean and median now something that's more useful is the standard deviation which is based on deviations from the mean of the data technically the standard deviation is based on the mean so find the deviations from the mean from all the data values of this sample so solution number one the mean is seven that's just uh the 1 + 2 + 8 + 11 + 13 all ID 5 right 24 32 35 yep seven now subtract to find the deviation so we take each number we take each data value and we subtract seven from it now the sum of the deviations will always be zero that's a problem to make it more helpful we jump to the variance the variance is found by summing the squares the squares of what the squares of the deviations and dividing that sum by uh n minus one now the N minus one is indicative that it is a sample if it were a population that would be divide by n the square root of the variant gives a kind of average of deviations from the mean which is called a standard deviation it's denoted by the letter s since this s is a Latin English letter that tells you this is involving a sample the standard deviation for a population is denoted by Sigma which is the lowercase Greek letter for the equ equalent S letter so let a sample of n numbers uh be X1 through xn and they have a mean of xar then the sample standard deviation s is the numbers given by this is the number given by this formula so again that fork looking figure that means the sum of so there is our deviation that we were looking at a couple of slides earlier the problem is with the deviation is that if you add up all the deviations it'll some are positive some negative it'll always come out to be zero so what we're going to do is introduce a square we're going to take each deviation we're going to square it so now everything is forced to be positive POS we're going to take the sum of all those numbers we're going to divide it all by n minus one and then you're going to uh take square root of that end result so the steps involved in this calculation are as follows step one calculate the mean of the numbers step two find the deviations from the mean Square each deviation sum up all these squared deviations take that sum that you got in step four and divide it by n minus one and then take the square root of that quotient from step five from the previous step that is your standard deviation so we'll find a sample Devi for these uh this collection of values so the mean is seven add up all the numbers divide by five that gives you seven and now looky here the first row we have our data set so 1 2 8 11 13 deviation so we're going to subtract the mean that is uh seven from all of the data sets so these are all your deviations and then you're going to square them so the square of -6 so that's going to be -6 * -6 netive * negative is positive 6 * 6 36 so positive 365 squared or the square of5 so5 * five so the ne * negative that gives you positive the 5 * 5 gives you 25 so positive 25 1 2ar is 1 4 2ar is 16 uh 6 2ar 36 you're going to add up all of these uh square of deviations So you get4 you're going take that 114 you're going to divide it by n minus one so 114 divid 4 is 28.5 and then you're going to take the square root of 8.5 and that gives you 5.34 so the main use of dispersion is to compare amounts of spread and two or more data sets the common technique in inferential statistics is to draw comparisons between populations by analyzing samples that come from those populations so the more characteristics you can draw from a sample then the more descriptive you can be about the population it was drawn from and one way to look at a characteristic is to see how spread out uh the data is two companies A and B sell uh small packs of sugar for coffee the mean and standard deviations for samples for each company is given below which company consistently provides more sugar in their packs which company fills its packs more consistently so company a you have a uh this uh mean Company B you have another mean but notice it's slightly smaller than company a uh the standard deviation for company a is 0.0021 and if you notice the standard deviation for Company B is slightly smaller so remember the higher the standard deviation the more spread out the data so because the standard deviation for a is even though it's only slightly bigger it's good enough because we're dealing with very precise values solution we infer that company a most likely provides more sugar than company mean that's the greater mean on the other hand we also infer that Company B is more consistent than company a because it has the smaller standard deviation so this is uh one area where standard deviation comes in how can you tell I mean company comes in with uh a machine that fills up uh bottles of ketchup and they're supposed to fill it up to a certain amount with each bottle that passes by uh one day the the machine breaks down and uh it has to be replaced so it's replaced with a new machine to do the same job and now you need to ask yourself wait a minute is this new machine doing as good of a job as the previous machine now one way to look at that is to say oh wait a minute let's look at the the means and the standard deviation for its fills and so a good business person would have that information for its uh past machine and it would start generating the new information for this new machine I'm pretty sure these days the the machines probably do it themselves but anyway so company a and Company B its values covers uh 0.021 uh Company B .18 so the spread of B is smaller which means that it is uh hopefully you can see that this means it's more consistent and filling up those uh bags of sugar this one has a wider spread so company's a machine is uh not as can because it's just as likely to fill up this and this this small value this large value now this is not to say this is the range so standard deviation is not the same thing as a range uh even though I treated it as a range and this in this example this is just to make it clear I don't want to go into specifically how the standard deviation is used so if I say at this point this uh measure of spread is an additional quality of the sample that is often searched out it does have use and if you take the stats 300 class you can find more use for it but uh we probably won't be going uh delving too far into it in this class for any set of numbers regardless of how they are distributed so not regardless of how are they distributed so they could be bodal they could be uniform they could be left skewed right skewed they could be uh symmetrical non-symmetrical does matter the fraction of them that lie within K standard deviations of the mean where K is greater than one is at least this this is a fraction so let's uh look at an example what is the minimum percentage of the items in a data set that lie within three standard deviations on the mean so solution with k equals 3 we calculate so 1 - 1 over 3^ 2 that gives us 1 - 1 over 9 which is 8 9ths which is roughly 889 which converts to 88.9% at least least 88.9% of the data Li within three standard deviations from the mean so here's your number line uh somewhere is the mean and if we were to go out three standard deviations three standard deviations three standard deviation remember standard deviation is represented by S so that means that within this range is at least 88.9% of data coefficient of variation so this expresses the standard deviation as a percentage of the mean it is not necessarily A measure of dispersion but it does combine the uh the IDE it does combine the ideas of central tendency and and dispersion for any set of data the coefficient of variation is given by uh these so if we're talking about a sample V is your coefficient of variation so the idea is with the coefficient of variation so what percentage of the mean is the standard deviation so that kind of gives you another uh tidbit of characteristics so if you have a large mean but a small standard deviation so uh the coefficient of variation means holy cow this uh data varies not very far from the mean on the other hand if you have a large standard deviation but a small mean and it's like holy cow that data really varies quite a bit uh around the mean so compare the dispersions in the uh two samples A and B A is this B is this notice that a is in the 10 you know in the low tens and such while the B's are all over 100 so solution the values on next slide are computed using a calculator and formula from the uh previous slide so the mean of a is this the mean of B is this the standard deviation of a is this standard deviation of B is this so if we so 3.125 all over 16.67 so that's going to give you roughly .193 which gives you this the uh 252 94 over 1 5367 that gives you roughly 0.165 so which gives you this so sample B has a larger dispersion that sample a but sample a has a larger relative dispersion so an absolute value the sample B has a larger dispersion than a because that's what this tells you but this tells you that comparing the two together relatively speaking A's dispersion is bigger than b many times uh comparing A and B is uh like comparing yeah many times comparing A and B is like comparing apples to oranges I mean you the one value may be bigger than the other but you know no big deal doesn't doesn't necessarily mean anything on the other hand the co coefficient of variance allows you to meaningful compare to different oops two different uh samples the data may have units tied to it that forces standard deviation to have the same units you know if the datas are all in pounds well the standard deviation is also on pounds um but if we're looking at it like pounds and then these are seconds um yeah the coefficient of variance takes away all the units and you're just looking at a a num a unitless value that you can now compare with other unitless values that does it for that so this may actually be a short day Okay so measures of position so here is another way to drive characteristics from a population or sample looking at measures of position so you're looking at you're you're collecting all the data all and all the orderable data things that you can rank you put them in order additional qualities that you can drive from the uh from The Ordering of all the data so this gives you that's this gives you into the areas of holy cow this person is a 10 top 10% of this or this that that athlete is in the 90th percentile of all you know Runners and um or that plane is in the bottom 50% you know it's a horrible plane these are the kind of things that we're looking at in this uh case we'll be looking at several things uh compute interpret zc scores compute and interpret percentiles per compute and interpret death siles and quartiles and then construct and interpret box plots in some cases we are interested and certain individual items in the data set rather than the set as a whole we need a way of measuring how an item fits into the collection how it compares to other items in the collection or even how it compares to another item in another collection there are several common ways of creating such measures and they are usually called measures of position so the first thing we're going to look at is the zcore if x is a data item and your sample with mean uh xar and standard deviation s then the zcore of X measures the number of standard deviations by which X differs from the mean it is calculated as follows uh two students who take different history classes had exams on the same day gen score was 83 Joy score was 78 which student did relatively better please note the word relatively given the class data is shown below the so Jen Jen's class had a class mean of 78 and a standard Devi standard deviation of four Joy's class had a class mean of 70 with the standard deviation of five calculate the zc scores Jen's zcore so 83 minus 78 IDE by 4 gives her 1.25 Joy's score is 78 minus 70 IDE by 5 so 1.6 since Joy's zcore is higher she was positioned relatively higher within her class than Jen was in her class notice that Jen has the higher test score than that but compared to the CL Jennifer compared to her class did not do as well as Joy did within her class so let us talk about for a moment the Jen's zcore of 1.25 so what this means is here's the mean of 78 here's the gen's score of 83 if we were take the standard deviation of uh was it four so if we jump over four we get to 82 so this is one class standard deviation is four so that takes your 82 but but in order to reach here we still have this one extra unit well how much is that how many standard deviations is that extra jump well 1 IDE 4 so that gives you 0.25 standard deviations oh one plus .25 standard deviations so 1.25 standard deviation so it takes from the mean it takes and notice that this is positive so it's a higher number so it would take 1.25 standard deviations from the mean to reach J's score but in Joy's class it would take 1.6 standard deviations to go from the mean to Joy score Jen's class had a higher mean than Joy's class um had a Jen's class had a lower standard deviation so a higher mean for Jen's class than Joy's class uh but it's a smaller standard deviation so the class was more compact around the uh around the mean than was Joys those kinds of things gives a clue that oh yeah okay it's possible that the joys did do relatively better than Jen in absolute values yeah Jen did better 83 versus 78 but relatively speaking the zcore tells you how Jen did within her class the zcore tells you how Joy did within her class and then you can compare those relative values percentiles when you take a standardized test taken by large numbers of students your raw score is usually converted into a percentile score which is defined in the next slide if approximately n% of the items in a distribution are less than number X then X is the nth percentile of the distribution denoted by P subn P for percentile so the following are test scores out of 100 for each particular for a particular math class Point there's all the scores and note that all the scores are already listed in order find the 40th percentile or P of 40 so solution the 40th percentile can be taken as the item below which 40% of the items are ranked since now there was 30 pieces of data so 40% of 30 is4 * 30 which is 12 so this here is a position 12 number but we're not interested in that so we're going to move to the neighboring value so we take the 13th position or 75 as the 40th percentile so we we look back at this this is all ordered so 44 is the first then 2 3D fourth fifth 6th 7th 8th 9th 10th 11 12 13 so 13 is 75 so P of 40 equals 75 now the reason why we add one is because we we don't want the number to be part of the 40% we want to say what if we have the number we want to say oh 75 that means 40% are below this value of 75 So to avoid including the number within the the percentage that we're looking at we add one to make sure oh okay it's uh excluded from that percentage now what if you want P of 45 oh you want let me take a look at couple of things p 4 so if we looked at P of 42 so that means 42 time 30 so that gives us 12.6 uh 0.47 * 30 so that's going to give you uh 14.1 there is no 12.6 position so this 12.6 is somewhere between the 12th and 13th position if there is a decimal we're going to round up Round Up by that I don't mean round off this is not round or round off uh rounding off is where you compare you know number to five if it's bigger than you know equal to or bigger than five then you bump up the previous position if it's smaller than five then you just drop that thing uh that's not what we're talking about if there is anything any decimal coming after the number it doesn't matter how big it is doesn't matter how small it is if it's there we're going to round up P of 402 is still 75 please is if you have small amount data it is possible to have a number to occupy more than one position it's only as the you have large amount of data that the uh percentiles have uh more meaning to it uh 15th position so I'm going to end it here primarily because uh number one it's uh tomorrow's Independence Day so yeah uh might as well cut it off early like uh 50 minutes early and number two give me a little bit more time to catch up on the uh on the videos so we'll pick up Monday on this uh desiles and quiles yeah have a good Fourth of July good luck on the uh next exam and we'll see you on Monday