hi everybody welcome to chapter 2 numerical descriptors the chapter objectives are describing distribution with numbers using the measure of center which are the mean and median the measure of spread which are the quartiles and standard deviation the five number summary and box plots or cat and whisker box and whisker the iqr and outliers dealing with outliers and choosing among summary statistics organizing statistical problems okay the mean is also known as the arithmetic average so this is a measure of center so when we're looking at histograms we are guesstimating the center and the mean is one way of doing that so to calculate the mean of a data set you add up all of the values and then you divide by the number of individuals it's the center of the mass or the majority of the data you would hope so it's um it's dictated by our indicated by an x bar so the x with a line over it is called x bar you add up all the values in your sample so you have x1 plus x2 plus dot dot dot plus xn where n is the number of values over n which is the number of values total number of values in your sample so we could also denote this as x bar equals 1 over n times the summation which is that epsilon sine of x i or i equals one so you start at one to the nth power or to the nth value okay so the measure of center can also be measured through the median and ideally it's good to look at both the mean and the median if they are truly symmetrical in distribution you should expect these values to be pretty close to each other the median is the midpoint of a distribution a number such that half of the observations are smaller and half are larger so to do this first you sort the observations from smallest to largest where n equals the number of observations and two the location of the median is n plus one divided by two in the sorted list so if n is odd the median is the value in the center of the observation so if you have 25 values n equals 25 plus 1 25 plus 1 is 26 divided by 2 is 13. so we find the 13th value in our data set and that's our median if n is even then the median is the mean of the two center observations so if we have 24 values 24 plus 1 is 25 divided by 2 is 12.5 12.5 is going to be the the average between the 12th and 13th value so we take 3.3 plus 3.4 and divide by 2 to get 3.35 for the median okay the median is a measure of center that is resistant to skewing and outliers the mean is not okay so here we can see um the mean and the median are approximately the same only if the distribution is symmetric the mean is not resistance to skewing and outliers because the mean is computed using all of the numerical data in the data set the median only requires finding the middle value so it's not directly affected by values that are on the the very end the outliers of the distribution of values okay so here we have a study of freely forming groups in bars all over europe recorded the group sides a number of individuals in the group of all 501 groups in the study that were naturally laughing the median laughter group size is a2 b 2.5 3c are c3 d 3.5 or e4 the average laughter group size is smaller than the median b about the same or c larger than the median okay so um try and calculate the median and estimate the mean so i'll give you a minute to do that so you have 254 um twos 168 threes four fifty twos five twenty ones and six i mean twenty one fives and six sixes so just uh remember your n value right and then try to find the median okay so here we can see the median is there's about 500 groups so the median is going to be um by the group 250 and 251 in the list which is all two laughter group size two so our median is a two now when we add up all of these numbers and divided by the total which is 500 we're going to see that there is the mean is slightly larger than 2 right because 2 is the bottom number so it it's going to be larger than 2. so here we know that the mean is going to be larger than the median okay good job okay so we are going to start looking at the five point summary the five point summary includes the median which we just learned how to do and the quartiles so the first quartile is q1 and the quartile is the median of the values below the median and the data that is put into order okay um the uh so here we have the medians the 13th value so 1 through 12 is the values below the median so the middle number for that is 6 and 7 or 6.5 so that would be the mean of 2.1 and 2.3 which would be 2.2 then the third quartile or q3 is the median of the values above the median and the sorted data so that would be the values 14 through 25 where the middle value is 19.5 so we take uh four point two plus four point five and divide by two to get four point three five so let's try it out how fast do skin wounds heal so here's the skin healing rate from the 18 newts measured in micrometers per hour we put the values in order from smallest to largest and then we try to find the median and quartiles quartile 1 and quartile 3. so take a minute and try to sort that out the first thing to do is to find the n value what's the number of the sample size what's the n value nth value so you need to add up all the numbers how many numbers are there so there's 18 values so our n equals 18 our medium equation is n plus 1 divided by 2 so 18 plus 1 divided by 2 is 9.5 so we count over between the 9th and 10th value so one two three four five six seven eight nine ten and the ninth and tenth value is 26 and 27 so we take the mean of those two numbers and that's our median that's 26.5 now for the quartiles we're going to look at all the median below the numbers and the median um above the numbers okay so the median below the mean the median the numbers below the median are one through nine so that's 11 through 26 right so that's nine values so nine plus one equals ten divided by two is five so we take the fifth value one two three four five so quartile one equals q1 now for the quartile three we take them down the numbers above 9.5 and that's the tenth value through the 18th value right so that's also nine values and we're going to take the it's going to be the fifth value again from the count of that so one two three four five and that's 33 so q1 equals 22 q2 or the median equals 26.5 and q3 equals 33. okay so the measure of spread can also be measured through standard deviation standard deviations measure spread by looking at how far each observation is from the mean so if you have a class average of 75 and i say the standard deviation is plus or minus five percent you can gather from that that the majority of the class got around a c plus or minus five percent seventy-five percent plus or minus five percent is uh centering around seas however it's a standard deviation is 75 percent plus or minus 25 that could mean that half the class got an f and half the class got an a so here we can see how tight the data is clustered around the mean okay to calculate standard deviation of the sample we calculate the variance as squared where s squared equals one over n minus one where n equals the sample size and we have minus 1 to allow us some flexibility in that times the summation of x the difference between x i minus x bar quantity squared so here we're looking at the distance between each observation x i and the mean how far away is that data point from the overall mean and then to get the standard deviation we take the square root of that entire formula to get s okay so let's try it out so we have a person's metabolic rate which is the rate at which the body consumes energy so try to find the mean and standard deviation for metabolic rates of a sample of seven men so take a second if you have to pause the video go ahead and pause it okay so here we have um all of the observations added up so that's the summation of x i's or x i starts with one and goes to seven divided by seven so x bar or the sample mean equals sixteen hundred the summation of that is um to find the standard deviation we have to find how much the each value deviates from the sample mean so x i minus x bar quantity squared added together so we have all these values 192 quantity squared equals 36 864 plus 4356 and so forth the sum of that is two hundred and fourteen thousand eight hundred and seventy so now for the variance s squared equals one over the degrees of freedom which is n minus one we know n equals seven so seven minus one equals six um times the summation of the difference between x i minus x bar quantity squared when we multiply this 1 times 214 870 equals 214 870 divided by the degrees of freedom n minus one so seven minus one is six to get s squared equals thirty five thousand eight hundred and eleven point seven this equals the variance to find the standard deviation we find the square root of that value and we have one hundred and eighty nine point two um so this is what's deviating around the mean of sixteen hundred okay with the um values known the minimum the maximum quartile one quartile two are the median and quartile three we can create a box plot or a box and whisker plot okay so we make the box with quartile three value and quartile one we put the median in outside the box and then we have the whiskers for the minimum and maximum the interquartile range or iqr is the distance between the first and the third quartiles which equates to the length of the box in the box plot so the uh enter the outlier is the individual value that falls outside the pattern so how far outside the overall pattern does the value have to be to be considered a suspected outlier so we have a formula for that so you don't just guess whether or not something's an outlier you first find the interquartile range finding the quartile 1 and quartile 3 you multiply it by 1.5 and then you take quartile 1 and subtract 1.5 times the interquartile range anything less than that is as a suspected outlier for the outliers above on the higher range you take quartile 3 and add 1.5 times the iqr interquartile range anything greater than that is considered an outlier all right so let's try to calculate this so we know how to find the median that's the q2 is 3.4 the q1 and q3 we found already two so that was 2.2 and 4.35 now to find the interquartile range the iqr is 4.35 minus 2.2 which equals 2.15 to find the outliers we're going to find the lower and upper outliers which is one um q1 minus 1.5 times iqr so 1.5 times iqr is 1.5 times 2.15 which equals 3.225 so 2.2 minus 3.225 would be less than zero right and there's no values there and then we have um three point or q three four point three five um plus one point five times the iqr so that would be [Music] over 7.5 and that is one value there at 7.9 so 7.9 is our outlier which makes the new maximum 5.6 so we would have to modify our box and wix whiskers plot and change the the whisker to max at 5.6 and change 7.9 to an asterisk to denote that it is an outlier okay so for this one we can see um bmi and frequency and this is really nice because a modified box plot with the asterisks helps us to distinguish between points that are part of a skewed pattern and the presence of an outlier so here we have three possible outliers around 33 to 35 that are close to the rest of the pattern and appear to simply be part of a skew whereas the largest value is clearly an outlier that is far away from the rest of the data okay what should you do if you find outliers in your data well it depends on part on what kind of outliers they are if they're human error and recording information we should fix that or human error and experimentation or data experimentation or data collection we should look at that more closely because maybe all of the data may be skewed unexplainable but apparently legitimate wild observations then you would ask the question well are you interested in looking at all of the individuals are you only interested in looking in the typical individuals but you don't discard outliers just to make your data uh fit a model um you don't act like they don't exist okay so how to choose a summary statistic because the mean is not resistant to outliers or skewing you use it to describe distributions that are fairly symmetrical and don't have outliers you plot the mean and use the standard deviation for error bars so there's the mean plus or minus the standard deviation or sd okay so this is for data that is not um highly skewed or has outliers lots of extreme outliers otherwise you can use the median and the five point uh summary or five numbers summary which can be plotted as a box plot so here we have an example we have deep sea sediments a phytopigment concentrations in deep sea sediments collected worldwide show a very strong right skew okay so the tail is on the right side which of the two values is the mean and which is the median so 0.015 and 0.009 grams per square meter of bottom surface so 0.015 is larger than 0.009 so which one would be the mean and which one would be the median and the second question is which would be better a better summary statistic for these data okay so since we know the mean is not a robust measure of center and it's influenced by outliers and skewed data we know that the data is skewed to the right very skewed to the right so that means that the mean is probably being pulled to the right meaning it's it's being pulled to be a larger value so 0.015 would be the mean 0.009 would be the median and since this data is highly skewed we would say the median is probably the better summary statistic to use for these data here's another example researchers grafted human cancerous cells onto 20 healthy mice 10 of the mice were injected with tumor-specific antibodies while the other 10 were not here's a table of the raw data what summary statistic would you use for each of these two variables so the two variables are the presence of metastases and the number of metastases so presence of metastases as we know is categorical data and number of metastases is quantitative data so for categorical data we want to compute the count of the mice with the metastases or the proportion of the mice with the metastases for each group for quantitative data we can compute the mean and the standard deviation of the number of metastases for each group and we can do a five point summary or five number summary but with just 10 values in each group it wouldn't summarize the data as much so organizing a statistical problem includes four steps one is stating what's the practical question and the contents of a real-world setting two is planning what specific statistical operations does this problem call for three is solving making the graphs and carrying out the calculations needed for the problem and four is concluding giving your practical conclusion in the real world setting so state plan solve and conclude so this is these are basically those four steps we're going to work on and we're going to apply different um statistical tests based on the type of data okay that's it for chapter 2. thank you for listening bye bye