Transcript for:
Understanding Descriptive Statistics and Analysis

so in this module we're going to start talking about descriptive statistics so we previously discussed ways of displaying data and examining sample data in terms of the shape the distribution and other characteristics of the data so we're looking at things like symmetry skewness modality normality the presence of outliers and we introduced ideas like using histograms or stem and leaf plots or just bar plots and other ways to display data so plots and histograms are an essential first step whenever you do any kind of data analysis so you always need to look at the data you always need to do data cleaning and we also have to make sure that the data fit the assumptions of your planned test so for example as we start to talk about different statistical tests we're going to talk about those situations in which a test is valid and invalid and part of that decision making involves whether your data look normally distributed or not and so looking at that data and making sure you understand what your data look like is an important step but in addition to visual inspection of your data we also need especially for quantitative data some sort of numeric summary of the data and this is where measures of central tendency and variability become important so measures of central tendency basically try to describe what is the typical number amongst your data in other words where is the signal within your data and measures of variability or dispersion basically tell you something about the subject to subject differences in the values of your data or basically represent the noise in your data we can also use percentages and percentiles to describe the data as well so let's go over some basic notation so if x here represents a variable something that we've measured in our subjects and suppose we've done that measurement x on a sample of n subjects so here sample size n represents the number of subjects that we've collected so then measurements of x on n subjects are then denoted by x1 x2 x3 so on and so forth until we get to x of n where n is again is the sample size and the sum of those n observations is represented by the capital letter sigma where with this notation of i equals 1 to n and basically what this symbol does means the sigma means sum or to add up and the i equals 1 to n basically says that we're going to add up going from using i as an index and we're going to go from 1 all the way up to our sample size n and you'll notice over here that we have x sub i so that just means that you're going to add up x1 x2 x3 so on and so forth until we get to xn so then based on that the mean or the average can be computed as the sum of the x's divided by the sample size n where again n is the total number of observations then the median is the middlemost observation we talked about this last time so the middlemost observation of the ordered data so you basically order your data right going from lowest to highest and then you compute or determine where the middle most observation is so if you have an odd number of samples if n is odd then the median is basically the n plus 1 divided by 2 largest observation and if n is even then the median m is the average of the n divided by two and the n divided by two plus one largest observations so the medium basically it's computed differently what depending on whether you have odd or even numbers of samples and then the mode is the most frequently occurring observation which value occurs the most so all three of these the mean the median and the mode are measures of what we call central tendency they tell us something about how the data are clustering or how where the signal in the data are depending on what it is we're most interested in so let's look at an example say we're interested in patients with kidney stones and we go into that population and we sample six kidney stone patients and we measure their systolic blood pressure in millimeters of mercury and so here are our six values for our systolic blood pressure in those six patients so x1 would be 170 x2 193 x3 110 so on and so forth until our last sample x6 is 115. so the sum right so sigma of 1 going to 6 would be 838 that's just summing up all of our values so then computing the mean so it'd be 838 divided by 6 right means just the sum of x divided by the sample size n that gives us 139.7 millimeters of mercury remember all of your measurements whatever measurement you make has units on it so in this case right blood pressure is measured in millimeters of mercury so our average blood pressure is 139.7 millimeters of mercury the median so we take our data and we order it from lowest to highest so here's our ordered data and basically because we have an even number of of observations we're going to take the average of the middle most values right so the n divided by two and n plus one divided by two values so 115 and 135 we're going to take that average so the median blood pressure is 125 millimeters of mercury and note that right 125 we've computed the median as 125. that's not part of our original data but you know it represents again the meat the average of these two values okay and then the mode is a 115 millimeter mercury because right 115 is represented twice in our data the other values are only represented once in the data now we can do this kind of summary using spss descriptive statistics can be derived in a different number of ways in spss and if you go under analyze descriptive statistics you can choose either frequency or descriptives or explore all of these provide some form of of a series of summary statistics depending on what options you choose so pay close attention to the options available for each approach because they're not all the same and choose the one that works for whatever situation you're most interested in so for example if we chose frequencies basically in addition to providing us here's the output in addition to providing us with the mean and the median okay so this is basically the mean and the median sorry this is using our systolic blood pressure data so our mean 139.67 and our median of 125 and then our mode of 115. and then the frequencies output also outputs this table that basically shows you each of the observations so you can see here here's each of our observations are the values right 110 115 135 170 193 and how many times those values are represented so you can see here immediately why the mode is 115 because 115 appears twice in our data whereas all of the other observations only appear once and then this just tells you what percentage of the data each of these frequencies represent so 110 represents only 16.7 percent of the data 115 because there are two values represents 33.3 percent and so on and so forth um and then over here is the cumulative percentage so this is just summing up the percentages as we go down so we start at 16.7 we add 33.3 that gives us 50 we add 16.7 that gives us 66.7 so on and so forth okay so the frequencies output basically provides you with this nice summary of the frequency of the data in addition to providing you with those summary statistics now you'll notice that under n here at this top table there's a thing called valid and a thing called missing and under the percent you saw a percent and then there was something here called valid percent okay so the valid is the number of observations for this table and the missing tell you how many missing observations there are so if we had had many missing observations remember back when we talked about manipulating data in spss we said sometimes you might have a missing value code if some piece of data wasn't collected for a certain individual in which case you know maybe there there might have been missing data here and there might you know there'd be a value right now it's zero but there could be something else so the percent will always be calculated based on the total sample size including missing values and the valid percent would give you the percentage that's based just on the valid number of observations so let's say we had 10 observations and four missing values then the percentages here would have been based on a sample size of 10 but the valid percent would be based on our sample size of six because there were only six real observations now percentiles or what we call generally called quantiles are a generalization of the median the median basically represents the 50th percentile but you can calculate a percentile for any value that you want you could do the 10th percentile the 20th percentile the third percentile right so the peak percentile is the value of x of p such that p percent of the data values are less than or equal to x of p and 100 minus p percent of the data values are greater than or equal to x of p so to compute the p percentile you order your n data points from smallest to largest if m times p divided by 100 is not an integer then x of p is the kth smallest observation where k is the smallest integer greater than n times p divided by 100 okay and if n times p divided by 100 is an integer then x of p is the mean of the n times p divided by 100 and n times p divided by 100 plus 1 smallest observations right so this looks very similar to our calculation of the median right we said if the median was an odd number of observations then you would look at basically n over 2 has the obs n over 2 observation to be the median if the number of observations was even then the median was the average of the n divided by 2 and the n plus 1 divided by 2 observations right so this is very similar right we're saying that if n times p divided by 100 is not an integer then the percentile the p percentile is the value of k that's greater than n times p divided by 100 and if n times p by divided by 100 is an integer then we're calculating the mean of the n times p divided by 100 and the n times p divided by 100 plus one smallest observations so again let's look at a an example here let's say we have a data set with 16 observations and there they are we're going to order those from smallest to largest and notice you know this is an even number of observations so to get the 25th percentile k is going to be n times p divided by 100 or 16 times 25 right because we're looking for the 25th percentile divided by 100 which equals 4. so this is an integer so the 25th percentile is the average of the fourth and the fifth observations in our ranked data so the fourth and the fifth observations are 13 and 14 and so we take the average of those two and we get our 25th percentile as 13.5 so let's say let's look at another example let's say we only had 15 observations so we're going to take that same data and we're just going to get rid of this 18. so now we only have 15 observations so to get the same 25th percentile in this new data set n times p over 100 right so again 25th percentile so it's 15 times 25 divided by 100 or 3.75 so 3.75 is not an integer so the 25th percentile is the fourth ranked observation so 4 being the smallest integer that's greater than 3.75 so that means that our 25th percentile is 13. so the median which represents the second quartile is the 50th percentile in the data right so p equals 50 and if we use the previous formula so the first and the third quartiles are the 25th and the 75th percentiles which basically represent p equals 25 and p equals 75 in the formula from the previous slide so the 25th and 75th percentiles are sometimes called the lower and upper quartiles respectively and are denoted by either q sub l equals q 25 and q u equals q75 there is an alternative way of calculating the lower and upper quartiles so first find the overall median of the data set right using the formula that we described earlier and the median of the values to the left of the overall median represents the 25th percentile and then the median of the values to the right of the overall median is approximately the 75th percentile so that's another way that you can calculate the 25th and 75th percentiles in the data more specifically you can write n equals 2k if n is even if the sample size is even and n is equal to 2 k plus 1 if our sample size n is odd and then the median of the k smallest values is approximately the 25th percentile and the median of the k largest values is approximately the 75th percentile so this alternate method gives roughly the same answer as the method from the previous slide about 75 percent of the time so let's go to our previous example so for n equals 16 we have two times eight which uh we where we have k equals to eight so the 25th percentile is the median of the eight of the eight smallest numbers which is 13.5 as before right so here we have our first eight numbers right and we're just taking the median of these numbers and because these num here this sample size is even we're going to take the average of the middle two values 13 and 14 which gives us 13.5 and for n equals 15 which is an odd number right 2 times 7 plus one we have k equals to seven so the 25th percentile is the median of the seven smallest numbers so again here's the seven numbers and again because it's odd we take the fourth observation and we get 13 as the median all right so these are those two same two examples where we had a sample size of 16 and the other one where we had a sample size of 15. now if i run those same data through spss so v1 here variable 1 represents our sample with 16 values and v2 represents our sample with 15 values and you can see here spss computes the 25th i've chosen to have it compute the 25th the 50th and 75th percentiles and you can see the results here for both of those samples and note that for the 25th percentile in the n equals 16 case this number 13.25 doesn't agree with what we calculated right so we calculated a 13.5 and in fact if we calculated the 75th percentile this 25.25 would not match what we calculated okay and there are a couple of reasons for this first spss uses a slightly different algorithm than what we're presenting to you in class and what spss does is it uses what's called a weighted mean so when estimating percentiles by hand your data will often not fall exactly on the p percentile and in those cases we're telling you just to take the mean of the two observations however most statistical software like spss will use what's called the weighted mean of the observations that fall near the peak percentile and a weighted mean is basically an average in which the individual data points don't contribute equally to the calculation so each observation x sub i each one of our measurements actually has a weighting factor assigned to it so you can see here instead of the mean being calculated as the sum of x divided by the sample size it's basically the sum of x times the individual weighting factor so here w's w represents the weighting factor so each individual observation i has a weighting factor assigned to it so you multiply the value of x times its weighting factor and sum those up across the entire sample size so again i goes from 1 to n and then dividing by the sum of all the weights and that gives you the weighted average and so that partly explains the difference that we see between the spss output and our hand calculation now outliers are very very important because the outliers have undue influence on results so basically an outlier is a data point that's very different from the results of your data so it may you know maybe all of your data is clustering around 100 and suddenly you have a value that says 200 right so that would be considered an outlier outliers can have a big different effect on results whether it be the mean or some statistical analysis and therefore you want to make sure that you identify outliers and deal with them so let's say we have some data here six observations and we compute measures of central tendency so for these data right so 170 193 so on and so forth we have a mean of 139.7 a median of 125 and a mode of 115. now let's add on this additional data point that's 400. so this data point looks very different from the rest of the data right because the rest of the data are in the hundreds and suddenly this guy is over here at 400. so if we do the same thing we order the data and we measure we calculate our measures of central tendency so now the mean is 176.9 the median is 135 and the mode is 115. so you can see here that the mean changes quite a bit because of the influence of this outlier value the median changes a little bit right 125 to 135 and the mode doesn't change at all right because there are no other 400s in the data so that won't affect the mode right because the mode is the most frequent observation so here you can see that the median and the mode are more robust to outlier effects than the mean is the means is very sensitive to outlier values so next let's talk about measures of variability or dispersion and there are a number of ways that you can quantify variability in your data so let's start with the range so the range is just the difference between the largest and the smallest observations and it's often displayed as the two values in parentheses as shown here so the range basically gives you an idea of how widespread the data are right since you have the lowest and you have the largest values it gives you a sense of how widespread the data are there's something else called the interquartile range or the iqr which is the difference between the 75th and the 25th percentiles the iqr many times is misreported as the very similar to the range it they'll put the 25th and the 75th percentiles in parentheses and call that the iqr but the iqr actually is just the difference between those two numbers so it's supposed to be reported as a single value so it just tells you what the distance is between the 25th and the 75th observations and the reason for reporting the interquartile range as opposed to the range is that the interquartile range will be a little more resistant to the effects of outlier values because outliers are going to tend to be outside of the 25th and the 75th percentiles and therefore not influence the reporting of that range whereas if you reported the range the range could include outlier values and sort of misrepresent how widespread the data really are and then there's the sample variance which is represented as s squared right the population variance would be sigma squared but we're talking about drawing from from the population and working with samples so we're talking about the sample variance and basically the sample variance is the sum of the individual observations x sub i minus the overall sample mean so you compute the mean you subtract the mean from from the individual observations we're going to square that value and we're only squaring to get rid of the negative sign and sum those up and divide by n minus 1 and that gives us the variance and then there's the sample standard deviation or s in popu for the population value that would be represented by sigma and the standard deviation basically is the square root of the variant and then there's the standard error of the mean this standard error of the mean is a measure of variability about the mean usually represented as se or sem and that's the standard deviation divided by the square root of n the sample size so variance sample standard deviation standard error of the mean three different measures of variability uh based on some calculations and then of course the iqr and the range some of you may have heard of the coefficient of variation or cv and this basically describes variability in the data relative to the mean so cv is basically the standard deviation divided by the mean the coefficient of variation is unitless and typically reported as a percentage and can be used to compare the spread of data sets with different units or different means so the coefficient of variation is commonly used to describe precision and reproducibility of assays so you might read in the methods section of a paper that you know we measured insulin using a radioimmunoassay and the coefficient of variation was seven percent so it just gives you a sense of how good that assay works so let's look at measures of dispersion and variability using our kidney stone blood pressure data again so here's our systolic blood pressure data and we can compute the variance so here's sigma squared right so we're going to sum up the difference between the mean and the observed value so here 170 minus the mean right remember the mean was 139.7 we're squaring that we're going to do that to all the values right so here's 193 110 so on and so forth we're going to sum those all up and we're going to divide by our sample size minus one okay so we have six observations so n minus one is five and that gives us a variance of 1176.7 millimeters of mercury squared so again remember all your data have units on it and because we're squaring all of these values what you're working with is millimeters of mercury squared so you don't usually see people reporting in their papers and grants or whatever the variance because nobody thinks in terms of millimeters of mercury squared right and that's why we usually report the standard deviation because remember the standard deviation is going to be the square root of the variance which then gets us to 34.3 millimeters of mercury which is much more interpretable than millimeters of mercury squared now one thing i wanted to we should point out right now is that when doing these calculations by hand never ever round uh data in during intermediate calculations always round at the end so for example if we go back here even though we said this is 139.7 the actual mean when you calculate the mean for these six values the actual mean is 139.66666 and so that's what you should be using to calculate the standard deviation the coefficient of variation the standard error of the mean or the variance so this really should be 139.66666 when we do the calculation and then we can round to 1176.7 at the end and please do that whenever you do any calculations for this class round at the end do not round in the middle rounding in the middle will basically introduce what's known as rounding error and there's a very good chance you'll end up with a slightly different answer whenever we do statistics precision and accuracy are very very important and therefore we should always carry significant digits until the very end and round at the end so standard deviation of 34.3 we calculate a coefficient of variation so remember standard deviation divided by the mean times 100 so that's a coefficient of variation of 24.56 percent and we can calculate the standard error of the mean as the standard deviation right divided by the square root of n which gives us 14 millimeters of mercury so again the standard deviation and the coefficient of variation tell us something about the variation in the data or the dispersion in the data the standard error of the mean gives us a sense of the accuracy and precision in our estimate of the mean so again i can go using spss in this case i've chosen the option descriptives and here is my data my systolic blood pressure data and you can see here it gives me the mean right so it says me here it says writes n so i have six observations it gives me the mean so 139.67 it calculates the standard error is 14.004 there's my standard deviation and here's the variance so you can see the difference between what spss computes carrying all the significant digits versus what we're reporting here based on our hand calculations so again number of observations and this is for the data and the valid and list wise basically is the number of rows and dates for which all variables are not missing so let's go back to talking about outliers right so we talked about outliers briefly in terms of measures of central tendency if we look at measures of variate variability then so for example looking at this data let's have the standard deviation and the original data was 34.3 if we add our outlier value of 400 then the standard deviation is now 103.3 so you can see that this single value of 400 has a huge impact on the estimate of the standard deviation in the data so often the standard deviation and therefore the variance and the coefficient of variation and all those related measures the standard error of the mean are going to be more sensitive to outliers than the mean is so this is why again looking through your data visually to identify outliers like this becomes very very important now an easy way to summarize the data is using something called a box and whiskers plot this is also simply just called a box plot and it provides a graphical representation of the five number summary of any set of data so in the in a box plot you'll basically get the minimum and the maximum the 25th and the 75th percentiles and the median or basically the you know remember the the median is the 50th percentile right you'll get all of these measures in an instantly in a single easily viewable graph so here's taking um the percent body fat data that we showed in previous modules and just making a box plot of it and you can see basically there's a it's why it's called a box and whiskers plot is basically the box here right that represents um your date the bulk of your data and then these are what are basically called the whiskers so that's where the name box and whiskers plot comes from and basically your interpretation of this figure is that that line in the middle of the box represents the median and then the size of the box the two ends of the box represent the 75th and the 25th percentiles and then that difference is basically the inner interquartile range and then the lines the the the vertical lines with the little horizontal line on top the little caps on top of them represent the minimum and the maximum values okay so this is the minimum this is the maximum so this is a nice way to just visually instantaneously look and if your data are skewed then the median will be skewed either in this direction or that direction and they might be the tails might be really wide etc so there are all kinds of ways to look at this data and if you have different groups like if we think about the percentage of body fat remember in the previous modules we talked about how the distribution of percentage body fat differed between males and females if you did a box plot separately for males and females there'd be a you know let's say you know if you did males and then you did females there'd be a second box plot you know like this where you might see that that shift in the overall data and the differences in the range etc so notice that in this in this picture in the hold on let me let me get rid of all of my extraneous scribbles here notice in this figure that the minimum and maximum are in quotation marks and that's because those represent the smallest and the largest non-outlier values and spss has an automatic algorithm to identify outliers and extreme values and so when it report does a box plot it'll note those extreme values and will only report the minimum and the maximum for the non-outlier values so specifically for a box and whiskers plot in spss and this is only for spss the regular data points are those between the 25th percentile minus 1.5 times the iqr and the 75th percentile plus 1.5 times the iqr this is the rule that spss uses to define the regular data and then outliers which are marked by this little zero or the degree mark are those that fall between the 25th percentile minus three iqrs and the 25th percentile minus 1.5 iqrs and then the 75th percentile plus 1.5 iqrs and the 75th percentile plus 3 times iqrs so this values that fall within these ranges are labeled as outliers values that fall within these ranges are considered to be regular data and then what they call extreme values which on a box and whiskers plot is going to be denoted by the asterisk those are values that are smaller than the 25th percentile minus three ieqrs and greater than the 75th percentile and three iqrs so these are the rules that define regular data outlier data and extreme values and this is specific to spss this doesn't apply to any other software that you might use other software may also do similar things but you need to read the documentation to figure out exactly what rule other software may be using to identify outliers and extreme values so let's look at an example let's take the insulin data remember we had insulin data that was very skewed basically take that data and do a box and whiskers plot so here is my box and whiskers plot here and you can see that the data are very skewed so the median is lower in the box i have a short whisker for the lower values and i have a very long a longer whisker for the upper values the larger values and then notice that i've got a bunch of outlier values here and then i've got a bunch of extreme values including this one extreme value all the way up here that looks like it's about 90 micro units per ml now you'll notice that there's a number beside every observation that's either labeled as an outlier or a an extreme value so you can see here there's a 864 next to this outline this extreme value and the number displayed is basically the row number of that data set so that it's easier for you to identify it when you go back to your data so you know that that 90 micro units per ml is observation row numbers 864. now just to show you remember we were talking about statistical transformations so if i take that same data and i log transform it now the data look more normal right and if you look at the box and whiskers plot now the median is more centered in the box the whiskers are more even about the box and then we only have two outlier values again as identified by the algorithm that spss uses so this this just shows you how statistical transformation really changes things so we have this very skewed data here with all these outliers and extreme values i do a transformation and suddenly we have all those outlier and extreme values disappear and i still have two so here's our 90 micro unit per ml value from the last slide right but now it's much closer to the data it's still considered an outlier value but it's now closer to the bulk of the data than it was before so how do you choose what summary statistics are best for your data so a good rule of thumb is that if the data highly skewed or there are a lot of outliers then the median and the iqr are generally the best way to measure central tendency and dispersion and if the data look relatively symmetric if it looks like a bell-shaped curve and there are very few outliers then using the mean and the standard deviation are generally the best way to get capture the central tendency and dispersion this is very very important because again remember the measures of central tendency like the mean and our measures of dispersion are highly sensitive to outlier values it's so if you use the wrong summary statistic you can get a mis interpretation of the distribution of your data and so applying these rule this rule of thumb is very very important