Transcript for:
Understanding Data Variability and Spread

so we have covered how to use numbers to express the central tendency and distributions in this mini lecture we'll be covering how to use numbers to describe the spread of data now if you remember when we first talked about spread we noted that distributions can show different spreads of data in this video we will talk about a few ways that you can use numbers to describe the spread of data what we are really trying to describe is the variability between people in our sample in this distribution people are quite similar to each other we have low variability in schools whereas in this distribution people are quite different from each other there is high variability in the data so we're going to be talking about four types of measures of spread maximum and minimum quartiles variance and standard deviation so the data we're using here is data collected on the length of wings of housewives published in 1955 by Sir Colin hunter so here are the wing lengths measured from a hundred flies and here is the histogram of the data this data is freely available from the quantitative environmental learning project here's the link but the data file will also be on the module site so that you can have a play with it so let's start with the minimum and maximum probably the easiest measure of spread to calculate quite simply the lowest and the highest values this gives an indication of the range over which scores occur here between 36 and 55 however let's imagine what would happen if we accidentally included a gerado miss Harris fly the largest known species with a wing length of around four and a half centimeters well our maximum and minimum changed substantially despite there being only one outlier this is similar to the mean in the way that is not robust to outliers next up we have quartiles now quartiles have similar principles as the median as they rely on ordering the values from the low to high the median is the center of these values ordered here as we have an even number of data points 100 that is between the 50th and 51st value which here are 45 and 46 our median is 45 point five and it splits the data such that half of the values are under the median and half the values are above the median quartiles mark quarters of the data so the first quartile or q1 cuts between the lowest value in the median which is also known as the second quartile or q2 and the third quartile q3 cuts between the median and the highest value quartiles effectively split the data into quarters hence the name with 25% of the data in each bin with 100 values that means that there are 25 values in each of the bins so let's add Kuroda miss Harris back him to see how the quartile measures coped with outliers if you remember from when we were talking about central measures this mean can be quite affected by outliers here the mean would move from forty five point five to forty nine point five with this inclusion of a single fly the median is far less affected by the outlier it shifts up only very slightly to 46 now there are a hundred and one values the position of q1 and q3 moves slightly to to capture 25% of the data now twenty five point two five values in each bin however the quartile scores in this case do not change so like the median quartiles are fairly robust to outliers and can therefore be a useful measure of spread next up we have variance now when we calculated the mean we noted that the mean was a balance point and that the sum of the deviations of each point from the mean was zero variance works on a similar principle by looking at the mean of these deviations so let's take our wing length data and calculate deviations for each score by subtracting the mean forty five point five from each data point now as the mean is the balance point the sum and the mean of these deviation scores is and always will be zero so it's not very useful that is because we have negative and positive values what we can do is square the values multiplying each value by itself to make all of the values positive as a negative value multiplied by another negative value is always going to be a positive by doing this the sum won't equal zero so the variance is the sum of these squared deviation values divided by another number and this number depends on the sample of participants we have tested and you should go back and revise the central tendency video if this sounds unfamiliar if we have tested a population in other words every person we were interested in we divide the sum of deviations by the number of people in the population however if we have tested a sample of the population then we divide by the number of the values minus one and this method is far more common as we don't usually exhaustively test populations so why do we take one away from the number of values well it turns out that this is actually quite an important mathematical principle which is used in statistics called the degrees of freedom and this is useful when we're trying to estimate features of a population from a sample so if we look at our deviations for for participants we have three of the values for P 1 P 2 and P 3 what is P 4 equal well we know that the deviations have to sum to zero so the value must equal three because if we know these values we know that this value must be three at the moment we have a value of minus three from P 1 plus P 2 plus P 3 and we need to cancel that out with a positive deviation so because these have to sum to 1 the degrees of freedom refers to how many numbers can vary due to mathematical constraints well this will work through an example of that here are some data on a number line each data point represents one person's score let's imagine that this is our entire population that we were interested in studying a population mean here is 0 just to keep things nice and simple and we'll square the deviations of each data point from the population mean sum of these squares is 682 and the sum of squared deviations divided by the number of data points 12 equals 56 point 8 3 so our population variance is 56 point eight three and we'll pop that up here for the next part now let's imagine we can't test our whole population but we're going to randomly sample from that population we can look at the means and they vary around the population mean some of them are above zero some of the move below zero so when we calculate our variance by dividing by the total number of participants the estimates are typically well under the population variance but if we use n minus 1 these values are increased and bring us closer to the population and variance so because something from a population cannot measure all of the variance using n gives us an estimate of variance that is bias to be lower than the population mean using n minus 1 gives us an estimate of variance that is unbiased and tends to be closer to the population mean so back to variance if we look at the mathematical versions of the formulas population variance is denoted by the Greek symbol Sigma squared and this is the sum of all of the values minus the population mean squared so that's giving us our deviation squared values and we divide that by capital n the number of people in the population the sample means variance estimated by s squared and we take the sum of each value minus the sample mean x-bar squared divided by n minus 1 for our flowing data that translates to one hundred one thousand five hundred and twenty one divided by 99 and equals fifteen point three six times not point one millimeters squared and this is the square part is a slight weakness of the variance measure it isn't really immediately clear what point one millimeter squared translates to in real terms and that is where standard deviation becomes useful the standard deviation takes the variance score and takes the square root of that value to remind you of a square root function when we square four we times it by itself and get 16 the square root of 16 is therefore 4 square root of 25 is 5 36 is 6 10 is 3 point 1 6 and 20 is four point four seven etc the standard deviation of our fly data is therefore the square root of fifteen point three six which equals three point nine so the standard deviation is a rough measure of the average amount by which scores deviate from the mean so on our histogram of data here is the mean of forty five point five our standard deviation here shortened to SD was three point nine so each standard deviation is three point nine away from the mean so one standard deviation above the mean would be 49 point four and one standard deviation below the mean would be forty one point six plus two standard deviations from the mean and minus two standard deviations from the mean of further again and what is always true is that the majority of data will fall within one standard deviation of the mean and that the majority of data will be outside of two standard deviations of the mean so our garota miss he rose at 450 times point one millimeters will be a hundred and four standard deviations above the mean which means we can identify this as an outlier and generally this understanding of how people on average vary from the mean is very useful in helping us identify outliers in data if we have a data point that's more or two more than two or three standard deviations from the mean we may look at it and try and work out that we've entered the data correctly or that the data point really belongs in our population we can also use a rule called the 1.5 IQR rule IQR here stands for on interquartile range now how does this work well we take the interquartile range as the value between q1 and q3 so with our fly data for example we had values of q3 of 48 and q1 of 43 this gives us an interquartile range of 5 there are five points between q1 and q3 and then we want to multiply this by 1.5 to give us our 1.5 our QR rule which gives us a value of 7.5 now how do we use this well what want to do is use this value as a maximum cutoff so we want to say any data that is beyond the maximum cutoff could be suspect and could be a potential outlier here we take q3 plus our interquartile range times 1.5 so q3 plus 7.5 gives us a maximum cut off from fifty five point five any fly that we measure that's above five fifty five point five we may be suspect off and try and work out whether this might be an outlier the minimum cutoff is q1 minus the end quartile range times 1.5 so it's 43 minus 7.5 which gives us a value of 35 point five so what does it mean to have an outlier well maybe you've just made a typo when entering the data anything where we manually put numbers into Excel or SPSS possible we miss a decimal point or add an extra zero where there is none if you have an outlier it's always worth checking that you haven't made an error copying data from one place to another another option is that you've made a measurement error again whenever there is a chance of human error is important to check that we haven't accidentally read inches instead of centimeters for a participant or something similar perhaps we've measured a wrong individual the giant fly would be an example instance of this it isn't a housefly but if you mistakenly measured it as one it does not belong in the housefly population and you have measured the wrong individual or lastly maybe they're legitimate measurements there are extreme values into all distributions and there is always the possibility that you have just measured one of these extreme individuals so what should you do if you identify an outlier well there are pros and cons of leaving outliers in or removing them why would you leave them in well you can't pick and choose your data if it is a validly massive person they do exist in the population extreme or not as a pro you're considering all of your data set however measures such as the mean and standard deviation are influenced by outliers so these measures will be affected so what if we remove the outlier well we can do this if we believe the individual isn't really representative of the population for example those who are competitively growing their nails can be considered systematically different from the population as they are skewing nail measurement data in an intentional way I hate to get the record and this would have the positive effect of removing the influence of these outliers from the mean and standard deviation however by removing in this participant any statistics that are used cannot be said to exhaustively apply to all individuals so really arguments can be made both ways and the removal of outliers needs to be considered on a case-by-case basis and the reasons for inclusion or exclusion explained when reporting any analysis so in summary like the central tendency we can use numbers to describe the spread of distribution measures of spread describe variability of participants the minimum a maximum give a measure of the range of values quartiles allow us to bin our data into quarters to measure how spread out data points are variance can be calculated to compare individuals to the mean standard deviations are similar to variance but can give more easy-to-understand scales to our measurements measures of spread can be used to identify outliers and data which we can remove all even based on a case-by-case basis