Fundamentals of Statistics and Data Analysis

let's start with this problem find the mean median mode and range of the following data set now the first thing that i would recommend doing is arranging the numbers in increasing order so the lowest number is seven and we have two of them after that the next number is 10 and then 14 15 23 and 32. now to calculate the mean what we need to do is we need to take the sum of the sum of the numbers and divided by the seven numbers that are in the data set so this is going to be 7 plus 7 plus 10 and so forth and then we're going to divide it by the seven numbers so the mean is basically the average of those numbers so the total sum that i got is a hundred and eight if you take 108 and divide it by seven that's going to give you 15.4 i'm gonna round it to three so it's approximately 15.43 so that is the mean that's how you can find it now what about the next thing how can we find the median of this data set the median is basically the middle number so what i like to do is eliminate the first and the last number and then working towards the middle i'm going to eliminate the next two numbers until i'm left with the middle number so as we could see in this example the median is equal to 14. it's simply the middle number of data set now what about the mode what is the mode in this problem the mode is simply the number that occurs the most frequently i think i said that wrong it's basically the the most frequent number in the data set and notice that 7 appears twice in this data set so 7 is the mode now what about the range the range is simply the difference between the highest number and the lowest number so it's going to be 32 minus 7 which is 25. and so now you know how to find the mean median mode and range of a data set now let's move on to number two now this is going to be a similar problem but it's not exactly the same as you can see we have eight numbers in our data set as opposed to the seven that we had previously so let's begin by putting these numbers in order so the lowest number is 11 and then the next number is 15 and then we have 21 and then 37 41 and 59 so let's begin by calculating the mean so the mean represented by the symbol x bar is equal to the sum divided by the number of numbers in the data set so let's add up the eight numbers this is going to take some time and then divide the sum by those eight numbers so we have 11 plus 15 plus another 15 and then 21 37 41 and 259 so i got a sum of 258. if we divide that by 8 you'll find that the mean is 32.25 so that's the answer for the first part of the problem now let's move on to the second part let's calculate the median as we said before the median is simply the middle number so let's eliminate the first and the last numbers and then the next two and the next two notice that we don't have one number in the middle but this time we have two numbers in the middle so what should we do if we come across a situation in this case what you need to do is take the average of those two middle numbers so you need to add them up and divide by two two plus three is five seven plus one is eight so 21 plus 37 is 58 and if you divide it by two this gives you 29 so 29 is the median in this data set now what about the mode what is the mode in this example now as we said before the mode is the number that is basically is the most frequent number in the data set but this problem is different from the last one because we have two numbers that appear twice 15 and 59 so which of these is the mode it turns out that they're both represent the mode so the mode is 15 and 59 so what we have is something known as a bimodal data set because there's two modes instead of one it's not unimodal now what about the range well there's nothing different about the range in this problem compared to the last problem it's simply the highest number divided by the lowest number so you can say h minus l so the highest number is 59 as we could see here and the lowest number is 11. so 59 minus 11 is 48. so now you know how to find the mean median mode and range of a data set now let's talk about finding the quartiles and the interquartile range so what i'm going to do right now is i'm going to make basically a number line with a beginning and an end and let's say this number line represents our data the lowest value is known as the minimum the highest value in our data set is the maximum now we're going to break this number line into four equal parts the first part is known as q1 this is the first quartile the second one is q2 the second quartile and then this is the third quartile so let's say if the data was normally distributed this would be at a zero percent level this would be 25 50 75 and then this will be 100 so you could see how the quartiles are related to each other respectively now how do we go about finding q1 q2 and q3 how do we do that q2 is basically the median of the entire data set q1 is the median of the lower half of the data set and q3 is the median of the upper half of the data set now the interquartile range represented by iqr this is the difference between q3 and q1 so once you find q1 and q3 you can now calculate the interquartile range now the next thing that i want to mention is the ability to find or identify if a number in a data set is an outlier so here's what you need to know it's not going to be an outlier if it's within this range if it's between q1 minus 1.5 times the iqr so that's the lowest that it can be or the highest it can be is q3 plus 1.5 times the iqr so if you have a number that is within this range it is not an outlier but if you have a number in the data set that is outside of this range and then that number is an outlier so let's work on an example let's say we have the numbers 7 11 14 5 8 27 16 10 13 17 and 16. go ahead and identify q1 q2 and q3 calculate the interquartile range the iqr and determine if there's any outliers in this data set feel free to pause the video and use what you know to try now as always the first thing we should do is organize the data the lowest number is 5 and then we have 7 8 10 11 perhaps it's uh best if we cross it off as we go along and then the next number is uh 13 and then 14 16 there's two 16s and then the 17 and then 27. now what do you think is our next step in order to find the interquartile range and the three quartiles what's our next step the best thing to do at this point is to determine the median of the entire data set which is going to be q2 so we could eliminate the first two numbers the first and the last number and then the next two until we get a number in the middle so notice that 13 is in the middle so therefore 13 is going to be q2 now what i like to do is i'm going to get rid of this number for now and i'm going to put a line between the left side and the right side so i want to separate the lower half of the data set with the upper half of the data set but the 13 is still here though so just keep that in mind so that's our q2 value now q1 is the median of the lower half of the data set so what is the median of those five numbers the median is simply going to be the middle number of those five numbers so q1 is 8. now what is the median of the upper half of the data set notice that the middle number is 16 so that is q3 notice that we have a total of 11 data points and so that's why 13 is not included in the lower half or the upper half because if it was one side will have five numbers the other side will have six and that's why i chose to write it up here so that the lower half is the same as the upper half they both contain five numbers now let's go ahead and calculate the interquartile range iqr so we said it's the difference between the third quartile and the first quartile so it's going to be 16 minus 8 which is 8. and so that's how you could find the interquartile range of a data set now what about the presence of any outliers so looking at these numbers do you think we have a number that really stands out that doesn't belong right now 27 appears to be very far off from all the other numbers so do you think 27 is an outlier well let's find out so let's write down what we know the presence of an outlier is based upon this range it's q1 minus 1.5 times the iqr to q3 plus 1.5 times the iqr so we know that q1 is 8 and the interquartile range is also 8. q3 is 16. so this is going to be 16 plus 1.5 times 8. what is 1.5 times 8 1 times 8 is 8.5 times 8 is 4 8 plus 4 is 12. so this is going to be 8 minus 12 and this is 16 plus 12. now 8 minus 12 is negative 4 and 16 plus 12 is 28. so now looking at what we have is 27 an outlier based on its range because 27 is between negative 4 and 28 27 is not an outlier now if we had 29 that would be an outlier so now you know how to determine if a point is an outlier within a range now let's talk about how we can create a box and whisker plot the reason why we want to talk about this now is because it's related to the values of q1 q2 and q3 so typically a box in whisker plot looks something like this assuming if there's no outliers this is going to be the lowest value on the right we're going to have the highest value or the maximum and then this line here represents the value of the first quartile which is the 25th percentile in the middle we have q2 the second quartile which is the 50th percentile and then this is q3 the 75th percentile and so that's the basic shape of a box and whisker plot now what about if we have an outlier let's say if we have one to the right then this will no longer be the maximum the outlier is shown as a point it's outside of the box in whisker plot now if it's on the left side this will no longer be the minimum and that will be the outlier there so let's work on an example let's say we have the numbers 16 18 28 13 50 31 25 22 and let's say uh another 18 23 29 31 and 38 actually i wrote down 31 already let's make this 38 so we have 12 numbers using this data set go ahead and find the interquartile range q1 q2 q3 identify the presence of any outliers and then using all of that information construct a box and whisker plot feel free to pause the video if you want to try that yourself now the first step as always is to write the numbers in increasing order so we have 13 16 18 so let's get rid of those numbers and then there's another 18 and then 22 23 next is 25 and then 28 29 and then the last three are 31 38 and 50. now what i like to do is break up the data into four quarters or four sections now because we have an even number of data points we have a total of 12 we can put a line right in the middle so now this is the lower half of the data set and here we have the upper half so let's determine the median for the entire data set because it's even we can't just immediately identify the median the median is going to be an average of these two numbers if we eliminate the first two and then the next two and so forth we will eventually get to these two numbers so what is the median between or what is the average between 23 and 25 so if you add up those two numbers and divide by two you're just going to get the midpoint of 23 and 25 which is 24. so 24 is the second quartile this is the q2 value now let's focus on the lower half of the data what is the median of the lower half of the data so notice that we have six data points in that section and because it's even the median is going to be an average of these two numbers so let's put a line there the average of 18 and 18 is 18. so that's our q1 value now what about the median of the upper half of the data so because we have six numbers here we're gonna put a line right in the middle so we have 3 on the left 3 on the right and the median is simply going to be the average of these two numbers the average of 29 and 31 is the number in the middle which is 30. and so this is why i like to use these lines here so now we have three numbers in each of the four sections of our data so now that we know q1 q2 and q3 what is the interquartile range what's our iqr value the iqr is the difference between the third quartile and the first quartile so it's going to be 30 minus 18 which is 12 in this example so now that we have that our next step is to determine if we have any outliers in this problem so remember this is what we need we need to create a range the lowest point of it will be q1 minus 1.5 times the iqr and the highest point of the range will be q3 plus 1.5 times the iqr now i'm going to have to get rid of a few things so let's write down the information that is important so q1 is 18 q2 is 24 q3 is 30. our minimum the lowest value is 13 and the maximum our highest value is 50. so let's keep that in mind so now i can get rid of this you may want to write that down just in case we need to go back to it and i can also get rid of this too so let's plug in what we know into this expression so q1 is 18 and the iqr i gotta write that down again that was uh 12 it was 30 minus 18 which is 12. so this is going to be 18 minus 1.5 times 12 and then q3 is 30 plus 1.5 times 12. now what is 1.5 times 12 so it's basically 12 plus half of 12 which is six twelfth plus six is eighteen eighteen minus eighteen is zero thirty plus eighteen is forty eight so do we have any outliers are there any numbers outside of this range and it turns out that there is the maximum 50 is not between 0 and 48 so this is an outlier now going back to our original data i'm just going to rewrite it so you could see everything this is what we had so now at this point let's go ahead and let's make a number line so 0 is going to be our lowest point and we're going to go up to 50. so let's go by tens now let's put a mark in the middle to represent the fives so let's begin by drawing a box so we need to draw a box ranging from q1 to q3 so that's going to start at 18 and this is basically a rough estimate it's not going to be perfect and the end of the box will be at 30. now 13 is the minimum which is approximately around that area so that's the left side of the box and 50 is the maximum but that's an outlier so we're going to put a point at 50. 38 is the second highest which is not an outlier so that's going to be part of the box and whisker plot and so 38 is in this region so we're going to say it's over there now q1 is 18 q3 is 30 but we also need to write what q4 i mean not q4 but where q2 is q2 is 24 which is around here so this is q1 that's 18 q2 and q3 so the left of the box represents the interquartile range which is 12. that's 30 minus 18. and so here is the minimum and this is the maximum which is the outlier but that's how you can construct a box and whisker plot given a data set now the next topic we need to talk about is skewness so let's say we have this representation of our data and notice that it is symmetrical this line represents the median and if you have a data that is perfectly symmetrical the mean this is the sample mean it's going to be equal to the median now the box and whisker plot will look something like this q2 is going to be right in the middle of q1 and q3 so notice that the box plot is evenly distributed the left side is the same as the right side and left and also these lines are equal in length so that represents a symmetric distribution now what if it's not symmetrically distributed what's going to happen in that case there are two possibilities the data can be skewed to the right or it can be skewed to the left so which one would you say this particular shape represents would you say it's skewed to the right or skewed to the left notice that we have a tail that extends towards the right so this particular data or this graph we say that it's skewed to the right now what is the relationship between the sample mean and the median in this case so the median the middle portion of the data will be somewhere in this region and the sample mean will be to the right of the median since it's skewed to the right so the mean will be greater than the median in this case by the way whenever you have a shape that's skewed to the right some textbooks will refer this as a positive skew and it makes sense because positive numbers on a number line will be on the right side now you need to be familiar with the box and whisker plots for this type of distribution so here's one example notice that the right side of the box is longer than the left side so this tells us that q3 minus q2 is greater than q2 minus q1 and so in that case looking at the box and whisker plot you can see that it's skewed to the right now sometimes these two boxes may be equal in left nevertheless this side might be longer than this side so even if the boxes are of the same life if this side is longer it will also be skewed to the right now what about if it's skewed to the left in this case the graph is going to look something like this let me try that again so notice that the tail is on the left side so in this case we have a negative skew where we can say that is skewed to the left now the median will be somewhere in this region and the mean is going to be to the left of the median since it's q to the left so the sample mean is less than median now how can we represent this using a box plot well here's one possibility so in this case q2 will be closer to q3 in the box plot so as you can see the left side is longer than the right side so we could say that q2 minus q1 is greater than q3 minus q2 in this case now if the boxes are equal in life you can also tell that we have a negative skew if the left side of the box plot is longer than the right side so that's another indication that the data is skewed to the left now there's some other things that you need to know if you're going to take a statistics course and you need to be able to create a dot plot so let's say if you have the numbers 5 8 3 7 1 5 3 2 3 3 eight five with this information how can we construct a dot plot well we can begin by drawing a number line so let's say this is zero one 2 3 4 and so forth we could stop by 8 since 8 is the highest number now the first number is a 5. so all we need to do is draw a dot above the number five and then let's put the dots one at a time so the next number is an eight so we're going to draw a dot at eight and then it's a three so let's put the dot there next we have a seven and then it's a one and then a five so notice that we have a second five all we need to do is draw another dot above the first one and then it's a three two and then another three and another one and then eight and then five so that's how you can make a dot plot now using this dot plot which number is the mode what would you say now if you recall the mode is the number in the data set that occurs most frequently so in this case it's the number with the most dots so the mode for this data set is 3. by the way if you haven't done so already don't forget to subscribe to this channel and click on that notification bell now let's talk about how we can make a stem and leaf plot so let's say we have the numbers 4 9 13 13 17 21 36 38 let's see another 38 and then 56. how can we make a stem-and-leaf plot with this data so the first thing we need to do is we need to write two columns on the left it's going to be the stem and on the right the leaf so the first number is four so for the stem we're going to write zero and for the leaf we're going to put four the next one is nine so we're going to write or represent nine as zero nine so we have zero on the left and then we're going to put nine on the right the next number is thirteen so the first digit is a one the second digit we're going to put in the second column now we have another 13. so all we got to do is add another three now for 17 we need one and seven we already have the one but we need to write the 7 to the right side now for 21 we need to put a 2 in the stem column a 1 in the leaf column next we have 36 so we need a 3 in a stem column a 6 in the leaf column now notice that we have 3 38 so we got to add 3 8 to the leaf column and finally 56 now we don't have anything in the 40s so we're going to write a 4 but we're not going to put anything here for 56 we're going to write a 5 in the stem column and a 6 and the leaf column and so that's how you can make a stem and leaf plot now it's always good to have a key so we could say that 2 1 represents 21. so if someone looks at the stem relief plot they know what you mean let's try another example so let's say we have the numbers 56 actually let's see 78 85 89 92 106 107 and 119. go ahead and make a stem-a-leaf plot with those numbers so the first number 78 we're going to write a 7 in the first column and the 8 in the second column next we have 85 and 89 so we're going to put an 8 for the first digit and then 5 and 9 for the second and then it's 92 so we're going to write 9 and 2. now for 106 we're going to put a 10 in the stem column but a 6 in the leaf column and for 107 we just got to add a 7 here and there's usually no commas so let's get rid of that and then finally for 119 we're going to write 11 in the stem plot and 9 in the the leaf column and so for example this will be our key so this represents 92 and this would represent 106. because sometimes you could have decimal values for instance let's say if we have 1.2 1.6 1.8 2.1 2.3 2.3 and 2.5 we can construct this stem-and-leaf plot like this so we could start with one and to write 1.2 we could just put a 2 for the leaf plot now for 1.6 we just got to put a 6 in the the leaf column and for 1.8 just an 8. now we can move on to the twos so we have 2.1 2.3 2.3 and 2.5 and so that's how you can make a stem-a-leaf plot using decimal values so we can say 1 6 represents 1.6 in this particular example now the next thing we're going to talk about is something called a frequency table so given a data set how can we make a frequency table so let's say we have the numbers 5 9 8 7 8 12 nine eight ten eight nine seven so using those numbers how can we make a frequency table we're going to put two columns so the first column will represent the number and the second column will represent the frequency and let's put down what we have so the first number is a five and how many fives do we have there's only one five so the frequency is one the next number is a seven and notice that we have two sevenths so the frequency is two next is eight we have a total of four eighths so that gives us a frequency of four after that we have nine and i've spotted three nines and then there's one ten and we have one twelve so that's a simple way in which you can make a frequency table now here's another question for you how can we use the frequency table in order to calculate the sample mean how can we calculate the average instead of just adding all of those numbers up and dividing by the number of data points in a set let's add another column and we're going to call this the sum now we have one five so five times one is five we have two sevens if you add up seven and seven you get fourteen we have four eights eight times four is thirty two or if you add eight four times you get thirty two nine times three is twenty 27 10 times 1 is 10 12 times 1 is 12. now we're going to add up the sum column to get the total sum so 5 plus 14 plus 32 plus 27 plus 10 plus 12 that's 100 and we're also going to add up the frequency column that's going to give us the total number of numbers that we have here so we have 1 plus 2 which is 3 plus 4 that's 7 plus 3 that's 10 plus 2 that's 12. so we have a total of 12 numbers so the mean is going to be the sum divided by the total number of points that we have in our set so it's a hundred divided by twelve so the mean in this example is eight point three repeating and so that's how you could use the frequency table to calculate the mean of a data set now the next thing that you need to know how to do is how to create a histogram a histogram looks like a bar graph but unlike a bar graph a histogram has its bars connected to each other so here's an example of a histogram on the left side on the right side i'm going to draw a bar graph so this would be a bar graph so as you can see they're very similar but the bars in the histogram they're adjacent to each other there's no space in between but how do we go about taking the data set and making a histogram so let's say we have the test scores of students in a typical class let's say the test scores are 65 72 93 68 76 98 let's say 84 85 79 88 90 82 83 87 and 78 now the first thing we need to do is create a frequency distribution table and so what we're going to do this time rather than talk about how frequent or rather than describing the frequency of each number we're going to break it up into categories or classes on the left side we're going to have the grade on the right side the frequency now i'm going to categorize the grades in levels of 10. so a d would be 60 to 69. a c would be 70 to 79. a grade of a b would be 80 to 89. and an a is going to be 90 to 100 so those are of our four categories or four classes now how many students received a grade between 60 and 69 so notice that there are two students we have the grades 65 and 68 so the frequency will be 2. now how many students received a c on their exam how many students received a grade of somewhere between 70 and 79. so we have one two three and four so four students got a c on their exam what about a b so we have 84 85 88 82 83 87 so i counted a total of six and what's left over are those who got an a a 93 a 98 and a 90. so now that we have our frequency distribution table we can now make a histogram so on the y-axis we're going to plot the frequency on the x-axis we're going to put the grades so the grades will vary between 60 70. 80 90 and 100 because the classes are they're separated by intervals of 10 approximately now the highest frequency is six so let's go by one one two three four five six so let's plot the first one between 60 and 69 which is close to 70 the frequency is a two so it's going to look like that and then between 70 and 79 four students got a grade in that region i'm going to use the same color and between 80 and 89 six students fall in that category and between 90 and 100 only three students received an a so let's put the grade so this is a d two students got a d four students got a c and six students received a b and three students received an a so that's how you can create a histogram from a data set like this using a frequency distribution table now there's something else that we need to go over and that is making a table with the frequency the relative frequency and also the cumulative relative frequency so let's say we have the numbers 2 3 5 3 6 eight seven eight three three five three seven three eight five two seven 783 so we're going to have four columns the first column is going to be the value the second column will be the frequency the third column is going to be the relative frequency and the last one is going to be the cumulative the relative frequency so the lowest value that we have in our list is two and notice that we have two twos so the frequency for that number is two next we have a three and we have one two three four five six seven threes so that's the frequency the next number in the list is five we have one two three fives now we only have one six which is here next is a seven and there's one two three sevens and i need to extend this list and then finally we have some eights one two three four eighths now let's take the sum of the frequency column so if we add these numbers 2 plus 7 is 9 plus 3 that's 12 plus 1 is 13 plus 3 is 16 plus 4 is 20. so we have a total of 20 numbers in our set now how do we calculate the relative frequency so for the first entry in that column take the frequency and divide it by the total number of numbers that you have in the data set so the relative frequency is basically the frequency divided by n so for the first one it's going to be 2 over 20 which is 1 out of 10 and so that's point 10. for this one it's going to be 7 divided by 20 which is 0.35 and for the next one it's 3 divided by 20. which is point 15 and then it's 1 divided by 20. that's 0.05 3 over 20 is point 15 again and then 4 out of 20 is .20 now if you add up all of these numbers you should get one next we have the cumulative relative frequency so we're going to start with point 10 and then we're going to add these two numbers 0.10 plus 0.35 is 0.45 now 0.45 plus 0.15 that's 0.60 and then if we add 0.60 and 0.05 that's going to be 0.65 and then add in those two numbers that's going to give us 0.80 and then .80 plus .20 will give us one so that's how you can complete the cumulative relative frequency table given a data set now let's talk about how we could use this information using this table what is the value of the 60th percentile what would you say so what we need to do is look at the cumulative relative frequency so the 60th percentile will end here now notice that it's exactly 0.60 which corresponds to 60 to find the 60th percentile you need to average these two values so you have to do five plus six divided by two the average of five and six is five point five and so that's going to be the 60th percentile now what about the 80th percentile so notice that we have exactly 0.8 in the cumulative relative frequency column so what we're going to do is we're going to average these two numbers the average of seven and eight is 7.5 now what if it's not listed in the cumulative relative frequency table for instance let's say if we have or if we want to find the 20th percentile what value corresponds to that now the 20th percentile is between 0.10 and 0.45 it's important to understand that after 0.10 you're going to exceed the value of two you're going to go into the threes and after 0.45 you're gonna move from the threes to the fives so the 20th percentile because it's more than two but less than three it's going to be three there's no numbers in between here if you look at the data that we have so the 20th percentile is going to fall in this number to explain it better it's important to understand that between 0 and 0.10 the value is 2. between 0.10 and 0.45 the value is 3. now if your percentile falls between 0.45 and 0.60 not including 0.45 and 0.60 but if it's between those numbers then this can be 5. it's going to be 6 if it's between 0.60 and 0.65 and it's going to be 7 if it's between 0.65 and 0.8 so let's say if we want to determine the 75th percentile the 75th percentile is between 0.65 and 0.8 so because it's more than 0.65 we're not gonna have the value six we're going to get seven so that's going to be the 75th percentile now let me help you to see this visually so that it makes more sense let's begin by arranging the numbers in increasing order so we have two twos we have a total of seven threes and we have three fives one six three sevenths and 4 8. now i'm writing these in pairs of twos and you'll see why we have a total of 20 numbers and if you take 20 and divide it by 2 you're going to get 10 equal parts so this is going to be the 10th percentile let me put that in a different color so here's the 10th percentile the 20th percentile the 30th and so forth so this would be the 100th percentile there's nothing higher than that so the first thing that we went over was the 60th percentile which is here so notice that we have a 5 on the left and a 6 on the right so we need to average five and six and so we said the 60th percentile was 5.5 the second one was the 80th percentile which is here notice that it's between two different numbers seven and eight so if we average seven and eight it will give us seven point five now for that we talked about the 20th percentile and the two numbers that it's between are identical to each other so therefore the 20th percentile is just three and the last one was the 75th percentile which is between 70 and 80. and so these two numbers are just seven and so the 75th percentile has to be seven so now you can visually see the values that correspond to the different percentiles if you want to you could put zero as well i forgot to do that but that's basically it for this video so hopefully i gave you a good uh introduction into statistics there's a lot of other stuff that you'll learn in this course but these are just some of the basics thanks again for watching

Transcript for:Fundamentals of Statistics and Data Analysis

Transcript for:
Fundamentals of Statistics and Data Analysis