In this lesson, we're going to talk about some statistics for describing, exploring, and comparing data. The first one we're going to talk about are some ways to measure the center of data. So if you have a lot of data, we want to know where is the middle of that data. So before we begin, let's talk about summation notation. Now you probably saw this in an algebra class that you've had, but we're going to review it really quickly.
So if you have this notation, this tells you starting at the number 1, and then for every number in between 1 and 5, and specifically talking about integers here, we're going to add all those numbers together. So in other words, the summation from 1 to 5 of i, or rather the summation as i goes between 1 and 5 of i, would be 1 plus 2 plus 3 plus 4 plus 5. So these numbers represent, or i rather, represents these numbers going between 1 and 5. So the summation sign here, this capital sigma, means add all these together. If we have something a little more complicated, now we're going to start at 4. So i is going to start at 4, and it's going to end at 7, and we're going to hit every number between those.
But instead of just adding those numbers up like we did in this first example, we're going to take that number, so starting with 4, then square it, and then add 1 to it. And we do that for every i between 4 and 7. So starting with 4, I'll have 4 squared plus 1. And then the next number will be 5 squared plus 1. The next number is 6 squared, and the last number is 7 squared plus 1. And we're adding all these together because this is summation. So finishing this one off, going with PEMDAS, you may remember that from algebra, parentheses, exponents, multiplication, division, addition, and subtraction.
So now we have 16 plus 1, 25 plus 1, 36 plus 1, and 49 plus 1. Let me add all those parentheses together then. 17, 26, 37, 49. And finally adding all those up, you get 129. Let's look at another example. So in this case, notice I've changed my variable here. So instead of i, I'm going to use k.
It doesn't matter what variable you use. It doesn't change the problem. So now I'm going to go between 1 and 3. But first of all, I'm going to multiply. 4 times k minus 1 and then square that whole thing. Add all those up and then finally the last step will be to multiply by 5. So in other words, I do all the summation first and then whatever is being multiplied times the outside here, that multiplies at the end.
So I would start this out, k is going to go between 1 to 3, so I need a 2 in there. Each time I'm going to have 4 times k. minus 1 squared.
So again going with PEMDAS let's figure out what's inside these innermost parentheses first. So 4 times 1 gives me 4, 4 times 2 gives me 8, 4 times 3 gives me 12. Next I'll subtract 1 in each of those parentheses and then PEMDAS tells me next to do the exponents then I've got to add all these up inside this the outside parentheses so I've got 179 5 times 179 gives me 895 so that's a quick crash course in summation notation now we're going to see how we're going to need this in measuring center of data and in the next section so measures of Center a measure of Center is a value at the center or middle of a data set So there are several ways to calculate a measure of center and it depends on what we're looking for that will determine which measure of center we'd use. The one most people are familiar with and the one that comes up most often is the mean. The mean is also more formally called the arithmetic mean or less formally known as the average.
So the mean of a data set is found by adding the data values together. and dividing by the number of data values. So if these are the cents portion on seven PG&E bills, so in other words, my PG&E bill last month was $70.
$1.12, this is the cents portion. If I wanted to find the mean or the arithmetic mean of these seven cents portions, I would simply add those cents portions together and divide by the total number of data I have. So divide by seven in this case.
That gives me 105 sevenths, which is a fine number. We're allowed to leave it like this or you can convert it to a decimal. In which case we would get 15. I guess we shouldn't leave it as 105 sevenths because that's a fraction that can be reduced.
But in general, in this class, you are allowed to leave something as an improper fraction where the numerator is larger than the denominator. So this is the first place we're going to use sigma notation. summation notation, we're going to be a little less formal with this.
So in this case, the x's are going to represent all the data in my data set. And the n will represent the total number of data in my data set. But I'm not going to write it as i equals 1 to 7, which is the more official way to write it.
We're going to be a little less formal and just write it with this sigma symbol. And we're going to assume that means add up all the data. in the data set.
A shorthand version of this, so whenever we talk about the mean of a sample set of data, we can write it as x-bar and we'll see this a lot. But all x-bar means is add up all the data together and divide it by n, which is the same thing as the mean. So you'll see x bar a lot in this course. That just means the mean of a data set.
So here's some data in this column, and it looks like we've got seven numbers again. So n will be seven in this case. I have three questions here.
I say calculate sigma of x over n, calculate x bar, and calculate the mean of the numbers. But of course, all three of these questions are asking the same exact thing. These are all questions asking you to find the mean of the data. So really, this is just one question, three different ways to write it. So let's go with the first one, calculating the sum on top, divided by n.
n we said was 7. Otherwise, we just have to add up the data on top of the fraction and then calculate this. So I could leave my answers 37ths. We're going to be more partial to decimals in this course, but there's nothing wrong with leaving it as an improper fraction. Now, I use this symbol. to suggest that 37th isn't exactly this, the decimal actually goes on longer for 37th, however we're going to round most of our data to 4 or less decimals.
So we say it's approximately 4.2857. Another measure of the center of some data is the median. The median of a data set is found by arranging the values in order and choosing the value in the middle.
So suppose this is our data set. I'm going to first rearrange everything in order. So the second row is the same data, just rearranged in order from lowest to highest.
Now to find the one in the middle, I just have to look in the very middle of that data set. In this case, the median will be 15. So it's true that the median of the first row is 15. It's just harder to see when the data is not arranged in order. Sometimes we'll use the symbol x tilde as the median.
However, this is not as common. So we'll use it in this section, in this lesson, but we won't see it a whole lot later on. Suppose you had an even number of data in your data set. So now there is no data in the middle.
And notice I've already ordered this data. There's no number right in the middle. So what do you do?
Well, take the two numbers that are in the middle, 58 and 83 in this case, and find their arithmetic mean. So then we'll call the median the mean of the two middle data. So in this case, it'll be 70.5.
The third measure of center is the mode. The mode is the value in your data that occurs most frequently. So suppose we have this as our data set.
What is the mode? In other words, what is the value that occurs most frequently? So clearly 53 comes up four times and nothing else comes up that often.
So the mode in this case is 53. What happens if you have more than one number occurring most often? So look at the first data set here. We have 2 that comes up twice, and 3 comes up twice, but 4, 5, and 9 only come up once.
So the mode is 2 and 3, and this is what we call bimodal. Similarly, if we had a number come up, if we had three numbers come up most often, so in this case, one comes up twice, three comes up twice, and four comes up twice, and five only comes up once, we call this trimodal or multimodal. So anytime there's more than three numbers coming up most often, we'll call it multimodal. If no number comes up most often... So in other words, if all numbers come up only once, we say there's no mode.
Mid-range is another measure of center. Consider this data set. The mid-range is going to be the average between the largest data value, the maximum, and the smallest data value, the minimum. So in other words, it's the mean or the arithmetic mean of the largest and smallest data values.
So it looks like in this case, 20 is our largest data value and 10 is our minimum or smallest data value. So the mean between those is 15. Here's a tricky example. Suppose we list the M&Ms in a fun size pack. One is going to represent red. two is going to represent green, three is going to represent yellow, four brown, five blue, six orange.
Which measure of center gives meaningful information in this situation? So here's our fun-sized pack of M&Ms. And so again, two means that a green M&M came up, one means a red M&M came out of the bag, one is another red, six was an orange, etc. What measure of center gives meaningful information in this situation?
Well, consider finding the arithmetic mean, or the mean of all this data. We get 3.5714 if we add up all the data and divide by the total number of data. What does the mean tell you about this?
Well, if you think about it, We arbitrarily decided 1 would represent red and 6 would represent orange. We could have just as easily switched those or switched any of these numbers, or we could have called red 10 and green 27 and yellow 4. We arbitrarily chose these numbers. So the mean, just by adding all these up, the mean tells us nothing because the numbers we chose to represent the colors was arbitrary. So we're not going to use the mean. How about the median?
4 is the number that was in the middle. But again, since this was arbitrary, the median doesn't tell us anything. Just because 4 represents brown, we could have just as easily had 4 represent green or 4 represent orange. So the median doesn't necessarily tell us anything interesting.
How about the mode? We have 1, 5, and 6 as the mode. But of course, this does tell us something interesting. That is, 1, 5, and 6 came up the most. So we can tell that if we get out of this pack of M&Ms, red, blue, and orange came out the most.
If we switched the numbers representing them, then our mode would switch, but whatever number was representing red would still come up as the mode. So the mode does tell us something interesting, namely what colors come up the most. So here's a general rule of thumb for rounding when you're calculating the mean and the median. In general, use one more decimal place than is in the set of values.
So in this case, money spent at a convenience store, we have two decimal places in the first data, one in the second, one in the third, two, two, one, and two decimal places. That means if I calculate the mean or the median for this data set, the general rule of thumb here is to use three decimal places. So suppose I wanted to calculate the mean, I'm going to add up all the data and divide by the number of data.
So if I add up all my data, I get 44.11, and there are seven numbers in my data set, so I'll divide by seven. That gets me 6.3014285714, etc., etc., but this general rule is saying round to three decimal places. So I'm going to leave it at 6.301. Suppose we had a frequency distribution, but we didn't have the original data set.
If we don't have the original data values, we still can calculate an estimate of the mean. So suppose this is our frequency distribution and we want to calculate the mean of this frequency distribution. So as an estimate of the mean of the original data set, since we don't have those original data values, we're going to assume that the data values are the class midpoints.
So in other words, I know that three of my data values were between 0 and 4. Since I don't know exactly what those data values are, I'm going to assume they were at the class midpoint, or 2. So in other words, I'm assuming that I had three values in my data set that were the number two. And those are the values between zero and four. So my mean will be calculated by multiplying the frequencies times the class midpoints. So in this case, x will represent the class midpoint and f will represent the frequency.
So again, you only use this formula. If you don't have the original data, all you have is the frequency distribution. Then, to find the denominator, you simply add up the frequencies. This is equivalent to the number of data values that were in the original data set. So here, I'm going to take 3, which is my first frequency, times the class midpoint 2 of my first class.
And then 5. times 7, 4 times 12, 1 times 17, and 0 times 22. That gives me my numerator. My denominator is simply 3 plus 5 plus 4 plus 1 plus 0. That comes out to be 8.2, which is, of course, an average, or a rounded number to one decimal place.