What's up my stat stars, welcome to AP Statistics Unit 1 Summary Video. In this video, we're going to go all the way through Unit 1, exploring one variable data, talking about all the major themes and all the major concepts to make sure that you are ready either for your Unit 1 test or to help prepare you for the AP test in May. Now, before we begin, I want to mention two really, really important things.
First, this is just a review video. We're not going to cover every single teeny tiny topic in extreme detail. That's what a class was for.
The purpose of this video is to take everything that your teacher threw at you the last couple weeks and put it into one digestible video that kind of covers the big major themes of it all. Now, if you are looking for much more specific videos that cover every single topic in Unit 1 and all the other units of AP Statistics, please check out my YouTube channel. I got videos for every single topic explaining everything in much more detail in this review video.
Or if you're really looking for a lot of great information that can help you prepare for your unit test or the AP test, please check out the Ultimate Review Package using the link in the description. At the Ultimate Review Package, you can get a free trial to take a look at every single unit. You get study guides, practice sheets, practice multiple choice, and you also get these awesome review videos as well. And the best part is you even get answer keys to those study guides and those practice sheets to make sure that you're doing everything okay.
At the very, very end, you can even do a full-length practice AP exam. And the second thing I want to mention is, yes, you heard me right, study guide. While you're at the Ultimate Review Packet, please make sure to download my study guide for Unit 1. I also got study guides for all the other units. And you can use that study guide while you're watching this video. You could pause, fill parts in, hit play, pause, fill some more parts in, or you could watch the whole video and fill it all in at the end.
But the... best thing is you got access to that answer key so you can check out all the answers at the end and make sure that you're doing everything okay. And if you even want more practice to prepare you for the exam, you can also check out my practice sheets.
All right, let's get into unit one. Unit one is all about exploring one variable data. We're really going to learn how to analyze one variable or how to take one variable and compare it across multiple samples or multiple multiple groups. Now, listen, understanding how to analyze data is super important.
May seem kind of boring and not that fun at the beginning, but what we need to do with analyzing data later on in statistics is so crucial to the really the big important concepts that are probably going to be the most challenging for you. So if you understand how to analyze data now, it's going to pay off big time at the end when we do some really important stuff. Now, listen, this unit is really broken down into two things, categorical data and quantitative data.
And I'm not going to lie to you. Categorical data is way easier, way faster, way shorter. In fact, only a small percentage of this entire unit is even about categorical data.
Much, much bigger part of the unit is over quantitative data. But regardless of categorical or quantitative variables, there's something really important that you need to understand. Anytime you select a sample and from that sample you collect data, any summary information that you learn from that sample data is called a statistic. Whereas if you collect information, if you collect data from an entire population, then anything you learn from that population is called a parameter. It's really easy to memorize these things because it basically, here's the idea.
Statistics starts with an S and so does samples and statistics come from samples. Parameter starts with a P and parameters come from populations, which also start with a P. So it's pretty easy to remember that concept.
Now we collect data from individuals and individuals can be, well... honestly anything. It could be a person, it could be a chair, it could be a tree, it could be a lake, it could be a state, it could be a country, it could be a day for that, all that matters.
Really, an individual could be anything. Now, here's the most important part. A variable is any characteristic that can change from one individual to another.
So if you just think about a person, or maybe multiple people, think about any characteristic that can change from one to another. Eye color, hair color, weight, height, just to name a few. Now, the reason why we like analyzing data so much is because individuals vary.
If individuals didn't vary, well, then honestly, we wouldn't even need this course, and the world would be a pretty boring place. Now, here's the deal with variables. There's only two types.
All variables in the world can be categorized into two types, either categorical variables or quantitative variables. A categorical variable takes on values that are category names or group labels, like eye color or hair color, whereas a quantitative variable takes on numerical values that are either measured or counted, like the weight of a frog or how many candies are in a bag. To try to keep it really simple, a categorical variable value is simply going to be a word, whereas a quantitative variable value is typically going to be a number.
Now, there are a couple exceptions to that rule, namely zip code. Zip code is a number, but it's not measured and it's not counted. That doesn't make it quantitative.
A zip code is simply a number that tells your mail where to go, which means it simply puts your mail into a specific category for your city's post office. So that's why zip code is one of those weird exceptions that's a number, but it's technically a categorical variable. But to be honest, in most cases, it's pretty straightforward. Categorical variables are words, quantitative variables are numbers.
Let's start off with categorical data, because it really is shorter and much faster to talk about. There's just not a whole lot there. Now, let's say that we take a sample of 89 lemurs, and one of the variables that we want to analyze from those lemurs is the type of lemur it is.
Whether it's a sissica, an aye-aye, a ringtail, or a mouse, and I'm probably pronouncing some of those wrong. But again, those are all words, which makes this a categorical variable. Now, if we just have all that data collected, it's probably going to be a really long, boring list of all those different categories.
So the first thing we'd like to do is organize that into what we call a frequency table. Frequency is just a fancy word for cows. Here we list each of the categories and we simply count how many of the lemurs fit into each of those categories.
Now we could also take a look at what's called the relative frequency. The relative frequency is just the proportion of lemurs that fell into each category. So for example, we take the number of ringtail lemurs that we have, we divide by 89 and we get the proportion. Now keep in mind that...
A relative frequency, a percentage, or a rate all tell us the exact same information that a proportion does. They're all really basically the same thing. However, we really do like using proportions. I'm not trying to say that we're never going to use frequencies at all, but we like relative frequencies a lot because when we are comparing two samples, especially two samples that have different sizes, using relative frequencies is a much more fair way to compare them. When it comes to making graphs of categorical data, we really have two options.
pie charts, or what some people call circle graphs, and bar graphs. Now, a bar graph could also be turned into what's called a relative bar graph. So instead of the heights of each bar showing the frequency or the number of lemurs that fall into each category, it simply shows the proportion. Whereas a circle graph only shows proportions because the idea is each slice is a proportion of the whole circle. Now, when we look at a pie chart or a bar graph, one thing that you might be asked to do is to describe the distribution of that variable.
Now, what is a distribution? Because that's a really important word for this entire unit. A distribution of data is basically what values that that data takes on and how often it takes on those values. So if we're asked to talk about the distribution of categorical data, really all we could say is maybe which category had the most, which category had the least, and then we could even mention all the different categories that are even available to us, but there's not a whole lot we could say. Oftentimes, the best things we can do with either a bar graph or a pie chart is compare.
two different samples. So for example, here we see a pie chart for the lemurs in forest one, a pie chart for the lemurs in forest two. And because pie charts are based on proportions, it's really easy to see some important differences. Like we noticed in forest two, there's a much higher proportion of siskas than there is in forest one.
And we simply know that by just seeing that the piece of that pie is much bigger in forest two than forest one. Now, what it's going to be expected of you to answer questions about on the AP exam when it comes to categorical variables is really, again, like I said, just describe the distribution. reading a bar graph, also noticing if it's a relative bar graph so you can see what proportion or what percentage of data falls into each category.
Now let's move on to quantitative variables which are going to take up way more time in this video. First, we have two different types of quantitative variables, discrete and continuous. A discrete quantitative variable takes on values that are countable in finite.
For example, the number of goals that you could score in a soccer game, well that's going to be zero, one, two, three, four. five, it may say, well, I guess it can be infinite. You can have a million goals in the game.
But again, realistically, no, you can't. So typically with a discrete quantitative variable, we're thinking whole numbers only. And if we think about it, you could make a list of all possible outcomes that wouldn't necessarily go on forever. Whereas a continuous quantitative variable takes on values that are not countable and basically theoretically could be infinite.
For example, the weight of a frog. If you think about the weight of a frog, it really could go infinite in either directions, especially when you have a really good measuring tool. Because if you have a good measuring tool that maybe goes to say five decimal places, well, even if you're talking about between 10 pounds and 11 pounds, which actually would be a pretty big frog, let's shrink that down a little bit.
Let's say between five and six pounds. Realistically, right? You have to understand, I hope you all know this, between five and six pounds, there's an infinite number of values, right? Now, even if you say, well, we're only going to go to two decimal places, there's not an infinite number of values. Okay, well, there's still a lot and you wouldn't want to sit and count them all.
But again, hypothetically, from five to six pounds, there is an infinite number of possibilities, especially if you add some really precise measuring tool. So discrete, we're thinking countable set number of outcomes that are typically whole numbers. Whereas continuous, we got way too many of them to even count because we got decimals upon decimals upon decimals that make for a truly continuous variable that can take on infinite outcomes, even if it's really not infinite. Quantitative variables can also be analyzed into what we call a frequency table or a relative frequency table. But because we don't have categories or names, we have numbers, the first thing we have to do is create bins, basically intervals, right?
So each bin or interval has to be equal in size. So here we have data from a sample of trees. And from every tree, we measured the tree's height.
And we have bins of 20 to 30 feet, 30 to 40 feet, and so forth. These bins are what we call left-handed bins, which means you equal a number on the left and you go up to the number on the right. So that first bin is for any tree from 20 up to 29.99999999 feet. If a tree weighed, or if it weighed, if a tree had 30 feet of height, it would go into the next bin.
So again, once we set up our bins, and you can set the bins however you want, you can choose whatever interval you want that just has to be consistent. Then you just go through your data and you count, okay, how many trees were 20 to 30 feet, count them up, and that's again the frequency. Or you could obviously take that value, divide it by the total of 174 total trees in the sample, and you can get the relative frequency as well.
Now, there are four types of graphs that can be made from quantitative data. A dot plot, a stem and leaf plot, a histogram, and a cumulative graph. Now, let's look at our sample of 174 trees.
And from every tree, we measured its height, which is a quantitative variable. First off, because it's a number. Technically, it'd be continuous because the height of a tree, if you've got a really precise measuring tool, could be any value.
But again, you get the idea. Now, here's an example of a stem and leaf plot. Cool thing about a stem and leaf plot is you can actually see all the individual values and they just stack up so you can see the distribution. Then we have a...
So... dot plot that puts dots for each individual tree. We could also see where they stack up.
We see there's far less trees on the left, far less trees on the right, most trees kind of in the middle, around 80 feet. Then what we have is called a histogram. I'll probably say that a histogram is the number one preferred graph for quantitative data in all statistics.
Once again, across the x axis, we see those bins or intervals, 20 to 30, 30 to 40, 40 to 50. And then we simply count how many trees fall into each bit. And then we make a bar. that goes up to that count or that frequency.
You can also make it a relative frequency histogram as well where that bar goes up to the proportion instead of the count. Now, listen, I know it looks like a bar graph. It might smell like a bar graph. It might even taste like a bar graph, but it's not a bar graph. Bar graphs are for categorical data.
Don't ever call a histogram a bar graph. You'll offend the statistician somewhere. The really cool thing about it, whether it's a stem and leaf plot or it's a dot plot or if it's...
A histogram is that you can see the distribution. Remember, the distribution is what values your variable can take on and how often it takes them on. So by looking at these distributions, we can clearly see where there's less data and where there's more data, what heights are most common versus what heights are least common. Now, the fourth type of graph is called a cumulative graph.
These are really cool graphs that you actually don't see too often, but they're really, really valuable. Now, here we see a bunch of dots connected by lines. Now, every dot has an X and it has a Y.
For example, there's a dot at 80 on the X, that's 80 feet, and 0.45 on the Y. Now, what that means is that 45% of all the trees in our sample were below 80 feet. So again, every dot tells you the proportion of data below that particular height.
Now, if we look in between, we see that the slopes of the lines connected to the dots are different. A steeper slope simply means that there's more data in that range. So we see that there's a large amount of data from 60 to 70, and also a large amount from 70 to 80, because that's where we see steeper lines.
If the line is horizontal, like we see between 0 and 10, or 10 to 20, that means there is no data in those bins whatsoever, because there was no change from one to the other. These are great graphs as well to see some really important information about how the data builds up, where there's a lot of data, where there's a little data. All through this idea of looking at the steepness of the lines and understand that each point tells you the proportion of data below that particular height.
Make sure that you know how to analyze these different graphs and be able to answer questions about them. For example, if we look at the histogram, I could say, hey, how many trees are greater than 70 feet? How many trees are less than 70 feet?
How many trees between 100 and 120 feet? You got to be able to answer all those questions. It's pretty simple. I'm going to be able to add them up.
Make sure you get a rough count as to how many are in each bin. But also make sure if you're looking at histogram, is it a frequency histogram where it shows how many trees are in each bin? Or is it a relative frequency histogram where it shows what proportion are in each bit? So it's really important to use all those kind of facts and ideas to answer questions about these different graphs. But for the most part, they're pretty easy questions.
In this unit, one of the most important things that you're going to be asked to do is to describe the distribution of a quantitative variable by looking at a graph. Now, when you do this, there's four things that you have to mention. The shape, the center, the spread, and any outliers or other unusual features.
Now, when we look at shape, there's lots of different things we could say. Unimodal, bimodal, gap, clusters, symmetric, skewed left, skewed right. When we talk about the center, you're looking for one value that you think best summarizes all the data. Spread is really analysis of how the data varies. And then again, outliers are data values that are very far away from all the other values, whether it be far to the left or far to the right.
Let's take a look at several graphs that I've made for you that are good at... It will enable us to, well, talk about the distributions. Now, every single graph represents a sample of trees selected from all different parts of a forest. Every single sample had roughly 174 trees, and we're going to see how that sample shook out. Now, in these first two graphs, we see the shape of symmetric, but they're both symmetric in different ways.
Now, the pink graph is symmetric with most of the data in the middle, so it's going to have a smaller spread. Yes, the overall data does go from 20 to 140, but the majority of data is clustered in the middle, near the center of around 80 to 85 feet. Whereas the graph on the bottom also has a center of 80 to 85 feet, but that would be called bimodal because we see a big chunk of data on the left and another big peak of data on the right.
Now, even though 80 is probably a good center of the data, it's actually not really a good description of the data because there's actually two centers. It looks like we have two clusters of data. So we've got a bunch of smaller trees centered maybe around 35 feet, and a bunch of larger trees centered maybe around 120 feet. This one's going to be way more spread out.
It's going to vary much, much more because we got so many different trees on the left and so many different trees on the right end of the scale. Whereas the graph in pink has a much smaller spread because the majority of data is all, well, clumped together in the middle. Here we see two more samples of trees.
The one in purple is clearly skewed to the left, where the majority of the data is on the right, so the center is probably around, I don't know, 120 to 110 feet. And on the one in blue, we see it's skewed to the right, which gives us a center of maybe 35 to 40 feet. Now they both have similar spreads, but again, the majority of the data in purple is at the higher end, where the majority of the data in the blue is at the lower end.
Here we have two more graphs that are both symmetric, but with the biggest difference between these two graphs is how spread out they are. The one in green is far less spread out than the one in purple. In green, we have a center of 80, but it's all clustered together from 60 feet to 100 feet. Whereas in purple, we also have a center probably around 80 feet, but it's very evenly spread from 20 all the way up to 140. When your data is very evenly spread like this, we typically call it uniform. In this last example, we see a very unusual feature of a huge gap.
We have a couple trees ranging from 20 to 40 feet at the bottom. Then we have an enormous gap where there's no trees at all. And then we have a bunch of trees, 80 all the way to 130. We're in a couple there above 130 feet.
Now here we can also say that this graph is maybe slightly skewed to the left. And again, describing the center is kind of tough because you might want to jump and say something like 70, but there's not a single tree at 70. A better center here would be looking at maybe 110. Yes, there's a couple trees at the very bottom, but typically trees in this sample are about 110 feet, maybe even say 115. Now, we don't know for sure, but we'll learn a little bit more about this in a couple moments about outliers, because trees at the bottom definitely look like they could be outliers. Now, in any of these graphs that we've just taken a look at, we've got to make sure that we describe the distribution in context.
So if you go back and pause, you can read my descriptions and how I give a quick explanation of the shape, the center, and the spread. and if there's any unusual features in every graph. It really doesn't take a whole lot to describe a distribution, but you gotta make sure you mention those four key details. Now, I gotta be honest, when you just have the graph of a distribution of a quantitative variable, there's really not a whole lot you could say about the distribution, so you kind of have to be a little bit vague.
But if you actually have all the individual values, there's so much more we could do. Let's start off by talking about measures of center. Here, we're talking about the mean and the median. Now, these are.
both the most famous measures of center. The mean is found simply by adding all the values together and dividing by how many you have. It's a pretty simple formula. But the mean is easily influenced by outliers. Remember, the mean is trying to balance everything out.
And if there's one really, really large outlier, the mean is going to move up a little bit because of it to keep it balanced. That one large outlier might only be one value, but it weighs just as much as a bunch of the other small values. Now the median is simply the middle value, no matter what.
If you have an odd amount of data points, then there is an exact median in the middle. If you have an even number of data points in AP statistics, we just take the average of the middle two values. Now there is no formula to tell you what the median is. You simply have to put your data in order and find the middle.
But there is one really cool thing you can do that's going to help you, and that is by using the formula n plus 1 divided by 2. This formula will not tell you what the median is, but it will tell you the location of the median if your data is in order. For example, if you have 19 pieces of data, 19 plus 1 is 20, 20 divided by 2 is 10. That means that the median is the 10th value. If you have 20 pieces of data, 20 plus 1 is 21, divided by 2 is 10 and a half.
That means that the median is located between the 10th and the 11th value. So find the 10th value, find the 11th value, and average them together to get your median. Now, the median is not influenced by outliers because you could have an absolutely enormous outlier on the far left or the far right, and the median doesn't care at all because he or she is just sitting pretty right in the middle. That value on the left could go as far as it wants away, and it's not going to affect the median at all, but it will affect the median. Now, what's really important for you to know when it comes to the mean and the median for AP statistics is this.
When your data is roughly symmetric, the mean and the median will be pretty close together. So even if you don't have a picture of your data and you're like, I don't know what the shape is, but you do have the mean and the median and they're really, really close to each other, then that's telling you that your data is symmetric. When you are skewed to the left, the mean is going to be smaller than the median. When you're skewed to the right...
the mean is going to be larger than the median. We can actually see this pretty clearly in these four graphs. And the top two graphs are both symmetric, albeit in different ways, but because they're symmetric, the mean and the median are going to be about the same place. The arrow represents the mean and the M represents the median. Now the official symbol that we have for a mean of a sample is X bar.
It's X with a little bar over top of it. We don't really have any official symbol for the median. We just maybe use an M or right after the word median. Now, when data, again, like I already mentioned, is skewed to the left, like this purple graph, the mean, the arrow, is going to be a little bit less than the median.
And when your data is skewed to the right, like in blue, the mean, the arrow, is going to be a little bit greater than the median. Now, let's talk about why very quickly. Well, for example, in that blue graph, yes, the majority of data is at the bottom to the lower values.
But those higher trees, because remember, this is our tree data, even though there's only a couple of them at that far right, they are heavier. they're worth more, right? They're of bigger value to the data set and the mean has to take them into account. So even though there's only a couple of them, they have more weight to them.
If that makes sense, that's going to pull the mean higher. Now we also have what are known as measures of position. These are values that tell you where you are in the data. Now, probably one of the most famous is what's called a percentile.
You might hear this all the time, especially working with ACT or ACT scores. A percentile or a particular values percentile, is the percentage of data at or below that score. So for example, maybe you take the SAT and you find out that you scored the 95th percentile.
That means that 95% of other students scored at your level or below, which means 5% were above you. So that tells you your position in the data is pretty good here at the high end. Now we also have what's known as the first quartile.
The first quartile is known as the 25th percentile. Think of it as the middle of the bottom half of your data. 25% of data is below it, 75% of data is above it. The median, which we already know is the middle of our data, is actually known as the 50th percentile, because 50% of data is below it, 50% is above it.
And the third quartile, also known as Q3, is known as the 75th percentile. It has 75% of data below it, 25% of data above it. So these are just some important percentiles, but really a percentile can be any value. For example, the 42nd percentile has 42% of data at...
or below it. But again, percentiles really specifically tell you where you fall in the data. Next up, we have measures of spread.
There are three measures of spread. Range, which is simply your max minus your min. Now that's going to be very easily influenced by outliers.
So if you haven't outlined your data, it's going to make your range look huge. Whereas realistically, the overall range of your data might not be that big because that outlier. Then we have what's known as the IQR.
That stands for interquartile range. This is the range of the middle 50% of your data from Q3 to Q1. So finding it's really easy.
Just take the third quartile and subtract the first quartile. Lastly, we have probably the most common and most used and most famous measure of spread, the standard deviation. The standard deviation is a pretty complicated formula, which you see here.
But honestly, you're always going to use technology to find it for the most part, or you'll be given it. But what's more important is you know what the standard deviation represents. It represents how far a majority of data is from the mean.
So if you have a very large center deviation, that tells you typically most of your data is very far from the mean, whether above or below. If you have a very small center deviation, that means that most of your data is very close to the mean in the middle, not too far above, not too far below. Now, could there still be some data further and further away, whether it be above or below? Of course.
But again, it's speaking to where the majority of the data falls. Lastly, we have outliers. Now, when you're looking at a graph, you might just kind of vaguely say, that looks like it could be an outlier or maybe it's not.
But now we have actually specific ways to measure or determine if you have outliers in your data. Now, there are two of them and which one to use really depends upon what information you have. If you have your quartiles, then what you could use will be called the fence method.
So we basically find the upper fence and the lower fence. The upper fence is found by taking Q3, the third quartile, and adding 1.5 times the IQR. And if any value in your data is above that number, which you just calculated as your upper fence, then it is an outlier. You could have one, you could have none, you could have five or six, who knows. To find the lower fence, you take Q1, the first quartile, subtract 1.5 times your IQR, and that gives you your lower fence.
Any value in your data set below that number is considered an outlier. Again, you could have none, one. two, more, more, however many you got. Now, the second way that you can determine outliers is using your mean and standard deviation.
Now, remember, we know that the majority of data is within one standard deviation of the mean because that's, well, what's typical. So, we identify in outliers any value that is more than two standard deviations either above or below the mean. So, if you take your mean and you add two standard deviations and then you take your mean and you subtract two standard deviations. you get an interval. Any values in your data that's outside of that interval would be considered outliers.
Now, I'm going to be honest with you. The fence method is probably the most famous method to find outliers, but the meanest innovation method certainly works. But again, it all depends what you have.
If you don't know the meanest innovation, all you have is your quartiles, then you're going to use the fence method. If you have your mean and your standard deviation, then you could certainly use that method as well to determine if you have any outliers in your data. Now that we've very quickly gone over all the different summary statistics, let's talk about how they can be transformed if your data is transformed.
Now, there's two different ways to transform your data. First, we could take every single data value that we have and we could add a value to them all, we could subtract the value to them all, or we can multiply all the values. Now, how does that impact the different measures of summary statistics that we just learned?
Well, addition and subtraction affect measures of center and measures of position. If you add 5 to all your values, your mean is going to go up 5, your mean is going to go up 5, the third quartile is going to go up 5, the 25th percentile is going to go up 5, the 42nd percentile is going to go up 5. But what will not change is measures of spread, range, standard deviation, and IQR. They are not affected at all by adding or subtracting values to all of your data.
However, if you multiply all of your data by a specific value, that will affect... all measures of statistics. It's going to affect measures of center.
So if you multiply all your data by 0.2, for example, mean, median are going to multiply by 0.2, range, IQR, standard deviation, they're going to multiply by 0.2 and same with all your measures of position. Basically, everything will be multiplied by 0.2. Now, if you're going to transform them in two ways, maybe you're going to multiply and then add.
Just note that the multiplication affects everything, measures of center, measures of spread, and measures of position. But the measures of spread will not add whatever that constant is. Now, the second way we can transform data is by adding data to our data set or taking data away. Now, it's really important for you to understand that it's where that value is.
So if you have a data set and you add a huge, enormous outlier in the far right, well, your median is not going to change much at all. It might move over a little bit because you are adding a new data to your data set. but it's not going to change much. Whereas the mean is going to definitely get bigger because of that really big outlier. Remember, the mean has to take every value into account.
If you add a value that weighs a whole lot, it's going to make the mean go high. Now, if you add a new value and it's just like all the other values, it's kind of right in the middle, then once again, your median's not going to change a whole lot and your mean's not going to change much either. All right, that's it for summary statistics. There's a lot to go on there and a lot of new things we learned, but...
you know, feel free to take the time to make sure you review it all and that it all makes sense to you. Now, taking together the min, Q1, the median, Q3, and the maximum are known as the five-number summary. And what we could do with the five-number summary is create a box plot, which is a really cool graphical representation of our summary statistics.
Now, what we do is we make a box around Q1 and Q3 with the median somewhere in between there. Then in AP Statistics we use what's called a modified boxplot. So first we identify outliers using our fence method. We put asterisks at those outliers, then the whiskers of the boxplot go to the next highest or lowest values that were not outliers.
Here we see an example of a boxplot, and the most important thing is that each section of that boxplot represents 25% of our data. Now note that I have an outlier there on the far right. And that that whisker went to the next value in my data that was not deemed an outlier.
Now the five number summary breaks the data down into 25% chunks. A wider whisker on the far right does not mean more data. It just means that that section of the data is more spread out. So each chunk below Q1, in between Q1 the median, in between the median in Q3, and from Q3 all the way to that outlier represents 25% of data. Wider simply means more spread out.
It doesn't mean more data. Now the cool thing is through a box plot you can also see the shape. You clearly see the shape of this data is skewed to the right. Because 50% of the data is towards the bottom kind of clustered together. And then the upper 50% of data is way more spread out.
So if you visualize that as a skewed right graph. Here we see two more box plots that are symmetric. This is going back to those pink and orange graphs that were both symmetric in different ways. And now you can actually see that in these box plots. The first one is spread out with some outliers on the left and outliers on the right.
But we see our whiskers are about the same size. That means they have about equal spread on the left and right. Now, the median is not right smack dab in the middle of the box. And that's OK, but it's still pretty evenly balanced. It represents symmetry.
Then the bottom graph, we see that the data is way more spread out. Look at that middle 50% in the box is way more spread out. That's because...
to grab the majority of the data, the box has to go way to the left and way to the right, because again, look at the histogram. The majority of data is way to the left and way to the right, so the middle 50% is going to be way wider to capture that data. Now that we've learned all the different summary statistics for a quantitative variable, we can see how they all kind of fit together and really tell us a lot about the data.
And one thing that the AP statistics example loves to do is give you a set of summary statistics and have you complete some tasks with it. So here we're going to take a look at another set of 174 trees where the heights of each tree was measured. Now across the top we see the summary statistics, the mean, the median, min, q1, q3, the max, and standard deviation. And the first thing I noticed is that the mean is lower than the median so the data has a shape that is skewed left.
Also the median is closer to the third quartile than it is the first quartile. Now what that means is that because there is more distance between the first quartile and the median does not mean there's more data, it just means that section that is more spread out. We also notice that the third quartile is closer to the max than the first quartile is closer to the mean, meaning that the distance between the first quartile and the min is extremely far, which again is showing that that side of the data, the left side of the data, is more spread out.
All signs point to the bottom 50% of the data being more spread out than the top 50%, which makes our data skew to the left. Another very common question has you analyze the standard deviation. The standard deviation tells us the majority of trees in this sample are within 28.96 feet of the mean of 104.82 feet. Remember, the standard deviation tells you how far typical data is from the mean and within means plus or minus.
So if we take our mean and we add 28.96, we subtract 28.96, that tells us where the majority of our data falls. Now that standard deviation is kind of large to be quite honest, which is again, another sign that our data is fairly spread out. Now, they also love asking you to talk about outliers.
So remember, we have two different outlier formulas. In red, I have the fence method. Here, we're taking the third quartile of 125. We're adding 1.5 times the IQR, which is Q3 minus Q1, and we get 185. Now, the first thing I noticed is the max is only 135, which means that there is obviously no values bigger than 185, so there's no upper outliers.
Now, the lower fence is Q1. 85 minus 1.5 times the RQR and we get 25 here. Now the min is 22, which is below 25. So we for sure know that we have at least one outlier, the 22 foot tree.
But the idea here is without knowing every single individual data point, there could be more outliers. Like for example, there could be a tree that's 23 or a tree that's 24 feet. That would again be below 25. But we don't know all those values.
We only know the min. So that's why it's important to make sure we emphasize that there's only at least one outlier that could be more, but without the data we don't know. We could also use our mean and center deviation formula by taking the mean, adding two, and subtracting two center deviations. Here we get an interval where we know a large, large, large majority of our data falls, and any values outside this would be deemed outliers. Now the top of that interval is 162.74 feet, and again with our max of one...
So I actually had all this data, and there was one other tree at 23 feet. That's why you see two dots there, 22 and 23. And then that whisker goes to the next value. It looks to be about 30. That was not an outlier.
Now, again, we see where the Q1, the median, Q3, and that max value fall. Now, sometimes in the AP exam, if you don't actually have all the data, you only have your five-number summary, feel tight to just make a regular box plot not showing any outliers because you could certainly do that with the five-number summary alone. All right, that's it for this example.
Hopefully, that made a lot of sense. Now another really important task that very often comes up on the AP Stats exam, whether it be a multiple choice or an FRQ, is comparing two different distributions. Maybe we have two histograms, two box plots, or even two stem and leaf plots, which we could call a back-to-back stem and leaf plot.
But through any of these things, we want to make sure we compare. And when we compare, please make sure that we use comparative language like greater than, less than. Bigger, smaller, higher, lower, all those different things, or even A, they're just flat out the same.
Now when you're comparing, we want to compare the centers, we want to compare the shapes, we want to compare the spreads, we want to compare the presence or absence of alliers. Let's take a look at an example. Here we see what we call parallel box plots. There are two box plots that are parallel and on the same x-axis. Now oftentimes what we're going to be asked to do is compare.
So the top is trees from the west side of the forest, and the... bottom box plus trees from the east side. So what could we say about the shapes?
Well, we'd say they're both approximately skewed to the right. We see that the bottom 50% on both graphs is well less spread out than the upper 50% on both graphs, so they're both a little bit skewed to the right. We also identify that neither graph has any outliers, and then we also could look at the medians. The median for the east trees is 20 feet, where the median for the west trees is 33, so it clearly has a higher center.
We can also look at the middle 50%. The IQR for the top west trees is way more spread out than the IQR or the middle 50% for the east trees. So when we're looking at these different graphs, we want to talk about shape.
Maybe it's the same in this case, you'd write center. The median is higher for one than the other and spread as well. Being able to compare two distributions really is vital.
It comes up almost every single year on the FRQ section of the AP exam. So make sure you take your time with it. Use comparative language and speak in context.
Don't just say, oh, the one has a center of 33 and the other has a center of 20. 33 what? Fish? Inches? Centimeters? Seconds?
No, trees from the east are a little bit taller than trees from the west. One has a center of around 33 feet, one has a center of around 20 feet. Use things like that to make sure you speak in context, especially when you're comparing two distributions. In this last section of Unit 1, things take a pretty cool crazy twist. Now here's the deal.
Some sets of data can be modeled with what we call a density curve. A density curve is used to model a set of data to give us some insight as to what the population that that sample data came from could possibly look like. Some sets of data can be described as approximately normally distributed.
This is the most famous type of density curve there is, the normal distribution. Now the normal distribution is unimodal, mount-shaped, and symmetric. and it can be described with the parameters of the population mean and the population standard deviation.
So here we see that normal curve, again, mound-shaped and symmetric, right smack dab in the middle is the mean, and then as we move to the right, we go up one, up two, up three standard deviations, down one, down two, down three standard deviations. Now, the normal model is used for continuous quantitative variables, which again, remember, have infinite possibilities all the way up towards positive infinity and all the way down towards negative infinity. So why do we stop the normal model at three standard deviations above the mean and three standard deviations below the mean?
Because honestly, a huge chunk of data is within three standard deviations if it's normally distributed. There's just very little data above three standard deviations or below three standard deviations for us to even worry about. Please note that not all data sets follow a normal distribution. Furthermore, a simple, might-look, unimodal, mound-shaped, and symmetric, and you might want to say that it is a normal distribution, But remember, only a population could have been officially modeled with a normal distribution. Now, here's what's really cool about normal distributions is that they're actually very predictable.
We know that 68% of data in a population is within one standard deviation of the mean. 95% of data within a population is within two standard deviations of the mean. and 99.7% is within three standard deviations of the mean.
That is why even though a normal distribution is continuous all the way down towards negative infinity and up towards positive infinity, we usually stop drawing it at negative three and positive three standard deviations because it's so unlikely for data to be outside of that. Most data, pretty much all, 99.7 is within three standard deviations of the mean. We actually call this the empirical rule. Let's say that a large forest has trees that do in fact follow a normal distribution when it comes to their heights.
They would have a mean of 80 feet and a standard deviation of 18 feet. Here is what that normal distribution would look like in this scenario. Now remember, tree height is a continuous quantitative variable. So technically the height of a tree can be anything as low as negative infinity or as high as positive infinity if you want to look at it that way. But when we draw the normal model, there is no reason for us to go below 26 feet or above 134 feet.
because that is three standard deviations above and three standard deviations below the mean, which is where 99.7% of trees in this forest are going to fall anyway. Now the formula for standardized scores or z-scores is actually really really simple. To find a z-score you simply take your individual value, in this case a tree height, subtract the mean mu, and divide by the standard deviation sigma.
I just want to make sure I emphasize that when you're going to use a calculator to do this, do the numerator first and then hit enter, divide by the standard deviation. Now, once again, a z-score measures how many standard deviations above or below the mean you can be. So z-scores can be negative, z-scores can be positive. But again, don't go back and forget the idea that I said most data is within 3. So getting a z-score of negative 4, negative 5, positive 7, positive 18, those are extremely crazy z-scores because most data will fall within 3 standard deviations of your mean. Here we see a standard normal model.
when only is labeled by the z-scores. Zero in the middle because the mean is zero simulations from itself. Then we go up one, up two, up three simulations, down one, down two, down three simulations.
Now what's really cool about the standard normal model and z-scores is it allows us to compare anything. So for example, you might think it'd be impossible to compare the height of a tree to the weight of a bear. But if you standardize their scores, giving the z-scores for a tree and the z-score for a bear, putting them onto the same standard normal model, then you could really figure out, oh, that bear has a Z-score of 1.3, where that tree only has a Z-score of 0.9. Clearly, that is a bigger bear.
So even though bears and trees are things that you wouldn't seemingly ever compare, you actually can if you standardize their scores. So let's go back to our 100-foot tree question. The first thing we can do is standardize the score for a 100-foot tree.
So we're going to take 100, subtract 80, divide by the standardization to get a Z-score of 1.11. So we see that spot indicated on our standard normal model. That is where 100 foot tree is because it's 1.1 standard deviations above the me.
So one question we could ask is what proportion of trees are below 100 feet? So now that we have the z-score for 100 feet, we could again use technology. So here I'm showing you how to use a TI-84 calculator.
You're going to hit second vars or go to normal CDF. The lower value is negative 99. That's essentially acting as a negative infinity. We don't have an infinity button on the calculator.
So we're just going to extremely low z-score. And the upper value is going to be that 1.11. Again, it works left to right, lower left, upper right.
So if we're looking at the shaded region, trees below 100 feet or below a z-score of 1.11, we're going to start at negative 99, way down below, and go up to negative 1.11. And the TI-84 calculator tells us 0.867. So 86.7% of trees fall into that range that are below 100 feet, as long as it's a normal distribution.
Now we could also use Desmos. Desmos makes it pretty easy to do normal distributions with the command. Here it is.
We're just looking at negative infinity to a Z score at the top, max of 1.11. And we also get 86.7% of trees that are below 100 feet. And here's an example of what we use, one of those standard normal tables.
Now, a lot of teachers might not even teach us anymore because it's a little bit old school, but there are tables where you are actually going to look up your Z score and... inside the table, it gives you the proportion that is below that from a standard normal table. So on the left side, we look up the first decimal place.
That's the 1.1. So we have the 1 and then the 0.1. And then across the top, we find that second decimal, which is also a 1 in this case.
We're going to go to the 0.01 column. So in total, that'd be 1.11. And then we just cross that row and column together and we get 0.867.
Once again, telling us that 86.7% of trees in this forest are below 100 feet. We could also use this exact same procedure to find the proportion of trees that are greater than 100 feet. Again, it's the same z-score 1.11, but when we go to our TD4 calculator, now the lower value is going to be 1.11 and the upper value is going to be 99. We could also use Desmos or we could use a standard normal table. Just be careful, standard normal tables only give you the proportion below the particular z-score that you look up. So if the question is asking about going greater than 1.11, you have to first look up the value in the z-table that represents the below, and then simply take 1 minus that proportion, and then you're going to get the opposite of it, which would clearly be the proportion of trees above 100 feet or a z-score above 1.11.
Either way, we get the same proportion for trees that are greater than 100 feet. We can even find the proportion of trees that are between 70 and 100 feet. We got to get the z-score for both 70 and the z-score for 100. Then we can use normal CDF on our TID4 calculator to look in between those TZ scores.
Or we can even use Desmos to look in between those TZ scores as well. Once again, you could also use a standard normal table. Just involves a little bit more work because you have to look at the proportion of data below the higher z-score.
Then you got to look up the proportion of data below the smaller z-score. subtract them to get the proportion in between. Most people don't use standard normal model tables anymore, but if you're trained on how to use them, it is still pretty easy. The normal distribution even allows us to work backwards to solve some really cool problems.
Here we could be given the area or the proportion under a standard normal curve, and what we can do is use technology or standard normal table to actually find the z-score that represents that particular area. Let's look at this through an example. In the forest with trees whose heights follow a standard normal.
distribution, what they mean of 80 feet, and a standard deviation of 18 feet, what height would mark the 80th percentile? So remember what a percentile is, it's the percentage of trees at a particular value. So what we're asking here is what height of a tree would represent the position that is 80% of trees less than it, which would simultaneously mean 20% above it.
We could use technology or standard normal table to get us the z-score that represents this position with 80% below it. Here's how it works. On your TI-84 calculator, you're going to use the command invert norm. Now in the command for invert norm, you have to ask the area. The area you're looking for, the area that you type in is the proportion below, the area below or the area to the left.
So I'm going to type in 0.8 because that's what we're trying to find, the z-score that has 80% to the left of it or below it. And when we use the invert norm command, we get that z-score of 0.842. Now, we could also use the standard normal table.
What we have to do is actually use it in reverse. So we're going to actually look inside the table and find approximately 0.80. Now, we don't actually see it exactly, but we see two numbers that are really close. We see 0.7995 and 0.8023.
To be honest, you could probably use either one of them and be okay. But technically, 0.7995 is really, really close to 80% or 0.8. So the z-score that represents that 0.7995 below it is 0.8.
0.84. Now, I'm going to be a little bit more specific and use the 0.842 for my calculator. Now, what we're going to do is we're going to take our z-score formula.
We know the mean is 80. We know the standard deviation is 18. But now, we know the z-score that represents this 80th percentile is 0.842. Then, we could just work backwards, multiply the standard deviation over, that's 18, and then add the 80. And this gives us the height of a tree that would represent the 80th percentile. In this case, we've got 95.156 feet. So 80% of trees in the forest are below 95 feet and 20% are above it.
Another example might ask us something like, what tree height represents the top 5% of all trees in the forest? So once again, we got to get some technology here to figure out what Z-score represents that top 5%. But keep in mind that when you're talking about the top 5%, you're simultaneously talking about the bottom 95% below it. It's the same value.
So once again, we could go to our TI-84 calculator and we could type in... and 0.95, because that's the area to the left, or the area below, which again represents 95% below, but the same thing as saying 5% above, and we get a z-score of 1.645. We could also use our standard normal table and look up 0.95. Once again, this is one of the drawbacks of using the tables, is you're not going to find it precisely, but we can get pretty close here, and we see about 0.9495 or 0.9505.
Both are equally as close to 0.95. So that's a Z score of 1.64. Now again, I'm always going to try to be using technology to be a little bit more accurate here. So I'm going to go with the 1.645 as that Z score.
Once again, substituting that in for Z, I know the mean 80, I know the segregation of 18. Multiply the segregation over, add 80, and I get the height of a tree that represents that 95th percentile. So 5% of trees in the forest are taller than 109.61 feet, and 95% of trees are below that. Now, I have to be honest with you, there are so many more normal distribution calculation problems that can be done other than the ones I just went over.
In fact, there's a huge plethora of different types of problems that all involve the normal distribution. Some are really easy, like the ones we talked about in this video, and some can be much more complex. So here's the deal. I don't have time in this review video to go over all of them, but what you can do is visit my YouTube channel where I have a playlist for the normal distribution. I have tons of videos that go over all the different types of problems that can come up when you're doing normal distribution calculations.
They could be really fun in my opinion, but they can also be a little bit challenging, which hopefully makes it fun. But please check out my YouTube playlist where you can learn much, much more about the normal distribution and all the different calculations that can be done with it. All right, well, that's a wrap on Unit 1, Exploring One Variable Data. It's a pretty thick unit with lots of information in it, so hopefully I didn't review it too fast. But please take a look at that study guide through the Ultra Review Packet.
You also have the answer key, and just doing that study guide, checking out your answers is going to really help prepare you, not just for the Unit 1 test you have in class, but it's also going to really help prepare you for the AP Stats exam in May. Now, Unit 1 really kind of sets the foundation for the entire course because everything starts with analyzing data. If you know how to analyze data, talk about data, understand summary statistics, and how everything ties together, it's going to really help prepare you for everything else in this course.
So best of luck. Hopefully you learned a lot. I can't wait to see you in the next video.