Now let's talk a little bit more about the sample and population dilemma and how samples look like when you take them randomly from the population. So what you see in this app, I have three different properties. Height, age and IQ of people.
And my population is huge. It's 50 000 people. It's like a small city. So what you see here, it's a distribution of this property.
In this case it's height in the population. And then you can see that this distribution is by model. So it means that we have two groups in our population with different height. For example, it can be males and females or maybe some other groups.
So you can see that this is one group with average height approximately 178 centimeters. And this is the second group with a mode here and the largest value in this group is approximately 160. What happens if you take a sample from this population? Of course, the result depends on the sample size.
And you can see by default my sample size is 6. You can see this sample here. And then you can see also corresponding plots. So this is a box plot for population, and this is a box plot for sample. And this plot is rather new, but you already know the percentiles. We discussed it in the previous app, right?
So the percentiles is how many values are smaller than given. And then you can see this curve shows a percentile for the population, while this curve shows the percentile for the sample. Then you can see how close the sample is to the population. The point is that if you take a new sample, it will be pretty much different from the previous one. For example, this sample now is a very good approximation of the population.
You can see all these points lying pretty close to the theoretical line and also the box plot line is, I mean, pretty close. Like median and quartiles are pretty close to what you have in the population. The box, the whiskers are always smaller.
simply because you have too small sample size because the values on the tails are unlikely to occur. And when you have a small sample size, it's unlikely that you will cover the whole range. So it means that the variation of your sample will be always, almost always, smaller than the variation of your whole population. And then simply try to take more and more. So for example, this one is almost perfect sample.
You can see all this sample size is 6. Now we have almost perfect correspondence between... between the sample and the population. Right?
You can also see it here. Sometimes it will be pretty extreme. Let me, for example, this one.
Look how different is it. Right? Sometimes it will be pretty close.
The problem is that you take only one sample always. You don't work with infinite amount of samples. And therefore you never know how close your sample is to the population. So be prepared that your sample can be pretty pretty extreme. Especially if sample size is small.
So let me just change the sample size down to three. Look. In this case, you can see sometimes the sample can be really really varied. And the smaller sample size, the more extreme it can be. For example, in this case you can see that it's very much shifted to the left, right?
And you can also see it from here. So the smaller sample size, the less likely your sample will resemble your population well. The less likely your sample will be representative enough for your population. But if you increase your sample size, for example, in this case I can go up to 30. Now look how close the points are. And also the boxplot, right?
If I take a new sample, you can see that almost always the points are pretty close to this theoretical line and the boxplot have a very good correspondence with the boxplot for the population. So it means that if you take even a sample size of 30, it's not a huge sample. You have a pretty good approximation in this case.
You have a very representative sample. Of course, to assure that it's representative, the sample should be completely random, but In this case it is. And this is true regardless which population distribution you have. You can see for example for age I have a uniform distribution, right? So it means that all ages have approximately the same number of people in the population, right?
And then you can see that in this case the percentile is a straight line because of the uniform. And then if you take a new sample you can see that most of the time this sample is pretty close to the population, right? So you can see that the points line close to the population and also you can see that the boxplot is very similar to the boxplot that make population.
And then also for IQ. IQ is normally distributed. In this case you can see extreme values here, extreme values here, but we know that the boxplot has a certain rule for the detecting of extreme values. I didn't say in the previous example, but in the boxplot if you have normal distribution, approximately 99.2 percent of the points will be inside and then 0.8% will be considered as outlier.
So if you have, for example, thousands of points, then you can have at least 80 as outliers. So in this case, I have 50,000 of points. So it means that I can expect approximately 400 will be outliers in this case, right? But again, if you take samples, you can see that if sample size is large, you can see how close the behavior of your sample points is to the behavior of your population points. But if sample size is small, let me make for example 5. You can see how extreme it can be.
For example this one, look. Of course, I mean, it's a bad idea to make a boxplot for 5 values, because boxplot requires 5 statistics, right? So you need some more values.
But look at how far the percentile curve. the percentile polygon in this case is from the theoretical curve. So the smaller sample size, the more extreme your sample can be and less representative so to speak.
So the uncertainty will be larger. So just play with it and then hopefully you will understand it a little bit better.