hi everyone this is Matt to show and intro stats and today we're continuing our discussion of the various ways of collecting data so this is the part two video in our last video we looked at collecting data with a census convenience sample voluntary response sample and simple random sample and we're going to finish up just a few more ways that people collect data so we're on method 5 which is on the cluster sample okay so the cluster sample means you're collecting data from multiple groups of people in your population instead of one at a time obviously if you do a simple random sample and you're collecting individual individual people sort of or objects one at a time it can take a long time to collect data and also can be very expensive but as we mentioned before a simple random sample is worth it it's worth the time and money because you do get a pretty good dataset that's pretty representative usually so the clustering is sort of a way of getting data from multiple groups so you're kind of finding little pockets of groups of people in your population that you can collect data from so instead of collecting data from one at a time maybe I'm collecting data from 20 people at a time now one of the keys that we saw last time was if you want to minimize bias you need to have some kind of random technique so in other words if you're gonna collect data from groups of people in a cluster sample you want to pick the groups randomly so for example let's suppose that our population is all students that call at a college at the college and and let's say let's suppose instead of be picking student ID numbers and picking one student at a time maybe I can have the computer randomly select the section number of the class right so or if I had a column of data that had all the classes I could just have a computer randomly select cells out of that column so once that's the case I could go to those classrooms and just get data as long as their teachers would be mine and get data from all the students at the in those classes so I maybe that maybe the computer picked 20 classes at the college and I got data from you know most of the students in almost every one of those 20 classes so be a way for me to collect data quicker and it and also it usually saves money it also would be pretty good it would still be a random sample every student at the college would have a chance of being chosen it just that their class was chosen now they wouldn't have an equal chance of being chosen but they still would have a chance so it is pretty good I still think like simple random sample to me is still a little bit better but clustering is used sometimes now what if I didn't choose the groups randomly what if instead of me picking classes randomly with a computer how come what if I just selected I don't know the five or six classes that I teach right well it's still clustering I'm still getting data from groups of students but it wouldn't be very good right now not every student at the college has a chance of being chosen and I've almost kind of took in clustering and convenience and kind of smashed them together and I've done something very convenient for me it's very easy for me to get data from students that I actually teach and her in this physical class with okay um so that's clustering sample now one thing people do make a mistake is they think remember it should be multiple groups I always get someone that says well I just went to one class and I got data from just the class that I'm in is that a cluster and I would say no that sounds more like convenience data you just went to the class that you sit in and got different people as you were sitting next to right usually a clustering is multiple groups you have to go to multiple groups of people or objects and that would be considered a cluster sample so remember we do want the groups to be chosen randomly that's real key number six is another method is sometimes called a stratified sample so a stratified sample is a comparison study I like to think of it as a comparison study it's a very very common one of the most common studies we do in statistics or stratified samples we're always trying to compare my god we're comparing people that took the medicine to people that took the placebo we're trying to compare you know people from one state with another so usually what you do is usually what you do is if you kind of compare two groups to big groups or to I kind of like think of it as comparing multiple populations what I used to do is take a simple random sample from each group so if I could take a simple random sample from each of my pocket those two populations and then compare the simple random samples so it's almost like taking two or more simple random samples so let me give you an example so suppose I want to compare the average salaries of working adults in California to the average salary of working adults in Arizona Arizona California pretty close to each other and I suppose I want to do a comparison study well I could just take a simple random sample of people in California and then a simple random sample of people in Arizona and then I compare them by the way they don't have to have the same sample size a lot of people think you have to you have to collect the same exact amount of data you actually don't as long as they're decently large random samples here you're pretty okay so think of us stratified as a comparison study Who am I trying to compare one thing to another clustering usually is just one population you're dealing with and you're just trying to get data from from just one sample from that population but you're collecting data from groups of people in that one population instead of many instead of one at a time stratify think of it as I'm comparing big groups or comparing populations that's a good way to think of it now if I don't choose what if I just chose my friends right what if I chose my friends that live in California is my sample for California and my friends that live in Arizona and and and that's my sample for Arizona well first of all I wouldn't be good right it would be very biased it would have a lot of bias it wouldn't reflect California and Arizona very well so again the main thing to keep in mind random minimizes bias doesn't totally eliminate it will kind of get into that in our next video we'll be talking about bias and other ways you can mess up datasets but not random usually you're gonna be you're gonna have who usually have quite a bit of bias if you don't have a random sample okay all right let's look at the last one so our last method is called systematic this is where you use a system of some kind to select people or objects to collect data from so you're you're collecting a sample but you're you're using some kind of system it wasn't a random sample necessarily it was some kind of system so examples like you might see people in a business might tell their employee say every 5th person that comes into the store today ask them this question right so you're basically every 5th person that comes in your store by the way that probably would not be very good a normal that'd be you know borderline on convenience because it's so easy and it wouldn't reflect all of your customer this is just the customers that came in on that day and so that would probably have a lot of bias actually a good pretty good systematic what if I had a list of my entire population so like if I had a thousand at COC college students at my college and I and I had a list of all their names alphabetical list maybe and then I just took every 50th person on the list all right maybe I did number 50 and then number 100 and 150 and 200 and so on that would probably be pretty good I mean the list does have the entire population of the college on it now the only issue with that is it true it really would not be a random sample because if you think about it numbers 1 through 49 on the list had no chance of being chosen only number 50 and number 100 and number 150 and so on but sometimes you'll see people actually randomize the first choice before they do the system so a very common thing that you'll see data miners and others and data scientists do sometimes as though they'll have a computer randomly select a number between 1 and 50 so let's suppose they chose I know 17 right they chose the computer chose 17 so they would go to the 17th person on the list and then they would go every 50 from there so they would go 5th 1767 117 once 167 and so on right so they basically have a random choice for the first choice only and then use their system after that and that actually would be a random sample because everybody on the population list would be chose in fact it would be a simple random sample ok so there's some good and bad systematics but that's another technique that is sometimes used when people collect data ok so wanted to take aways from this this discussion is the way you collect data matters it matters a lot in terms of what you can say populations okay collecting data in a very biased way can have bad bad consequences okay so that's why we're kind of going over some of the good and bad ways of collecting data all right so this has been various ways to collect data this was our video part 2 and this is Matt to show and intro stats and I will see you next time