Transcript for:
Data Generation Techniques in Statistics

hello ladies and gentlemen this is Muskoka and we're back in AP statistics this time in Chapter four and we have learned so much in chapters one through three about how to plot data summarize data describe data analyze data but what we haven't talked about yet and we will be doing in Chapter four is figuring out where all of this data has been generated from and so we're going to learn about different types of data generation techniques the first being in section 4.1 samples and surveys after this section we should be able to figure out within a study which is the population and which is the sample we want to talk about ways that data shouldn't be produced namely voluntary response and convenient samples and how data can be generated using random samples simple random samples and other types such as stratified and cluster sampling and discuss a little bit about when some of them are convenient more convenient than the others and then we also want to be able to explain about under coverage non-response question wording and other forms of bias so let's talk a little bit about population when we have a specific group of interest every single individual in the group is our population a sample is going to be a subset of that when we collect data from the entire population so for example let's say we wanted to find the age of every single President of the United States at inauguration that would be a census because we would be collecting the age data from every single President if we were looking at a subset of the population that would be a sample of the population and that's going to be some smaller number than all the individuals in our group of interests or our population of so this is a great graphic you can see population as the whole it's kind of like a Venn diagram the sample is a subset of when we collect a sample or collect data from a sample our goal is to be able to infer about the population make some kind of a determination based on the small subset of of data that we've get okay so how can we choose a sample the first thing that we need to do is define the population of interest the next thing that we need to do is figure out what specific data we want if it's a numerical variable a categorical variable what we want to measure and then we won't need to come up with some type of data collection mechanism such as a survey and the last of all we need to decide how to choose the sample so from the population what is the sample that we're going to be using are we going to put names and a hat are we going to use a random generator from our calculator a random number generator there are lots of different things that we can do but we want to make sure that when we collect the data we have collected in a way that is going to allow us to make the inferences that we want to make some data collection mechanisms are going to be less appropriate because they introduce bias and one of those is what we would call a convenient sample and this is kind of like if you want to have information about people's opinions about recycling and you go out to the beach because you know it's a beautiful Saturday and there gonna be plenty of people they're out at the beach and you stand at the beach and ask people the questions in your survey all about recycling well yes it's convenient but on the other hand the data that we're gonna get from people who are already in the outdoors probably enjoying it you know that's why they've gone out there they might even be doing a beach cleanup and their idea of recycling is going to be very different than if we go out to a mall and ask the people who spend that beautiful Saturday indoors so a convenient sample doesn't always give us unbiased data which is why we don't want to do that you may want to pause the video video somewhere along the way so that you can take notes by the way another thing that we don't want to do is voluntary response with voluntary response what we're doing is we're throwing a question out there such as on the radio such as on the television or an online survey and we ask people to call in or to write in or to fill out an online survey and what we find is we end up having data that is biased because the only people who voluntarily represent themselves in in these data collection mechanisms are people who have super strong opinions so we often have very one-sided with people who have very strong opinions very one-sided results and so we don't want that because we want it to look whatever information we get from our survey we want it to be representative of our entire population of interest and if we're only looking at the people who have super strong opinions in one direction then it doesn't look like our whole population it looks just like that one segment of the population so a better way of sampling is what we call a simple random sample and this involves a chance process such as names being chosen out of a hat that's kind of a typical of example that we keep on going to we have something that we call a simple random sample not just a random sample but a simple random sample is imagine that we're going to choose n number out of our population to be included in our sample and a simple random sample allows every individual and every group of individuals the same probability or an equal chance to be selected and be a part of the sample so it's got to have both of those requirements by the way not just every individual have the same probability of being chosen but also every group of size n have the same probability of being chosen now this could happen a lot of different ways we can use a low-tech approach or a high-tech approach but if we were going to use calculators we would list out all of our individuals in from whom we will be sampling and we can generate a random integer using our calculator to select the appropriate number however many we need in our sample if we're using table D which is going to be given to you and test questions or in exam questions that's at the end of the year we're going to start out the same way we're going to list out or number each of our members of the population from whom we will be selecting we want to use as few digits as possible so if we need 10 we're going to number them 0 through 9 not 1 through 10 0 through 9 uses only one digit and the next thing that we're going to do is we're going to start at a randomly selected line on our table of random digits we're going to select the appropriate number of digits at a time and we're going to select enough of them ignoring repeats and ignoring values that are out of range we're going to continue that until we get enough the sample size that we had wanted this is an example about how to use table D so we have numbered all of our different hotels o 1 through 28 we need to have 28 different hotels numbered so that means we're going to need two different digits two digits we're gonna select two digits at a time from our random digit table that's the line in gray and the first one is 69 that's out of range the next one is Oh 5 that one is in range and represents beach Castle the next number is 16 also in range that represents the radisson the next number is 48 that's out of range the next number is 17 the ramada the next number is 87 thats out of range now notice we come to 17 again 17 is a repeat we've already chosen the ramada so we can't choose the ramada again and every single time we have a 17 we're gonna need to ignore it now because we already chose 17 now this is not always going to be the case but when you're representing specific individuals with this numbered list then you will ignore repeats so our last value that is going to be in range will be number 20 which is the C Club Hotel that's how you use the table of random digits so we have our four hotels Beach Castle Radisson Ramada and C Club chosen we have something called a stratified random sample and that one sometimes it's super super convenient we would use this because there's a specific characteristic shared by different individuals that is meaningful to our survey and so what we do is we start out by stratifying or separating into groups of similar individuals the population then from each stratum or each of those different groups that we've separated out we're going to perform a simple random sample when we collect the individuals that have been sampled from each of the different strata those are going to combine we're going to be combined and that's going to create our sample the other method that we're going to be using is called cluster sampling and this one is often geographically based now stratified can also be Geographic but the key difference is when we're doing cluster sampling we're going to split up into groups just like we did for our stratified but this time what we're going to do is we're going to randomly select through simple random sample we're going to randomly select clusters and then we're going to survey or collect data from every individual within the cluster in stratified we did a simple random sample within each stratum in cluster we're gonna collect data from every individual within the cluster key difference okay so what do we want to gain from this well we want to be able to the whole purpose of performing the sampling is so that we can draw inferences about our population so we want to make sure that we have avoided bias in selecting our samples that's why we avoid convenience and we avoid voluntary response and we want to make sure that we're able to infer by having a large enough sample so that we have some good information and also if we have things like non-response that people who refuse to answer our survey we don't have such a small sample size that we can't really make any inferences now that leads us to I use the terminology of under coverage so we're gonna see what can go wrong when you're running a survey one of the things that can happen is you have under coverage under coverage is when you have specific members of the population that there's no way they can be part of your sample you've forgotten about them basically you're not choosing them one example would be if you're sending out mail out surveys and you want to get opinions of adults in your County if you if you forget to include people who are in prison or college students or people who depending on where you got your addresses from people who are not in the telephone book or have unlisted numbers or things like that you know you're gonna come systematically prohibit some of those individuals from being a part of your survey so that's under coverage when they cannot be chosen in the sample there's no way because you haven't included them in using your selection method another thing that can happen is this idea of non-response where you send out a hundred surveys but you only get let's say ten surveys back non-response is when they either can't be contacted or they refuse to participate so if you do a phone survey if you do a mail-in survey those individuals who choose not to respond or literally they've moved away and you can't contact them they would be non-response type of a situation anytime you have a problem with your survey method and you have a systematic pattern of responses either for or against any particular issue what you've done is you've created response bias and very often we see surveys have wording that will cause a systematic purposefully now a systematic response in a specific way and there are plenty of examples in your textbook about this this is an example where you you want people to and express an opinion in a specific way because it benefits your cause and what you do is you set the wording in the question so that it points them in a specific direction this is something we have to be wary of and be cautious using data if we know what the questions are we can be a little bit more confident in the results but if we don't know what the questions are we don't know if they were biased questions so we need to be really cautious about that we have hit all of our objectives we will come back and take a look at section 4.2 which is all about experiments see you then