Transcript for:
Understanding Big Data and Statistical Inference

okay so we are now on what is lecture four and right four yeah four uh so big data statistical inference and practical significance so let's jump right in so sampling error is the deviation of the sample from the population that results from random sampling if repeated independent random samples of the same size are collected from the population of interest using probability sampling techniques on average the samples will be representative of the population that's true the random collection of sample data does not ensure that any single sample will be perfectly representative of the population of interest so any individual single sample that you take is going to have a sampling error sampling error is unavoidable when collecting a random sample by definition of random oh that is uh that's not fun that that cut that off thankfully i know what that sentence says so the word right here which has been cut off is that that is non-response error that's bummer uh anyways so non-sampling error deviations of the sample from the population that occur for reasons other than random sampling are referred to as non-sampling error if the research objective and the population from which a sample is to be drawn or not aligned the data collected will not help accomplish the research objective this type of error is referred to as a coverage error even when the sample is taken from the appropriate population non-sampling error can occur when segments of the target population are systematically underrepresented or over represented in the sample this type of error is referred to as a non-response error now care must be taken to minimize the introduction of non-sampling error and that involves carefully defining the target population carefully designing the data collection process and training the data collectors pre-test the data collection procedure to identify potential sources of error uh what that means in practice is that if you are collecting data from a manufacturing process you then actually try to collect some of the data uh if it is a survey and you actually run the survey pre-test the data collection procedure to identify potential sources of error means that you try to see what breaks use stratified random sampling when population level information about an important qualitative variable is available use cluster sampling when the population can be divided into heterogeneous subgroups or clusters heterogeneous means the same of type uh pardon me not the same of type uh the other would be homogeneous i use systematic sampling when population level information about an important quantitative variable is available and recognize that every random sample will suffer from some degree of sampling error you cannot get rid of it it's just going to happen recent estimates state that approximately 2.5 quintillion bytes of data are created worldwide each day that's a lot um that's a lot a lot the data sets that are generated are so large or complex the current data processing capacity and or analytical methods are not adequate for analyzing the data these are examples of big data so big data is a word that is thrown around a lot and most people don't actually understand what we mean when we say big data big data doesn't mean that you have you know a hundred samples uh big data doesn't mean you have a thousand samples although a thousand samples is a large data set big data is where you have you know hundreds of thousands or millions of individual cells and what you would imagine your table to be so sources of big data sensors mobile devices internet activities digital processes and social media a few years ago a petabyte of data seemed almost unimaginably large but we now routinely describe data in terms of yoda bytes so this of course draws the question what is yotabyte a uh yetabyte is 10 raised to the 24th power which is an extremely extremely large number so some common terminology for describing the size of data sets is as the following uh now the number of bytes as a as it appears over your on the left does not denote the number of you know 10 raised to powers as they would traditionally appear um we have kilobyte megabyte gigabyte terabyte petabyte exabyte zettabyte and yet a byte so very large things and then we also do have measurements beyond that so understanding what big data is the processes that generate big data can be described by four attributes or dimensions the volume which is the amount of data generated the variety which is the diversity and types and structures of data generated the veracity the reliability of the generate of the data generated and the velocity the speed at which the data are generated these are common attributes of any uh data set so big data can be tall data a data state that has so many observations that traditional statistical inference has little meaning so uh we will talk about that uh big data can also be y data data that has so many variables that simultaneous consideration of all variables is infeasible uh now we don't mean theoretically infeasible we mean practically infeasible computationally infeasible so how the standard error of the sampling distribution of the sample mean time spent by individual customers when they visit a website decreases as sample size increases that is a bless you that is an example of uh big data and sampling error standard error of sampling distribution of the proportion of the sample that clicked on any of the ads featured on some website decreases of the sample size increases so what you're looking at is when you're looking at these kinds of things the standard error gets smaller and smaller and smaller for any kind of reasonable error and so as a result our standard error term scrunches on down and very quickly becomes well quickly in regards to the size of 10 to the nth power gets smaller and smaller so what that does then is that when we're looking at terms like uh the standard error of proportion that standard error shrinks down to a very small size and you'll notice that that is uh for 1 billion sample size so practically speaking for a very reasonable data set for a company you could be talking about a practically zero standard error now confidence intervals are powerful tools for making inferences about population parameters but the validity of any interval estimate depends on the quality of the data used to develop the interval estimate confidence intervals of the population mean and population proportion become more narrow as the size of the sample increases and so they become less and less meaningful because they stop losing they start losing their intervalness in some sense the potential sampling error also decreases as the sample size increases so you have your inherent error just built up um so at a given at a given value so right here we have a 95 confidence level uh under the t distribution your margin of error term for your confidence interval shrinks shrinks shrinks so we can recall that a confidence interval is composed of two parts it is composed of the margin of error term and the point parameter well the plus and minus miss when the margin of error term goes to zero kind of loses any practical significance so you're not really varying the terms by much and it's even worse when you're talking about for a normal distribution which you know point four zeros and a three is a very small number from a computational standpoint when it comes to most things in regards to probability so implications of big data for confidence intervals there are three possible reasons for a sample mean to differ from last year's population mean sampling error non-sampling error and the population mean has actually changed since the last year so that's something that uh you know it needs to be an acceptable thing for population means to have changed um so for instance if we're talking about uh global height the height actually does have a tendency of changing as people's protein sources change and things like that globally oh that's a that's a bummer that that cut off so implications of big data for confidence intervals um if the sample is provided reliable evidence and the population mean has changed since last year you must still consider the potential impact of the difference between the sample mean and the mean from last year for example if a 0.1 second difference has a consequential effect on what pdt can charge for advertising on its site this result could have practical business implications for pdt where pdt is an example company otherwise there may be practical significance of the 0.1 second difference confidence intervals are extremely useful but only effective when properly applied interval estimates become increasingly precise as the sample size increases extremely large samples will yield extremely precise estimates no interval estimate no matter how precise will accurately reflect that reflect the parameter being estimated unless the sample is relatively free of non-sampling error when using interval estimation is always important to carefully consider whether a random sample of the population of interest has been taken the standard error of also the associated sampling distributions decrease as the sample size increases when the sample size n is very large almost any difference between the sample mean x bar and the hypothesized population mean mu zero results in rejection of the null hypothesis so for instance on this table the p-value associated with a given difference between a point estimate and a hypothesized value of a parameter decreases as the sample size increases and you can observe that for both a t distribution computation as well as a z distribution and so the results of any hypothesis test no matter the sample size are only reliable if the sample is relatively free of non-sampling error as an implication if non-sampling error is introduced in the data collection process the likelihood of making a type 1 or type 2 error may be higher than if the sample data are free of non-sampling error when testing a hypothesis it is always important to think carefully about whether a random sample of the population of interest has been taken no business decision should be based solely on statistical inference practical significance should always be considered in conjunction with statistical significance uh in our next lecture we're going to then uh look at some well and it's actually going to be a sort of a lecture snippet we're going to look at using r for doing some of these calculations and computations so see you in the next lecture you