Lecture 4: Big Data, Statistical Inference, and Practical Significance

Sampling Error

Definition: Deviation of the sample from the population due to random sampling.
Independent random samples are generally representative of the population.
Unavoidable: Every random sample will have some sampling error.

Deviations not due to random sampling.
Coverage Error: Data collected don’t align with research objectives.
Non-Response Error: Systematic under-representation or over-representation in samples.
Minimizing Non-Sampling Error:
- Define target population carefully.
- Design data collection process meticulously.
- Pre-test data collection methods.

Stratified Random Sampling: Use when qualitative population-level information is available.
Cluster Sampling: Use for heterogeneous subgroups.
Systematic Sampling: Use when quantitative population-level information is available.

Definition: Large or complex data sets beyond current processing capacity.
Sources: Sensors, mobile devices, internet activities, digital processes, social media.

Units: Kilobyte, Megabyte, Gigabyte, Terabyte, Petabyte, Exabyte, Zettabyte, Yottabyte.

Standard Error: Decreases as sample size increases.
Confidence Intervals:
- Narrower intervals with larger samples.
- Less meaningful if intervals shrink too much.
Margin of Error: Part of confidence interval, diminishes with large samples.

Sample means may differ due to sampling error, non-sampling error, or changes in population mean.
Business Implications: Small differences can have significant effects.

Very Large Samples: Almost any difference may lead to rejection of the null hypothesis.
P-Value: Decreases with larger sample sizes.
Non-Sampling Errors: Increase risk of Type 1 or Type 2 errors.

Non-Sampling Error

Deviations not due to random sampling.

Coverage Error: Data collected don’t align with research objectives.

Non-Response Error: Systematic under-representation or over-representation in samples.

Minimizing Non-Sampling Error:

Big Data

Definition: Large or complex data sets beyond current processing capacity.

Sources: Sensors, mobile devices, internet activities, digital processes, social media.

Size Terminology

Units: Kilobyte, Megabyte, Gigabyte, Terabyte, Petabyte, Exabyte, Zettabyte, Yottabyte.

Attributes of Big Data

Volume: Amount of data.

Variety: Diversity in types and structures.

Veracity: Reliability.

Velocity: Speed of data generation.

Types of Big Data

Tall Data: Many observations.

Wide Data: Many variables.