Understanding Big Data and Statistical Inference

Oct 2, 2024

Lecture 4: Big Data, Statistical Inference, and Practical Significance

Sampling Error

  • Definition: Deviation of the sample from the population due to random sampling.
  • Independent random samples are generally representative of the population.
  • Unavoidable: Every random sample will have some sampling error.

Non-Sampling Error

  • Deviations not due to random sampling.
  • Coverage Error: Data collected don’t align with research objectives.
  • Non-Response Error: Systematic under-representation or over-representation in samples.
  • Minimizing Non-Sampling Error:
    • Define target population carefully.
    • Design data collection process meticulously.
    • Pre-test data collection methods.

Sampling Techniques

  • Stratified Random Sampling: Use when qualitative population-level information is available.
  • Cluster Sampling: Use for heterogeneous subgroups.
  • Systematic Sampling: Use when quantitative population-level information is available.

Big Data

  • Definition: Large or complex data sets beyond current processing capacity.
  • Sources: Sensors, mobile devices, internet activities, digital processes, social media.

Size Terminology

  • Units: Kilobyte, Megabyte, Gigabyte, Terabyte, Petabyte, Exabyte, Zettabyte, Yottabyte.

Attributes of Big Data

  • Volume: Amount of data.
  • Variety: Diversity in types and structures.
  • Veracity: Reliability.
  • Velocity: Speed of data generation.

Types of Big Data

  • Tall Data: Many observations.
  • Wide Data: Many variables.

Standard Error and Confidence Intervals

  • Standard Error: Decreases as sample size increases.
  • Confidence Intervals:
    • Narrower intervals with larger samples.
    • Less meaningful if intervals shrink too much.
  • Margin of Error: Part of confidence interval, diminishes with large samples.

Implications for Confidence Intervals

  • Sample means may differ due to sampling error, non-sampling error, or changes in population mean.
  • Business Implications: Small differences can have significant effects.

Hypothesis Testing

  • Very Large Samples: Almost any difference may lead to rejection of the null hypothesis.
  • P-Value: Decreases with larger sample sizes.
  • Non-Sampling Errors: Increase risk of Type 1 or Type 2 errors.

Practical vs. Statistical Significance

  • Business decisions should consider both.

Next Steps

  • Future lecture on using R for calculations and computations.