Statistics Concepts Overview

Jun 24, 2025

Overview

This lecture provides an intuitive, non-mathematical introduction to key statistics concepts, focusing on data types, distributions, sampling, estimation, hypothesis testing, p-values, and the pitfalls of p-hacking.

Types of Data

  • Data is divided into categorical (categories) and numerical (numbers).
  • Categorical data splits into nominal (no order, e.g., team names) and ordinal (ordered, e.g., player positions).
  • Numerical data splits into discrete (countable, e.g., missed shots) and continuous (measurable, e.g., height).
  • Proportion/percentage data is a numerical summary of repeated nominal outcomes.

Distributions

  • A distribution shows how values of a variable are spread (e.g., heights in the NBA).
  • Common distributions include the normal (bell curve), uniform (even spread), bimodal (two peaks), and skewed (asymmetric).
  • The probability density function (PDF) shows the likelihood of sampling different values.

Sampling & Estimation

  • We use samples to estimate unknown population values (parameters, e.g., true shooting ability).
  • Sample statistics (like proportion or mean) estimate parameters, but always with some uncertainty.
  • Larger samples produce more precise (less variable) estimates.

Parameters & Sample Statistics

  • Common population parameters: μ (mean), σ (standard deviation), π or θ (proportion), ρ (correlation), β (regression slope).
  • Corresponding sample statistics: x̄ (sample mean), s (sample standard deviation), p (sample proportion), r (sample correlation), b (sample slope).
  • Greek letters represent unknown, fixed values; Roman letters denote sample-based estimates.

Hypothesis Testing

  • Hypothesis tests assess if sample data provide enough evidence to challenge a null hypothesis (status quo).
  • Null hypothesis (H₀): assumes no effect or status quo (e.g., player shoots ≤ 50%).
  • Alternate hypothesis (H₁): what you seek evidence for (e.g., player shoots > 50%).
  • Results are framed as rejecting or not rejecting H₀; never "proving" or "accepting."

P-values & Significance

  • P-value measures how extreme the observed sample is under the null hypothesis.
  • If p-value < significance level (commonly 0.05), reject H₀; otherwise, do not reject.
  • The smaller the p-value, the stronger the evidence against H₀.

P-Hacking & Research Integrity

  • P-hacking occurs when multiple tests are run and only significant results are reported, increasing false discoveries.
  • Good research defines hypotheses before data collection and tests only those.
  • Testing many hypotheses increases the likelihood of finding “significant” but spurious results.

Key Terms & Definitions

  • Categorical data — Data divided into groups or categories (nominal or ordinal).
  • Numerical data — Quantitative data represented by numbers (discrete or continuous).
  • Sample space — All possible outcomes for a variable.
  • Parameter — Unknown, fixed value in a population (e.g., μ, σ, π, θ, ρ, β).
  • Statistic — Value calculated from a sample used to estimate a parameter (e.g., x̄, s, p, r, b).
  • Null hypothesis (H₀) — Default assumption for statistical tests.
  • Alternate hypothesis (H₁) — The hypothesis being tested for.
  • P-value — Probability, under H₀, of obtaining a result as extreme as the observed.
  • Significance level — Threshold for rejecting the null hypothesis, often 0.05.

Action Items / Next Steps

  • Practice classifying data types (categorical vs. numerical, nominal vs. ordinal, etc.).
  • Review confidence intervals and hypothesis testing in detail.
  • Be cautious of p-hacking when interpreting research results.
  • Optional: Explore the mathematics of distributions, standard deviation, and regression.