Overview
This lecture provides an intuitive, non-mathematical introduction to key statistics concepts, focusing on data types, distributions, sampling, estimation, hypothesis testing, p-values, and the pitfalls of p-hacking.
Types of Data
- Data is divided into categorical (categories) and numerical (numbers).
- Categorical data splits into nominal (no order, e.g., team names) and ordinal (ordered, e.g., player positions).
- Numerical data splits into discrete (countable, e.g., missed shots) and continuous (measurable, e.g., height).
- Proportion/percentage data is a numerical summary of repeated nominal outcomes.
Distributions
- A distribution shows how values of a variable are spread (e.g., heights in the NBA).
- Common distributions include the normal (bell curve), uniform (even spread), bimodal (two peaks), and skewed (asymmetric).
- The probability density function (PDF) shows the likelihood of sampling different values.
Sampling & Estimation
- We use samples to estimate unknown population values (parameters, e.g., true shooting ability).
- Sample statistics (like proportion or mean) estimate parameters, but always with some uncertainty.
- Larger samples produce more precise (less variable) estimates.
Parameters & Sample Statistics
- Common population parameters: μ (mean), σ (standard deviation), π or θ (proportion), ρ (correlation), β (regression slope).
- Corresponding sample statistics: x̄ (sample mean), s (sample standard deviation), p (sample proportion), r (sample correlation), b (sample slope).
- Greek letters represent unknown, fixed values; Roman letters denote sample-based estimates.
Hypothesis Testing
- Hypothesis tests assess if sample data provide enough evidence to challenge a null hypothesis (status quo).
- Null hypothesis (H₀): assumes no effect or status quo (e.g., player shoots ≤ 50%).
- Alternate hypothesis (H₁): what you seek evidence for (e.g., player shoots > 50%).
- Results are framed as rejecting or not rejecting H₀; never "proving" or "accepting."
P-values & Significance
- P-value measures how extreme the observed sample is under the null hypothesis.
- If p-value < significance level (commonly 0.05), reject H₀; otherwise, do not reject.
- The smaller the p-value, the stronger the evidence against H₀.
P-Hacking & Research Integrity
- P-hacking occurs when multiple tests are run and only significant results are reported, increasing false discoveries.
- Good research defines hypotheses before data collection and tests only those.
- Testing many hypotheses increases the likelihood of finding “significant” but spurious results.
Key Terms & Definitions
- Categorical data — Data divided into groups or categories (nominal or ordinal).
- Numerical data — Quantitative data represented by numbers (discrete or continuous).
- Sample space — All possible outcomes for a variable.
- Parameter — Unknown, fixed value in a population (e.g., μ, σ, π, θ, ρ, β).
- Statistic — Value calculated from a sample used to estimate a parameter (e.g., x̄, s, p, r, b).
- Null hypothesis (H₀) — Default assumption for statistical tests.
- Alternate hypothesis (H₁) — The hypothesis being tested for.
- P-value — Probability, under H₀, of obtaining a result as extreme as the observed.
- Significance level — Threshold for rejecting the null hypothesis, often 0.05.
Action Items / Next Steps
- Practice classifying data types (categorical vs. numerical, nominal vs. ordinal, etc.).
- Review confidence intervals and hypothesis testing in detail.
- Be cautious of p-hacking when interpreting research results.
- Optional: Explore the mathematics of distributions, standard deviation, and regression.