Statistics Concepts Overview

Overview

This lecture provides an intuitive, non-mathematical introduction to key statistics concepts, focusing on data types, distributions, sampling, estimation, hypothesis testing, p-values, and the pitfalls of p-hacking.

Types of Data

Data is divided into categorical (categories) and numerical (numbers).
Categorical data splits into nominal (no order, e.g., team names) and ordinal (ordered, e.g., player positions).
Numerical data splits into discrete (countable, e.g., missed shots) and continuous (measurable, e.g., height).
Proportion/percentage data is a numerical summary of repeated nominal outcomes.

Distributions

A distribution shows how values of a variable are spread (e.g., heights in the NBA).
Common distributions include the normal (bell curve), uniform (even spread), bimodal (two peaks), and skewed (asymmetric).
The probability density function (PDF) shows the likelihood of sampling different values.

Sampling & Estimation

We use samples to estimate unknown population values (parameters, e.g., true shooting ability).
Sample statistics (like proportion or mean) estimate parameters, but always with some uncertainty.
Larger samples produce more precise (less variable) estimates.

Parameters & Sample Statistics

Common population parameters: μ (mean), σ (standard deviation), π or θ (proportion), ρ (correlation), β (regression slope).
Corresponding sample statistics: x̄ (sample mean), s (sample standard deviation), p (sample proportion), r (sample correlation), b (sample slope).
Greek letters represent unknown, fixed values; Roman letters denote sample-based estimates.

Hypothesis Testing

Hypothesis tests assess if sample data provide enough evidence to challenge a null hypothesis (status quo).
Null hypothesis (H₀): assumes no effect or status quo (e.g., player shoots ≤ 50%).
Alternate hypothesis (H₁): what you seek evidence for (e.g., player shoots > 50%).
Results are framed as rejecting or not rejecting H₀; never "proving" or "accepting."

P-values & Significance

P-value measures how extreme the observed sample is under the null hypothesis.
If p-value < significance level (commonly 0.05), reject H₀; otherwise, do not reject.
The smaller the p-value, the stronger the evidence against H₀.

P-Hacking & Research Integrity

P-hacking occurs when multiple tests are run and only significant results are reported, increasing false discoveries.
Good research defines hypotheses before data collection and tests only those.
Testing many hypotheses increases the likelihood of finding “significant” but spurious results.

Key Terms & Definitions

Categorical data — Data divided into groups or categories (nominal or ordinal).
Numerical data — Quantitative data represented by numbers (discrete or continuous).
Sample space — All possible outcomes for a variable.
Parameter — Unknown, fixed value in a population (e.g., μ, σ, π, θ, ρ, β).
Statistic — Value calculated from a sample used to estimate a parameter (e.g., x̄, s, p, r, b).
Null hypothesis (H₀) — Default assumption for statistical tests.
Alternate hypothesis (H₁) — The hypothesis being tested for.
P-value — Probability, under H₀, of obtaining a result as extreme as the observed.
Significance level — Threshold for rejecting the null hypothesis, often 0.05.

Action Items / Next Steps

Practice classifying data types (categorical vs. numerical, nominal vs. ordinal, etc.).
Review confidence intervals and hypothesis testing in detail.
Be cautious of p-hacking when interpreting research results.
Optional: Explore the mathematics of distributions, standard deviation, and regression.