Statistics Overview for Beginners

Jul 25, 2025

Overview

This lecture provides a beginner-friendly overview of statistics, focusing on key data types, distributions, estimation, hypothesis testing, and the importance and pitfalls of p-values, all with minimal math and practical examples.

Types of Data

  • Data is divided into categorical and numerical types.
  • Categorical data includes nominal (no order, e.g., team names) and ordinal (ordered categories, e.g., player positions).
  • Numerical data includes discrete (countable values, e.g., free throws missed) and continuous (any value within a range, e.g., height).
  • Proportions aggregate nominal data into a numerical summary, like shooting percentage.

Distributions

  • Distributions show how data values are spread across possible outcomes.
  • Common distributions: normal (bell curve), uniform (equal probability), bimodal (two peaks), and skewed (tail on one side).
  • Normal distribution is symmetric with most values around the mean.
  • Probability density functions describe the likelihood of selecting certain values.

Sampling and Estimation

  • Samples are taken to estimate unknown population parameters like a player's true shooting percentage (theta, θ).
  • The larger the sample size, the less variable the sample statistic is.
  • Sample statistics (like proportions) estimate population parameters, but always with some uncertainty.
  • Confidence intervals quantify this uncertainty, e.g., a 95% confidence interval suggests high confidence that the true value lies within it.

Parameters and Sample Statistics

  • Common parameters: mu (μ, mean), sigma (σ, standard deviation), pi (π, proportion), rho (ρ, correlation), beta (β, gradient).
  • Sample statistics: x-bar (sample mean), s (sample standard deviation), p (sample proportion), r (sample correlation), b (sample gradient).
  • Greek letters represent parameters, Latin letters represent estimated statistics.

Hypothesis Testing

  • Hypothesis testing assesses if sample evidence supports a claim about a population parameter.
  • Null hypothesis (H₀): a default claim (e.g., shooting percentage ≤ 50%).
  • Alternate hypothesis (H₁): the claim being tested (e.g., shooting percentage > 50%).
  • Rejection region: range of values deemed too extreme for H₀, usually top 5% (significance level).
  • Never say "prove" or "accept" the null; only "reject" or "do not reject" based on evidence.

P-values

  • A p-value measures how extreme the sample result is under H₀.
  • Small p-values (<0.05) suggest evidence against H₀ and justify rejection.
  • If the p-value is larger, do not reject H₀—insufficient evidence for H₁.
  • P-value less than significance level means the sample falls in the rejection region.

Pitfalls: P-hacking

  • P-hacking occurs when many hypotheses are tested, and only significant results are reported.
  • Repeated testing increases the probability of finding significant results by chance.
  • Good research tests a pre-specified effect, not multiple effects for significance.

Key Terms & Definitions

  • Categorical Data — Data grouped into categories (e.g., team, position).
  • Numerical Data — Data expressed as numbers, either discrete or continuous.
  • Sample Statistic — Number calculated from a sample, used to estimate a parameter.
  • Parameter — Unknown, fixed value describing the whole population.
  • Confidence Interval — Range estimating a parameter with a stated confidence level.
  • Hypothesis Test — Procedure to assess evidence for or against a claim.
  • Null Hypothesis (H₀) — Default assumption in a statistical test.
  • Alternate Hypothesis (H₁) — Claim tested against the null.
  • Rejection Region — Outcomes where H₀ is rejected.
  • P-value — Probability of observing results as extreme as the sample under H₀.
  • P-hacking — Manipulating analysis to find statistically significant (but possibly misleading) results.

Action Items / Next Steps

  • Review examples of data types and distributions.
  • Practice identifying null and alternate hypotheses for different scenarios.
  • Read about confidence intervals and calculating p-values.
  • Be cautious of interpreting p-values, and avoid p-hacking in research.