Statistics Overview for Beginners

Overview

This lecture provides a beginner-friendly overview of statistics, focusing on key data types, distributions, estimation, hypothesis testing, and the importance and pitfalls of p-values, all with minimal math and practical examples.

Types of Data

Data is divided into categorical and numerical types.
Categorical data includes nominal (no order, e.g., team names) and ordinal (ordered categories, e.g., player positions).
Numerical data includes discrete (countable values, e.g., free throws missed) and continuous (any value within a range, e.g., height).
Proportions aggregate nominal data into a numerical summary, like shooting percentage.

Distributions

Distributions show how data values are spread across possible outcomes.
Common distributions: normal (bell curve), uniform (equal probability), bimodal (two peaks), and skewed (tail on one side).
Normal distribution is symmetric with most values around the mean.
Probability density functions describe the likelihood of selecting certain values.

Sampling and Estimation

Samples are taken to estimate unknown population parameters like a player's true shooting percentage (theta, θ).
The larger the sample size, the less variable the sample statistic is.
Sample statistics (like proportions) estimate population parameters, but always with some uncertainty.
Confidence intervals quantify this uncertainty, e.g., a 95% confidence interval suggests high confidence that the true value lies within it.

Parameters and Sample Statistics

Common parameters: mu (μ, mean), sigma (σ, standard deviation), pi (π, proportion), rho (ρ, correlation), beta (β, gradient).
Sample statistics: x-bar (sample mean), s (sample standard deviation), p (sample proportion), r (sample correlation), b (sample gradient).
Greek letters represent parameters, Latin letters represent estimated statistics.

Hypothesis Testing

Hypothesis testing assesses if sample evidence supports a claim about a population parameter.
Null hypothesis (H₀): a default claim (e.g., shooting percentage ≤ 50%).
Alternate hypothesis (H₁): the claim being tested (e.g., shooting percentage > 50%).
Rejection region: range of values deemed too extreme for H₀, usually top 5% (significance level).
Never say "prove" or "accept" the null; only "reject" or "do not reject" based on evidence.

P-values

A p-value measures how extreme the sample result is under H₀.
Small p-values (<0.05) suggest evidence against H₀ and justify rejection.
If the p-value is larger, do not reject H₀—insufficient evidence for H₁.
P-value less than significance level means the sample falls in the rejection region.

Pitfalls: P-hacking

P-hacking occurs when many hypotheses are tested, and only significant results are reported.
Repeated testing increases the probability of finding significant results by chance.
Good research tests a pre-specified effect, not multiple effects for significance.

Key Terms & Definitions

Categorical Data — Data grouped into categories (e.g., team, position).
Numerical Data — Data expressed as numbers, either discrete or continuous.
Sample Statistic — Number calculated from a sample, used to estimate a parameter.
Parameter — Unknown, fixed value describing the whole population.
Confidence Interval — Range estimating a parameter with a stated confidence level.
Hypothesis Test — Procedure to assess evidence for or against a claim.
Null Hypothesis (H₀) — Default assumption in a statistical test.
Alternate Hypothesis (H₁) — Claim tested against the null.
Rejection Region — Outcomes where H₀ is rejected.
P-value — Probability of observing results as extreme as the sample under H₀.
P-hacking — Manipulating analysis to find statistically significant (but possibly misleading) results.

Action Items / Next Steps

Review examples of data types and distributions.
Practice identifying null and alternate hypotheses for different scenarios.
Read about confidence intervals and calculating p-values.
Be cautious of interpreting p-values, and avoid p-hacking in research.