Overview
This lecture provides a beginner-friendly overview of statistics, focusing on key data types, distributions, estimation, hypothesis testing, and the importance and pitfalls of p-values, all with minimal math and practical examples.
Types of Data
- Data is divided into categorical and numerical types.
- Categorical data includes nominal (no order, e.g., team names) and ordinal (ordered categories, e.g., player positions).
- Numerical data includes discrete (countable values, e.g., free throws missed) and continuous (any value within a range, e.g., height).
- Proportions aggregate nominal data into a numerical summary, like shooting percentage.
Distributions
- Distributions show how data values are spread across possible outcomes.
- Common distributions: normal (bell curve), uniform (equal probability), bimodal (two peaks), and skewed (tail on one side).
- Normal distribution is symmetric with most values around the mean.
- Probability density functions describe the likelihood of selecting certain values.
Sampling and Estimation
- Samples are taken to estimate unknown population parameters like a player's true shooting percentage (theta, θ).
- The larger the sample size, the less variable the sample statistic is.
- Sample statistics (like proportions) estimate population parameters, but always with some uncertainty.
- Confidence intervals quantify this uncertainty, e.g., a 95% confidence interval suggests high confidence that the true value lies within it.
Parameters and Sample Statistics
- Common parameters: mu (μ, mean), sigma (σ, standard deviation), pi (π, proportion), rho (ρ, correlation), beta (β, gradient).
- Sample statistics: x-bar (sample mean), s (sample standard deviation), p (sample proportion), r (sample correlation), b (sample gradient).
- Greek letters represent parameters, Latin letters represent estimated statistics.
Hypothesis Testing
- Hypothesis testing assesses if sample evidence supports a claim about a population parameter.
- Null hypothesis (H₀): a default claim (e.g., shooting percentage ≤ 50%).
- Alternate hypothesis (H₁): the claim being tested (e.g., shooting percentage > 50%).
- Rejection region: range of values deemed too extreme for H₀, usually top 5% (significance level).
- Never say "prove" or "accept" the null; only "reject" or "do not reject" based on evidence.
P-values
- A p-value measures how extreme the sample result is under H₀.
- Small p-values (<0.05) suggest evidence against H₀ and justify rejection.
- If the p-value is larger, do not reject H₀—insufficient evidence for H₁.
- P-value less than significance level means the sample falls in the rejection region.
Pitfalls: P-hacking
- P-hacking occurs when many hypotheses are tested, and only significant results are reported.
- Repeated testing increases the probability of finding significant results by chance.
- Good research tests a pre-specified effect, not multiple effects for significance.
Key Terms & Definitions
- Categorical Data — Data grouped into categories (e.g., team, position).
- Numerical Data — Data expressed as numbers, either discrete or continuous.
- Sample Statistic — Number calculated from a sample, used to estimate a parameter.
- Parameter — Unknown, fixed value describing the whole population.
- Confidence Interval — Range estimating a parameter with a stated confidence level.
- Hypothesis Test — Procedure to assess evidence for or against a claim.
- Null Hypothesis (H₀) — Default assumption in a statistical test.
- Alternate Hypothesis (H₁) — Claim tested against the null.
- Rejection Region — Outcomes where H₀ is rejected.
- P-value — Probability of observing results as extreme as the sample under H₀.
- P-hacking — Manipulating analysis to find statistically significant (but possibly misleading) results.
Action Items / Next Steps
- Review examples of data types and distributions.
- Practice identifying null and alternate hypotheses for different scenarios.
- Read about confidence intervals and calculating p-values.
- Be cautious of interpreting p-values, and avoid p-hacking in research.