Statistics Overview and Key Concepts

Types of Data

Data is divided into categorical and numerical types.
Categorical data: Nominal (no order, e.g., team names) and ordinal (ordered, e.g., player positions).
Numerical data: Discrete (countable, e.g., missed free throws) and continuous (measurable, e.g., height).
Proportions summarize nominal data into numerical form (e.g., 3-point shooting percentage).

A probability distribution shows how values of a variable are spread across all possible outcomes.
Common distributions: Normal (bell curve), uniform (equal probability), bimodal (two peaks), skewed (long tail).
Most real-world data (like NBA heights) follow a normal distribution.

A sample is a subset of data from a larger population.
Sample statistics (like sample mean or proportion) estimate population parameters (unknown true values like long-term shooting percentage).
Larger sample sizes yield more precise (less variable) estimates.
Confidence intervals indicate the range where the true parameter likely lies (e.g., 95% confidence interval).

Population parameters use Greek symbols: μ (mean), σ (standard deviation), π or θ (proportion), ρ (correlation), β (regression slope).
Sample statistics use Roman letters: x̄ (mean), s (standard deviation), p (proportion), r (correlation), b (slope).

Hypothesis tests assess evidence for or against a claim about a population parameter.
Null hypothesis (H₀): default assumption (e.g., θ ≤ 0.5).
Alternate hypothesis (H₁): what you seek evidence for (e.g., θ > 0.5).
The result is either to “reject” or “not reject” the null hypothesis—never to “prove” or “accept” anything.
The level of significance (often 0.05) defines the rejection region (when evidence is strong enough to reject H₀).

P-hacking: testing many hypotheses and only reporting those with p < 0.05, inflating false-positive findings.
Proper research tests predefined hypotheses; improper research looks for any significant finding post hoc.
Multiple testing increases the chance of finding a false “significant” result.

Categorical Data — data sorted into groups or categories.
Numerical Data — data measured as numbers, either discrete or continuous.
Proportion — a summary metric showing the fraction of occurrences.
Population Parameter — an unknown, fixed value describing the whole population.
Sample Statistic — a value calculated from a sample used to estimate a parameter.
Hypothesis Test — a procedure to evaluate evidence about a population.
Null Hypothesis (H₀) — the assumption that there is no effect or difference.
Alternate Hypothesis (H₁) — what we seek evidence for in a test.
P-Value — the probability of observing the sample statistic if the null hypothesis is true.
Confidence Interval — the range where a parameter likely lies, given the data.
P-Hacking — misuse of statistical testing by multiple unplanned comparisons.

Review the definitions of data types, parameters, and hypothesis test procedures.
Consider further readings or videos on the normal distribution, regression, and standard deviation for deeper understanding.
Be aware of p-hacking and the importance of predefining hypotheses in research.