SWiRL: learning statistics module

Oct 15, 2024

Introduction to Statistics Lecture Notes

Introduction

  • Presenter: Justin Zeltser from zstatistics.com
  • Challenge: Explain statistics in under half an hour without maths.
  • Goal: Develop intuition about statistics for beginners or the curious.
  • Examples themed around the NBA.

Types of Data

  • Categorical Data
    • Nominal: No order (e.g., NBA teams)
    • Ordinal: Some order (e.g., player positions)
  • Numerical Data
    • Discrete: Countable values (e.g., number of free throws missed)
    • Continuous: Any value within a range (e.g., player's height)
  • Proportions: Aggregates nominal data into numerical summaries.

Distributions

  • Probability Density Function: Describes distribution of a population.
  • Normal Distribution: Commonly occurs; mean-centered with rarer extremes.
  • Other Distributions
    • Uniform Distribution: Equal probability across values.
    • Bimodal Distribution: Two peaks in data.
    • Skewed Distribution: Longer tail on one side (left or right skew).

Sampling Distributions

  • Sampling: Selecting a group to infer about a population.
  • Distribution of Sample Means: Skinnier than population distribution.
  • Variance Reduction: Larger samples yield less variance in sample means.

Estimation

  • Sample Statistic: An estimate of a population parameter.
  • Theta (θ): Symbol for an unknown parameter.
  • Confidence Intervals: Indicate range where a parameter likely falls.
  • Example: Comparing Steph Curry and Myers Leonard's three-point shooting.

Parameters and Estimation

  • Common Parameters
    • Mu (μ): Mean
    • Sigma (σ): Standard deviation
    • Pi (π): Proportion
    • Rho (ρ): Correlation
    • Beta (β): Gradient (regression)
  • Sample Statistics: Estimates for parameters (e.g., x-bar for mean).

Hypothesis Testing

  • Hypothesis Test: Tests evidence against a null hypothesis (H0).
  • Alternate Hypothesis: What you seek evidence for.
  • Rejection Region: Area beyond which null hypothesis is rejected.
  • Significance Level: Commonly set at 5%.

P-Values

  • Definition: Measures how extreme a sample is.
  • Interpretation: Lower p-value indicates more evidence against null hypothesis.
  • Relation to Hypothesis Tests: If p-value < 0.05, likely in rejection region.

P-Hacking

  • Definition: Manipulating data to find significant results.
  • Problem in Research: Testing multiple hypotheses increases false positives.
  • Good vs. Bad Research: Good research tests a single hypothesis; bad research tests many and reports only significant findings.

Conclusion

  • Statistics deals with uncertainty and inference rather than proof.
  • Importance of understanding potential for misuse in research (p-hacking).
  • Additional resources and in-depth discussions available on zstatistics.com.