Introduction to Statistics by Justin Zeltzer

Jun 14, 2024

Introduction to Statistics

Lecture Overview

  • Lecturer: Justin Zeltzer from zstatistics.com
  • Objective: Explain statistics in under half an hour without using math
  • Target Audience: Beginners enrolled in a statistics course or those curious about statistics
  • Theme for Examples: NBA basketball

Types of Data

Categorical Data

  • Definition: Data that can be divided into categories
  • Subtypes:
    • Nominal: No intrinsic order (e.g., 'What team does Steph Curry play for?')
    • Ordinal: Some kind of order (e.g., 'What position does Steph Curry play?')

Numerical Data

  • Definition: Data that represents numbers
  • Subtypes:
    • Discrete: Finite values (e.g., 'How many free throws has Steph missed?')
    • Continuous: Infinite subdivisions (e.g., 'What is Steph's height?')

Special Case: Proportions

  • Example: 'What is Steph's three-point percentage this season?'
  • Explanation: Proportions are essentially aggregates of nominal data providing a numerical summary.

Distributions

Types of Distributions

  • Normal Distribution (Bell Curve): Common in statistics; bulk of data is in the middle.
  • Other Distributions:
    • Uniform Distribution: Equal probability across values
    • Bi-modal Distribution: Two peaks
    • Skewed Distribution: Long tail on one side (e.g., left-skewed, right-skewed)

Probability Density Function (PDF)

  • Describes the distribution of all players’ heights in the NBA
  • Symmetrical with most players in the middle

Sampling Distributions

  • Concept: If we sample, what is the probability distribution of their average height?
  • Key Point: Larger sample sizes reduce variance, making extreme sample means less likely.

Sampling and Estimation

Sample Statistics vs. Population Parameters

  • Example: Steph Curry's three-point percentage (sample statistic)
  • Theta (θ): Represents the true, long-term value we are trying to estimate
  • Confidence Intervals: Quantify uncertainty in our estimates

Example: Confidence Intervals

  • Steph Curry: More data, narrower confidence interval
  • Meyers Leonard: Less data, wider confidence interval

Parameters and Sample Statistics

Common Parameters

  • Mu (μ): Mean of a numerical variable
  • Sigma (σ): Standard deviation of a numerical variable
  • Pi (π): Proportion of a categorical variable
  • Rho (ρ): Correlation between two variables
  • Beta (β): Gradient between two variables
  • Theta (θ): General parameter

Sample Statistics Notation

  • x-bar (x̄): Sample mean
  • s: Sample standard deviation
  • p: Sample proportion
  • r: Sample correlation
  • b: Sample gradient

Hypothesis Testing

Concept

  • Example: Is Meyers Leonard shooting above 50%?
  • Null Hypothesis (H0): An assumption to test against e.g., θ ≤ 0.5
  • Alternative Hypothesis (H1): What we are seeking evidence for e.g., θ > 0.5
  • Rejection Region: If sample statistic falls in this, null hypothesis is rejected.

Key Points

  • P-value: Measures how extreme the sample is
  • Level of Significance (usually 5%): If p-value is less than this, reject the null hypothesis
  • Interpretation: Never use the words 'prove' or 'accept' in hypothesis testing conclusions

P-Hacking

Concept

  • Misuse of p-values by testing multiple hypotheses and reporting only favorable outcomes
  • Good Research: Theorize an effect and test it specifically
  • Bad Research: Collect data first and then test for various effects (leading to false positives)

Real-World Implications

  • P-hacking leads to invalid scientific research
  • Proper methodology is crucial for valid conclusions

Conclusion

  • Wrap-Up: Summarized the key points of statistics without intensive math
  • Additional Resources: More in-depth discussions and videos available on zstatistics.com