šŸ“Š

Understanding ANOVA and GLM Framework

Aug 11, 2024

Crash Course Statistics: ANOVA and General Linear Model Framework

Introduction

  • Presenter: Adriene Hill
  • Topic: Testing differences between multiple groups using ANOVA (Analysis of Variance)
  • Context: Previously discussed t-tests for comparing two groups; now expanding to more than two groups.

General Linear Model (GLM) Framework

  • Concept: Partitions data into two piles
    • Explained by the model: Represents the way we think things work
    • Error: Amount of information the model fails to explain

ANOVA (Analysis of Variance)

  • Purpose: Compare measurements of more than two groups
  • Similarity to Regression: Uses categorical variables to predict a continuous one
  • Example: Predicting number of yards a soccer player runs based on position
  • Model Building: Similar to regression models but with categorical variables

Example: Bunny Count Model

  • Model: Number of bunnies seen based on weather (rainy vs. sunny)
    • Prediction: 1 bunny on rainy days, 5 on sunny days
    • Error Calculation: Difference between prediction and actual observation
  • Regression Interpretation: Slope represents the difference in means between two groups
    • Slope Calculation: Expected value on a sunny day minus rainy day value (5-1=4)

Chocolate Bar Rating Example

  • Dataset Source: Kaggle.com, by Brady Brelinski
  • Groups: Chocolate bars made from Criollo, Forastero, or Trinitario beans
  • Rating Scale: 5 (highest) to 1 (mostly unpalatable)

ANOVA Calculation Steps

  1. Sums of Squares Total (SST): Sum of squared differences between each rating and overall mean
  2. Partition Variation: Divide SST into Model Sums of Squares (SSM) and Sums of Squares for Error (SSE)
  • SSM: Variation explained by the model (group means)
    • SSE: Variation not explained by the model
  1. F-statistic Calculation: Compare SSM and SSE adjusted by degrees of freedom
  • Degrees of Freedom (Model): k-1, where k is the number of groups (3 groups -> 2 degrees)
    • Degrees of Freedom (Error): n-k, where n is sample size (790 data points - 3 groups = 787)

Interpretation of Results

  • F-statistic: Value of 7.7619
  • P-value: 0.000459 (small enough to reject the null hypothesis)
  • Conclusion: Evidence that bean type influences chocolate ratings
  • Omnibus Test: Significant F-statistic indicates some difference but not where
    • Follow-up: Conduct multiple t-tests to pinpoint significant differences
      • 3 T-tests: To compare each unique pair of categories
      • Findings: Criollo beans have lower ratings compared to Trinitario and Forastero; no significant difference between Trinitario and Forastero

Historical Example: Fisher's Potato Study

  • Context: 12 different potato varieties, effect of fertilizers
  • ANOVA Table: Summarizes degrees of freedom, sums of squares, mean squares, F-statistic, and p-value
  • Results: Significant F-statistic indicating differences among potato varieties
    • Follow-up: Multiple tests needed to identify specific differences

Key Takeaways

  1. Similarity of Statistical Models: ANOVAs and Regressions both use GLM to model the world
  2. Filtering: ANOVAs help determine if there’s an overall effect before delving into specifics
  • Efficiency: Saves time by avoiding unnecessary tests