Crash Course Statistics: General Linear Models

Jun 27, 2024

Crash Course Statistics: General Linear Models (GLM)

Introduction

  • Presenter: Adriene Hill
  • Topic: General Linear Model (GLM)
  • Key Idea: Flexibility of GLMs in statistical analysis, similar to how a Transformer can change forms.

General Linear Models (GLMs)

  • Basic Concept: Data can be explained by the model and some error.
  • Model Equation: Generally in the form Y = mx + b (or Y = b + mx)
    • Example: Predicting trick-or-treaters based on local middle school enrollment
    • Baseline of 25 trick-or-treaters + 0.01 increase per middle school student
    • Reality vs. Model: Predicted 35, actual 42, so error = 7
  • Error: Deviation from the model, not necessarily something 'wrong'
    • Sources: Unaccounted variables, random variation

Types of GLMs

1. Linear Regression

  • Use: Provides predictions using continuous variables
    • Example: Predicting YouTube video likes based on comments
  • Data Plotting: Visually check for linearity and outliers
    • Decision on outliers affects the regression line

Creating the Regression Model

  • Assumption: The relationship is linear
  • Fitting the Model: Usually done by computers
  • Regression Line: Minimizes the sum of squared distances of data points to the line
    • Equation includes y-intercept (e.g., 9104 likes for 0 comments) and slope (e.g., 6.5 likes per comment)
  • Residuals/Error: Difference between observed and predicted values
    • Ideal: Evenly spaced residuals without patterns

Statistical Tests on Regression Coefficients

F-test

  • Purpose: Quantifies how well data fits a distribution under the null hypothesis (no relationship)
  • Null Hypothesis: No relationship between predictor and outcome
    • Expected slope of 0 for the regression line
    • Scatter plot would look like a blob
  • **Notation and Calculations: **
    • Y-hat: Predicted value
    • Y-bar: Mean value
    • Total Variation: Sum of Squares Total (variance)
    • SSR (Sums of Squares for Regression): Variation explained by the model
    • SSE (Sums of Squares for Error): Variation not explained by the model

Degrees of Freedom

  • SSE Degrees of Freedom: n (sample size) - 2
  • SSR Degrees of Freedom: 1 (one slope coefficient)
  • Division by Degrees of Freedom: Weighs each sum of squares appropriately

Calculating the F-statistic

  • **Formula Components: **
    • Numerator: SSR divided by its degrees of freedom
    • Denominator: SSE divided by degrees of freedom
  • p-value: Found using an F-distribution
    • Example: Very small p-value (<0.05) leads to rejecting the null hypothesis

Conclusion

  • **Relation to Real Life: **
    • GLMs explain data through the model and error
    • Examples: Budget deviation, predicting roommate's anger
  • Future Topics: More on the importance of F-tests
  • Applications: Widely used in various fields (science, economics, political science) to model relationships and make predictions

Remember: Regression shows correlation, not causation.