Crash Course Statistics: General Linear Models (GLM)
Introduction
- Presenter: Adriene Hill
- Topic: General Linear Model (GLM)
- Key Idea: Flexibility of GLMs in statistical analysis, similar to how a Transformer can change forms.
General Linear Models (GLMs)
- Basic Concept: Data can be explained by the model and some error.
- Model Equation: Generally in the form Y = mx + b (or Y = b + mx)
- Example: Predicting trick-or-treaters based on local middle school enrollment
- Baseline of 25 trick-or-treaters + 0.01 increase per middle school student
- Reality vs. Model: Predicted 35, actual 42, so error = 7
- Error: Deviation from the model, not necessarily something 'wrong'
- Sources: Unaccounted variables, random variation
Types of GLMs
1. Linear Regression
- Use: Provides predictions using continuous variables
- Example: Predicting YouTube video likes based on comments
- Data Plotting: Visually check for linearity and outliers
- Decision on outliers affects the regression line
Creating the Regression Model
- Assumption: The relationship is linear
- Fitting the Model: Usually done by computers
- Regression Line: Minimizes the sum of squared distances of data points to the line
- Equation includes y-intercept (e.g., 9104 likes for 0 comments) and slope (e.g., 6.5 likes per comment)
- Residuals/Error: Difference between observed and predicted values
- Ideal: Evenly spaced residuals without patterns
Statistical Tests on Regression Coefficients
F-test
- Purpose: Quantifies how well data fits a distribution under the null hypothesis (no relationship)
- Null Hypothesis: No relationship between predictor and outcome
- Expected slope of 0 for the regression line
- Scatter plot would look like a blob
- **Notation and Calculations: **
- Y-hat: Predicted value
- Y-bar: Mean value
- Total Variation: Sum of Squares Total (variance)
- SSR (Sums of Squares for Regression): Variation explained by the model
- SSE (Sums of Squares for Error): Variation not explained by the model
Degrees of Freedom
- SSE Degrees of Freedom: n (sample size) - 2
- SSR Degrees of Freedom: 1 (one slope coefficient)
- Division by Degrees of Freedom: Weighs each sum of squares appropriately
Calculating the F-statistic
- **Formula Components: **
- Numerator: SSR divided by its degrees of freedom
- Denominator: SSE divided by degrees of freedom
- p-value: Found using an F-distribution
- Example: Very small p-value (<0.05) leads to rejecting the null hypothesis
Conclusion
- **Relation to Real Life: **
- GLMs explain data through the model and error
- Examples: Budget deviation, predicting roommate's anger
- Future Topics: More on the importance of F-tests
- Applications: Widely used in various fields (science, economics, political science) to model relationships and make predictions
Remember: Regression shows correlation, not causation.