Crash Course Statistics: General Linear Model (GLM)
Introduction
- Presenter: Adriene Hill
- Topic: Introduction to the General Linear Model (GLM)
- Key Point: Flexibility of GLM in creating different models to describe the world
General Linear Model (GLM)
- Concept: Data can be explained by two things: the model and some error
- Model Form: Typically Y = mx + b (or Y = b + mx)
Example: Predicting Trick-or-Treaters
- Baseline: 25 trick-or-treaters
- Model: Number of trick-or-treaters increases by 0.01 for each middle school student
- Prediction: 35 trick-or-treaters (based on 1000 middle school students)
- Reality: 42 trick-or-treaters (Error = 7)
- Key Point: Error indicates deviation from the model, not necessarily that something is wrong
- Sources of Error: Unaccounted variables and random variation
Importance of Models
- Purpose: Allows making inferences (e.g., number of credit card frauds in a year)
- GLM Components: Information explained by the model and information that can't be explained
Linear Regression
- Concept: A type of GLM, predicts data using a continuous variable
- Example: Predicting YouTube video likes based on the number of comments
Steps in Linear Regression
- Plot Data: Check if data fits a straight line and look for outliers
- Handle Outliers: Decide based on criteria, outliers can influence the regression line
- Check Linearity: Ensure the relationship is linear
- Fit Regression Model: Usually done by a computer
- Interpret Regression Line: Minimizes the sum of squared distances of each point to the line
Components of Regression Line
- Y-intercept: Expected likes for a video with zero comments (may not make practical sense)
- Slope (coefficient): Indicates how much likes increase per additional comment
- Error (Residuals): Differences between observed and predicted values
Residuals
- Residual Plot: Should ideally be an evenly spaced cloud
- Concern: Patterns in residuals indicate errors depend on predictor variable values
Statistical Tests: F-test
- Purpose: Quantify fit of data to a distribution (null hypothesis)
- Null Hypothesis: No relationship between comments and likes
- Observed Model vs. Null Model: Compare actual data to the model where null hypothesis is true
F-test Calculation
- Notation: Y-hat (predicted value), Y-bar (mean value)
- Sum of Squares Total: Total variation in data
- Sum of Squares for Regression (SSR): Variation explained by the model
- Sum of Squares for Error (SSE): Variation not explained by the model
- Degrees of Freedom: Reflects amount of independent information
- F-statistic: Compare SSR and SSE to determine significance
- P-value: Probability of obtaining F-statistic as large or larger if null hypothesis is true
Conclusion from F-test
- Result: Reject null hypothesis if p-value is small (significant relationship exists)
- Comparison to t-test: Equivalent in hypothesis testing, squaring t-statistic gives F-statistic
Applications of Regression
- Fields: Science, economics, political science
- Examples: Relationship between taxes and cigarette purchases, heart rate and blood pressure
- Caution: Regression shows correlation, not causation
Summary
- GLM Framework: Explains data with model and error
- Practical Examples: Predicting gas budget, roommate's anger
Next Steps: Further understanding of F-tests in future episodes.