Overview
This lecture covers the fundamentals of regression analysis, explaining simple and multiple linear regression, logistic regression, the use of dummy variables, assumptions for linear regression, and how to interpret and perform these analyses.
Introduction to Regression Analysis
- Regression analysis predicts a dependent variable based on one or more independent variables.
- The dependent variable is the variable you want to predict; independent variables are the predictors.
- Regression is used to measure influences or make predictions.
Types of Regression
- Simple linear regression uses one independent variable to predict a metric dependent variable.
- Multiple linear regression uses two or more independent variables to predict a metric dependent variable.
- Logistic regression is used for categorical dependent variables (e.g., yes/no outcomes).
Simple Linear Regression
- Predicts a dependent variable based on one independent variable.
- The relationship is visualized in a scatter plot; the regression line is calculated using the least squares method.
- Model: y = a + bx, where a is intercept and b is slope.
- The error (epsilon) is the difference between predicted and actual values.
Multiple Linear Regression
- Uses several independent variables to predict the dependent variable.
- Model: y = a + b1x1 + b2x2 + ... + bkxk.
- Coefficients show the expected change in the dependent variable for a one-unit change in the independent variable.
Interpreting Regression Results
- Multiple correlation coefficient (R) measures the strength of the relationship.
- Coefficient of determination (R²) shows the proportion of variance explained.
- Adjusted R² corrects for too many predictors.
- Standard estimation error reflects average prediction error.
- F-test checks if model explains significant variance.
- Coefficient significance (p-value < 0.05) indicates meaningful predictors.
Assumptions for Linear Regression
- Linear relationship between dependent and independent variables.
- Errors (residuals) are normally distributed.
- No multicollinearity (independent variables not highly correlated).
- Homoscedasticity (constant variance of residuals).
Dummy Variables
- Dummy variables represent categorical predictors with two or more values.
- For a variable with n categories, use n-1 dummy variables.
- Each dummy is coded as 0 or 1 to indicate category membership.
Logistic Regression
- Used for categorical outcome variables (e.g., yes/no).
- Predicts the probability of the outcome using the logistic function (values between 0 and 1).
- Model uses maximum likelihood estimation to determine coefficients.
Key Terms & Definitions
- Dependent Variable — variable being predicted in regression.
- Independent Variable — variable used to predict the dependent variable.
- Simple Linear Regression — regression with one predictor for a metric outcome.
- Multiple Linear Regression — regression with two or more predictors for a metric outcome.
- Logistic Regression — regression for a categorical outcome.
- Dummy Variable — binary variable for categorical predictors.
- Homoscedasticity — equal variance of residuals across values.
- Multicollinearity — high correlation between independent variables.
- Coefficient of Determination (R²) — proportion of variance explained by the model.
Action Items / Next Steps
- Practice calculating regression analyses using an online tool like DataTab.
- Check assumptions before interpreting regression results.
- Prepare dummy variables for categorical predictors with more than two categories.