Simple linear regression helps understand relationships between two numeric variables.
Example: Predicting spending at Kroger based on yearly income.
Provides prediction equations and 95% confidence intervals.
Transition to Multiple Linear Regression
Multiple Linear Regression: Predicts a dependent variable (Y) using multiple independent variables (X1, X2, ..., XK).
Example predictors for household expenditure at Kroger:
Income
Distance to nearest Kroger
Number of kids
Number of pets
Number of cars
Structure of Multiple Linear Regression Model
Model Formulation:
Y = β0 + β1X1 + β2X2 + ... + βK*XK + ε
Where β0 is the intercept and ε is the disturbance term.
Allows for better predictions than using single predictors.
Predictive Analytics with Multiple Regression
Predictions can focus on:
Average value of Y across populations with specific X values.
Individual Y values based on specific X combinations.
Example: Predicting Tip Percentage
Data Setup:
Dependent variable: Tip Percentage
Independent variables: Bill Amount, Party Size
Model Fitting:
Use R's lm() function.
Example command: m.tips <- lm(tip_percentage ~ bill + party_size, data = ex2_tips).
Interpretation:
Coefficients:
Intercept: 19.9%
Bill: -0.27
Party Size: +6
Making Predictions:
Example predictions for specific bill and party sizes.
Confidence vs. prediction intervals explained.
Interpretation of Coefficients in Multiple Regression
Understanding coefficients can be complex:
Simple Linear Regression Interpretation:
Change in Y for a one-unit change in X.
Multiple Linear Regression Interpretation:
Difference in Y when X changes by one unit, holding all other predictors constant.
Example: Grades Based on Assignments and Attendance
Issue of Lurking Variables:
Attendance can skew results when predicting grades from assignments alone.
Multiple Regression Solution:
Including both assignment scores and attendance captures full picture.
Challenges in Interpretation
Case of Salary Prediction:
Experience vs. Education in salary models shows the importance of including all factors.
Misinterpretation risks arise from not accounting for all relevant predictors (e.g., gender, tenure).
Analyzing Correlation and Variance Inflation
High inter-variable correlation leads to multicollinearity issues.
Variance Inflation Factor (VIF):
Indicator of multicollinearity.
VIF > 5 or 10 signals potential problems.
Consequences of Multicollinearity:
Coefficient estimates become less reliable.
Model predictions may still be accurate despite poor coefficient interpretation.
Conclusion
Multiple linear regression is powerful for predictive analytics but requires careful interpretation of coefficients and consideration of multicollinearity.
Students should practice interpreting coefficients while accounting for overlapping relationships among predictors.