Understanding Multiple Linear Regression Concepts

Aug 7, 2024

Lecture Notes on Multiple Linear Regression

Introduction to Regression

  • Simple linear regression helps understand relationships between two numeric variables.
  • Example: Predicting spending at Kroger based on yearly income.
  • Provides prediction equations and 95% confidence intervals.

Transition to Multiple Linear Regression

  • Multiple Linear Regression: Predicts a dependent variable (Y) using multiple independent variables (X1, X2, ..., XK).
  • Example predictors for household expenditure at Kroger:
    • Income
    • Distance to nearest Kroger
    • Number of kids
    • Number of pets
    • Number of cars

Structure of Multiple Linear Regression Model

  • Model Formulation:
    • Y = β0 + β1X1 + β2X2 + ... + βK*XK + ε
    • Where β0 is the intercept and ε is the disturbance term.
  • Allows for better predictions than using single predictors.

Predictive Analytics with Multiple Regression

  • Predictions can focus on:
    1. Average value of Y across populations with specific X values.
    2. Individual Y values based on specific X combinations.

Example: Predicting Tip Percentage

  1. Data Setup:
    • Dependent variable: Tip Percentage
    • Independent variables: Bill Amount, Party Size
  2. Model Fitting:
    • Use R's lm() function.
    • Example command: m.tips <- lm(tip_percentage ~ bill + party_size, data = ex2_tips).
  3. Interpretation:
    • Coefficients:
      • Intercept: 19.9%
      • Bill: -0.27
      • Party Size: +6
  4. Making Predictions:
    • Example predictions for specific bill and party sizes.
    • Confidence vs. prediction intervals explained.

Interpretation of Coefficients in Multiple Regression

  • Understanding coefficients can be complex:
    • Simple Linear Regression Interpretation:
      • Change in Y for a one-unit change in X.
    • Multiple Linear Regression Interpretation:
      • Difference in Y when X changes by one unit, holding all other predictors constant.

Example: Grades Based on Assignments and Attendance

  • Issue of Lurking Variables:
    • Attendance can skew results when predicting grades from assignments alone.
  • Multiple Regression Solution:
    • Including both assignment scores and attendance captures full picture.

Challenges in Interpretation

  • Case of Salary Prediction:
    • Experience vs. Education in salary models shows the importance of including all factors.
  • Misinterpretation risks arise from not accounting for all relevant predictors (e.g., gender, tenure).

Analyzing Correlation and Variance Inflation

  • High inter-variable correlation leads to multicollinearity issues.
  • Variance Inflation Factor (VIF):
    • Indicator of multicollinearity.
    • VIF > 5 or 10 signals potential problems.
  • Consequences of Multicollinearity:
    • Coefficient estimates become less reliable.
    • Model predictions may still be accurate despite poor coefficient interpretation.

Conclusion

  • Multiple linear regression is powerful for predictive analytics but requires careful interpretation of coefficients and consideration of multicollinearity.
  • Students should practice interpreting coefficients while accounting for overlapping relationships among predictors.