Stata 16 Lasso for Model Selection

Aug 8, 2024

Stata 16: Lasso for Prediction and Model Selection

Overview

  • New feature in Stata 16: Lasso for prediction and model selection.
  • Includes three estimators:
    • Lasso
    • Square-root lasso
    • Elastic net
  • Models that can be fitted:
    • Linear
    • Logit
    • Probit
    • Poisson

Features

  • Utilizes cross-validation, adaptive lasso plugins, and user-specified criteria.
  • After fitting a lasso model, you can:
    • Create cross-validation function plots.
    • Generate coefficient path plots.
    • Select different lambda values.
    • Create tables listing variable entry/exit.
    • Tabulate measures of fit by lambda.
    • Compare fit across multiple lasso models.

Using the Lasso Dialog Box

  • Accessed under Statistics > Lasso.
  • Allows specification of:
    • Dependent variable
    • Variables to always include in the model
    • List of variables for lasso selection
    • Selection method
    • Cross-validation options

Example with Fake Survey Data

  • Dataset: web use fake survey
  • Contains responses to 161 questions + demographic data.
  • Objective: Select variables predicting response to Question 104.
  • Traditional variable selection is challenging with 160 questions.

Variable Grouping

  • Use VL commands to define variable groups:
    • VL set identifies categorical and continuous variables.
    • Categorical: variables with 4 or fewer unique values.
    • Continuous: variables with more than 4 unique values.
    • Use uncertain zero option to omit uncertain groups.
  • Output:
    • 115 categorical variables
    • 47 continuous variables
  • Global macros created:
    • VL_categorical
    • VL_continuous

Fitting Lasso Models

  1. Split Sample
    • Split data into training (group one) and testing (group two) datasets.
  2. Fit Linear Lasso Model
    • Command: lasso linear for dependent variable q104.
    • Specify covariates using global macros.
    • Include if sample == 1 for training data fitting.
    • Set a random number seed for reproducibility.
  3. Model Fit
    • Default lasso fits 23 models with varying lambda values.
    • Model 19: Largest out-of-sample R-squared and smallest cross-validation mean prediction error.
    • Best prediction: lambda = 0.17.
  4. Visualizations
    • Use CV plot to create graph of lambda vs. cross-validation function.
    • Confirm minimum at lambda = 0.17.

Model Selection and Evaluation

  • Store model results: estimates store CV.
  • Use lasso knots to create a model information table:
    • Model number
    • Lambda value
    • Nonzero coefficients count
    • Out-of-sample R-squared
    • BIC value.
  • Selection: Choose model with lowest BIC (Model 14).
  • View cross-validation plot for selected model using CV plot.

Adaptive Lasso Model

  • Fit adaptive lasso model with selection adaptive:
    • Selected model 78 for best fit based on smallest CV mean prediction error.
    • Store results as adaptive.
  • Use lasso Co F to view selected variables across models.

Goodness of Fit Assessment

  • Use lasso gof to evaluate fit on training vs. testing samples:
    • Sample 1: Training data
    • Sample 2: Testing data
  • Results indicate minimum BIC model has smallest mean squared error and largest R-squared in testing data.

Additional Resources

  • For more information on lasso for prediction and model selection, access the manual on the website.