Transcript for:
Stata 16 Lasso for Model Selection

hi Chuck Huber here for the stata YouTube channel  today I'd like to show you another new feature in   stata 16 lasso for prediction and model selection  lasso for prediction and model selection includes   three estimators lasso square-root lasso and  elastic net you can fit linear logit probit and   poisson models using cross-validation adaptive  lasso plugins and other user specified criteria   after you fit a lasso model you can create  cross validation function plots coefficient   path plots select a different lambda create  tables of variables as they enter and leave   a model tabulate measures of fit by lambda and  compare fit across multiple lassos the dialog   box for lasso is located under statistics lasso  and then here I'm going to just select lasso and   this opens the lasso dialog box I can specify  what kind of model I'd like to fit here at the   top I can select my dependent variable I can also  select which variables will always be included in   the model and then I can also specify a list of  variables from which lasso actually selects the   variables and I can specify the selection  method here at the bottom and I can also   select the cross-validation options if I like  in this example or in the following example I'm   just going to use the commands and do file editor  but that you can fit the models using the dialog   box if you like so let's take a look at a quick  example using some fake survey data you can open   this dataset by typing web use fake survey this  dataset contains responses to 161 questions along   with some demographic data our goal is to select  variables that predict the response to question   104 obviously it would be challenging to assess  the importance of the remaining 160 questions   using traditional variable selection techniques  so we'll use lasso to select which variables are   important predictors and you're not working with  a large number of variables can be challenging so   I'm going to use the new VL collection of commands  to define groups of variables I'm going to begin   by using VL set to identify categorical and  continuous variables the option categorical   for tells Stata that variables with four or fewer  unique values will be assigned to the categorical   group variables with more than four unique values  will be assigned to the continuous group the   uncertain zero option omits the uncertain groups  and we'll discuss this in a later video the output   shows us that we have 115 variables that were  assigned to the group of categorical variables   and forty-seven variables that were assigned  to the group of continuous variables we can   now refer to these groups using the global macros  VL categorical and VL continuous you can use VL   substitute to create a group named I factors that  adds an IDOT prefix to each variable in the list   of categorical variables now we're ready to fit  our lasso models we begin by using split sample   to split our data into two groups Group one is our  training data set which we will use to select our   model and group two is our testing data set that  we will use to test the prediction next we'll use   lasso linear' to fit a linear lasso model for  the dependent variable q 104 or question 104   note that the covariance are specified using the  global macros that we'd have defined using VL I've   also included the option if sample equals one so  that we fit our model using the testing data and   I've specified a random number seed so that our  results are reproducible by default lasso fits   23 models using different values of lambda model  19 had the largest out-of-sample r-squared and   the smallest cross-validation mean prediction  error this suggests that the model with lambda   equal to 0.1 7 is the best for prediction we can  type CV plot to create a graph with lambda on the   horizontal axis and the cross-validation function  on the vertical axis this graph confirms that the   cross-validation function is minimized where  lambda equals 0.17 I can store the results of   this model and memory by typing estimates store CV  next I can type lasso knots to create a table of   information about each of the models that were  fit the first column displays the model number   the second column displays the value of lambda for  that model the third column displays the number of   variables with nonzero coefficients the fourth  column displays the out-of-sample r-squared and   the fifth column displays the BI C so maybe we'd  like to select the model with the lowest B I see   which in this case is model 14 you can type lasso  select ID equals 14 to select model 14 and then   you can view a cross-validation plot by typing  CV plot this plot shows that the cross-validation   function is slightly higher for model 14 with  a valley of lambda equals 0.2 7 but model 14 is   more parsimonious with only 28 coefficients  rather than for 49 coefficients when lambda   equals point 1 7 let's store the results of this  model as menB I see next let's use the options   selection adaptive to fit an adaptive lasso model  adaptive lasso did two lassos and selected model   78 as the best fitting model because it has the  smallest CV mean prediction hair let's store these   estimates with the name adaptive now we can use  lasso Co F to view a table of the variables that   were selected using our three models the first  column lists the variables the second column   includes an X if the variable was selected  using the default cross-validation method the   third column includes an X if the variable was  selected using the minimum bi C method and the   fourth column displays an X if the variable was  selected using adaptive lasso note that the rows   are sorted so that the variables with the largest  standardized coefficients are displayed at the top   the most most important are listed first and we  can use lasso gof to assess the goodness of fit   over our training sample in our testing sample  recall that sample 1 is our training data and   sample 2 is our testing data that we created  using split sample the results show that the   model with the minimum be IC has the smallest mean  squared error and the largest r-squared in the   testing data set if you would like to learn more  about lasso for prediction and model selection   you can download the manual at our website I  hope this was helpful thanks for stopping by