Stata 16 Lasso for Model Selection

hi Chuck Huber here for the stata YouTube channel today I'd like to show you another new feature in stata 16 lasso for prediction and model selection lasso for prediction and model selection includes three estimators lasso square-root lasso and elastic net you can fit linear logit probit and poisson models using cross-validation adaptive lasso plugins and other user specified criteria after you fit a lasso model you can create cross validation function plots coefficient path plots select a different lambda create tables of variables as they enter and leave a model tabulate measures of fit by lambda and compare fit across multiple lassos the dialog box for lasso is located under statistics lasso and then here I'm going to just select lasso and this opens the lasso dialog box I can specify what kind of model I'd like to fit here at the top I can select my dependent variable I can also select which variables will always be included in the model and then I can also specify a list of variables from which lasso actually selects the variables and I can specify the selection method here at the bottom and I can also select the cross-validation options if I like in this example or in the following example I'm just going to use the commands and do file editor but that you can fit the models using the dialog box if you like so let's take a look at a quick example using some fake survey data you can open this dataset by typing web use fake survey this dataset contains responses to 161 questions along with some demographic data our goal is to select variables that predict the response to question 104 obviously it would be challenging to assess the importance of the remaining 160 questions using traditional variable selection techniques so we'll use lasso to select which variables are important predictors and you're not working with a large number of variables can be challenging so I'm going to use the new VL collection of commands to define groups of variables I'm going to begin by using VL set to identify categorical and continuous variables the option categorical for tells Stata that variables with four or fewer unique values will be assigned to the categorical group variables with more than four unique values will be assigned to the continuous group the uncertain zero option omits the uncertain groups and we'll discuss this in a later video the output shows us that we have 115 variables that were assigned to the group of categorical variables and forty-seven variables that were assigned to the group of continuous variables we can now refer to these groups using the global macros VL categorical and VL continuous you can use VL substitute to create a group named I factors that adds an IDOT prefix to each variable in the list of categorical variables now we're ready to fit our lasso models we begin by using split sample to split our data into two groups Group one is our training data set which we will use to select our model and group two is our testing data set that we will use to test the prediction next we'll use lasso linear' to fit a linear lasso model for the dependent variable q 104 or question 104 note that the covariance are specified using the global macros that we'd have defined using VL I've also included the option if sample equals one so that we fit our model using the testing data and I've specified a random number seed so that our results are reproducible by default lasso fits 23 models using different values of lambda model 19 had the largest out-of-sample r-squared and the smallest cross-validation mean prediction error this suggests that the model with lambda equal to 0.1 7 is the best for prediction we can type CV plot to create a graph with lambda on the horizontal axis and the cross-validation function on the vertical axis this graph confirms that the cross-validation function is minimized where lambda equals 0.17 I can store the results of this model and memory by typing estimates store CV next I can type lasso knots to create a table of information about each of the models that were fit the first column displays the model number the second column displays the value of lambda for that model the third column displays the number of variables with nonzero coefficients the fourth column displays the out-of-sample r-squared and the fifth column displays the BI C so maybe we'd like to select the model with the lowest B I see which in this case is model 14 you can type lasso select ID equals 14 to select model 14 and then you can view a cross-validation plot by typing CV plot this plot shows that the cross-validation function is slightly higher for model 14 with a valley of lambda equals 0.2 7 but model 14 is more parsimonious with only 28 coefficients rather than for 49 coefficients when lambda equals point 1 7 let's store the results of this model as menB I see next let's use the options selection adaptive to fit an adaptive lasso model adaptive lasso did two lassos and selected model 78 as the best fitting model because it has the smallest CV mean prediction hair let's store these estimates with the name adaptive now we can use lasso Co F to view a table of the variables that were selected using our three models the first column lists the variables the second column includes an X if the variable was selected using the default cross-validation method the third column includes an X if the variable was selected using the minimum bi C method and the fourth column displays an X if the variable was selected using adaptive lasso note that the rows are sorted so that the variables with the largest standardized coefficients are displayed at the top the most most important are listed first and we can use lasso gof to assess the goodness of fit over our training sample in our testing sample recall that sample 1 is our training data and sample 2 is our testing data that we created using split sample the results show that the model with the minimum be IC has the smallest mean squared error and the largest r-squared in the testing data set if you would like to learn more about lasso for prediction and model selection you can download the manual at our website I hope this was helpful thanks for stopping by

Transcript for:Stata 16 Lasso for Model Selection

Transcript for:
Stata 16 Lasso for Model Selection