Supplementary Information 5 (docx 25K)

advertisement
Supplemental Information 5. Analyses using the Least Absolute Shrinkage and
Selection Operator (LASSO)
The LASSO technique met our analytic requirements because it can incorporate multiple
imputation, accommodates sampling weights and is computationally efficient (1). It also
performed well in a pilot study we conducted using 6 variables from girls in NHANES. We
compared the model produced by LASSO (1), adaptive LASSO (2), Smoothly Clipped Absolute
Deviation (SCAD) (3), and all possible subsets regression. For the all possible subsets
regression, we compared results using the selection criteria of adjusted R2, Akaike information
criterion (AIC) and root mean square error (RMSE). All approaches produced equations with
similar performance. Given these results, LASSO was chosen as the analytic tools for term
selection because it allowed consideration of a very large number of candidate terms.
Step 2 of the Methods section in the text describes how we used LASSO to create
models to predict percent body fat and presents the decision rules used to reduce or avoid
overfitting when choosing the final model. Here, we illustrate this process by describing in detail
how final model A was selected for males and females using the fitting data.
We compared the model selected by LASSO as the one with minimal cross-validation
error (CVmin) to the model with cross validation error that was 1 standard error larger than
CVmin (CVmin+1SE). The addition of 1 SE was suggested by Hastie et al. (4) to control for
overfitting when using LASSO. Note that if no model had validation error that was exactly 1 SE
larger than CVmin, we chose the closest model that did not exceed the bound. In males, the
CVmin model had an adjusted R2 of 0.854, whereas, the CVmin+1SE model had an adjusted R2
of 0.847. Thus, the adjusted R2 was reduced by 0.007 (0.854-0.847=0.007). If the reduction in
the adjusted R2 had been 0.01 or greater, the CVmin+1SE model would have been chosen,
however, since the difference was less than 0.01 we proceeded to a second step in which we
examined models with a larger SE by increments of 0.25 and looked for the model with the
largest SE that reduced the adjusted R2 by up to, but not more than 0.01 compared to the
CVmin model. The adjusted R2 for the models with 1.25, 1.5, and 1.75 SE of the CVmin model
had adjusted R2 values that were lower than the CVmin model by 0.009, 0.009, and 0.011
respectively. Since the 1.75 SE model had an adjusted R2 that exceeded our bound (0.01), we
chose the 1.5 SE model.
This example illustrates the rare circumstance in which two models with different SE
added to the minimal (here 1.25 and 1.5) had the same adjusted R2 assessed to three decimal
places (both had adjusted R2 0.009 different from the R2 of the CVmin model). In these cases
we chose the model with the larger SE. The final selected model for males had 43 terms while
the CVmin model had 165 terms.
In females, for model A the CVmin+1SE model had an R2 0.012 lower than the CVmin
model. Since this decline in R2 was greater or equal to 0.01, we chose the CVmin+1SE model.
The selected model for females had 37 terms, whereas the CVmin model had 134 terms.
It is expected that model fit will be better in the fitting sample than in the validation
sample, but large discrepancies can indicate over-fitting. We found the mean difference in the
R2 between the fitting and the validation samples was 0.001 for males and 0.009 for females for
models A to N. The largest differences were 0.005 for males (models C, H, and M) and 0.014 in
women (model N). We consider these differences acceptable.
References
1.
Tibshirani RJ. Regression Shrinkage and Selection via the LASS. Journal of the Royal
Statistical Society. 1996;Ser. B, 58:267-88.
2.
Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical
Association 2006;101(476):1418-29.
3.
Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle
properties. Journal of the American Statistical Association 2001;96(456):1348-60.
4.
Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. New York, NY:
Springer New York Inc.; 2001. p. 244.
Download