Supplementary Information 5 (docx 25K)

Supplemental Information 5. Analyses using the Least Absolute Shrinkage and Selection Operator (LASSO) The LASSO technique met our analytic requirements because it can incorporate multiple imputation, accommodates sampling weights and is computationally efficient (1). It also performed well in a pilot study we conducted using 6 variables from girls in NHANES. We compared the model produced by LASSO (1), adaptive LASSO (2), Smoothly Clipped Absolute Deviation (SCAD) (3), and all possible subsets regression. For the all possible subsets regression, we compared results using the selection criteria of adjusted R2, Akaike information criterion (AIC) and root mean square error (RMSE). All approaches produced equations with similar performance. Given these results, LASSO was chosen as the analytic tools for term selection because it allowed consideration of a very large number of candidate terms. Step 2 of the Methods section in the text describes how we used LASSO to create models to predict percent body fat and presents the decision rules used to reduce or avoid overfitting when choosing the final model. Here, we illustrate this process by describing in detail how final model A was selected for males and females using the fitting data. We compared the model selected by LASSO as the one with minimal cross-validation error (CVmin) to the model with cross validation error that was 1 standard error larger than CVmin (CVmin+1SE). The addition of 1 SE was suggested by Hastie et al. (4) to control for overfitting when using LASSO. Note that if no model had validation error that was exactly 1 SE larger than CVmin, we chose the closest model that did not exceed the bound. In males, the CVmin model had an adjusted R2 of 0.854, whereas, the CVmin+1SE model had an adjusted R2 of 0.847. Thus, the adjusted R2 was reduced by 0.007 (0.854-0.847=0.007). If the reduction in the adjusted R2 had been 0.01 or greater, the CVmin+1SE model would have been chosen, however, since the difference was less than 0.01 we proceeded to a second step in which we examined models with a larger SE by increments of 0.25 and looked for the model with the largest SE that reduced the adjusted R2 by up to, but not more than 0.01 compared to the CVmin model. The adjusted R2 for the models with 1.25, 1.5, and 1.75 SE of the CVmin model had adjusted R2 values that were lower than the CVmin model by 0.009, 0.009, and 0.011 respectively. Since the 1.75 SE model had an adjusted R2 that exceeded our bound (0.01), we chose the 1.5 SE model. This example illustrates the rare circumstance in which two models with different SE added to the minimal (here 1.25 and 1.5) had the same adjusted R2 assessed to three decimal places (both had adjusted R2 0.009 different from the R2 of the CVmin model). In these cases we chose the model with the larger SE. The final selected model for males had 43 terms while the CVmin model had 165 terms. In females, for model A the CVmin+1SE model had an R2 0.012 lower than the CVmin model. Since this decline in R2 was greater or equal to 0.01, we chose the CVmin+1SE model. The selected model for females had 37 terms, whereas the CVmin model had 134 terms. It is expected that model fit will be better in the fitting sample than in the validation sample, but large discrepancies can indicate over-fitting. We found the mean difference in the R2 between the fitting and the validation samples was 0.001 for males and 0.009 for females for models A to N. The largest differences were 0.005 for males (models C, H, and M) and 0.014 in women (model N). We consider these differences acceptable. References 1. Tibshirani RJ. Regression Shrinkage and Selection via the LASS. Journal of the Royal Statistical Society. 1996;Ser. B, 58:267-88. 2. Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association 2006;101(476):1418-29. 3. Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 2001;96(456):1348-60. 4. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. New York, NY: Springer New York Inc.; 2001. p. 244.

Supplementary Information 5 (docx 25K)

Related documents

Products

Support

Supplementary Information 5 (docx 25K)

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib