SAS Workshop Introduction to SAS Programming DAY 3 SESSION I Iowa State University May 10, 2016 Sample Data: Prostate Data Set Example C8 further illustrates the use of all-subset selection options in proc reg. In this example, adjrsq is used instead of rsquare as the model selection criterion. The data used here came from a study that examined the correlation between the level of prostate specific antigen and a number of clinical measures in men who were about to receive a radical prostatectomy. The goal was to predict log prostate specific antigen (lpsa) from a number of measurements including log cancer volume (lcavol), log prostate weight (lweight), age, log benign prostatic hyperplasia amount (lbph), seminal vesicle invasion (svi), log capsular penetration (lcp), Gleason score (gleason), and percentage Gleason scores 4 or 5 (pgg45). SAS Example C9 data prostate; infile"U:\Documents\SAS_Workshop_Spring2016\Data \prostate.txt"; input case lcavol lweight age lbph svi lcp gleason pgg45 lpsa; ; ods pdf file="U:\Documents\SAS_Workshop_Spring2016 \c9out.pdf"; title 'Variable Subset Selection: Prostate Data'; proc reg data=prostate plots(only)= (cp(label) aic(label)); model lpsa=lcavol lweight age lbph svi lcp gleason pgg45/selection=adjrsq start=2 stop=5 best=12 sse mse aic cp b; run; ods pdf close; Validation and Cross-Validation The average squared error (ASE) of prediction is used for estimating the error in prediction of a model fitted using the training data: 1 ASE= N ∑ ( yi − yˆi )2 Here yi is the i thobservation in the validation data set and yˆi is its predicted value using the fitted model and N is the validation sample size. An alternative approach is to use K-fold cross-validation. Here the original data is first randomly divided into K equal-sized partitions. One of these partitions (say, the k th one) is considered the hold out data. The remaining K -1 partitions put together is considered the training data. The model selected is fitted to the training data. This fitted model is used on the hold out data to obtain the prediction error (ASE) of the model. The whole procedure is repeated using the k th fold, where k = 1,, K , as the hold out data set and remaining data as the training data and the ASE's resulting from each partition are then combined. For equal fold sample sizes, ASE of the whole cross-validation procedure is K 1 ˆ ki for each k are calculated using prediction ( yki − yˆ ki ) 2 noting y ∑ N k =1 equations fit to the training data sets in the k th fold. SAS Example C10: Validation data prostate; infile "U:\Documents\SAS_Workshop_Spring2016\Data\prostate.txt"; input case lcavol lweight age lbph svi lcp gleason pgg45 lpsa; ; ods pdf file="U:\Documents\SAS_Workshop_Spring2016\c10out.pdf"; proc glmselect data=prostate seed=12345 plots(stepAxis=number)=(criteria ASE); partition fraction(validate=.35); model lpsa=lcavol lweight age lbph svi lcp gleason pgg45 /selection=stepwise(choose = validate select = sl) stb; run; ods pdf close; Notes: Options Used in Example C10 SAS procedure glmselect is used to illustrate how to perform validation. Use the stepwise selection method using the p-value criteria to select models of each size (selection=stepwise with select=sl suboption) However, choose=validate suboption is used to specify that the model with the smallest validation ASE be determined. glmselect produces tables and graphics that contains the validation ASE for models in each step when this option is used. The stepwise option uses significance levels of .15 for both entry and deletion of variable, by default. The best model is chosen using a validation data set obtained by randomly splitting the original prostate data set of 97 cases to obtain training data set with 69 cases and a validation set of 28 cases. SAS Example C11: K-fold Cross-Validation data prostate; infile "U:\Documents\SAS_Workshop_Spring2016\Data\prostate.txt"; input case lcavol lweight age lbph svi lcp gleason pgg45 lpsa; ; ods pdf file="U:\Documents\SAS_Workshop_Spring2016\c11out.pdf"; proc glmselect data=prostate seed=12345 plots(stepAxis=number)=(criteria coefficients); model lpsa=lcavol lweight age lbph svi lcp gleason pgg45 /cvmethod=random(5) selection=stepwise(select=adjrsq choose=cv) stats=(cp aic sbc) stb; run; ods pdf close; Comments on Example C11 SAS procedure glmselect is used to illustrate how to perform K-fold cross validation. The model option cvmethod=random(5) specifies that 5-fold cross-validation with the folds selected randomly (each of size, approximately N/K rounded down to an integer) be used to perform cross validation. The stepwise selection method used here is the adjusted R 2 to select models of each size (selection=adjrsq) However, choose=cv suboption is used to specify that the model with the smallest validation ASE be determined for each model selected at each step.This statistic is called CV PRESS in SAS. The option stats=(cp aic sbc) specifies that glmselect produces tables and graphics that contain these statistics for selected models. However, the cv press values for the models 2 through 7 are similar. Thus a possible good model may be one of the other models, say Model 4, that has the smallest Cp value of 5.6264 as well as several other good fit criteria. This model with an intercept and the predictors lcavol, lweight, svi, and lbph thus appears to have the smallest bias and comparably small error in prediction.