SAS Workshop Introduction to SAS Programming Iowa State University

advertisement
SAS Workshop
Introduction to SAS
Programming
DAY 3 SESSION I
Iowa State University
May 10, 2016
Sample Data: Prostate Data Set
 Example C8 further illustrates the use of all-subset selection
options in proc reg.
 In this example, adjrsq is used instead of rsquare as the model
selection criterion.
 The data used here came from a study that examined the
correlation between the level of prostate specific antigen and a
number of clinical measures in men who were about to receive a
radical prostatectomy.
 The goal was to predict log prostate specific antigen (lpsa) from
a number of measurements including log cancer volume (lcavol),
log prostate weight (lweight), age, log benign prostatic
hyperplasia amount (lbph), seminal vesicle invasion (svi), log
capsular penetration (lcp), Gleason score (gleason), and
percentage Gleason scores 4 or 5 (pgg45).
SAS Example C9
data prostate;
infile"U:\Documents\SAS_Workshop_Spring2016\Data
\prostate.txt";
input case lcavol lweight age lbph svi lcp
gleason pgg45 lpsa;
;
ods pdf file="U:\Documents\SAS_Workshop_Spring2016
\c9out.pdf";
title 'Variable Subset Selection: Prostate Data';
proc reg data=prostate plots(only)=
(cp(label) aic(label));
model lpsa=lcavol lweight age lbph svi lcp
gleason pgg45/selection=adjrsq
start=2 stop=5 best=12 sse mse aic cp b;
run;
ods pdf close;
Validation and Cross-Validation
 The average squared error (ASE) of prediction is used for estimating the error





in prediction of a model fitted using the training data:
1
ASE= N ∑ ( yi − yˆi )2
Here yi is the i thobservation in the validation data set and yˆi is its predicted
value using the fitted model and N is the validation sample size.
An alternative approach is to use K-fold cross-validation. Here the original data
is first randomly divided into K equal-sized partitions. One of these
partitions (say, the k th one) is considered the hold out data. The remaining K -1
partitions put together is considered the training data.
The model selected is fitted to the training data. This fitted model is used on
the hold out data to obtain the prediction error (ASE) of the model.
The whole procedure is repeated using the k th fold, where k = 1,, K , as the hold
out data set and remaining data as the training data and the ASE's resulting
from each partition are then combined.
For equal
fold sample sizes, ASE of the whole cross-validation procedure is
K
1
ˆ ki for each k are calculated using prediction
( yki − yˆ ki ) 2 noting y
∑
N k =1
equations fit to the training data sets in the k th fold.
SAS Example C10: Validation
data prostate;
infile "U:\Documents\SAS_Workshop_Spring2016\Data\prostate.txt";
input case lcavol lweight age lbph svi lcp gleason pgg45 lpsa;
;
ods pdf
file="U:\Documents\SAS_Workshop_Spring2016\c10out.pdf";
proc glmselect data=prostate
seed=12345 plots(stepAxis=number)=(criteria ASE);
partition fraction(validate=.35);
model lpsa=lcavol lweight age lbph svi lcp gleason pgg45
/selection=stepwise(choose = validate
select = sl) stb;
run;
ods pdf close;
Notes: Options Used in Example C10
 SAS procedure glmselect is used to illustrate how to perform





validation.
Use the stepwise selection method using the p-value criteria to select
models of each size (selection=stepwise with select=sl suboption)
However, choose=validate suboption is used to specify that the model
with the smallest validation ASE be determined.
glmselect produces tables and graphics that contains the validation
ASE for models in each step when this option is used.
The stepwise option uses significance levels of .15 for both entry and
deletion of variable, by default.
The best model is chosen using a validation data set obtained by
randomly splitting the original prostate data set of 97 cases to obtain
training data set with 69 cases and a validation set of 28 cases.
SAS Example C11: K-fold Cross-Validation
data prostate;
infile "U:\Documents\SAS_Workshop_Spring2016\Data\prostate.txt";
input case lcavol lweight age lbph svi lcp gleason pgg45 lpsa;
;
ods pdf file="U:\Documents\SAS_Workshop_Spring2016\c11out.pdf";
proc glmselect data=prostate
seed=12345 plots(stepAxis=number)=(criteria coefficients);
model lpsa=lcavol lweight age lbph svi lcp gleason pgg45
/cvmethod=random(5)
selection=stepwise(select=adjrsq choose=cv)
stats=(cp aic sbc) stb;
run;
ods pdf close;
Comments on Example C11
 SAS procedure glmselect is used to illustrate how to perform K-fold cross





validation.
The model option cvmethod=random(5) specifies that 5-fold cross-validation
with the folds selected randomly (each of size, approximately N/K rounded
down to an integer) be used to perform cross validation.
The stepwise selection method used here is the adjusted R 2 to select models of
each size (selection=adjrsq)
However, choose=cv suboption is used to specify that the model with the
smallest validation ASE be determined for each model selected at each
step.This statistic is called CV PRESS in SAS.
The option stats=(cp aic sbc) specifies that glmselect produces tables and
graphics that contain these statistics for selected models.
However, the cv press values for the models 2 through 7 are similar. Thus a
possible good model may be one of the other models, say Model 4, that has the
smallest Cp value of 5.6264 as well as several other good fit criteria.
This model with an intercept and the predictors lcavol, lweight, svi, and lbph
thus appears to have the smallest bias and comparably small error in
prediction.
Download