SAS Workshop Introduction to SAS Programming Iowa State University

advertisement
SAS Workshop
Introduction to SAS
Programming
DAY 2 SESSION IV
Iowa State University
May 10, 2016
Controlling ODS graphical output
from a procedure
 Many SAS procedures produce default plots in ODS graphics format
 One way to select graphs (and tables) to be output is to use ODS





SELECT statement as we have seen in previous examples.
The ODS SELECT only add these plots to the output sent to the ODS
destination (but it may not stop them from being produced!).
To selectively generate only the required plots, use the PLOTS= option
available in statistical procedures that support ODS Graphics
The simplest PLOTS= specifications is of the form PLOTS=plot-request
or PLOTS= (plot-requests). This does not stop the default plots from a
procedure being produced.
To do that, use the ONLY global option for e.g., plots(only)=residuals
Using plots=none disables all ODS Graphics for the current proc step
Sample SAS Program C4
data muscle;
input x y;
label x='Age' y='Muscle Mass';
datalines;
71
82
64
91
……
…
……
…
53
100
49
105
78
77
;
ods pdf file="U:\Documents\SAS_Workshop_Spring2016\c4_out.pdf";
proc reg data=muscle plots(only)=(diagnostics
qq residualbypredicted fit residuals);
model y = x/r;
title “Simple Linear Regression Analysis of Muscle Mass Data”;
run;
ods pdf close;
Sample SAS Program C5
data lead;
input Sample $ x y;
label x='Traffic Flow' y='Lead Content';
datalines;
A 8.3 227
B 8.3 312
C 12.1 362
D 12.1 521
E 17.0 640
F 17.0 539
G 17.0 728
H 24.3 945
I 24.3 738
J 24.3 759
K 33.6 1263
L 10.0
.
M 15.0
.
;
proc reg data=lead plots(only label)=(fit rstudentbyleverage cooksd);
model y=x/clm cli;
id Sample;
title “Prediction Intervals: Lead Content Data”;
run;
Sample SAS Program C6
In the following example we use plot= options to request the
diagnostic panel and the regression fit plot from a regression of
Weight vs. Height in the biology data. We use the clb option to
compute the confidence intervals for the regression coefficients and
print all residual and influence statistics using the r and influence
options.
libname libc "U:\Documents\SAS_Workshop_Spring2016\Data\";
ods select ANOVA ParameterEstimates OutputStatistics FitPlot;
proc reg data=libc.biology plots(only)=(fit diagnostics);
model Weight=Height/clb r influence;
title “Regression of Weight on Height: Biology Class”;
run;
Small SAS Project
 Import an Excel data set as SAS data set using proc import
 The data consists of air pollution and related values for 41 U.S.




cities . SO2 in the air in mcg / m3 is the response variable
Use proc sgscatter to obtain a scatterplot matrix and proc reg to do
a preliminary multiple regression analysis
Looking at the plots alone, Obs #31 looks like an influential
y-outlier. Must look at diagnostic statistics to confirm this
Read a file containing City Names indexed by the same City #
used in the above Excel file, combine it with first SAS dataset
using a merge, and save the resulting SAS dataset
In a second SAS program, access this SAS dataset and perform a
variable subset selection procedure using proc reg
Sample SAS Program C7
libname mylib "U:\Documents\SAS_Workshop_Spring2016\Data\";
proc import out= work.air
datafile= "U:\Documents\SAS_Workshop_Spring2016\Data\air_pollution.xls"
dbms=xls replace;
getnames=yes;
run;
proc print data=air;
run;
ods rtf file="U:\Documents\SAS_Workshop_Spring2016\c7_out.rtf" style=statistical;
proc sgscatter data=air;
title "Scatterplot Matrix for Air Pollution Data";
matrix SO2--PrecipDays;
run;
proc reg corr data=air plots(only label)=diagnostics;
model SO2 = AvTemp--PrecipDays/r influence clb vif;
id City
title 'Model fitted with all explanatory variables';
run;
ods rtf close;
Sample SAS Program C7 (Continued)
data names;
infile "U:\Documents\SAS_Workshop_Spring2016\Data\city_names.txt" truncover;
input City CityName $14.;
run;
proc print data=names;
title "List of City Names";
run;
data mylib.pollution;
merge air names;
by City;
run;
proc print data=mylib.pollution;
title "Listing of Air Pollution Data Set Merged with City Names";
run;
Sample SAS Program C8
libname mylib "U:\Documents\SAS_Workshop_Spring2016\Data\";
data cleaned;
set mylib.pollution;
if _N_= 31 then delete;
run;
ods pdf file="U:\Documents\SAS_Workshop_Spring2016\c8_out.pdf";
proc reg data=cleaned plots(only)=(criteria cp(label));
model SO2 = AvTemp--PrecipDays/selection=rsquare start=2 stop=4 best=4
cp sse mse;
title "Models fitted with all explanatory variables (with Obs#31 deleted)";
run;
ods pdf close;
Model Building: Variable Selection in Regression
 The aim of variable selection methods is to identify a subset of k





predictors (i.e.,x-variables) that has good predictive power.
Classical methods are based on entering (called forward selection) or
deleting (called backward elimination) a single variable at-a-time from
the current model, based on p-values.
The significance of a variable to be entered to a model is calculated
using an F-statistic by comparing the current model with the model
with the new variable.
If this variable is significant at the significance level for entry (called sle
in SAS), (by comparing its p-value with sle). the variable is entered
The same process is used for deleting variables. A variable is deleted
if it is significant at significance level for stay (or sls in SAS).
The stepwise selection method combines these two methods. In each
iteration of the method a forward selection step is followed by a
backward elimination step.
All Subset Selection Method
 Suppose we start with k predictor variables: x1 , x2 ,  , xk
 Fit all models of size p where p = 1,  , k i.e., 1-var models, 2-var models etc.
 Pick the best among these models of each size. Here best is defined as
having the largest R 2.
 Select a single best model from among these models using a criterion such as






Cp, (or AIC), BIC, or adjusted R 2.
For the selected model to be unbiased, we would like Cp to be close to p or
smaller.
AIC and Cp are equivalent for models with normal errors.
Generally we select the model that has the lowest BIC value.
There is no guarantee that the selected models will perform well when
accuracy of predicting new observations is of interest.
A standard approach for assessing predictive ability of different regression
models is to evaluate their performance on a hold out data set (often called
the validation data set).
When a sufficiently large data set is available, this is usually achieved by
randomly splitting the data into a training data set and a validation data set.
Download