SAS Workshop Introduction to SAS Programming DAY 2 SESSION IV Iowa State University May 10, 2016 Controlling ODS graphical output from a procedure Many SAS procedures produce default plots in ODS graphics format One way to select graphs (and tables) to be output is to use ODS SELECT statement as we have seen in previous examples. The ODS SELECT only add these plots to the output sent to the ODS destination (but it may not stop them from being produced!). To selectively generate only the required plots, use the PLOTS= option available in statistical procedures that support ODS Graphics The simplest PLOTS= specifications is of the form PLOTS=plot-request or PLOTS= (plot-requests). This does not stop the default plots from a procedure being produced. To do that, use the ONLY global option for e.g., plots(only)=residuals Using plots=none disables all ODS Graphics for the current proc step Sample SAS Program C4 data muscle; input x y; label x='Age' y='Muscle Mass'; datalines; 71 82 64 91 …… … …… … 53 100 49 105 78 77 ; ods pdf file="U:\Documents\SAS_Workshop_Spring2016\c4_out.pdf"; proc reg data=muscle plots(only)=(diagnostics qq residualbypredicted fit residuals); model y = x/r; title “Simple Linear Regression Analysis of Muscle Mass Data”; run; ods pdf close; Sample SAS Program C5 data lead; input Sample $ x y; label x='Traffic Flow' y='Lead Content'; datalines; A 8.3 227 B 8.3 312 C 12.1 362 D 12.1 521 E 17.0 640 F 17.0 539 G 17.0 728 H 24.3 945 I 24.3 738 J 24.3 759 K 33.6 1263 L 10.0 . M 15.0 . ; proc reg data=lead plots(only label)=(fit rstudentbyleverage cooksd); model y=x/clm cli; id Sample; title “Prediction Intervals: Lead Content Data”; run; Sample SAS Program C6 In the following example we use plot= options to request the diagnostic panel and the regression fit plot from a regression of Weight vs. Height in the biology data. We use the clb option to compute the confidence intervals for the regression coefficients and print all residual and influence statistics using the r and influence options. libname libc "U:\Documents\SAS_Workshop_Spring2016\Data\"; ods select ANOVA ParameterEstimates OutputStatistics FitPlot; proc reg data=libc.biology plots(only)=(fit diagnostics); model Weight=Height/clb r influence; title “Regression of Weight on Height: Biology Class”; run; Small SAS Project Import an Excel data set as SAS data set using proc import The data consists of air pollution and related values for 41 U.S. cities . SO2 in the air in mcg / m3 is the response variable Use proc sgscatter to obtain a scatterplot matrix and proc reg to do a preliminary multiple regression analysis Looking at the plots alone, Obs #31 looks like an influential y-outlier. Must look at diagnostic statistics to confirm this Read a file containing City Names indexed by the same City # used in the above Excel file, combine it with first SAS dataset using a merge, and save the resulting SAS dataset In a second SAS program, access this SAS dataset and perform a variable subset selection procedure using proc reg Sample SAS Program C7 libname mylib "U:\Documents\SAS_Workshop_Spring2016\Data\"; proc import out= work.air datafile= "U:\Documents\SAS_Workshop_Spring2016\Data\air_pollution.xls" dbms=xls replace; getnames=yes; run; proc print data=air; run; ods rtf file="U:\Documents\SAS_Workshop_Spring2016\c7_out.rtf" style=statistical; proc sgscatter data=air; title "Scatterplot Matrix for Air Pollution Data"; matrix SO2--PrecipDays; run; proc reg corr data=air plots(only label)=diagnostics; model SO2 = AvTemp--PrecipDays/r influence clb vif; id City title 'Model fitted with all explanatory variables'; run; ods rtf close; Sample SAS Program C7 (Continued) data names; infile "U:\Documents\SAS_Workshop_Spring2016\Data\city_names.txt" truncover; input City CityName $14.; run; proc print data=names; title "List of City Names"; run; data mylib.pollution; merge air names; by City; run; proc print data=mylib.pollution; title "Listing of Air Pollution Data Set Merged with City Names"; run; Sample SAS Program C8 libname mylib "U:\Documents\SAS_Workshop_Spring2016\Data\"; data cleaned; set mylib.pollution; if _N_= 31 then delete; run; ods pdf file="U:\Documents\SAS_Workshop_Spring2016\c8_out.pdf"; proc reg data=cleaned plots(only)=(criteria cp(label)); model SO2 = AvTemp--PrecipDays/selection=rsquare start=2 stop=4 best=4 cp sse mse; title "Models fitted with all explanatory variables (with Obs#31 deleted)"; run; ods pdf close; Model Building: Variable Selection in Regression The aim of variable selection methods is to identify a subset of k predictors (i.e.,x-variables) that has good predictive power. Classical methods are based on entering (called forward selection) or deleting (called backward elimination) a single variable at-a-time from the current model, based on p-values. The significance of a variable to be entered to a model is calculated using an F-statistic by comparing the current model with the model with the new variable. If this variable is significant at the significance level for entry (called sle in SAS), (by comparing its p-value with sle). the variable is entered The same process is used for deleting variables. A variable is deleted if it is significant at significance level for stay (or sls in SAS). The stepwise selection method combines these two methods. In each iteration of the method a forward selection step is followed by a backward elimination step. All Subset Selection Method Suppose we start with k predictor variables: x1 , x2 , , xk Fit all models of size p where p = 1, , k i.e., 1-var models, 2-var models etc. Pick the best among these models of each size. Here best is defined as having the largest R 2. Select a single best model from among these models using a criterion such as Cp, (or AIC), BIC, or adjusted R 2. For the selected model to be unbiased, we would like Cp to be close to p or smaller. AIC and Cp are equivalent for models with normal errors. Generally we select the model that has the lowest BIC value. There is no guarantee that the selected models will perform well when accuracy of predicting new observations is of interest. A standard approach for assessing predictive ability of different regression models is to evaluate their performance on a hold out data set (often called the validation data set). When a sufficiently large data set is available, this is usually achieved by randomly splitting the data into a training data set and a validation data set.