Scatterplot Smoothing Using PROC LOESS and Restricted Cubic Splines Jonas V. Bilenas Barclays Global Retail Bank/UK Adjunct Faculty, Saint Joseph University, School of Business June 23, 2011 Introduction • In this tutorial we will look at 2 scatterplot smoothing techniques: – The LOESS Procedure: • Non-parametric regression smoothing (local regression or DWLS; Distance Weighted Least Squares). – Restricted Cubic Splines: • Parametric smoothing that can be used in regression procedures to fit functional models. SUG, RUG, & LUG Pictures LOESS documentation from SAS • • The LOESS procedure implements a nonparametric method for estimating regression surfaces pioneered by Cleveland, Devlin, and Grosse (1988), Cleveland and Grosse (1991), and Cleveland, Grosse, and Shyu (1992). The LOESS procedure allows great flexibility because no assumptions about the parametric form of the regression surface are needed. The main features of the LOESS procedure are as follows: – – – – – – – – fits nonparametric models supports the use of multidimensional data supports multiple dependent variables supports both direct and interpolated fitting that uses kd trees performs statistical inference performs automatic smoothing parameter selection performs iterative reweighting to provide robust fitting when there are outliers in the data supports graphical displays produced through ODS Graphics LOESS Procedure Details • LOESS fits a local regression function to the data within a chosen neighborhood of points. • The radius of each neighborhood is chosen so that the neighborhood contains a specified percentage of the data points. This percentage of the region is specified by a smoothing parameter (0 < smooth <= 1). The larger the smoothing parameter the smoother the graphed function. – Default value of smoothing is at 0.5. – Smoothing parameter can also be optimized: • • • AICC specifies the AICC criterion.. AICC1 specifies the AICC1 criterion. GCV specifies the generalized cross validation criterion. • The regression procedure performs a fit weighted by the distance of points from the center of the neighborhood. Missing values are deleted. Example of some LOESS proc loess data=sashelp.cars; ods output outputstatistics=outstay; model MPG_Highway=MSRP /smooth=0.8 alpha=.05 all; run; Fit Summary Fit Method Blending Number of Observations Number of Fitting Points kd Tree Bucket Size Degree of Local Polynomials Smoothing Parameter Points in Local Neighborhood Residual Sum of Squares Trace[L] GCV AICC AICC1 Delta1 Delta2 Equivalent Number of Parameters Lookup Degrees of Freedom Residual Standard Error kd Tree Linear 428 9 68 1 0.80000 342 8913.89292 3.77247 0.04953 4.05885 1737.19028 424.12399 424.20690 3.66893 424.04109 4.58445 SUG, RUG, & LUG Pictures Example of some LOESS proc sort data=outstay; by pred; run; axis1 label = (angle=90 "MPG HIGHWAY"); axis2 label = (h=1.5 "MSRP"); symbol1 i=none c=black v=dot h=0.5; symbol2 i=j value=none color=red l=1 width=30; proc gplot data=outstay; plot (depvar pred)*MSRP / overlay haxis=axis2 vaxis=axis1 grid; title "LOESS Smooth=0.8"; run;quit; LOESS with ODS GRAPHICS ods html; ods graphics on; proc loess data=sashelp.cars; model MPG_Highway=MSRP /smooth=(0.5 0.6 0.7 0.8) alpha=.05 all; run; ods grapahics off; ods html close; Optimized LOESS ods html; ods graphics on; proc loess data=sashelp.cars; model MPG_Highway=MSRP / SELECT=AICC; run; ods grapahics off; ods html close; LOESS in SGPLOT ods html; ods graphics on; title 'LOESS/SMOOTH=0.60'; proc sgplot data=sashelp.cars; loess x=MSRP y=MPG_Highway / smooth=0.60; run; quit; ods graphics off; ods html close; Optimized LOESS ods html; ods graphics on; proc loess data=sashelp.cars; model MPG_Highway=MSRP Horsepower / SELECT=AICC; run; ods grapahics off; ods html close; SUG, RUG, & LUG Pictures LOESS for Time Series Plots ods html; ods graphics on; title 'Time series plot'; proc loess data=ENSO; model Pressure = Month / SMOOTH=0.1 0.2 0.3 0.4; run; quit; ods graphics off; ods html close; Data from Cohen (SUGI 24) Data also online: http://support.sas.com/documentation/c dl/en/statug/63033/HTML/default/view er.htm#statug_loess_sect033.htm LOESS for Time Series Plots (AICC optimized) Large Number of Observations • http://www.statisticalanalysisconsulting.com/scatterplots-dealing-with-overplotting/ • Peter Flom Blog. • Set PLOTS(MAXPOINTS= ) in PROC LOESS. Default limit is 5000, • Run PROC LOESS on all data. But plot after binning independent variable and running means on binned data. proc loess data=test; /* output 300 for each record */ ods output outputstatistics=outstay; model MPG_Highway=horsepower /smooth=0.4 ; run; proc rank data=outstay groups=100 ties=low out=ranked; var horsepower; ranks r_horsepower; run; proc means data=ranked noprint nway; class r_horsepower; var depvar pred Horsepower; output out=means mean=; run; axis1 label = (angle=90 "MPG HIGHWAY") ; axis2 label = (h=1.5 "Horsepower"); symbol1 i=none c=black v=dot h=0.5; symbol2 i=j value=none color=red l=1 width=10; proc gplot data=means; plot (depvar pred)*Horsepower / overlay haxis=axis2 vaxis=axis1 grid; title "LOESS Smooth=0.4"; run;quit; Large Number of Observations SUG, RUG, & LUG Pictures Restricted Cubic Splines • Recommended by Frank Harrell • Knots are specified in advanced. • Placement of Knots are not important. Usually determined predetermined percentiles based on sample size, k 3 4 5 6 7 Quantiles .10 .5 .90 .05 .35 .65 .95 .05 .275 .5 .725 .95 .05 .23 .41 .59 .77 .95 .025 .1833 .3417 .5 .6583 .8167 .975 Restricted Cubic Splines • Percentile values can be derived using PROC UNIVARIATE. • Can Optimize number of Knots selecting number based on minimizing AICC. • Provides a parametric regression function. • Sometimes knot transformations make for difficult interpretation. • May be difficult to incorporate interaction terms. • Much more efficient than categorizing continuous variables into dummy terms. • Macro available: • http://biostat.mc.vanderbilt.edu/wiki/pub/Main/SasMacros/survrisk.txt Restricted Cubic Splines proc univariate data=sashelp.cars noprint; var horsepower; output out=knots pctlpre=P_ pctlpts=5 27.5 50 72.5 95; run; proc print data=knots; run; Obs P_5 1 115 P_27_5 170 P_50 210 P_72_5 245 P_95 340 Restricted Cubic Splines options nocenter mprint; data test; set sashelp.cars; %rcspline (horsepower,115, 170, 210, 245, 340); run; LOG: MPRINT(RCSPLINE): DROP _kd_; MPRINT(RCSPLINE): _kd_= (340 - 115)**.666666666666 ; MPRINT(RCSPLINE): horsepower1=max((horsepower-115)/_kd_,0)**3+((245-115)*max((horsepower-340)/_kd_,0)**3 -(340-115)*max((horsepower-245)/_kd_,0)**3)/(340-245); MPRINT(RCSPLINE): ; MPRINT(RCSPLINE): horsepower2=max((horsepower-170)/_kd_,0)**3+((245-170)*max((horsepower-340)/_kd_,0)**3 -(340-170)*max((horsepower-245)/_kd_,0)**3)/(340-245); MPRINT(RCSPLINE): ; MPRINT(RCSPLINE): horsepower3=max((horsepower-210)/_kd_,0)**3+((245-210)*max((horsepower-340)/_kd_,0)**3 -(340-210)*max((horsepower-245)/_kd_,0)**3)/(340-245); MPRINT(RCSPLINE): ; 43 run; Restricted Cubic Splines proc reg data=test; model MPG_Highway = horsepower horsepower1 horsepower2 horsepower3; LINEAR: TEST horsepower1, horsepower2, horsepower3; run; quit; Analysis of Variance DF Sum of Squares Mean Square 4 423 427 8147.64458 5926.86710 14075 2036.91115 14.01151 3.74319 26.84346 13.94453 R-Square Adj R-Sq Source Model Error Corrected Total Root MSE Dependent Mean Coeff Var F Value Pr > F 145.37 <.0001 0.5789 0.5749 Parameter Estimates Variable Label Intercept Horsepower horsepower1 horsepower2 horsepower3 Intercept DF Parameter Estimate Standard Error t Value Pr > |t| 1 1 1 1 1 63.32145 -0.22900 0.83439 -2.53834 2.55417 2.50445 0.01837 0.12653 0.49019 0.66356 25.28 -12.46 6.59 -5.18 3.85 <.0001 <.0001 <.0001 <.0001 0.0001 Test LINEAR Results for Dependent Variable MPG_Highway Source Numerator Denominator DF Mean Square 3 423 750.78949 14.01151 F Value Pr > F 53.58 <.0001 Restricted Cubic Splines (5 Knots) Restricted Cubic Splines (7 Knots): Time Series Data Regression terms not significant SUG, RUG, & LUG Pictures References • • • • • • Akaike, H. (1973), “Information Theory and an Extension of the Maximum Likelihood Principle,” in Petrov and Csaki, eds., Proceedings of the Second International Symposium on Information Theory, 267–281. Cleveland, W. S., Devlin, S. J., and Grosse, E. (1988), “Regression by Local Fitting,” Journal of Econometrics, 37, 87–114. Cleveland, W. S. and Grosse, E. (1991), “Computational Methods for Local Regression,” Statistics and Computing, 1, 47–62. Cohen, R.A. (SUGI 24). “An Introduction to PROC LOESS for Local Regression,” Paper 273-24. Harrell, F. (2010). “Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis (Springer Series in Statistics),” Springer. Harrell RCSPLINE MACRO: – • http://biostat.mc.vanderbilt.edu/wiki/pub/Main/SasMacros/survrisk.txt C. J. Stone and C. Y. Koo (1985), “Additive splines in statistics,” In Proceedings of the Statistical Computing Section ASA, pages 45{48, Washington, DC, 1985. [34, 39]