Restricted Cubic Splines

advertisement
Scatterplot Smoothing Using PROC
LOESS and Restricted Cubic Splines
Jonas V. Bilenas
Barclays Global Retail Bank/UK
Adjunct Faculty, Saint Joseph University,
School of Business
June 23, 2011
Introduction
• In this tutorial we will look at 2 scatterplot
smoothing techniques:
– The LOESS Procedure:
• Non-parametric regression smoothing (local regression or DWLS; Distance
Weighted Least Squares).
– Restricted Cubic Splines:
• Parametric smoothing that can be used in regression procedures to fit
functional models.
SUG, RUG, & LUG Pictures
LOESS documentation from SAS
•
•
The LOESS procedure implements a nonparametric method for estimating
regression surfaces pioneered by Cleveland, Devlin, and Grosse (1988),
Cleveland and Grosse (1991), and Cleveland, Grosse, and Shyu (1992). The
LOESS procedure allows great flexibility because no assumptions about the
parametric form of the regression surface are needed.
The main features of the LOESS procedure are as follows:
–
–
–
–
–
–
–
–
fits nonparametric models
supports the use of multidimensional data
supports multiple dependent variables
supports both direct and interpolated fitting that uses kd trees
performs statistical inference
performs automatic smoothing parameter selection
performs iterative reweighting to provide robust fitting when there are outliers in the
data
supports graphical displays produced through ODS Graphics
LOESS Procedure Details
• LOESS fits a local regression function to the data within a
chosen neighborhood of points.
• The radius of each neighborhood is chosen so that the
neighborhood contains a specified percentage of the data
points. This percentage of the region is specified by a
smoothing parameter (0 < smooth <= 1). The larger the
smoothing parameter the smoother the graphed function.
– Default value of smoothing is at 0.5.
– Smoothing parameter can also be optimized:
•
•
•
AICC specifies the AICC criterion..
AICC1 specifies the AICC1 criterion.
GCV specifies the generalized cross validation criterion.
• The regression procedure performs a fit weighted by the
distance of points from the center of the neighborhood. Missing
values are deleted.
Example of some LOESS
proc loess data=sashelp.cars;
ods output outputstatistics=outstay;
model MPG_Highway=MSRP
/smooth=0.8 alpha=.05 all;
run;
Fit Summary
Fit Method
Blending
Number of Observations
Number of Fitting Points
kd Tree Bucket Size
Degree of Local Polynomials
Smoothing Parameter
Points in Local Neighborhood
Residual Sum of Squares
Trace[L]
GCV
AICC
AICC1
Delta1
Delta2
Equivalent Number of Parameters
Lookup Degrees of Freedom
Residual Standard Error
kd Tree
Linear
428
9
68
1
0.80000
342
8913.89292
3.77247
0.04953
4.05885
1737.19028
424.12399
424.20690
3.66893
424.04109
4.58445
SUG, RUG, & LUG Pictures
Example of some LOESS
proc sort data=outstay;
by pred;
run;
axis1 label = (angle=90 "MPG HIGHWAY");
axis2 label = (h=1.5 "MSRP");
symbol1 i=none c=black v=dot h=0.5;
symbol2 i=j value=none color=red l=1 width=30;
proc gplot data=outstay;
plot (depvar pred)*MSRP
/ overlay
haxis=axis2
vaxis=axis1
grid;
title "LOESS Smooth=0.8";
run;quit;
LOESS with ODS GRAPHICS
ods html;
ods graphics on;
proc loess data=sashelp.cars;
model MPG_Highway=MSRP
/smooth=(0.5 0.6 0.7 0.8) alpha=.05 all;
run;
ods grapahics off;
ods html close;
Optimized LOESS
ods html;
ods graphics on;
proc loess data=sashelp.cars;
model MPG_Highway=MSRP
/ SELECT=AICC;
run;
ods grapahics off;
ods html close;
LOESS in SGPLOT
ods html;
ods graphics on;
title 'LOESS/SMOOTH=0.60';
proc sgplot data=sashelp.cars;
loess x=MSRP y=MPG_Highway / smooth=0.60;
run; quit;
ods graphics off;
ods html close;
Optimized LOESS
ods html;
ods graphics on;
proc loess data=sashelp.cars;
model MPG_Highway=MSRP
Horsepower
/ SELECT=AICC;
run;
ods grapahics off;
ods html close;
SUG, RUG, & LUG Pictures
LOESS for Time Series Plots
ods html;
ods graphics on;
title 'Time series plot';
proc loess data=ENSO;
model Pressure = Month
/ SMOOTH=0.1 0.2 0.3 0.4;
run; quit;
ods graphics off;
ods html close;
Data from Cohen (SUGI 24)
Data also online:
http://support.sas.com/documentation/c
dl/en/statug/63033/HTML/default/view
er.htm#statug_loess_sect033.htm
LOESS for Time Series Plots (AICC optimized)
Large Number of Observations
• http://www.statisticalanalysisconsulting.com/scatterplots-dealing-with-overplotting/
• Peter Flom Blog.
• Set PLOTS(MAXPOINTS= ) in PROC LOESS. Default limit is 5000,
• Run PROC LOESS on all data. But plot after binning independent variable and running means
on binned data.
proc loess data=test; /* output 300 for each record */
ods output outputstatistics=outstay;
model MPG_Highway=horsepower
/smooth=0.4 ;
run;
proc rank data=outstay groups=100 ties=low out=ranked;
var horsepower;
ranks r_horsepower;
run;
proc means data=ranked noprint nway;
class r_horsepower;
var depvar pred Horsepower;
output out=means mean=;
run;
axis1 label = (angle=90 "MPG HIGHWAY")
;
axis2 label = (h=1.5 "Horsepower");
symbol1 i=none c=black v=dot h=0.5;
symbol2 i=j value=none color=red l=1 width=10;
proc gplot data=means;
plot (depvar pred)*Horsepower
/ overlay
haxis=axis2
vaxis=axis1
grid;
title "LOESS Smooth=0.4";
run;quit;
Large Number of Observations
SUG, RUG, & LUG Pictures
Restricted Cubic Splines
• Recommended by Frank Harrell
• Knots are specified in advanced.
• Placement of Knots are not important. Usually determined predetermined
percentiles based on sample size,
k
3
4
5
6
7
Quantiles
.10 .5 .90
.05 .35 .65 .95
.05 .275 .5 .725 .95
.05 .23 .41 .59 .77 .95
.025 .1833 .3417 .5 .6583 .8167 .975
Restricted Cubic Splines
• Percentile values can be derived using PROC UNIVARIATE.
• Can Optimize number of Knots selecting number based on minimizing
AICC.
• Provides a parametric regression function.
• Sometimes knot transformations make for difficult interpretation.
• May be difficult to incorporate interaction terms.
• Much more efficient than categorizing continuous variables into dummy
terms.
• Macro available:
• http://biostat.mc.vanderbilt.edu/wiki/pub/Main/SasMacros/survrisk.txt
Restricted Cubic Splines
proc univariate data=sashelp.cars noprint;
var horsepower;
output out=knots pctlpre=P_ pctlpts=5 27.5 50 72.5 95;
run;
proc print data=knots; run;
Obs
P_5
1
115
P_27_5
170
P_50
210
P_72_5
245
P_95
340
Restricted Cubic Splines
options nocenter mprint;
data test;
set sashelp.cars;
%rcspline (horsepower,115, 170, 210, 245, 340);
run;
LOG:
MPRINT(RCSPLINE):
DROP _kd_;
MPRINT(RCSPLINE):
_kd_= (340 - 115)**.666666666666 ;
MPRINT(RCSPLINE):
horsepower1=max((horsepower-115)/_kd_,0)**3+((245-115)*max((horsepower-340)/_kd_,0)**3
-(340-115)*max((horsepower-245)/_kd_,0)**3)/(340-245);
MPRINT(RCSPLINE): ;
MPRINT(RCSPLINE):
horsepower2=max((horsepower-170)/_kd_,0)**3+((245-170)*max((horsepower-340)/_kd_,0)**3
-(340-170)*max((horsepower-245)/_kd_,0)**3)/(340-245);
MPRINT(RCSPLINE): ;
MPRINT(RCSPLINE):
horsepower3=max((horsepower-210)/_kd_,0)**3+((245-210)*max((horsepower-340)/_kd_,0)**3
-(340-210)*max((horsepower-245)/_kd_,0)**3)/(340-245);
MPRINT(RCSPLINE): ;
43
run;
Restricted Cubic Splines
proc reg data=test;
model MPG_Highway = horsepower horsepower1 horsepower2 horsepower3;
LINEAR: TEST horsepower1, horsepower2, horsepower3;
run; quit;
Analysis of Variance
DF
Sum of
Squares
Mean
Square
4
423
427
8147.64458
5926.86710
14075
2036.91115
14.01151
3.74319
26.84346
13.94453
R-Square
Adj R-Sq
Source
Model
Error
Corrected Total
Root MSE
Dependent Mean
Coeff Var
F Value
Pr > F
145.37
<.0001
0.5789
0.5749
Parameter Estimates
Variable
Label
Intercept
Horsepower
horsepower1
horsepower2
horsepower3
Intercept
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
1
1
1
1
1
63.32145
-0.22900
0.83439
-2.53834
2.55417
2.50445
0.01837
0.12653
0.49019
0.66356
25.28
-12.46
6.59
-5.18
3.85
<.0001
<.0001
<.0001
<.0001
0.0001
Test LINEAR Results for Dependent Variable MPG_Highway
Source
Numerator
Denominator
DF
Mean
Square
3
423
750.78949
14.01151
F Value
Pr > F
53.58
<.0001
Restricted Cubic Splines (5 Knots)
Restricted Cubic Splines (7 Knots): Time Series Data
Regression terms not significant
SUG, RUG, & LUG Pictures
References
•
•
•
•
•
•
Akaike, H. (1973), “Information Theory and an Extension of the Maximum
Likelihood Principle,” in Petrov and Csaki, eds., Proceedings of the Second
International Symposium on Information Theory, 267–281.
Cleveland, W. S., Devlin, S. J., and Grosse, E. (1988), “Regression by Local
Fitting,” Journal of Econometrics, 37, 87–114.
Cleveland, W. S. and Grosse, E. (1991), “Computational Methods for Local
Regression,” Statistics and Computing, 1, 47–62.
Cohen, R.A. (SUGI 24). “An Introduction to PROC LOESS for Local
Regression,” Paper 273-24.
Harrell, F. (2010). “Regression Modeling Strategies: With Applications to
Linear Models, Logistic Regression, and Survival Analysis (Springer Series in
Statistics),” Springer.
Harrell RCSPLINE MACRO:
–
•
http://biostat.mc.vanderbilt.edu/wiki/pub/Main/SasMacros/survrisk.txt
C. J. Stone and C. Y. Koo (1985), “Additive splines in statistics,” In
Proceedings of the Statistical Computing Section ASA, pages 45{48,
Washington, DC, 1985. [34, 39]
Download