Smoothing Scatterplots Using Penalized Splines 1

advertisement
Smoothing Scatterplots
Using Penalized Splines
1
What do we mean by smoothing?
Fitting a "smooth" curve to the data in a
scatterplot
2
Why would we want to fit a smooth curve
to the data in a scatterplot?
Imagine the model
yi=f(xi)+ei (i=1,…,n)
e1,…,en ~ independent, mean 0, and
f is some unknown smooth function.
3
If the subject matter underlying the data
set tells us nothing about a parametric
form for f, we may prefer to let the data
suggest a curve rather than concocting
some parametric function that we hope
will fit the data well.
The estimated curve might help us see
features of the data that are obscured by
variation or simply provide a nice
summary of the relationship between y
and x.
4
d=read.delim(
"http://www.public.iastate.edu/~dnett/S511/Diabetes.txt")
head(d)
1
2
3
4
5
6
subject age acidity
y
1 5.2
-8.1 4.8
2 8.8
-16.1 4.1
3 10.5
-0.9 5.2
4 10.6
-7.8 5.5
5 10.4
-29.0 5.0
6 1.8
-19.2 3.4
#Variables are
#subject: subject ID number
#age: age diagnosed with diabetes
#acidity: a measure of acidity called base deficit
#y: natural log of serum C-peptide concentration
#Original source is Sockett et al. (1987)
#mentioned in Hastie and Tibshirani's book
#"Generalized Additive Models".
5
6
7
8
9
10
Again consider the model
yi=f(xi)+ei (i=1,…,n)
e1,…,en ~ independent, mean 0, and
f is some unknown smooth function.
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
Some Strategies for Choosing the
Smoothing Parameter
1. Cross-Validation (CV)
2. Generalized Cross-Validation (GCV)
3. Linear Mixed Effects Model Approach
There are other approaches, but we will
restrict our discussion to the methods
above.
37
1. Cross-Validation (CV):
CV is a general strategy for choosing
"tuning" parameters like our smoothing
2
parameter λ .
These are parameters whose values are
not of interest except for the fact that
they affect estimates of the model
parameters that are of interest.
38
We will talk specifically about leave-oneout cross-validation, which is a special
case of cross-validation.
This approach is known as PRESS
(PRediction Error Sum of Squares) when
it is used to select variables in multiple
regression.
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
DF=10
66
DF=2
67
DF=3.59
68
69
70
71
72
73
74
75
76
77
78
79
The source for the information in these
slides is
Ruppert, D., Wand, M.P., Carroll, R.J.
(2003). Semiparametric Regression.
Cambridge University Press, New York.
80
d=read.delim(
"http://www.public.iastate.edu/~dnett/S511/Diabetes.txt")
head(d)
1
2
3
4
5
6
subject age acidity
y
1 5.2
-8.1 4.8
2 8.8
-16.1 4.1
3 10.5
-0.9 5.2
4 10.6
-7.8 5.5
5 10.4
-29.0 5.0
6 1.8
-19.2 3.4
#Variables are
#subject: subject ID number
#age: age diagnosed with diabetes
#acidity: a measure of acidity called base deficit
#y: natural log of serum C-peptide concentration
#Original source is Sockett et al. (1987)
#mentioned in Hastie and Tibshirani's book
#"Generalized Additive Models".
81
#First install the package SemiPar.
#Then issue the following commands.
#Load the package SemiPar.
library(SemiPar)
#spm does not allow a data argument.
o=spm(d$y~f(d$age,basis="trunc.poly",degree=1))
summary(o)
Summary for non-linear components:
df spar knots
f(d$age) 3.59 5.705
8
Note this includes 1 df for the intercept.
82
plot(d$age,d$y,pch=19,col=4,
xlab="Age at Diagnosis",
ylab="Log C-Peptide Concentration",
main =
expression(
paste(
"Linear Spline Fit with ", lambda^2,"=5.7")))
lines(o,shade=F,se=F)
83
84
plot(o)
85
#Load the data set fossil that comes
#with the SemiPar package.
data(fossil)
head(fossil)
age strontium.ratio
1 91.78525
0.707343
2 92.39579
0.707359
3 93.97061
0.707410
4 95.57577
0.707438
5 95.60286
0.707463
6 112.33691
0.707320
dim(fossil)
[1] 106
2
86
#Shows relationship between strontium
#ratios of ocean fossils and their age
#in millions of years. The dip just less
#then 115 million years ago coincides
#with the mid-plate volcanic activity.
#See Bralower et al. (1997).
#Mid-Cretaceous strontium isotope
#stratigraphy of deep-sea sections.
#Geological Society of America Bulletin
#109, 1421-1442.
plot(fossil)
87
88
y=fossil$strontium.ratio
x=fossil$age
o=spm(y~f(x,basis="trunc.poly",degree=1))
summary(o)
Summary for non-linear components:
df spar knots
f(x) 12.76 1.324
25
Note this includes 1 df for the intercept.
plot(fossil)
lines(o,se=F)
89
90
#Try penalized quadratic splines
#rather than linear splines.
o=spm(y~f(x,basis="trunc.poly",degree=2))
summary(o)
Summary for non-linear components:
df spar knots
f(x) 10.06 2.243
25
Note this includes 1 df for the intercept.
plot(fossil)
lines(o,se=F)
91
92
#The next set of notes covers the lowess
#(or loess) smoother.
o=lowess(y~x,f=.2)
plot(fossil)
lines(o,lwd=2)
#See also the function 'loess' which has more
#capabilities then 'lowess'.
#capabilities then 'lowess'.
93
94
Download