11-2 • Why the penalty?

advertisement
STAT 502X: Smoothing
1
11-2
1.0
• Why the penalty?
●
0.8
0.9
●
0.7
y
●
●
●
●
0.6
●
●
●
0.5
●
0.0
0.2
0.4
●
0.6
0.8
1.0
x
• Solid = quad polynomial, residuals but “mild” 2nd derivatives
• Dashed = 10th degree poly, no residuals but “wild” 2nd derivatives
STAT 502X: Smoothing
2
• Adopt the convention x1 < x2 < x3 < ... < xN .
• “Natural” basis functions for cubic splines (slide 9-7)
– 1
– x
x
−x
j
– (x − xj )3+ − xN N
(x − xN −1 )3+ +
−xN −1
j = 1, 2, ..., N − 2
xN −1 −xj
(x
xN −xN −1
• Think about the form of “X-matrix” this makes:

HN ×N
...

• •
...
• • •
...









... 

•

• •

•


•

=
•

 ...

•

• • • • ...
... ... ... ... ...
• • • • ...
• • • • • ... •
− xN )3+ ,
STAT 502X: Smoothing
3
For the example:

1 0.0 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
1

1


1

1

1

1


1

1

1

0.1 0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 

0.2 0.008 0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.000 
0.3 0.027 0.008 0.001 0.000 0.000 0.000 0.000 0.000
0.4 0.064 0.027 0.008 0.001 0.000 0.000 0.000 0.000
0.5 0.125 0.064 0.027 0.008 0.001 0.000 0.000 0.000
0.6 0.216 0.125 0.064 0.027 0.008 0.001 0.000 0.000
0.7 0.343 0.216 0.125 0.064 0.027 0.008 0.001 0.000
0.8 0.512 0.343 0.216 0.125 0.064 0.027 0.008 0.001
0.9 0.729 0.512 0.343 0.216 0.125 0.064 0.027 0.008


0.000 

0.000 

0.000 


0.000 

0.000 

0.000 

0.001 
1 1.0 0.990 0.720 0.504 0.336 0.210 0.120 0.060 0.024 0.006
STAT 502X: Smoothing
4
1.0
OLS fit to the columns of H (i.e. λ = 0):
●
0.8
0.9
●
0.7
y
●
●
●
●
0.6
●
●
●
0.5
●
0.0
0.2
0.4
●
0.6
x
0.8
1.0
STAT 502X: Smoothing
Connection between “2 derivatives” and the functional form:
•
∂p
∂xp (x
− ξ)q+ is defined on each side of ξ ... always 0 for x ≤ ξ.
• For x > ξ.
– a function that approaches 0 as x ↓ ξ if p < q
– a nonzero constant if p = q
• So cubic splines preserve 2 derivatives ... 9th order splines
preserve 8 derivatives ...
5
STAT 502X: Smoothing
6
11-3
• Let the N -dimensional row-vector hpp (x) have jth value h00j (x).

• Think of an “nrow=huge” matrix: Hpp
hpp (x1 )

 pp

 h (x1 + δ) 


 pp

=  h (x1 + 2δ) 


 ...



hpp (xN )
• Then the limiting δ × Hpp0 Hpp is Ω is
• A sort of “X0 X-matrix”, but based on derivatives rather than the
original basis functions, and covering all of [a, b] rather than just the
N data values.
• → positive semi-definite, θ 0 Ωθ ≥ 0
STAT 502X: Smoothing
• Note that for the basis functions we are using,
– h1 (x) = 1, so h001 (x) = 0
– h2 (x) = x, so h002 (x) = 0
– hj (x) = (x − xj−2 )3+ − c1 (x − xN −1 )3+ + c2 (x − xN )3+ , so
h00j (x) = 6(x − xj−2 )+ − 6c1 (x − xN −1 )+ + 6c2 (x − xN )+ ,
j = 3, 4, ..., N
• This means that the first two rows and columns of Ω are
contain only zeroes!
7
STAT 502X: Smoothing
8
11-4
• For λ = 0, OLS on the N spline functions
• For λ = large, OLS on (1, x)
1.0
• For the example, with λ = 0, 10−5 , 10−4 , ..., 1,
●
0.8
0.9
●
0.7
y
●
●
●
●
0.6
●
●
●
0.5
●
0.0
0.2
0.4
●
0.6
x
0.8
1.0
STAT 502X: Smoothing
11-6
• Sλ = H(H0 H + λΩ)−1 H0
• For λ = 0, S0 = H(H0 H)−1 H0
• For λ → ∞, X = (1, x), S∞ = X(X0 X)−1 X
• ALL Sλ project Y onto the column space of H, only S0 yields
the orthogonal projection.
9
STAT 502X: Smoothing
10
11-7
For the example:
λ
trace(Sλ )
0
11
.00001
9.444893
.0001
7.107705
.001
4.731553
.01
3.154565
.1
2.272793
1
2.032943
STAT 502X: Smoothing
11-8
• Expression (5) is the same form as is used on slide 4, focusing
on Ŷ (a data smooth) rather than θ (a model) ...
• Original Form: (Y − Hθ)0 (Y − Hθ) + λθ 0 Ωθ
• Ŷ = Hθ, so θ = H−1 Ŷ
• (Y − HH−1 Ŷ)0 (Y − HH−1 Ŷ) + λŶ0 (H0 )−1 ΩH−1 Ŷ
• K = (H0 )−1 ΩH−1
• (Y − Ŷ)0 (Y − Ŷ) + λŶ0 KŶ
11
STAT 502X: Smoothing
11-9
Eigenvalues of Sλ
For the example:
• λ = .0001: 1, 1, 0.9965 ... 0.023628
• λ = .1: 1, 1, 0.22179 ... 0.000024199
Fun facts about idempotent matrices (A = AA)
• trace = rank
• all eigenvalues are 0 or 1
For λ → ∞, X = (1, x), S0 = X(X0 X)−1 X0 , idempotentent,
rank=2, eigenvalues 1, 1, 0 ...
12
STAT 502X: Smoothing
13
11-14
Example: S.001 (rounded) for the example:

0.76
0.30
0.04 −0.04 −0.04 −0.02 −0.01
0.00
0.00
0.00
0.00

 0.30 0.38 0.25 0.10 0.01 −0.02 −0.01 −0.01 0.00 0.00 0.00 


 0.04 0.25 0.37 0.26 0.10 0.01 −0.01 −0.01 −0.01 0.00 0.00 




 −0.04 0.10 0.26 0.36 0.25 0.10 0.01 −0.01 −0.01 −0.01 0.00 


 −0.04 0.01 0.10 0.25 0.35 0.25 0.10 0.01 −0.01 −0.01 −0.01 


 −0.02 −0.02 0.01 0.10 0.25 0.35 0.25 0.10 0.01 −0.02 −0.02 


 −0.01 −0.01 −0.01 0.01 0.10 0.25 0.35 0.25 0.10 0.01 −0.04 




 0.00 −0.01 −0.01 −0.01 0.01 0.10 0.25 0.36 0.26 0.10 −0.04 


 0.00 0.00 −0.01 −0.01 −0.01 0.01 0.10 0.26 0.37 0.25 0.04 


 0.00 0.00 0.00 −0.01 −0.01 −0.02 0.01 0.10 0.25 0.38 0.30 
0.00
0.00
0.00
0.00 −0.01 −0.02 −0.04 −0.04
0.04
0.30
0.76
STAT 502X: Smoothing
13-2
One way to do this is with the R routine Tps, in “fields” package.
• A simple script:
fit <- Tps(X, Y, lambda= ...)
summary(fit)
out.p <- predict.surface(fit)
image(out.p)
• X is an N × 2 matrix of inputs ... can be more than 2
predictors
• Y is an N -element list of corresponding responses
• If a value is given for λ, it is used, otherwise it is determined by
GCV
14
STAT 502X: Smoothing
15
Example: 27 points in 2-D
●
●
8
●
●
●
●
●
●
6
X[, 2]
●
●
●
●
●
●
4
●
●
●
●
●
2
●
●
●
4
●
●
●
2
●
●
6
8
X[, 1]
Y = X1 (10 − X1 )X2 + “Signal” has a range of about 200, normal noise with s.d. = 30
STAT 502X: Smoothing
16
Predictions for:
1. λ = 0 (no smoothing)
2. GCV λ = 0.010, 9.189 effective df
2
4
6
8
8
2
4
6
8
6
4
2
2
4
6
8
3. λ = 1
2
4
6
8
2
4
6
8
STAT 502X: Smoothing
17
14-3
0.6
0.4
0.2
0.0
D
0.8
1.0
1.2
Tri-cube, Epanechnikov, and normal kernels (top to bottom):
−1.5
−1.0
−0.5
0.0
t
0.5
1.0
1.5
STAT 502X: Smoothing
18
14-4
Example:
• locally weighted averaging
• same 11-point 1-d data set used before
1.0
• tri-cubic kernel, λ = .1, .2, .3, .4, .5
●
0.8
0.9
●
0.7
y
●
●
●
●
0.6
●
●
●
0.5
●
0.0
0.2
0.4
●
0.6
x
0.8
1.0
STAT 502X: Smoothing
19
14-5
Example, continued:
• locally weighted linear regression
1.0
• tri-cubic kernel, λ = .11, .2, .3, .4, .5 ... why?
●
0.8
0.9
●
0.7
y
●
●
●
●
0.6
●
●
●
0.5
●
0.0
0.2
0.4
●
0.6
x
0.8
1.0
STAT 502X: Smoothing
15-4
Example:
• Same set of N = 27 2-d x’s used in the Tps example.
• Same function to generate Y , but added more noise here.
# Use simple average as the intercept ...
alphahat <- mean(Y)
# ... and iterate for everything else.
for(iter in 1:10) {
# Subtract everything but g1 from data ..
temp <- Y - alphahat - g2
# ... and locally smooth along x1.
for(i in 1:27){
num <- 0
den <- 0
for(j in 1:27){
# Tri-cubic kernel
t <- (X[i,1]-X[j,1])/lambda
Dt <- 0
if(abs(t) < 1){
Dt <- (1-abs(t)^3)^3}
num <- num + temp[j]*Dt
den <- den + Dt
}
s[i] <- num/den
}
g1 <- s-ave(s)
20
STAT 502X: Smoothing
# Subtract everything but g2 from data ..
temp <- Y - alphahat - g1
# ... and locally smooth along x2.
for(i in 1:27){
num <- 0
den <- 0
for(j in 1:27){
# Tri-cubic kernel
t <- (X[i,2]-X[j,2])/lambda
Dt <- 0
if(abs(t) < 1){
Dt <- (1-abs(t)^3)^3}
num <- num + Y[j]*Dt
den <- den + Dt
}
s[i] <- num/den
}
g2 <- s-ave(s)
}
quilt.plot(X[,1],X[,2],Y,nx=10,ny=10)
quilt.plot(X[,1],X[,2],alphahat+g1+g2,nx=10,ny=10)
21
STAT 502X: Smoothing
22
200
8
8
250
200
6
6
150
150
100
4
4
100
50
50
2
2
0
−50
2
4
6
8
2
4
6
8
Related documents
Download