STAT 502X: Smoothing 1 11-2 1.0 • Why the penalty? ● 0.8 0.9 ● 0.7 y ● ● ● ● 0.6 ● ● ● 0.5 ● 0.0 0.2 0.4 ● 0.6 0.8 1.0 x • Solid = quad polynomial, residuals but “mild” 2nd derivatives • Dashed = 10th degree poly, no residuals but “wild” 2nd derivatives STAT 502X: Smoothing 2 • Adopt the convention x1 < x2 < x3 < ... < xN . • “Natural” basis functions for cubic splines (slide 9-7) – 1 – x x −x j – (x − xj )3+ − xN N (x − xN −1 )3+ + −xN −1 j = 1, 2, ..., N − 2 xN −1 −xj (x xN −xN −1 • Think about the form of “X-matrix” this makes: HN ×N ... • • ... • • • ... ... • • • • • = • ... • • • • • ... ... ... ... ... ... • • • • ... • • • • • ... • − xN )3+ , STAT 502X: Smoothing 3 For the example: 1 0.0 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1 1 1 1 1 1 1 1 1 0.1 0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.2 0.008 0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.3 0.027 0.008 0.001 0.000 0.000 0.000 0.000 0.000 0.4 0.064 0.027 0.008 0.001 0.000 0.000 0.000 0.000 0.5 0.125 0.064 0.027 0.008 0.001 0.000 0.000 0.000 0.6 0.216 0.125 0.064 0.027 0.008 0.001 0.000 0.000 0.7 0.343 0.216 0.125 0.064 0.027 0.008 0.001 0.000 0.8 0.512 0.343 0.216 0.125 0.064 0.027 0.008 0.001 0.9 0.729 0.512 0.343 0.216 0.125 0.064 0.027 0.008 0.000 0.000 0.000 0.000 0.000 0.000 0.001 1 1.0 0.990 0.720 0.504 0.336 0.210 0.120 0.060 0.024 0.006 STAT 502X: Smoothing 4 1.0 OLS fit to the columns of H (i.e. λ = 0): ● 0.8 0.9 ● 0.7 y ● ● ● ● 0.6 ● ● ● 0.5 ● 0.0 0.2 0.4 ● 0.6 x 0.8 1.0 STAT 502X: Smoothing Connection between “2 derivatives” and the functional form: • ∂p ∂xp (x − ξ)q+ is defined on each side of ξ ... always 0 for x ≤ ξ. • For x > ξ. – a function that approaches 0 as x ↓ ξ if p < q – a nonzero constant if p = q • So cubic splines preserve 2 derivatives ... 9th order splines preserve 8 derivatives ... 5 STAT 502X: Smoothing 6 11-3 • Let the N -dimensional row-vector hpp (x) have jth value h00j (x). • Think of an “nrow=huge” matrix: Hpp hpp (x1 ) pp h (x1 + δ) pp = h (x1 + 2δ) ... hpp (xN ) • Then the limiting δ × Hpp0 Hpp is Ω is • A sort of “X0 X-matrix”, but based on derivatives rather than the original basis functions, and covering all of [a, b] rather than just the N data values. • → positive semi-definite, θ 0 Ωθ ≥ 0 STAT 502X: Smoothing • Note that for the basis functions we are using, – h1 (x) = 1, so h001 (x) = 0 – h2 (x) = x, so h002 (x) = 0 – hj (x) = (x − xj−2 )3+ − c1 (x − xN −1 )3+ + c2 (x − xN )3+ , so h00j (x) = 6(x − xj−2 )+ − 6c1 (x − xN −1 )+ + 6c2 (x − xN )+ , j = 3, 4, ..., N • This means that the first two rows and columns of Ω are contain only zeroes! 7 STAT 502X: Smoothing 8 11-4 • For λ = 0, OLS on the N spline functions • For λ = large, OLS on (1, x) 1.0 • For the example, with λ = 0, 10−5 , 10−4 , ..., 1, ● 0.8 0.9 ● 0.7 y ● ● ● ● 0.6 ● ● ● 0.5 ● 0.0 0.2 0.4 ● 0.6 x 0.8 1.0 STAT 502X: Smoothing 11-6 • Sλ = H(H0 H + λΩ)−1 H0 • For λ = 0, S0 = H(H0 H)−1 H0 • For λ → ∞, X = (1, x), S∞ = X(X0 X)−1 X • ALL Sλ project Y onto the column space of H, only S0 yields the orthogonal projection. 9 STAT 502X: Smoothing 10 11-7 For the example: λ trace(Sλ ) 0 11 .00001 9.444893 .0001 7.107705 .001 4.731553 .01 3.154565 .1 2.272793 1 2.032943 STAT 502X: Smoothing 11-8 • Expression (5) is the same form as is used on slide 4, focusing on Ŷ (a data smooth) rather than θ (a model) ... • Original Form: (Y − Hθ)0 (Y − Hθ) + λθ 0 Ωθ • Ŷ = Hθ, so θ = H−1 Ŷ • (Y − HH−1 Ŷ)0 (Y − HH−1 Ŷ) + λŶ0 (H0 )−1 ΩH−1 Ŷ • K = (H0 )−1 ΩH−1 • (Y − Ŷ)0 (Y − Ŷ) + λŶ0 KŶ 11 STAT 502X: Smoothing 11-9 Eigenvalues of Sλ For the example: • λ = .0001: 1, 1, 0.9965 ... 0.023628 • λ = .1: 1, 1, 0.22179 ... 0.000024199 Fun facts about idempotent matrices (A = AA) • trace = rank • all eigenvalues are 0 or 1 For λ → ∞, X = (1, x), S0 = X(X0 X)−1 X0 , idempotentent, rank=2, eigenvalues 1, 1, 0 ... 12 STAT 502X: Smoothing 13 11-14 Example: S.001 (rounded) for the example: 0.76 0.30 0.04 −0.04 −0.04 −0.02 −0.01 0.00 0.00 0.00 0.00 0.30 0.38 0.25 0.10 0.01 −0.02 −0.01 −0.01 0.00 0.00 0.00 0.04 0.25 0.37 0.26 0.10 0.01 −0.01 −0.01 −0.01 0.00 0.00 −0.04 0.10 0.26 0.36 0.25 0.10 0.01 −0.01 −0.01 −0.01 0.00 −0.04 0.01 0.10 0.25 0.35 0.25 0.10 0.01 −0.01 −0.01 −0.01 −0.02 −0.02 0.01 0.10 0.25 0.35 0.25 0.10 0.01 −0.02 −0.02 −0.01 −0.01 −0.01 0.01 0.10 0.25 0.35 0.25 0.10 0.01 −0.04 0.00 −0.01 −0.01 −0.01 0.01 0.10 0.25 0.36 0.26 0.10 −0.04 0.00 0.00 −0.01 −0.01 −0.01 0.01 0.10 0.26 0.37 0.25 0.04 0.00 0.00 0.00 −0.01 −0.01 −0.02 0.01 0.10 0.25 0.38 0.30 0.00 0.00 0.00 0.00 −0.01 −0.02 −0.04 −0.04 0.04 0.30 0.76 STAT 502X: Smoothing 13-2 One way to do this is with the R routine Tps, in “fields” package. • A simple script: fit <- Tps(X, Y, lambda= ...) summary(fit) out.p <- predict.surface(fit) image(out.p) • X is an N × 2 matrix of inputs ... can be more than 2 predictors • Y is an N -element list of corresponding responses • If a value is given for λ, it is used, otherwise it is determined by GCV 14 STAT 502X: Smoothing 15 Example: 27 points in 2-D ● ● 8 ● ● ● ● ● ● 6 X[, 2] ● ● ● ● ● ● 4 ● ● ● ● ● 2 ● ● ● 4 ● ● ● 2 ● ● 6 8 X[, 1] Y = X1 (10 − X1 )X2 + “Signal” has a range of about 200, normal noise with s.d. = 30 STAT 502X: Smoothing 16 Predictions for: 1. λ = 0 (no smoothing) 2. GCV λ = 0.010, 9.189 effective df 2 4 6 8 8 2 4 6 8 6 4 2 2 4 6 8 3. λ = 1 2 4 6 8 2 4 6 8 STAT 502X: Smoothing 17 14-3 0.6 0.4 0.2 0.0 D 0.8 1.0 1.2 Tri-cube, Epanechnikov, and normal kernels (top to bottom): −1.5 −1.0 −0.5 0.0 t 0.5 1.0 1.5 STAT 502X: Smoothing 18 14-4 Example: • locally weighted averaging • same 11-point 1-d data set used before 1.0 • tri-cubic kernel, λ = .1, .2, .3, .4, .5 ● 0.8 0.9 ● 0.7 y ● ● ● ● 0.6 ● ● ● 0.5 ● 0.0 0.2 0.4 ● 0.6 x 0.8 1.0 STAT 502X: Smoothing 19 14-5 Example, continued: • locally weighted linear regression 1.0 • tri-cubic kernel, λ = .11, .2, .3, .4, .5 ... why? ● 0.8 0.9 ● 0.7 y ● ● ● ● 0.6 ● ● ● 0.5 ● 0.0 0.2 0.4 ● 0.6 x 0.8 1.0 STAT 502X: Smoothing 15-4 Example: • Same set of N = 27 2-d x’s used in the Tps example. • Same function to generate Y , but added more noise here. # Use simple average as the intercept ... alphahat <- mean(Y) # ... and iterate for everything else. for(iter in 1:10) { # Subtract everything but g1 from data .. temp <- Y - alphahat - g2 # ... and locally smooth along x1. for(i in 1:27){ num <- 0 den <- 0 for(j in 1:27){ # Tri-cubic kernel t <- (X[i,1]-X[j,1])/lambda Dt <- 0 if(abs(t) < 1){ Dt <- (1-abs(t)^3)^3} num <- num + temp[j]*Dt den <- den + Dt } s[i] <- num/den } g1 <- s-ave(s) 20 STAT 502X: Smoothing # Subtract everything but g2 from data .. temp <- Y - alphahat - g1 # ... and locally smooth along x2. for(i in 1:27){ num <- 0 den <- 0 for(j in 1:27){ # Tri-cubic kernel t <- (X[i,2]-X[j,2])/lambda Dt <- 0 if(abs(t) < 1){ Dt <- (1-abs(t)^3)^3} num <- num + Y[j]*Dt den <- den + Dt } s[i] <- num/den } g2 <- s-ave(s) } quilt.plot(X[,1],X[,2],Y,nx=10,ny=10) quilt.plot(X[,1],X[,2],alphahat+g1+g2,nx=10,ny=10) 21 STAT 502X: Smoothing 22 200 8 8 250 200 6 6 150 150 100 4 4 100 50 50 2 2 0 −50 2 4 6 8 2 4 6 8