Reproducing Kernel Hilbert Spaces and Polynomial Smoothing Splines Sam Watson November 21, 2008

advertisement
Reproducing Kernel Hilbert Spaces and Polynomial
Smoothing Splines
Sam Watson
November 21, 2008
Introduction
Suppose we have some data {xi , yi } ∈ T × R for i = 1, 2, . . . , N and T an arbitrary set, and we
want to find a function f : T −→ R that minimizes the penalized residual sum-of-squares
N
PRSS( f ) = ∑ L(yi , f (xi )) + λ J( f ),
(1)
i=1
where L is a loss function, λ ≥ 0 is a parameter of the model to be fixed later by cross-validation,
and J is a functional that imposes a smoothness criterion on f . We will use reproducing kernel
Hilbert spaces to develop a unifying framework for this problem and many other estimation and
minimization problems in statistics.1
Reproducing Kernel Hilbert Spaces
Recall that a Hilbert space H is an inner-product space which is complete in the metric induced
by its norm. For any Hilbert space of functions on a set T , we may define for each t ∈ T the
evaluation functional f 7→ f (t). If every such evaluation functional is bounded, then we say that
the Hilbert space is a reproducing kernel Hilbert space (RKHS). Observe that L2 is not an RKHS,
because the Dirac delta functions are not in L2 . For an RKHS, by the Riesz representation theorem
we can find for each t an Rt ∈ H so that
f (t) = h f , Rt i.
(2)
Then we can define a function R : T × T → R (the kernel) by R(s,t) = Rs (t). Such a function is
symmetric because Rt (s) = hRs , Rt i = hRt , Rs i = Rs (t), and by direct application of (2) it has the
reproducing property hR(·, s), R(·,t)i = R(s,t). It is also nonnegative definite, meaning that for all
t1 ,t2 , . . . ,tn ∈ T and real numbers a1 , a2 , . . . , an ,
n
∑ aia j R(ti,t j ) ≥ 0,
i=1
1 The
material presented here is standard; see References.
because the quantity on the left-hand side is equal to h∑ni=1 R(·,ti )ai , ∑ni=1 R(·,ti )ai i = kR(·,ti )ai k2 .
Moreover, each RKHS H has a unique kernel, because if R and R0 are kernels, then kR(·, y) −
R0 (·, y)k2 = hRy − R0y , Ry − R0y i = hRy − R0y , Ry i − hRy − R0y , R0y i = 0 using the reproducing property.
We now state an important theorem regarding the correspondence between reproducing kernel
Hilbert spaces and their kernel functions [1].
Theorem 1. To every RKHS there is a unique nonnegative definite kernel with the reproducing
property, and conversely for any symmetric, nonnegative definite R : T ×T → R, there is a unique
RKHS HR of functions on T whose kernel is R.
To obtain the RKHS for a kernel R, we first consider all finite linear combinations of the functions
{R(·, s)|s ∈ T }, and define the inner product by hRs , Rt i = R(s,t) along with linearity. We confirm
that R is a kernel for this space: first let f = ∑ ai R(·,ti ) where i ranges over a finite index set and
calculate h∑ ai R(·,ti ), R(·, s)i = ∑ ai hR(·,ti ), R(·, s)i = ∑i ai R(s,ti ) = f (t). A Cauchy sequence of
such functions is guaranteed to converge to a limiting function pointwise, because | fn (t) − fm (t)| =
h fn − fm , Rt i ≤ k fn − fm kkRt k by Cauchy-Schwarz. Therefore, we obtain a complete inner product
space HR if we include all linear combinations of the R(·, s) as well as limits of Cauchy sequences
of these functions.
Let’s see generally how the RKHS approach can help with a problem like (1). Suppose we can find
a kernel R whose RKHS HR has a squared norm corresponding to the penalty functional J. Then
by Hilbert space theory, for any f ∈ HR we can write f = f˜ + f ⊥ , where f˜ is a linear combination
⊥
of the N functions {R(·, xi )}N
i=1 , and h f , K(·, xi )i = 0 for all i = 1, 2, . . . , N. The following theorem
demonstrates the usefulness of this representation.
Theorem 2. For f ∈ HR with f = f˜ + f ⊥ and f˜ ∈ span({K(·, xi )}N
i=1 ),
N
N
∑ L(yi − f (xi)) + λ k f k2 ≥ ∑ L(yi − f˜(xi)) + λ k f˜k2,
i=1
(3)
i=1
with equality iff f ⊥ = 0.
Proof. The summations on each side of (3) will be equal since they are only evaluated at the N
values xi , and f ⊥ (xi ) = 0 ⇒ f (xi ) = f˜(xi ). Then observe that k f k2 = k f˜k2 + k f ⊥ k2 ≥ k f˜k2 with
equality iff f ⊥ = 0.
This theorem shows that the solution for (3) is of the form
N
fλ = ∑ ci R(·, xi ),
(4)
i=1
and therefore reduces the infinite-dimension minimization problem (1) to a finite-dimensional minimization problem. The sacrifice required was narrowing the space of candidate functions f from
the space of all functions on T to the space HR . It is therefore important to choose R so that HR
contains a broad class of functions. Let us illustrate these ideas with an example.
2
Application: Polynomial Smoothing Splines
If the input data {xi }N
i=1 are one-dimensional, then without loss of generality we may assume T =
[0, 1]. A common choice for smoothing a one-dimensional interpolating function is to penalize the
R
integral of the square of the mth derivative of f : J( f ) = 01 ( f (m) (x))2 dx. Our goal is to construct
an RKHS whose norm corresponds to this J. Recall that Taylor’s Theorem with integral remainder
term states that if f has m − 1 absolutely continuous derivatives [0, 1] and f (m) ∈ L2 [0, 1], then
f (t) =
m−1 i
t
∑
i=1
i!
f
(i)
Z t
(t − x)m−1 (m)
(0) +
f (x) dx.
0
(m − 1)!
(5)
Z 1
(t − x)m−1
+
It will be helpful to rewrite the integral as
f (m) (x) dx, where (x)+ = x for x ≥ 0 and
(m − 1)!
0 for negative x. If we
to be the functions under the hypotheses of (5) whose first m − 1
derivatives are zero at t = 0, then for f ∈ Wm0 , we have
0
define Wm0
Z 1
f (t) =
0
Gm (t, x) f (m) (x) dx,
(6)
where we have written Gm (t, x) for (t − x)m
+ /(m − 1)!. With the inner product
h f , gi =
Z 1
f (m) (x)g(m) (x) dx
0
and kernel R1 (s,t) = 01 Gm (r, x)Gm (s, x) dx, the space Wm0 forms an RKHS. To see that R1 is a
R
1(m)
kernel, first notice that Rt (s) = Gm (s,t) from (6). Then h f , R1 (·,t)i = 01 f (m) (x)Gm (s, x) dx =
f (t). Now define the nullspace of the penalty functional: H0 = span({φi }m
i=1 ), where φi (t) =
m
i−1
0
t /(i − 1)!. The kernel for H0 is R (r, s) = ∑ j=1 φi (r)φi (s). It may easily be verified [1] that
the space Wm of functions with m − 1 absolutely continuous derivatives and m derivatives can be
L
written as a direct sum H = H0 Wm0 with kernel R = R1 + R0 . Also, J( f ) corresponds to the
R
squared norm of the projection P f of f onto Wm0 , so (1) with J( f ) = 01 ( f (m) (x))2 dx becomes
R
N
∑ L(yi, f (xi)) + λ kP f k2
(7)
i=1
for f ∈ H . This problem differs from (3) only by the appearance of the projection operator, and
in fact by following the method of our proof of (3), it is easy to prove that the solution to (7) is the
natural generalization of (4):
N
m
fλ = ∑ ci R1 (·, xi ) + ∑ d j φ j .
i=1
(8)
j=1
In other words, fλ consists of an unpenalized component from H0 as well as a linear combination
of the projections onto Wm0 of the representers of evaluation at the N input data points. For the
squared-error loss L(yi − f (xi )) = (yi − f (xi ))2 , this solution actually corresponds to the natural
polynomial spline. We will prove this in the case m = 2, but the general case is very similar.
3
Theorem 3. The function fλ ∈ W2 that minimizes the penalized residual sum of squares
N
2
PRSS( f ) = ∑ (yi − f (xi )) + λ
i=1
Z 1
( f 00 (x))2 dx
(9)
0
is the natural cubic spline with nodes at x1 , . . . , xN .
Proof. Calculate
Z 1
R(s,t) =
0
Z 1
Gm (r, x)Gm (s, x) dx =
0
(
(r − x)+ (s − x)+ dx =
1 2
2 rs −
1 2
2 sr −
1 3
6s
1 3
6r
if s ≤ r
if r ≤ s
which is a cubic in r on [0, s] and linear on [s, 1] with continuous first and second derivative. Then
(8) differs from the natural cubic spline only in that the latter is required be linear on the interval
[0, x1 ]. We will show that this is also the case for fλ (in what follows we will preserve arbitrary
m). Define the N × m matrix (T )i j = φ j (xi ) and the N × N matrix (S)i j = R1 (xi , x j ). Then in matrix
notation, (9) becomes
1
ky − (Sc + T d)k2 + λ cT Sc.
(10)
N
Taking derivatives to minimize (10), we find that
c = M −1 (I − T (T T M −1 T )−1 T T M −1 )y,
k
where M = S + Nλ I. Therefore, T T c = 0, from which we obtain ∑N
i=1 ci xi = 0 for all k =
0, 1, . . . , m − 1. Therefore, for t ∈ [0, x1 ], we have
N
1
P fλ (t) = ∑ ci R (t, xi ) =
i=1
Z t
(t − x)m−1 N
0
(m − 1)!
∑ (xi − x)m−1 dx = 0,
i=1
since no powers of the xi greater than m − 1 appear in the expansion of (xi − x)m−1 . Therefore,
the only nonzero component of fλ are the constant and linear terms from the projection of fλ onto
H0 .
References
[1] Aronszajn, N. (1950) Theory of Reproducing Kernels, Am. Math. Soc. Trans. 68:337–404.
[2] Hastie T., R. Tibshirani, and J. H. Friedman. (2001) The Elements of Statistical Learning.
Springer, New York.
[3] Wahba, G. (2000) Spline Models for Observational Data. SIAM, Philadelphia.
4
Download