Reproducing Kernel Hilbert Spaces and Polynomial Smoothing Splines Sam Watson November 21, 2008 Introduction Suppose we have some data {xi , yi } ∈ T × R for i = 1, 2, . . . , N and T an arbitrary set, and we want to find a function f : T −→ R that minimizes the penalized residual sum-of-squares N PRSS( f ) = ∑ L(yi , f (xi )) + λ J( f ), (1) i=1 where L is a loss function, λ ≥ 0 is a parameter of the model to be fixed later by cross-validation, and J is a functional that imposes a smoothness criterion on f . We will use reproducing kernel Hilbert spaces to develop a unifying framework for this problem and many other estimation and minimization problems in statistics.1 Reproducing Kernel Hilbert Spaces Recall that a Hilbert space H is an inner-product space which is complete in the metric induced by its norm. For any Hilbert space of functions on a set T , we may define for each t ∈ T the evaluation functional f 7→ f (t). If every such evaluation functional is bounded, then we say that the Hilbert space is a reproducing kernel Hilbert space (RKHS). Observe that L2 is not an RKHS, because the Dirac delta functions are not in L2 . For an RKHS, by the Riesz representation theorem we can find for each t an Rt ∈ H so that f (t) = h f , Rt i. (2) Then we can define a function R : T × T → R (the kernel) by R(s,t) = Rs (t). Such a function is symmetric because Rt (s) = hRs , Rt i = hRt , Rs i = Rs (t), and by direct application of (2) it has the reproducing property hR(·, s), R(·,t)i = R(s,t). It is also nonnegative definite, meaning that for all t1 ,t2 , . . . ,tn ∈ T and real numbers a1 , a2 , . . . , an , n ∑ aia j R(ti,t j ) ≥ 0, i=1 1 The material presented here is standard; see References. because the quantity on the left-hand side is equal to h∑ni=1 R(·,ti )ai , ∑ni=1 R(·,ti )ai i = kR(·,ti )ai k2 . Moreover, each RKHS H has a unique kernel, because if R and R0 are kernels, then kR(·, y) − R0 (·, y)k2 = hRy − R0y , Ry − R0y i = hRy − R0y , Ry i − hRy − R0y , R0y i = 0 using the reproducing property. We now state an important theorem regarding the correspondence between reproducing kernel Hilbert spaces and their kernel functions [1]. Theorem 1. To every RKHS there is a unique nonnegative definite kernel with the reproducing property, and conversely for any symmetric, nonnegative definite R : T ×T → R, there is a unique RKHS HR of functions on T whose kernel is R. To obtain the RKHS for a kernel R, we first consider all finite linear combinations of the functions {R(·, s)|s ∈ T }, and define the inner product by hRs , Rt i = R(s,t) along with linearity. We confirm that R is a kernel for this space: first let f = ∑ ai R(·,ti ) where i ranges over a finite index set and calculate h∑ ai R(·,ti ), R(·, s)i = ∑ ai hR(·,ti ), R(·, s)i = ∑i ai R(s,ti ) = f (t). A Cauchy sequence of such functions is guaranteed to converge to a limiting function pointwise, because | fn (t) − fm (t)| = h fn − fm , Rt i ≤ k fn − fm kkRt k by Cauchy-Schwarz. Therefore, we obtain a complete inner product space HR if we include all linear combinations of the R(·, s) as well as limits of Cauchy sequences of these functions. Let’s see generally how the RKHS approach can help with a problem like (1). Suppose we can find a kernel R whose RKHS HR has a squared norm corresponding to the penalty functional J. Then by Hilbert space theory, for any f ∈ HR we can write f = f˜ + f ⊥ , where f˜ is a linear combination ⊥ of the N functions {R(·, xi )}N i=1 , and h f , K(·, xi )i = 0 for all i = 1, 2, . . . , N. The following theorem demonstrates the usefulness of this representation. Theorem 2. For f ∈ HR with f = f˜ + f ⊥ and f˜ ∈ span({K(·, xi )}N i=1 ), N N ∑ L(yi − f (xi)) + λ k f k2 ≥ ∑ L(yi − f˜(xi)) + λ k f˜k2, i=1 (3) i=1 with equality iff f ⊥ = 0. Proof. The summations on each side of (3) will be equal since they are only evaluated at the N values xi , and f ⊥ (xi ) = 0 ⇒ f (xi ) = f˜(xi ). Then observe that k f k2 = k f˜k2 + k f ⊥ k2 ≥ k f˜k2 with equality iff f ⊥ = 0. This theorem shows that the solution for (3) is of the form N fλ = ∑ ci R(·, xi ), (4) i=1 and therefore reduces the infinite-dimension minimization problem (1) to a finite-dimensional minimization problem. The sacrifice required was narrowing the space of candidate functions f from the space of all functions on T to the space HR . It is therefore important to choose R so that HR contains a broad class of functions. Let us illustrate these ideas with an example. 2 Application: Polynomial Smoothing Splines If the input data {xi }N i=1 are one-dimensional, then without loss of generality we may assume T = [0, 1]. A common choice for smoothing a one-dimensional interpolating function is to penalize the R integral of the square of the mth derivative of f : J( f ) = 01 ( f (m) (x))2 dx. Our goal is to construct an RKHS whose norm corresponds to this J. Recall that Taylor’s Theorem with integral remainder term states that if f has m − 1 absolutely continuous derivatives [0, 1] and f (m) ∈ L2 [0, 1], then f (t) = m−1 i t ∑ i=1 i! f (i) Z t (t − x)m−1 (m) (0) + f (x) dx. 0 (m − 1)! (5) Z 1 (t − x)m−1 + It will be helpful to rewrite the integral as f (m) (x) dx, where (x)+ = x for x ≥ 0 and (m − 1)! 0 for negative x. If we to be the functions under the hypotheses of (5) whose first m − 1 derivatives are zero at t = 0, then for f ∈ Wm0 , we have 0 define Wm0 Z 1 f (t) = 0 Gm (t, x) f (m) (x) dx, (6) where we have written Gm (t, x) for (t − x)m + /(m − 1)!. With the inner product h f , gi = Z 1 f (m) (x)g(m) (x) dx 0 and kernel R1 (s,t) = 01 Gm (r, x)Gm (s, x) dx, the space Wm0 forms an RKHS. To see that R1 is a R 1(m) kernel, first notice that Rt (s) = Gm (s,t) from (6). Then h f , R1 (·,t)i = 01 f (m) (x)Gm (s, x) dx = f (t). Now define the nullspace of the penalty functional: H0 = span({φi }m i=1 ), where φi (t) = m i−1 0 t /(i − 1)!. The kernel for H0 is R (r, s) = ∑ j=1 φi (r)φi (s). It may easily be verified [1] that the space Wm of functions with m − 1 absolutely continuous derivatives and m derivatives can be L written as a direct sum H = H0 Wm0 with kernel R = R1 + R0 . Also, J( f ) corresponds to the R squared norm of the projection P f of f onto Wm0 , so (1) with J( f ) = 01 ( f (m) (x))2 dx becomes R N ∑ L(yi, f (xi)) + λ kP f k2 (7) i=1 for f ∈ H . This problem differs from (3) only by the appearance of the projection operator, and in fact by following the method of our proof of (3), it is easy to prove that the solution to (7) is the natural generalization of (4): N m fλ = ∑ ci R1 (·, xi ) + ∑ d j φ j . i=1 (8) j=1 In other words, fλ consists of an unpenalized component from H0 as well as a linear combination of the projections onto Wm0 of the representers of evaluation at the N input data points. For the squared-error loss L(yi − f (xi )) = (yi − f (xi ))2 , this solution actually corresponds to the natural polynomial spline. We will prove this in the case m = 2, but the general case is very similar. 3 Theorem 3. The function fλ ∈ W2 that minimizes the penalized residual sum of squares N 2 PRSS( f ) = ∑ (yi − f (xi )) + λ i=1 Z 1 ( f 00 (x))2 dx (9) 0 is the natural cubic spline with nodes at x1 , . . . , xN . Proof. Calculate Z 1 R(s,t) = 0 Z 1 Gm (r, x)Gm (s, x) dx = 0 ( (r − x)+ (s − x)+ dx = 1 2 2 rs − 1 2 2 sr − 1 3 6s 1 3 6r if s ≤ r if r ≤ s which is a cubic in r on [0, s] and linear on [s, 1] with continuous first and second derivative. Then (8) differs from the natural cubic spline only in that the latter is required be linear on the interval [0, x1 ]. We will show that this is also the case for fλ (in what follows we will preserve arbitrary m). Define the N × m matrix (T )i j = φ j (xi ) and the N × N matrix (S)i j = R1 (xi , x j ). Then in matrix notation, (9) becomes 1 ky − (Sc + T d)k2 + λ cT Sc. (10) N Taking derivatives to minimize (10), we find that c = M −1 (I − T (T T M −1 T )−1 T T M −1 )y, k where M = S + Nλ I. Therefore, T T c = 0, from which we obtain ∑N i=1 ci xi = 0 for all k = 0, 1, . . . , m − 1. Therefore, for t ∈ [0, x1 ], we have N 1 P fλ (t) = ∑ ci R (t, xi ) = i=1 Z t (t − x)m−1 N 0 (m − 1)! ∑ (xi − x)m−1 dx = 0, i=1 since no powers of the xi greater than m − 1 appear in the expansion of (xi − x)m−1 . Therefore, the only nonzero component of fλ are the constant and linear terms from the projection of fλ onto H0 . References [1] Aronszajn, N. (1950) Theory of Reproducing Kernels, Am. Math. Soc. Trans. 68:337–404. [2] Hastie T., R. Tibshirani, and J. H. Friedman. (2001) The Elements of Statistical Learning. Springer, New York. [3] Wahba, G. (2000) Spline Models for Observational Data. SIAM, Philadelphia. 4