Lecture 30 Generalization bounds for kernel methods. 18.465 Let X ⊂ Rd be a compact subset. Assume x1 , . . . , xn are i.i.d. and y1 , . . . , yn = ±1 for classification and �∞ [−1, 1] for regression. Assume we have a kernel K(x, y) = i=1 λi φi (x)φi (y), λi > 0. Consider a map x ∈ X �→ φ(x) = ( � λ1 φ1 (x), . . . , � λk φk (x), . . .) = ( � λk φk (x))k≥1 ∈ H where H is a Hilbert space. Consider the scalar product in H: (u, v)H = �∞ i=1 ui vi and �u�H = � (u, v)H . For x, y ∈ X , (φ(x), φ(y))H = ∞ � λi φi (x)φi (y) = K(x, y). i=1 Function φ is called feature map. Family of classifiers: FH = {(w, z)H : �w�H ≤ 1}. F = {(w, φ(x))H : �w�H ≤ 1} � f : X �→ R. Algorithms: (1) SVMs f (x) = n � αi K(xi , x) = ( i=1 n � αi φ(xi ), φ(x))H i=1 � �� w � Here, instead of taking any w, we only take w as a linear combination of images of data points. We have a choice of Loss function L: • L(y, f (x)) = I(yf (x) ≤ 0) – classification • L(y, f (x)) = (y − f (x))2 – regression (2) Square-loss regularization Assume an algorithm outputs a classifier from F (or FH ), f (x) = (w, φ(x))H . Then, as in Lecture 20, � � n n 1� 1� P (yf (x) ≤ 0) ≤ Eϕδ (yf (x)) = ϕδ (yi f (xi )) + Eϕδ (yf (x)) − ϕδ (yi f (xi )) n i=1 n i=1 � � n n 1� 1� ≤ I(yi f (xi ) ≤ δ) + sup Eϕδ (yf (x)) − ϕδ (yi f (xi )) n i=1 n i=1 f ∈F By McDiarmid’s inequality, with probability at least 1 − e−t � � � � � n n 1� 1� 2t sup Eϕδ (yf (x)) − ϕδ (yi f (xi )) ≤ E sup Eϕδ (yf (x)) − ϕδ (yi f (xi )) + n n n f ∈F f ∈F i=1 i=1 79 Lecture 30 Generalization bounds for kernel methods. 18.465 Using the symmetrization technique, � � � � n n �1 � � ��� 1� � E sup E(ϕδ (yf (x)) − 1) − (ϕδ (yi f (xi )) − 1) ≤ 2E sup � εi ϕδ (yi f (xi )) − 1 � . � n i=1 f ∈F f ∈F � n i=1 Since δ · (ϕδ − 1) is a contraction, � n � � n � �1 � � �1 � � ��� 2 � � � 2E sup � εi ϕδ (yi f (xi )) − 1 � ≤ 2E sup � εi yi f (xi )� � δ � f ∈F � n i=1 f ∈F � n i=1 � n � � n � �1 � � 4 �1 � � 4 � � � � = E sup � εi f (xi )� = E sup � εi (w, φ(xi ))H � � δ �w�≤1 � n � δ f ∈F � n i=1 i=1 � � � � n n � �� � � � 4 4 � � � � E sup �(w, εi φ(xi ))H � = E sup � εi φ(xi )� = � δn �w�≤1 � � δn �w�≤1 � i=1 i=1 H �� � � n �� n � � 4 � 4 = E� εi φ(xi ), εi φ(xi ) = E εi εj (φ(xi ), φ(xi ))H δn δn i=1 i=1 i,j H �� � � 4 4 = E εi εj K(xi , xj ) ≤ E εi εj K(xi , xj ) δn δn i,j i,j � � � n � 4 � 4 EK(x1 , x1 ) � = EK(xi , xi ) = δn i=1 δ n Putting everything together, with probability at least 1 − e−t , n 1� 4 P (yf (x) ≤ 0) ≤ I(yi f (xi ) ≤ δ) + n i=1 δ � EK(x1 , x1 ) + n � 2t . n Before the contraction step, we could have used Martingale method again to have Eε only. Then EK(x1 , x1 ) �n in the above bound will become n1 i=1 K(xi , xi ). 80