Lecture 30 Generalization bounds for kernel methods. 18.465

advertisement
Lecture 30
Generalization bounds for kernel methods.
18.465
Let X ⊂ Rd be a compact subset. Assume x1 , . . . , xn are i.i.d. and y1 , . . . , yn = ±1 for classification and
�∞
[−1, 1] for regression. Assume we have a kernel K(x, y) = i=1 λi φi (x)φi (y), λi > 0.
Consider a map
x ∈ X �→ φ(x) = (
�
λ1 φ1 (x), . . . ,
�
λk φk (x), . . .) = (
�
λk φk (x))k≥1 ∈ H
where H is a Hilbert space.
Consider the scalar product in H: (u, v)H =
�∞
i=1
ui vi and �u�H =
�
(u, v)H .
For x, y ∈ X ,
(φ(x), φ(y))H =
∞
�
λi φi (x)φi (y) = K(x, y).
i=1
Function φ is called feature map.
Family of classifiers:
FH = {(w, z)H : �w�H ≤ 1}.
F = {(w, φ(x))H : �w�H ≤ 1} � f : X �→ R.
Algorithms:
(1) SVMs
f (x) =
n
�
αi K(xi , x) = (
i=1
n
�
αi φ(xi ), φ(x))H
i=1
�
��
w
�
Here, instead of taking any w, we only take w as a linear combination of images of data points. We
have a choice of Loss function L:
• L(y, f (x)) = I(yf (x) ≤ 0) – classification
• L(y, f (x)) = (y − f (x))2 – regression
(2) Square-loss regularization
Assume an algorithm outputs a classifier from F (or FH ), f (x) = (w, φ(x))H . Then, as in Lecture 20,
�
�
n
n
1�
1�
P (yf (x) ≤ 0) ≤ Eϕδ (yf (x)) =
ϕδ (yi f (xi )) + Eϕδ (yf (x)) −
ϕδ (yi f (xi ))
n i=1
n i=1
�
�
n
n
1�
1�
≤
I(yi f (xi ) ≤ δ) + sup Eϕδ (yf (x)) −
ϕδ (yi f (xi ))
n i=1
n i=1
f ∈F
By McDiarmid’s inequality, with probability at least 1 − e−t
�
�
�
� �
n
n
1�
1�
2t
sup Eϕδ (yf (x)) −
ϕδ (yi f (xi )) ≤ E sup Eϕδ (yf (x)) −
ϕδ (yi f (xi )) +
n
n
n
f ∈F
f ∈F
i=1
i=1
79
Lecture 30
Generalization bounds for kernel methods.
18.465
Using the symmetrization technique,
�
�
�
�
n
n
�1 �
�
���
1�
�
E sup E(ϕδ (yf (x)) − 1) −
(ϕδ (yi f (xi )) − 1) ≤ 2E sup �
εi ϕδ (yi f (xi )) − 1 � .
�
n i=1
f ∈F
f ∈F � n i=1
Since δ · (ϕδ − 1) is a contraction,
� n
�
� n
�
�1 �
�
�1 � �
��� 2
�
�
�
2E sup �
εi ϕδ (yi f (xi )) − 1 � ≤ 2E sup �
εi yi f (xi )�
� δ
�
f ∈F � n i=1
f ∈F � n i=1
� n
�
� n
�
�1 �
� 4
�1 �
�
4
�
�
�
�
= E sup �
εi f (xi )� = E sup �
εi (w, φ(xi ))H �
� δ �w�≤1 � n
�
δ f ∈F � n i=1
i=1
�
�
�
�
n
n
�
��
�
�
�
4
4
�
�
�
�
E sup �(w,
εi φ(xi ))H � =
E sup �
εi φ(xi )�
=
� δn �w�≤1 �
�
δn �w�≤1 �
i=1
i=1
H
��
�
� n
��
n
�
�
4 �
4
=
E�
εi φ(xi ),
εi φ(xi )
=
E
εi εj (φ(xi ), φ(xi ))H
δn
δn
i=1
i=1
i,j
H
��
� �
4
4
=
E
εi εj K(xi , xj ) ≤
E
εi εj K(xi , xj )
δn
δn
i,j
i,j
�
�
� n
�
4 �
4 EK(x1 , x1 )
�
=
EK(xi , xi ) =
δn i=1
δ
n
Putting everything together, with probability at least 1 − e−t ,
n
1�
4
P (yf (x) ≤ 0) ≤
I(yi f (xi ) ≤ δ) +
n i=1
δ
�
EK(x1 , x1 )
+
n
�
2t
.
n
Before the contraction step, we could have used Martingale method again to have Eε only. Then EK(x1 , x1 )
�n
in the above bound will become n1 i=1 K(xi , xi ).
80
Download