Lecture 24 Bounds in terms of sparsity (example). 18.465

advertisement
Lecture 24
Bounds in terms of sparsity (example).
18.465
In this lecture, we give another example of margin-sparsity bound involved with mixture-of-experts type
of models. Let H be a set of functions hi : X → [−1, +1] with finite VC dimension. Let C1 , · · · , Cm
m
be partitions of H into m clusters H = i=1 Ci . The elements in the convex hull convH takes the form
�T
�
�
�
�
f = i=1 λi hi = c∈{C1 ,··· ,Cm } αc h∈c λch · h, where T � m, i λi = 1, αc = h∈c λh , and λch = λh /αc
for h ∈ c. We can approximate f by g as follows. For each cluster c, let {Ykc }k=1,··· ,N be random variables
�
�
c
c
such that ∀h ∈ c ⊂ H, we have P(Ykc = h) = λch . Then EYkc =
h∈c λh · h. Let Zk =
c αc Yk and
�
�
�
�
�
N
N
g = c αc N1 k=1 Ykc = N1 k=1 Zk . Then EZk = Eg = f . We define σc2 = var(Zk ) = c αc2 var(Ykc ), where
�
2
var(Ykc ) = �Ykc − EYkc �2 = h∈c λch (h − EYhc ) . (If we define {Yk }k=1,··· ,N be random variables such that
�N
∀h ∈ H, P(Yk = h) = λh , and define g = N1 k=1 Yk , we might get much larger var(Yk )).
. C2
.
. ...
.
C1
h
..
C
m
Recall that a classifier takes the form y = sign(f (x)) and a classification error corresponds to yf (x) < 0. We
can bound the error by
P(yf (x) < 0) ≤ P(yg ≤ δ) + P(σc2 > r) + P(yg > δ|yf (x) ≤ 0, σc2 < r).
(24.1)
The third term on the right side of inequality 24.1 can be bounded in the following way,
�
P(yg > δ|yf (x) ≤ 0, σc2 < r)
=
≤
≤
≤
≤
set
≤
(24.2)
As a result, ∀N ≥
4·r
δ2
N
1 �
P
(yZk − EyZk ) > δ − yf (x)|yf (x) ≤ 0, σc2 < r
N
k=1
�
�
N
1 �
2
P
(yZk − EyZk ) > δ|yf (x) ≤ 0, σc < r
N
k=1
�
�
N 2 δ2
exp −
,Bernstein’s inequality
2N σc2 + 23 N δ · 2
�
� 2 2
��
N δ N 2 δ2
exp − min
,
4N σc2 83 N δ
�
�
N δ2
exp −
, for r small enough
4r
�
1
.
n
log n, inequality 24.2 is satisfied.
To bound the first term on the right side of inequality 24.1, we note that EY1 ,··· ,YN P(yg ≤ δ) ≤ EY1 ,··· ,YN Eφδ (yg)
and En φδ (yg) ≤ Pn (yg ≤ 2δ) for some φδ :
61
Lecture 24
Bounds in terms of sparsity (example).
18.465
ϕδ (s)
1
δ 2δ
Any realization of g =
�Nm
k=1
s
Zk , where Nm depends on the number of clusters (C1 , · · · , Cm ), is a linear
combination of h ∈ H, and g ∈ convNm H. According to lemma 20.2,
(Eφδ (yg) − En φδ (yg)) /
��
�
Eφδ (yg) ≤ K
�
n
V Nm log /n + u/n
δ
�
with probability at least 1 − e−u . Using a technique developed earlier in this course, and taking the union
bound over all m, δ, we get, with probability at least 1 − Ke−u ,
�
P(yg ≤ δ)
≤ K inf
m,δ
Pn (yg ≤ 2δ) +
(Since EPn (yg ≤ 2δ) ≤ EPn (yf (x) ≤ 3δ) + EPn (σc2 ≥ r) +
V · Nm
n u
log +
n
δ
n
1
n
�
.
with appropriate choice of N , based on
the same reasoning as inequality 24.1, we can also control Pn (yg ≤ 2δ) by Pn (yf ≤ 3δ) and Pn (σc2 ≥ r)
probabilistically).
To bound the second term on the right side of inequality 24.1, we approximate σc2 by
�
�2
�N
(1)
(2)
(1)
(2)
2
σN
= N1 k=1 12 Zk − Zk
where Zk and Zk are independent copies of Zk . We have
2
EY (1,2) σN
=
1,··· ,N
�2
1 � (1)
(2)
Zk − Zk
1,··· ,N 2
varY (1,2)
σc2
�
�2
1
(1)
(2)
var Zk − Zk
4
�4
1 � (1)
(2)
E Zk − Zk
≤
4
�
�
�
�2
(1)
(2)
(1)
(2)
−1 ≤ Zk , Zk ≤ 1 ,and Zk − Zk
≤4
=
�
�2
(1)
(2)
≤ E Zk − Zk
=
2
varY (1,2) σN
1,··· ,N
2σc2
≤ 2 · σc2 .
62
Lecture 24
Bounds in terms of sparsity (example).
18.465
We start with
2
2
PY1,··· ,N (σc2 ≥ 4r) ≤ PY (1,2) (σN
≥ 3r) + PY (1,2) (σc2 ≥ 4r|σN
≤ 3r)
1,··· ,N
1,··· ,N
2
≤ EY (1,2) φr σN
≥ 3r +
�
�
1,··· ,N
1
n
with appropriate choice of N , following the same line of reasoning as in inequality 24.1. We note that
2
2
2
2
PY1 ,··· ,YN (σN
≥ 3r) ≤ EY1 ,··· ,YN φr (σN
), and En φδ (σN
) ≤ Pn (σN
≥ 2r) for some φδ .
ϕδ (s)
1
δ 2δ 3δ 4δ
Since
2
σN
s
�
�
N
� 2
1 � � � (1)
(2)
(1)
(2)
∈{
αc hk,c − hk,c
: hk,c , hk,c ∈ H} ⊂ convNm {hi · hj : hi , hj ∈ H},
2N
c
k=1
and log D({hi · hj : hi , hj ∈ H}, �) ≤ KV log
hj : hi , hj ∈ H}, �) ≤ KV · Nm · log
2
�
by the assumption of our problem, we have log D(convNm {hi ·
2
�
by the VC inequality, and
��
�
�
�
� �
n
2
2
2 ) ≤ K
Eφr (σN
) − En φr (σN
) / Eφr (σN
V · Nm log /n + u/n
r
with probability at least 1 − e−u . Using a technique developed earlier in this course, and taking the union
bound over all m, δ, r, with probability at least 1 − Ke−u ,
�
�
1
V · Nm
n u
2
P(σc2 ≥ 4r) ≤ K inf Pn (σN
≥ 2r) + +
log +
.
m,δ,r
n
n
δ
n
As a result, with probability at least 1 − Ke−u , we have
�
�
V · min(rm /δ 2 , Nm )
n
u
2
P(yf (x) ≤ 0) ≤ K · inf Pn (yg ≤ 2 · δ) + Pn (σN
≥ r) +
log log n +
r,δ,m
n
δ
n
for all f ∈ convH.
63
Download