Lecture 24 Bounds in terms of sparsity (example). 18.465 In this lecture, we give another example of margin-sparsity bound involved with mixture-of-experts type of models. Let H be a set of functions hi : X → [−1, +1] with finite VC dimension. Let C1 , · · · , Cm m be partitions of H into m clusters H = i=1 Ci . The elements in the convex hull convH takes the form �T � � � � f = i=1 λi hi = c∈{C1 ,··· ,Cm } αc h∈c λch · h, where T � m, i λi = 1, αc = h∈c λh , and λch = λh /αc for h ∈ c. We can approximate f by g as follows. For each cluster c, let {Ykc }k=1,··· ,N be random variables � � c c such that ∀h ∈ c ⊂ H, we have P(Ykc = h) = λch . Then EYkc = h∈c λh · h. Let Zk = c αc Yk and � � � � � N N g = c αc N1 k=1 Ykc = N1 k=1 Zk . Then EZk = Eg = f . We define σc2 = var(Zk ) = c αc2 var(Ykc ), where � 2 var(Ykc ) = �Ykc − EYkc �2 = h∈c λch (h − EYhc ) . (If we define {Yk }k=1,··· ,N be random variables such that �N ∀h ∈ H, P(Yk = h) = λh , and define g = N1 k=1 Yk , we might get much larger var(Yk )). . C2 . . ... . C1 h .. C m Recall that a classifier takes the form y = sign(f (x)) and a classification error corresponds to yf (x) < 0. We can bound the error by P(yf (x) < 0) ≤ P(yg ≤ δ) + P(σc2 > r) + P(yg > δ|yf (x) ≤ 0, σc2 < r). (24.1) The third term on the right side of inequality 24.1 can be bounded in the following way, � P(yg > δ|yf (x) ≤ 0, σc2 < r) = ≤ ≤ ≤ ≤ set ≤ (24.2) As a result, ∀N ≥ 4·r δ2 N 1 � P (yZk − EyZk ) > δ − yf (x)|yf (x) ≤ 0, σc2 < r N k=1 � � N 1 � 2 P (yZk − EyZk ) > δ|yf (x) ≤ 0, σc < r N k=1 � � N 2 δ2 exp − ,Bernstein’s inequality 2N σc2 + 23 N δ · 2 � � 2 2 �� N δ N 2 δ2 exp − min , 4N σc2 83 N δ � � N δ2 exp − , for r small enough 4r � 1 . n log n, inequality 24.2 is satisfied. To bound the first term on the right side of inequality 24.1, we note that EY1 ,··· ,YN P(yg ≤ δ) ≤ EY1 ,··· ,YN Eφδ (yg) and En φδ (yg) ≤ Pn (yg ≤ 2δ) for some φδ : 61 Lecture 24 Bounds in terms of sparsity (example). 18.465 ϕδ (s) 1 δ 2δ Any realization of g = �Nm k=1 s Zk , where Nm depends on the number of clusters (C1 , · · · , Cm ), is a linear combination of h ∈ H, and g ∈ convNm H. According to lemma 20.2, (Eφδ (yg) − En φδ (yg)) / �� � Eφδ (yg) ≤ K � n V Nm log /n + u/n δ � with probability at least 1 − e−u . Using a technique developed earlier in this course, and taking the union bound over all m, δ, we get, with probability at least 1 − Ke−u , � P(yg ≤ δ) ≤ K inf m,δ Pn (yg ≤ 2δ) + (Since EPn (yg ≤ 2δ) ≤ EPn (yf (x) ≤ 3δ) + EPn (σc2 ≥ r) + V · Nm n u log + n δ n 1 n � . with appropriate choice of N , based on the same reasoning as inequality 24.1, we can also control Pn (yg ≤ 2δ) by Pn (yf ≤ 3δ) and Pn (σc2 ≥ r) probabilistically). To bound the second term on the right side of inequality 24.1, we approximate σc2 by � �2 �N (1) (2) (1) (2) 2 σN = N1 k=1 12 Zk − Zk where Zk and Zk are independent copies of Zk . We have 2 EY (1,2) σN = 1,··· ,N �2 1 � (1) (2) Zk − Zk 1,··· ,N 2 varY (1,2) σc2 � �2 1 (1) (2) var Zk − Zk 4 �4 1 � (1) (2) E Zk − Zk ≤ 4 � � � �2 (1) (2) (1) (2) −1 ≤ Zk , Zk ≤ 1 ,and Zk − Zk ≤4 = � �2 (1) (2) ≤ E Zk − Zk = 2 varY (1,2) σN 1,··· ,N 2σc2 ≤ 2 · σc2 . 62 Lecture 24 Bounds in terms of sparsity (example). 18.465 We start with 2 2 PY1,··· ,N (σc2 ≥ 4r) ≤ PY (1,2) (σN ≥ 3r) + PY (1,2) (σc2 ≥ 4r|σN ≤ 3r) 1,··· ,N 1,··· ,N 2 ≤ EY (1,2) φr σN ≥ 3r + � � 1,··· ,N 1 n with appropriate choice of N , following the same line of reasoning as in inequality 24.1. We note that 2 2 2 2 PY1 ,··· ,YN (σN ≥ 3r) ≤ EY1 ,··· ,YN φr (σN ), and En φδ (σN ) ≤ Pn (σN ≥ 2r) for some φδ . ϕδ (s) 1 δ 2δ 3δ 4δ Since 2 σN s � � N � 2 1 � � � (1) (2) (1) (2) ∈{ αc hk,c − hk,c : hk,c , hk,c ∈ H} ⊂ convNm {hi · hj : hi , hj ∈ H}, 2N c k=1 and log D({hi · hj : hi , hj ∈ H}, �) ≤ KV log hj : hi , hj ∈ H}, �) ≤ KV · Nm · log 2 � by the assumption of our problem, we have log D(convNm {hi · 2 � by the VC inequality, and �� � � � � � n 2 2 2 ) ≤ K Eφr (σN ) − En φr (σN ) / Eφr (σN V · Nm log /n + u/n r with probability at least 1 − e−u . Using a technique developed earlier in this course, and taking the union bound over all m, δ, r, with probability at least 1 − Ke−u , � � 1 V · Nm n u 2 P(σc2 ≥ 4r) ≤ K inf Pn (σN ≥ 2r) + + log + . m,δ,r n n δ n As a result, with probability at least 1 − Ke−u , we have � � V · min(rm /δ 2 , Nm ) n u 2 P(yf (x) ≤ 0) ≤ K · inf Pn (yg ≤ 2 · δ) + Pn (σN ≥ r) + log log n + r,δ,m n δ n for all f ∈ convH. 63