Lecture 23 Bounds in terms of sparsity. �T Let f = i=1 λi hi , where λ1 ≥ λ2 ≥ . . . ≥ λT ≥ 0. Rewrite f as f= d � �T i=d+1 T � λi h i + i=1 where γ(d) = 18.465 λ i hi = d � λi hi + γ(d) i=1 i=d+1 T � λ�i hi i=d+1 λi and λ�i = λi /γ(d). Consider the following random approximation of f , g= d � λi hi + γ(d) i=1 k 1� Yj k j=1 where, as in the previous lectures, P (Yj = hi ) = λ�i , i = d + 1, . . . , T for any j = 1, . . . , k. Recall that EYj = �T i=d+1 λ�i hi . Then P (yf (x) ≤ 0) = P (yf (x) ≤ 0, yg(x) ≤ δ) + P (yf (x) ≤ 0, yg(x) > δ) � � � �� ≤ P (yg(x) ≤ δ) + E PY yf (x) ≤ 0, yg(x) ≥ δ � (x, y) Furthermore, � � � � � � PY yf (x) ≤ 0, yg(x) ≥ δ � (x, y) ≤ PY yg(x) − yf (x) > δ � (x, y) ⎛ ⎛ ⎞ ⎞ k � � 1 = PY ⎝γ(d)y ⎝ Yj (x) − EY1 ⎠ ≥ δ � (x, y)⎠ . k j=1 By renaming Yj� = ⎛ yYj +1 2 ∈ [0, 1] and applying Hoeffding’s inequality, we get ⎛ ⎞ ⎞ ⎛ ⎞ k k � � � � 1 1 δ � PY ⎝γ(d)y ⎝ Yj (x) − EY ⎠ ≥ δ � (x, y)⎠ = PY ⎝ Y � (x) − EY1� ≥ (x, y)⎠ k j=1 k j=1 j 2γ(d) 2 kδ − 2γ(d) 2 ≤e . Hence, 2 P (yf (x) ≤ 0) ≤ P (yg(x) ≤ δ) + e 2 kδ − 2γ(d )2 If we set e = 1 n, then k = 2γ 2 (d) δ2 − 2γkδ 2 (d) . log n. We have g= d � i=1 d+k =d+ 2γ 2 (d) δ2 λi hi + γ(d) k 1� Yj ∈ convd+k H, k j=1 log n. 58 Lecture 23 Bounds in terms of sparsity. 18.465 Define the effective dimension of f as � e(f, δ) = min 0≤d≤T � 2γ 2 (d) d+ log n . δ2 Recall from the previous lectures that Pn (yg(x) ≤ 2δ) ≤ Pn (yf (x) ≤ 3δ) + 1 . n Hence, we have the following margin-sparsity bound Theorem 23.1. For λ1 ≥ . . . λT ≥ 0, we define γ(d, f ) = P (yf (x) ≤ 0) ≤ inf � ε+ � δ∈(0,1) �T i=d+1 λi . Then with probability at least 1 − e−t , Pn (yf (x) ≤ δ) + ε2 �2 where �� ε=K V · e(f, δ) n log + n δ � � t n Example 23.1. Consider the zero-error case. Define δ ∗ = sup{δ > 0, Pn (yf (x) ≤ δ) = 0}. Hence, Pn (yf (x) ≤ δ ∗ ) = 0 for confidence δ ∗ . Then n t V · e(f, δ ∗ ) log ∗ + n δ n � � V log n n t log ∗ + ≤K (δ ∗ )2 n δ n P (yf (x) ≤ 0) ≤ 4ε2 = K because e(f, δ) ≤ 2 δ2 � � log n always. Consider the polynomial weight decay: λi ≤ Ki−α , for some α > 1. Then γ(d) = T � i=d+1 λi ≤ K T � i−α ≤ K � ∞ x−α dx = K d i=d+1 1 Kα = α−1 α−1 (α − 1)d d Then � 2γ 2 (d) e(f, δ) = min d + log n d δ2 � � Kα� ≤ min d + 2 2(α−1) log n d δ d � Taking derivative with respect to d and setting it to zero, 1− Kα log n =0 δ 2 d2α−1 we get d = Kα · log1/(2α−1) n log n ≤ K 2/(2α−1) . 2/(2α−1) δ δ 59 Lecture 23 Bounds in terms of sparsity. 18.465 Hence, e(f, δ) ≤ K log n δ 2/(2α−1) Plugging in, � P (yf (x) ≤ 0) ≤ K V log n n t log ∗ + δ n n(δ ∗ )2/(2α−1) � . As α → ∞, the bound behaves like V log n n log ∗ . n δ 60