Lecture 29 Generalization bounds for neural networks. 18.465 In Lecture 28 we proved � � � � n k n �1 � � �1 � � � 8 � � � � εi (yi − h(xi ))2 � ≤ 8 (2LAj ) · E sup � εi h(xi )� + √ E sup � � � � � n n n h∈H h∈Hk (A1 ,...,Ak ) i=1 j=1 i=1 Hence, � � n � � 1� � � Z (Hk (A1 , . . . , Ak )) := sup L(yi , h(xi ))� �EL(y, h(x)) − � � n h∈Hk (A1 ,...,Ak ) i=1 � � � k n �1 � � � 8 t � � ≤8 (2LAj ) · E sup � εi h(xi )� + √ + 8 � � n n n h∈H j=1 i=1 with probability at least 1 − e−t . Assume H is a VC-subgraph class, −1 ≤ h ≤ 1. We had the following result: ⎛ n 1� K Pε ⎝∀h ∈ H, εi h(xi ) ≤ √ n i=1 n � √ n1 Pni=1 h2 (xi ) 0 � � �⎞ � n � � t 1 log1/2 D(H, ε, dx )dε + K � h2 (xi ) ⎠ n n i=1 ≥ 1 − e−t , where � dx (f, g) = n 1� (f (xi ) − g(xi ))2 n i=1 �1/2 . Furthermore, � � � � �⎞ � � √ n1 Pni=1 h2 (xi ) n n �1 � � � � K t 1 � � Pε ⎝∀h ∈ H, � εi h(xi )� ≤ √ log1/2 D(H, ε, dx )dε + K � h2 (xi ) ⎠ . �n � n n n 0 i=1 i=1 ⎛ ≥ 1 − 2e−t , Since −1 ≤ h ≤ 1 for all h ∈ H, � Pε � � � � � 1 n �1 � � K t � � sup � εi h(xi )� ≤ √ log1/2 D(H, ε, dx )dε + K ≥ 1 − 2e−t , � � n n n h∈H 0 i=1 Since H is a VC-subgraph class with V C(H) = V , 2 log D(H, ε, dx ) ≤ KV log . ε 76 Lecture 29 Generalization bounds for neural networks. 18.465 Hence, 1 � log 1/2 � 1 � 2 KV log dε ε 0 � √ � 1 √ 2 ≤K V log dε ≤ K V ε 0 D(H, ε, dx )dε ≤ 0 Let ξ ≥ 0 be a random variable. Then � ∞ � Eξ = P (ξ ≥ t) dt = 0 a 0 � V n = a and K � t n ∞ � P (ξ ≥ t) dt = a + P (ξ ≥ a + u) du a Let K P (ξ ≥ t) dt a ∞ � ≤a+ ∞ � P (ξ ≥ t) dt + 0 nu2 = u. Then e−t = e− K 2 . Hence, we have � � � � ∞ n �1 � � nu2 V � � Eε sup � εi h(xi )� ≤ K + 2e− K 2 du � � n h∈H n i=1 0 � � ∞ 2 V K √ e−x dx =K + n n 0 � � V K V ≤K +√ ≤K n n n for V ≥ 2. We made a change of variable so that x2 = nu2 K2 . Constants K change their values from line to line. We obtain, Z (Hk (A1 , . . . , Ak )) ≤ K k � � (2LAj ) · j=1 V 8 + √ +8 n n � t n −t with probability at least 1 − e . Assume that for any j, Aj ∈ (2−�j −1 , 2−�j ]. This defines �j . Let Hk (�1 , . . . , �k ) = � � Hk (A1 , . . . , Ak ) : Aj ∈ (2−�j −1 , 2−�j ] . Then the empirical process Z (Hk (�1 , . . . , �k )) ≤ K k � � (2L · 2 −�j j=1 )· V 8 + √ +8 n n � t n with probability at least 1 − e−t . For a given sequence (�1 , . . . , �k ), redefine t as t + 2 �k j =1 log |wj | where wj = �j if �j = � 0 and wj = 1 if �j = 0. With this t, Z (Hk (�1 , . . . , �k )) ≤ K k � � −�j (2L · 2 j=1 )· � V 8 + √ +8 n n t+2 �k j=1 log |wj | n 77 Lecture 29 Generalization bounds for neural networks. 18.465 with probability at least 1 − e−t−2 Pk j=1 log |wj | =1− k � 1 −t e . 2 |w j| j=1 By union bound, the above holds for all �1 , . . . , �k ∈ Z with probability at least � �k k � � � 1 1 −t 1− e =1− e−t 2 2 |w | |w | j 1 j=1 �1 ,...,�k ∈Z �1 ∈Z � �k π2 =1− 1+2 e−t ≥ 1 − 5k e−t = 1 − e−u 6 for t = u + k log 5. Hence, with probability at least 1 − e−u , ∀(�1 , . . . , �k ), Z (Hk (�1 , . . . , �k )) ≤ K k � � −�j (2L · 2 )· j=1 � 8 V + √ +8 n n 2 �k j=1 log |wj | + k log 5 + u n . If Aj ∈ (2−�j −1 , 2−�j ], then −�j − 1 ≤ log Aj ≤ �j and |�j | ≤ | log Aj | + 1. Hence, |wj | ≤ | log Aj | + 1. Therefore, with probability at least 1 − e−u , ∀(A1 , . . . , Ak ), Z (Hk (A1 , . . . , Ak )) ≤ K k � � (4L · Aj ) · j=1 � +8 2 �k j=1 V 8 +√ n n log (| log Aj | + 1) + k log 5 + u n . Notice that log (| log Aj | + 1) is large when Aj is very large or very small. This is penalty and we want the product term to be dominating. But log log Aj ≤ 5 for most practical applications. 78