Foundations of Machine Learning Learning with Infinite Hypothesis Sets Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation With an infinite hypothesis set H, the error bounds of the previous lecture are not informative. Is efficient learning from a finite sample possible when H is infinite? Our example of axis-aligned rectangles shows that it is possible. Can we reduce the infinite case to a finite set? Project over finite samples? Are there useful measures of complexity for infinite hypothesis sets? Mehryar Mohri - Foundations of Machine Learning page 2 This lecture Rademacher complexity Growth Function VC dimension Lower bound Mehryar Mohri - Foundations of Machine Learning page 3 Empirical Rademacher Complexity Definition: G family of functions mapping from set Z to [a, b] . sample S = (z1 , . . . , zm ). i s (Rademacher variables): independent uniform random variables taking values in { 1, +1} . • • • 1 RS (G) = E sup g G m g(z1 ) 1 .. . m · .. . g(zm ) 1 = E sup g G m correlation with random noise Mehryar Mohri - Foundations of Machine Learning page 4 m i g(zi ) i=1 . Rademacher Complexity Definitions: let G be a family of functions mapping from Z to [a, b] . Empirical Rademacher complexity of G : • 1 RS (G) = E sup g G m m i g(zi ) , i=1 where i s are independent uniform random variables taking values in { 1, +1} and S = (z1 , . . . , zm ) . • Rademacher complexity of G : Rm (G) = Mehryar Mohri - Foundations of Machine Learning E m [RS (G)]. S D page 5 Rademacher Complexity Bound (Koltchinskii and Panchenko, 2002) Theorem: Let G be a family of functions mapping from Z to [0, 1]. Then, for any > 0 , with probability g G: at least 1 , the following holds for all ⇥ E[g(z)] E[g(z)] m log 1 g(zi ) + 2Rm (G) + . 2m i=1 ⇤ m 2 log 1 ⇥ S (G) + 3 g(zi ) + 2R . m i=1 2m 1 m Proof: Apply McDiarmid’s inequality to (S) = sup E[g] g G Mehryar Mohri - Foundations of Machine Learning ES [g]. page 6 • Changing one point of S changes (S) by at most m1 . (S 0 ) (S) = sup {E[g] g2G sup {{E[g] g2G b S [g] = sup {E g2G b S 0 [g]} E b S [g]} E sup {E[g] g2G b S 0 [g]} E {E[g] b S 0 [g]} = sup E g2G b S [g]}} E 1 m (g(zm ) 0 g(zm )) 1 m. • Thus,1by McDiarmid’s inequality, with probability at least 2 (S) E[ (S)] + S log 2 2m . • We are left with bounding the expectation. Mehryar Mohri - Foundations of Machine Learning page 7 • Series of observations: ⇥ E[ (S)] = E sup E[g] S S ⇥ g2G b S (g) E b S 0 (g) = E sup E0 [E S g2G S (Jensen’s ineq.) E 0 S,S = E0 S,S (swapzi and zi0 ) = ⇥ ⇥ E ,S,S 0 (sub-additiv. of sup) E 0 ,S =2 E ⇥ ,S Mehryar Mohri - Foundations of Machine Learning b S 0 (g) sup E g2G m ⇤ b S (g)] E b S (g) E 1 X sup (g(zi0 ) g2G m i=1 ⇥ g(zi )) m 1 X sup g2G m i=1 0 (g(z i i) m 1 X sup g2G m i=1 ⇥ ⇤ ⇤ m 1 X sup g2G m i=1 0 g(z i i) ⇤ i g(zi ) g(zi )) + E ⇤ ⇤ ,S ⇥ ⇤ m 1 X sup g2G m i=1 = 2Rm (G). page 8 i g(zi ) ⇤ • Now, changing one point of S makes R (G) vary by S at most m1 . Thus, again by McDiarmid’s inequality, with probability at least 1 2 , Rm (G) RS (G) + ⇥ log 2 . 2m • Thus, by the union bound, with⇥ probability at least 1 , (S) Mehryar Mohri - Foundations of Machine Learning 2RS (G) + 3 log 2 . 2m page 9 Loss Functions - Hypothesis Set Proposition: Let H be a family of functions taking values in{ 1, +1}, G the family of zero-one loss functions of H: G = (x, y) 1h(x)=y : h H . Then, 1 Rm (G) = Rm (H). 2 Proof: Rm (G) = E S, = E S, = = 1 E 2 S, 1 E 2 S, Mehryar Mohri - Foundations of Machine Learning 1 sup h H m sup h H 1 m sup h H sup h H m i 1h(xi )=yi i=1 m i=1 m 1 m 1 m 1 i (1 2 yi h(xi )) i yi h(xi ) i=1 m i h(xi ) . i=1 page 10 Generalization Bounds - Rademacher Corollary: Let H be a family of functions taking values in { 1, +1}. Then, for any > 0 , with probability at least 1 , for any h H , R(h) R(h) ⇥ log 1 R(h) + Rm (H) + . 2m ⇥ log 2 R(h) + RS (H) + 3 . 2m Mehryar Mohri - Foundations of Machine Learning page 11 Remarks First bound distribution-dependent, second datadependent bound, which makes them attractive. But, how do we compute the empirical Rademacher complexity? Computing E i h(xi )] requires solving ERM problems, typically computationally hard. 1 [suph H m m i=1 Relation with combinatorial measures easier to compute? Mehryar Mohri - Foundations of Machine Learning page 12 This lecture Rademacher complexity Growth Function VC dimension Lower bound Mehryar Mohri - Foundations of Machine Learning page 13 Growth Function Definition: the growth function hypothesis set H is defined by ⇥m N, H (m) = max {x1 ,...,xm } X H: N N for a ⇧⇤ ⇥ ⇧ ⇧ h(x1 ), . . . , h(xm ) : h ⌅⇧⇧ H ⇧. Thus, H (m) is the maximum number of ways m points can be classified using H. Mehryar Mohri - Foundations of Machine Learning page 14 Massart’s Lemma (Massart, 2000) Theorem: Let A Rm be a finite set, with R = max x 2, x A then, the following holds: m ⇤ 1 sup E m x A i=1 Proof: exp i xi ⌅ R 2 log |A| . m ⇥ m m t E sup x A i=1 i xi E exp t sup x A i=1 m =E sup exp t x A i xi i=1 m m E exp t x A (Hoeffding’s ineq.) exp x A Mehryar Mohri - Foundations of Machine Learning (Jensen’s ineq.) i xi i xi E (exp [t i xi ]) = i=1 m 2 2 i=1 t (2|xi |) 8 page 15 x A i=1 |A|e t2 R2 2 . • Taking the log yields: ⇥ E sup m ⇤ x A i=1 i xi log |A| tR2 + . t 2 • Minimizing the bound by choosing t = gives E sup m ⇤ x A i=1 Mehryar Mohri - Foundations of Machine Learning i xi ⇥ 2 log |A| R ⌅ R 2 log |A|. page 16 Growth Function Bound on Rad. Complexity Corollary: Let G be a family of functions taking values in { 1, +1}, then the following holds: 2 log Rm (G) G (m) m . Proof: 1 RS (G) = E sup g G m ..1 . · m g(z1 ) .. . g(zm ) m 2 log |{(g(z1 ), . . . , g(zm )) : g G}| m m 2 log G (m) 2 log G (m) = . m m Mehryar Mohri - Foundations of Machine Learning (Massart’s Lemma) page 17 Generalization Bound - Growth Function Corollary: Let H be a family of functions taking values in { 1, +1}. Then, for any > 0 , with probability at least 1 , for any h H , R(h) R(h) + ⇥ 2 log H (m) + m ⇤ log 1 . 2m But, how do we compute the growth function? Relationship with the VC-dimension (VapnikChervonenkis dimension). Mehryar Mohri - Foundations of Machine Learning page 18 This lecture Rademacher complexity Growth Function VC dimension Lower bound Mehryar Mohri - Foundations of Machine Learning page 19 VC Dimension (Vapnik & Chervonenkis, 1968-1971;Vapnik, 1982, 1995, 1998) Definition: the VC-dimension of a hypothesis set H is defined by VCdim(H) = max{m : H (m) = 2m }. Thus, the VC-dimension is the size of the largest set that can be fully shattered by H. Purely combinatorial notion. Mehryar Mohri - Foundations of Machine Learning page 20 Examples In the following, we determine the VC dimension for several hypothesis sets. To give a lower bound d for VCdim(H) , it suffices to show that a set S of cardinality d can be shattered by H. To give an upper bound, we need to prove that no set S of cardinality d+1 can be shattered by H, which is typically more difficult. Mehryar Mohri - Foundations of Machine Learning page 21 Intervals of The Real Line Observations: Any set of two points can be shattered by four intervals -+-+ • ++ • No set of three points can be shattered since the following dichotomy “+ - +” is not realizable (by definition of intervals): +-+ • Thus, VCdim(intervals in R) = 2 . Mehryar Mohri - Foundations of Machine Learning page 22 Hyperplanes Observations: Any three non-collinear points can be shattered: + + • • Unrealizable dichotomies for four points: - • + + + - + + Thus, VCdim(hyperplanes in Rd ) = d+1 . Mehryar Mohri - Foundations of Machine Learning page 23 Axis-Aligned Rectangles in the Plane Observations: The following four points can be shattered: • - + + - + + + - - + - - - + - • No set of five points can be shattered: label • + negatively the point that is not near the sides. + + + + Thus, VCdim(axis-aligned rectangles) = 4 . Mehryar Mohri - Foundations of Machine Learning page 24 Convex Polygons in the Plane Observations: 2d+1 points on a circle can be shattered by a d-gon: • - + + + - |positive points| < |negative points| - ++ + + + + +|positive points| > |negative points| • It can be shown that choosing the points on the circle maximizes the number of possible dichotomies. Thus,VCdim(convex d-gons) = 2d+1 . Also,VCdim(convex polygons) = + . Mehryar Mohri - Foundations of Machine Learning page 25 Sine Functions Observations: Any finite set of points on the real line can be shattered by {t ⇤ sin( t) : ⇥ R} . Thus,VCdim(sine functions) = + . • • Mehryar Mohri - Foundations of Machine Learning page 26 Sauer’s Lemma (Vapnik & Chervonenkis, 1968-1971; Sauer, 1972) Theorem: let H be a hypothesis set withVCdim(H) = d then, for all m N , ⇥ d H (m) ⇤ m . i i=0 Proof: the proof is by induction on m+d. The statement clearly holds for m = 1 and d = 0 or d = 1. Assume that it holds for (m 1, d 1) and (m 1, d). • Fix a set S = {x , . . . , x with H (m) dichotomies and let G = H|S be the set of concepts H induces by restriction to S . 1 Mehryar Mohri - Foundations of Machine Learning m} page 27 • Consider the following families over S ⇥ = {x1 , . . . , xm G1 = G|S G2 = {g S : (g ⇥ G) ⌅ (g ⇤ {xm } ⇥ G)}. x1 x2 1 1 0 1 1 1 1 1 0 0 ··· ··· · · · xm Mehryar Mohri - Foundations of Machine Learning 1 xm 0 0 1 0 0 1 1 1 1 0 0 1 1 0 1 ··· ··· ··· • Observe that|G | + |G | = |G|. 1 : 1} 2 page 28 • Since VCdim(G ) d , by the induction hypothesis, 1 d |G1 | G1 (m 1) 1 m i i=0 . • By definition of G , if a set Z S is shattered by G2, then the set Z {xm } is shattered by G . Thus, 2 VCdim(G2 ) ⇥ VCdim(G) 1=d 1 and by the induction hypothesis, d 1 |G2 | • Thus, |G| = G2 (m ⇤d 1) ⇥ m 1 i=0 i ⇥ ⇤d m 1 i=0 i Mehryar Mohri - Foundations of Machine Learning + + 1 m i . i=0 ⇤d 1 m 1⇥ i=0 i ⇥ ⇤d ⇥ m 1 m i=0 i . i 1 = page 29 Sauer’s Lemma - Consequence Corollary: let H be a hypothesis set with VCdim(H) = d then, for all m d , H (m) Proof: d ⇤ ⌅ ⇧ m i=0 i Mehryar Mohri - Foundations of Machine Learning em d d = O(md ). d ⇤ ⌅ ⇧ m m ⇥d i i d i=0 m ⇤ ⌅ ⇧ m m ⇥d i i d i=0 ⌅i m ⇤ ⌅⇤ ⇥ ⇧ d m d m = i d m i=0 ⇤ ⌅m ⇥ d m d m ⇥d d = 1+ e . d m d page 30 Remarks Remarkable property of growth function: either VCdim(H) = d < + and H (m) = O(md ) or and H (m) = 2m . VCdim(H) = + • • Mehryar Mohri - Foundations of Machine Learning page 31 Generalization Bound - VC Dimension Corollary: Let H be a family of functions taking values in { 1, +1} with VC dimension d . Then, for any > 0 , with probability at least 1 , for any h H , R(h) R(h) + ⇥ 2d log em d + m ⇤ log 1 . 2m Proof: Corollary combined with Sauer’s lemma. Note: The general form of the result is R(h) Mehryar Mohri - Foundations of Machine Learning ⇤ R(h) +O ⌅ log(m/d) (m/d) ⇥ . page 32 Comparison - Standard VC Bound (Vapnik & Chervonenkis, 1971;Vapnik, 1982) Theorem: Let H be a family of functions taking values in { 1, +1} with VC dimension d . Then, for any > 0 , with probability at least 1 , for any h H , R(h) R(h) + ⇥ 4 + 8 log 8d log 2em d . m Proof: Derived from growth function bound ⇧ Pr R(h) ⌅ R(h) > Mehryar Mohri - Foundations of Machine Learning ⌃ ⇥4 H (2m) exp ⇥ page 33 m 8 2 ⇤ . This lecture Rademacher complexity Growth Function VC dimension Lower bound Mehryar Mohri - Foundations of Machine Learning page 34 VCDim Lower Bound - Realizable Case (Ehrenfeucht et al., 1988) Theorem: let H be a hypothesis set with VC dimension d > 1 . Then, for any learning algorithmL, D, f H, Pr m S D d 1 RD (hS , f ) > 32m 1/100. Proof: choose D such that L can do no better than tossing a coin for some points. Let X = {x0 , x1 , . . . , xd 1 }be a set fully shattered. For any > 0 , define D with support X by • Pr[x0 ] = 1 D 8 Mehryar Mohri - Foundations of Machine Learning and ⇤i ⇥ [1, d 1], Pr[xi ] = D page 35 8 d 1 . loss of generality that L • We can assume without makes no error on x . • For a sample S , let S denote the setSof its elements 0 • falling in X1 = {x1 , . . . , xd 1 } and let be the set of samples of size m with at most (d 1)/2 points in X1. Fix a sample S S. Using |X S| ⇥ (d 1)/2 , f E [RD (hS , f )] = U ⇤⇤ x⇥X f ⇥ = 1h(x)⇤=f (x) Pr[x] Pr[f ] ⇤⇤ f 1h(x)⇤=f (x) Pr[x] Pr[f ] x⇤⇥S ⇤ ⇤ x⇤⇥S f ⇥ 1h(x)⇤=f (x) Pr[f ] Pr[x] 1⇤ 1d 1 8 = Pr[x] ⇥ =2 . 2 2 2 d 1 x⇤⇥S Mehryar Mohri - Foundations of Machine Learning page 36 • Since the inequality holds for all S • also holds in expectation: ES,f U [RD (hS , f )] 2 . This implies that there exists a labeling f0 such that ES [RD (hS , f0 )] 2 . Since PrD [X {x0 }] ⇥ 8 , we also have RD (hS , f0 ) 8 . Thus, 2 E[RD (hS , f0 )] S 8 Pr [RD (hS , f0 ) S S • Collecting terms in Pr [R S S Pr [RD (hS , f0 ) S S ] + (1 D (hS , f0 ) ] 1 (2 7 S , it Pr [RD (hS , f0 ) S S ] , we obtain: 1 )= . 7 • Thus, the probability over all samples S (not necessarily in S ) can be lower bounded as Pr[RD (hS , f0 ) S Mehryar Mohri - Foundations of Machine Learning ] Pr [RD (hS , f0 ) S S ] Pr[S] page 37 1 Pr[S]. 7 ]) . • This leads us to seeking a lower bound for Pr[S]. The probability that more than (d 1)/2 points be drawn in a sample of size m verifies the Chernoff bound for any > 0 : 1 Pr[S] = Pr[Sm ⇤ 8⇥m(1 + )] ⇥ e • Thus, for Pr[Sm ⇤ for 3 . 1)/(32m) and = 1, = (d d 1 2 ] 8 m 2 ⇥e (d 1)/12 .01. Thus, Pr[S] Mehryar Mohri - Foundations of Machine Learning 1/12 ⇥1 7 and Pr[RD (hS , f0 ) S ⇥e ] . page 38 7 , Agnostic PAC Model Definition: concept class C is PAC-learnable if there exists a learning algorithm L such that: for all c C, ⇥ > 0, > 0, and all distributions D, • Pr S⇠D h R(hS ) inf R(h) ✏ h2H i 1 , • for samples S of size m = poly(1/⇥, 1/ ) for a fixed polynomial. Mehryar Mohri - Foundations of Machine Learning page 39 VCDim Lower Bound - Non-Realizable Case (Anthony and Bartlett, 1999) Theorem: let H be a hypothesis set with VC dimension d > 1 . Then, for any learning algorithmL, D over X {0, 1}, Pr m RD (hS ) inf RD (h) > S D h H d 320m 1/64. Equivalently, for any learning algorithm, the sample complexity verifies m Mehryar Mohri - Foundations of Machine Learning d . 2 320 page 40 References • Martin Anthony, Peter L. Bartlett. Neural network learning: theoretical foundations. Cambridge University Press. 1999. • Anselm Blumer, A. Ehrenfeucht, David Haussler, and Manfred K. Warmuth. Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM (JACM),Volume 36, Issue 4, 1989. • A. Ehrenfeucht, David Haussler, Michael Kearns, Leslie Valiant. A general lower bound on the number of examples needed for learning. Proceedings of 1st COLT. pp. 139-154, 1988. • Koltchinskii,Vladimir and Panchenko, Dmitry. Empirical margin distributions and bounding the generalization error of combined classifiers. The Annals of Statistics, 30(1), 2002. • Pascal Massart. Some applications of concentration inequalities to statistics. Annales de la Faculte des Sciences de Toulouse, IX:245–303, 2000. • N. Sauer. On the density of families of sets. Journal of Combinatorial Theory (A), 13:145-147, 1972. Mehryar Mohri - Foundations of Machine Learning page 41 References • • • • Vladimir N.Vapnik. Estimation of Dependences Based on Empirical Data. Springer, 1982. • Vladimir N.Vapnik and Alexey Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Prob. and its Appl., vol. 16, no. 2, pp. 264-280, 1971. Vladimir N.Vapnik. The Nature of Statistical Learning Theory. Springer, 1995. Vladimir N.Vapnik. Statistical Learning Theory. Wiley-Interscience, New York, 1998. Vladimir N.Vapnik and Alexey Chervonenkis. Theory of Pattern Recognition. Nauka, Moscow (in Russian). 1974. Mehryar Mohri - Foundations of Machine Learning page 42