Foundations of Machine Learning Learning with Infinite Hypothesis Sets Mehryar Mohri

Foundations of Machine Learning Learning with Infinite Hypothesis Sets Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation With an infinite hypothesis set H, the error bounds of the previous lecture are not informative. Is efficient learning from a finite sample possible when H is infinite? Our example of axis-aligned rectangles shows that it is possible. Can we reduce the infinite case to a finite set? Project over finite samples? Are there useful measures of complexity for infinite hypothesis sets? Mehryar Mohri - Foundations of Machine Learning page 2 This lecture Rademacher complexity Growth Function VC dimension Lower bound Mehryar Mohri - Foundations of Machine Learning page 3 Empirical Rademacher Complexity Definition: G family of functions mapping from set Z to [a, b] . sample S = (z1 , . . . , zm ). i s (Rademacher variables): independent uniform random variables taking values in { 1, +1} . • • • 1 RS (G) = E sup g G m g(z1 ) 1 .. . m · .. . g(zm ) 1 = E sup g G m correlation with random noise Mehryar Mohri - Foundations of Machine Learning page 4 m i g(zi ) i=1 . Rademacher Complexity Definitions: let G be a family of functions mapping from Z to [a, b] . Empirical Rademacher complexity of G : • 1 RS (G) = E sup g G m m i g(zi ) , i=1 where i s are independent uniform random variables taking values in { 1, +1} and S = (z1 , . . . , zm ) . • Rademacher complexity of G : Rm (G) = Mehryar Mohri - Foundations of Machine Learning E m [RS (G)]. S D page 5 Rademacher Complexity Bound (Koltchinskii and Panchenko, 2002) Theorem: Let G be a family of functions mapping from Z to [0, 1]. Then, for any > 0 , with probability g G: at least 1 , the following holds for all ⇥ E[g(z)] E[g(z)] m log 1 g(zi ) + 2Rm (G) + . 2m i=1 ⇤ m 2 log 1 ⇥ S (G) + 3 g(zi ) + 2R . m i=1 2m 1 m Proof: Apply McDiarmid’s inequality to (S) = sup E[g] g G Mehryar Mohri - Foundations of Machine Learning ES [g]. page 6 • Changing one point of S changes (S) by at most m1 . (S 0 ) (S) = sup {E[g] g2G  sup {{E[g] g2G b S [g] = sup {E g2G b S 0 [g]} E b S [g]} E sup {E[g] g2G b S 0 [g]} E {E[g] b S 0 [g]} = sup E g2G b S [g]}} E 1 m (g(zm ) 0 g(zm ))  1 m. • Thus,1by McDiarmid’s inequality, with probability at least 2 (S) E[ (S)] + S log 2 2m . • We are left with bounding the expectation. Mehryar Mohri - Foundations of Machine Learning page 7 • Series of observations: ⇥ E[ (S)] = E sup E[g] S S ⇥ g2G b S (g) E b S 0 (g) = E sup E0 [E S g2G S (Jensen’s ineq.)  E 0 S,S = E0 S,S (swapzi and zi0 ) = ⇥ ⇥ E ,S,S 0 (sub-additiv. of sup)  E 0 ,S =2 E ⇥ ,S Mehryar Mohri - Foundations of Machine Learning b S 0 (g) sup E g2G m ⇤ b S (g)] E b S (g) E 1 X sup (g(zi0 ) g2G m i=1 ⇥ g(zi )) m 1 X sup g2G m i=1 0 (g(z i i) m 1 X sup g2G m i=1 ⇥ ⇤ ⇤ m 1 X sup g2G m i=1 0 g(z i i) ⇤ i g(zi ) g(zi )) + E ⇤ ⇤ ,S ⇥ ⇤ m 1 X sup g2G m i=1 = 2Rm (G). page 8 i g(zi ) ⇤ • Now, changing one point of S makes R (G) vary by S at most m1 . Thus, again by McDiarmid’s inequality, with probability at least 1 2 , Rm (G) RS (G) + ⇥ log 2 . 2m • Thus, by the union bound, with⇥ probability at least 1 , (S) Mehryar Mohri - Foundations of Machine Learning 2RS (G) + 3 log 2 . 2m page 9 Loss Functions - Hypothesis Set Proposition: Let H be a family of functions taking values in{ 1, +1}, G the family of zero-one loss functions of H: G = (x, y) 1h(x)=y : h H . Then, 1 Rm (G) = Rm (H). 2 Proof: Rm (G) = E S, = E S, = = 1 E 2 S, 1 E 2 S, Mehryar Mohri - Foundations of Machine Learning 1 sup h H m sup h H 1 m sup h H sup h H m i 1h(xi )=yi i=1 m i=1 m 1 m 1 m 1 i (1 2 yi h(xi )) i yi h(xi ) i=1 m i h(xi ) . i=1 page 10 Generalization Bounds - Rademacher Corollary: Let H be a family of functions taking values in { 1, +1}. Then, for any > 0 , with probability at least 1 , for any h H , R(h) R(h) ⇥ log 1 R(h) + Rm (H) + . 2m ⇥ log 2 R(h) + RS (H) + 3 . 2m Mehryar Mohri - Foundations of Machine Learning page 11 Remarks First bound distribution-dependent, second datadependent bound, which makes them attractive. But, how do we compute the empirical Rademacher complexity? Computing E i h(xi )] requires solving ERM problems, typically computationally hard. 1 [suph H m m i=1 Relation with combinatorial measures easier to compute? Mehryar Mohri - Foundations of Machine Learning page 12 This lecture Rademacher complexity Growth Function VC dimension Lower bound Mehryar Mohri - Foundations of Machine Learning page 13 Growth Function Definition: the growth function hypothesis set H is defined by ⇥m N, H (m) = max {x1 ,...,xm } X H: N N for a ⇧⇤ ⇥ ⇧ ⇧ h(x1 ), . . . , h(xm ) : h ⌅⇧⇧ H ⇧. Thus, H (m) is the maximum number of ways m points can be classified using H. Mehryar Mohri - Foundations of Machine Learning page 14 Massart’s Lemma (Massart, 2000) Theorem: Let A Rm be a finite set, with R = max x 2, x A then, the following holds: m ⇤ 1 sup E m x A i=1 Proof: exp i xi ⌅ R 2 log |A| . m ⇥ m m t E sup x A i=1 i xi E exp t sup x A i=1 m =E sup exp t x A i xi i=1 m m E exp t x A (Hoeffding’s ineq.) exp x A Mehryar Mohri - Foundations of Machine Learning (Jensen’s ineq.) i xi i xi E (exp [t i xi ]) = i=1 m 2 2 i=1 t (2|xi |) 8 page 15 x A i=1 |A|e t2 R2 2 . • Taking the log yields: ⇥ E sup m ⇤ x A i=1 i xi log |A| tR2 + . t 2 • Minimizing the bound by choosing t = gives E sup m ⇤ x A i=1 Mehryar Mohri - Foundations of Machine Learning i xi ⇥ 2 log |A| R ⌅ R 2 log |A|. page 16 Growth Function Bound on Rad. Complexity Corollary: Let G be a family of functions taking values in { 1, +1}, then the following holds: 2 log Rm (G) G (m) m . Proof: 1 RS (G) = E sup g G m ..1 . · m g(z1 ) .. . g(zm ) m 2 log |{(g(z1 ), . . . , g(zm )) : g G}| m m 2 log G (m) 2 log G (m) = . m m Mehryar Mohri - Foundations of Machine Learning (Massart’s Lemma) page 17 Generalization Bound - Growth Function Corollary: Let H be a family of functions taking values in { 1, +1}. Then, for any > 0 , with probability at least 1 , for any h H , R(h) R(h) + ⇥ 2 log H (m) + m ⇤ log 1 . 2m But, how do we compute the growth function? Relationship with the VC-dimension (VapnikChervonenkis dimension). Mehryar Mohri - Foundations of Machine Learning page 18 This lecture Rademacher complexity Growth Function VC dimension Lower bound Mehryar Mohri - Foundations of Machine Learning page 19 VC Dimension (Vapnik & Chervonenkis, 1968-1971;Vapnik, 1982, 1995, 1998) Definition: the VC-dimension of a hypothesis set H is defined by VCdim(H) = max{m : H (m) = 2m }. Thus, the VC-dimension is the size of the largest set that can be fully shattered by H. Purely combinatorial notion. Mehryar Mohri - Foundations of Machine Learning page 20 Examples In the following, we determine the VC dimension for several hypothesis sets. To give a lower bound d for VCdim(H) , it suffices to show that a set S of cardinality d can be shattered by H. To give an upper bound, we need to prove that no set S of cardinality d+1 can be shattered by H, which is typically more difficult. Mehryar Mohri - Foundations of Machine Learning page 21 Intervals of The Real Line Observations: Any set of two points can be shattered by four intervals -+-+ • ++ • No set of three points can be shattered since the following dichotomy “+ - +” is not realizable (by definition of intervals): +-+ • Thus, VCdim(intervals in R) = 2 . Mehryar Mohri - Foundations of Machine Learning page 22 Hyperplanes Observations: Any three non-collinear points can be shattered: + + • • Unrealizable dichotomies for four points: - • + + + - + + Thus, VCdim(hyperplanes in Rd ) = d+1 . Mehryar Mohri - Foundations of Machine Learning page 23 Axis-Aligned Rectangles in the Plane Observations: The following four points can be shattered: • - + + - + + + - - + - - - + - • No set of five points can be shattered: label • + negatively the point that is not near the sides. + + + + Thus, VCdim(axis-aligned rectangles) = 4 . Mehryar Mohri - Foundations of Machine Learning page 24 Convex Polygons in the Plane Observations: 2d+1 points on a circle can be shattered by a d-gon: • - + + + - |positive points| < |negative points| - ++ + + + + +|positive points| > |negative points| • It can be shown that choosing the points on the circle maximizes the number of possible dichotomies. Thus,VCdim(convex d-gons) = 2d+1 . Also,VCdim(convex polygons) = + . Mehryar Mohri - Foundations of Machine Learning page 25 Sine Functions Observations: Any finite set of points on the real line can be shattered by {t ⇤ sin( t) : ⇥ R} . Thus,VCdim(sine functions) = + . • • Mehryar Mohri - Foundations of Machine Learning page 26 Sauer’s Lemma (Vapnik & Chervonenkis, 1968-1971; Sauer, 1972) Theorem: let H be a hypothesis set withVCdim(H) = d then, for all m N , ⇥ d H (m) ⇤ m . i i=0 Proof: the proof is by induction on m+d. The statement clearly holds for m = 1 and d = 0 or d = 1. Assume that it holds for (m 1, d 1) and (m 1, d). • Fix a set S = {x , . . . , x with H (m) dichotomies and let G = H|S be the set of concepts H induces by restriction to S . 1 Mehryar Mohri - Foundations of Machine Learning m} page 27 • Consider the following families over S ⇥ = {x1 , . . . , xm G1 = G|S G2 = {g S : (g ⇥ G) ⌅ (g ⇤ {xm } ⇥ G)}. x1 x2 1 1 0 1 1 1 1 1 0 0 ··· ··· · · · xm Mehryar Mohri - Foundations of Machine Learning 1 xm 0 0 1 0 0 1 1 1 1 0 0 1 1 0 1 ··· ··· ··· • Observe that|G | + |G | = |G|. 1 : 1} 2 page 28 • Since VCdim(G ) d , by the induction hypothesis, 1 d |G1 | G1 (m 1) 1 m i i=0 . • By definition of G , if a set Z S is shattered by G2, then the set Z {xm } is shattered by G . Thus, 2 VCdim(G2 ) ⇥ VCdim(G) 1=d 1 and by the induction hypothesis, d 1 |G2 | • Thus, |G| = G2 (m ⇤d 1) ⇥ m 1 i=0 i ⇥ ⇤d m 1 i=0 i Mehryar Mohri - Foundations of Machine Learning + + 1 m i . i=0 ⇤d 1 m 1⇥ i=0 i ⇥ ⇤d ⇥ m 1 m i=0 i . i 1 = page 29 Sauer’s Lemma - Consequence Corollary: let H be a hypothesis set with VCdim(H) = d then, for all m d , H (m) Proof: d ⇤ ⌅ ⇧ m i=0 i Mehryar Mohri - Foundations of Machine Learning em d d = O(md ). d ⇤ ⌅ ⇧ m m ⇥d i i d i=0 m ⇤ ⌅ ⇧ m m ⇥d i i d i=0 ⌅i m ⇤ ⌅⇤ ⇥ ⇧ d m d m = i d m i=0 ⇤ ⌅m ⇥ d m d m ⇥d d = 1+ e . d m d page 30 Remarks Remarkable property of growth function: either VCdim(H) = d < + and H (m) = O(md ) or and H (m) = 2m . VCdim(H) = + • • Mehryar Mohri - Foundations of Machine Learning page 31 Generalization Bound - VC Dimension Corollary: Let H be a family of functions taking values in { 1, +1} with VC dimension d . Then, for any > 0 , with probability at least 1 , for any h H , R(h) R(h) + ⇥ 2d log em d + m ⇤ log 1 . 2m Proof: Corollary combined with Sauer’s lemma. Note: The general form of the result is R(h) Mehryar Mohri - Foundations of Machine Learning ⇤ R(h) +O ⌅ log(m/d) (m/d) ⇥ . page 32 Comparison - Standard VC Bound (Vapnik & Chervonenkis, 1971;Vapnik, 1982) Theorem: Let H be a family of functions taking values in { 1, +1} with VC dimension d . Then, for any > 0 , with probability at least 1 , for any h H , R(h) R(h) + ⇥ 4 + 8 log 8d log 2em d . m Proof: Derived from growth function bound ⇧ Pr R(h) ⌅ R(h) > Mehryar Mohri - Foundations of Machine Learning ⌃ ⇥4 H (2m) exp ⇥ page 33 m 8 2 ⇤ . This lecture Rademacher complexity Growth Function VC dimension Lower bound Mehryar Mohri - Foundations of Machine Learning page 34 VCDim Lower Bound - Realizable Case (Ehrenfeucht et al., 1988) Theorem: let H be a hypothesis set with VC dimension d > 1 . Then, for any learning algorithmL, D, f H, Pr m S D d 1 RD (hS , f ) > 32m 1/100. Proof: choose D such that L can do no better than tossing a coin for some points. Let X = {x0 , x1 , . . . , xd 1 }be a set fully shattered. For any > 0 , define D with support X by • Pr[x0 ] = 1 D 8 Mehryar Mohri - Foundations of Machine Learning and ⇤i ⇥ [1, d 1], Pr[xi ] = D page 35 8 d 1 . loss of generality that L • We can assume without makes no error on x . • For a sample S , let S denote the setSof its elements 0 • falling in X1 = {x1 , . . . , xd 1 } and let be the set of samples of size m with at most (d 1)/2 points in X1. Fix a sample S S. Using |X S| ⇥ (d 1)/2 , f E [RD (hS , f )] = U ⇤⇤ x⇥X f ⇥ = 1h(x)⇤=f (x) Pr[x] Pr[f ] ⇤⇤ f 1h(x)⇤=f (x) Pr[x] Pr[f ] x⇤⇥S ⇤ ⇤ x⇤⇥S f ⇥ 1h(x)⇤=f (x) Pr[f ] Pr[x] 1⇤ 1d 1 8 = Pr[x] ⇥ =2 . 2 2 2 d 1 x⇤⇥S Mehryar Mohri - Foundations of Machine Learning page 36 • Since the inequality holds for all S • also holds in expectation: ES,f U [RD (hS , f )] 2 . This implies that there exists a labeling f0 such that ES [RD (hS , f0 )] 2 . Since PrD [X {x0 }] ⇥ 8 , we also have RD (hS , f0 ) 8 . Thus, 2 E[RD (hS , f0 )] S 8 Pr [RD (hS , f0 ) S S • Collecting terms in Pr [R S S Pr [RD (hS , f0 ) S S ] + (1 D (hS , f0 ) ] 1 (2 7 S , it Pr [RD (hS , f0 ) S S ] , we obtain: 1 )= . 7 • Thus, the probability over all samples S (not necessarily in S ) can be lower bounded as Pr[RD (hS , f0 ) S Mehryar Mohri - Foundations of Machine Learning ] Pr [RD (hS , f0 ) S S ] Pr[S] page 37 1 Pr[S]. 7 ]) . • This leads us to seeking a lower bound for Pr[S]. The probability that more than (d 1)/2 points be drawn in a sample of size m verifies the Chernoff bound for any > 0 : 1 Pr[S] = Pr[Sm ⇤ 8⇥m(1 + )] ⇥ e • Thus, for Pr[Sm ⇤ for 3 . 1)/(32m) and = 1, = (d d 1 2 ] 8 m 2 ⇥e (d 1)/12 .01. Thus, Pr[S] Mehryar Mohri - Foundations of Machine Learning 1/12 ⇥1 7 and Pr[RD (hS , f0 ) S ⇥e ] . page 38 7 , Agnostic PAC Model Definition: concept class C is PAC-learnable if there exists a learning algorithm L such that: for all c C, ⇥ > 0, > 0, and all distributions D, • Pr S⇠D h R(hS ) inf R(h)  ✏ h2H i 1 , • for samples S of size m = poly(1/⇥, 1/ ) for a fixed polynomial. Mehryar Mohri - Foundations of Machine Learning page 39 VCDim Lower Bound - Non-Realizable Case (Anthony and Bartlett, 1999) Theorem: let H be a hypothesis set with VC dimension d > 1 . Then, for any learning algorithmL, D over X {0, 1}, Pr m RD (hS ) inf RD (h) > S D h H d 320m 1/64. Equivalently, for any learning algorithm, the sample complexity verifies m Mehryar Mohri - Foundations of Machine Learning d . 2 320 page 40 References • Martin Anthony, Peter L. Bartlett. Neural network learning: theoretical foundations. Cambridge University Press. 1999. • Anselm Blumer, A. Ehrenfeucht, David Haussler, and Manfred K. Warmuth. Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM (JACM),Volume 36, Issue 4, 1989. • A. Ehrenfeucht, David Haussler, Michael Kearns, Leslie Valiant. A general lower bound on the number of examples needed for learning. Proceedings of 1st COLT. pp. 139-154, 1988. • Koltchinskii,Vladimir and Panchenko, Dmitry. Empirical margin distributions and bounding the generalization error of combined classifiers. The Annals of Statistics, 30(1), 2002. • Pascal Massart. Some applications of concentration inequalities to statistics. Annales de la Faculte des Sciences de Toulouse, IX:245–303, 2000. • N. Sauer. On the density of families of sets. Journal of Combinatorial Theory (A), 13:145-147, 1972. Mehryar Mohri - Foundations of Machine Learning page 41 References • • • • Vladimir N.Vapnik. Estimation of Dependences Based on Empirical Data. Springer, 1982. • Vladimir N.Vapnik and Alexey Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Prob. and its Appl., vol. 16, no. 2, pp. 264-280, 1971. Vladimir N.Vapnik. The Nature of Statistical Learning Theory. Springer, 1995. Vladimir N.Vapnik. Statistical Learning Theory. Wiley-Interscience, New York, 1998. Vladimir N.Vapnik and Alexey Chervonenkis. Theory of Pattern Recognition. Nauka, Moscow (in Russian). 1974. Mehryar Mohri - Foundations of Machine Learning page 42

Foundations of Machine Learning Learning with Infinite Hypothesis Sets Mehryar Mohri

Related documents

Products

Support

Foundations of Machine Learning Learning with Infinite Hypothesis Sets Mehryar Mohri

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib