Learning Relatively Small Classes Shahar Mendelson Computer Sciences Laboratory, RSISE, The Australian National University, Canberra 0200, Australia shahar@csl.anu.edu.au Abstract. We study the sample complexity of proper and improper learning problems with respect to different Lq loss functions. We improve the known estimates for classes which have relatively small covering numbers (log-covering numbers which are polynomial with exponent p < 2) with respect to the Lq norm for q ≥ 2. 1 Introduction In this article we present sample complexity estimates for proper and improper learning problems of classes with respect to different loss functions, under the assumption that the classes are not “too large”. Namely, we assume that their log-covering numbers are polynomial in ε−1 with exponent p < 2. The question we explore is the following: let G be a class of functions defined on a probability space (Ω, µ) such that each g ∈ G maps Ω into [0, 1] and let T be an unknown function which is bounded by 1 (not necessarily a member of G). Recall that a learning rule L is a map which assigns to each sample Sn = (ω1 , ..., ωn ) a function LSn ∈ G. The learning sample complexity associated with the q-loss, accuracy ε and confidence δ is the first integer n0 such that for every n ≥ n0 , every probability measure µ on (Ω, Σ) and every measurable function bounded by 1, q q µ∞ Eµ |LSn − T | ≥ inf Eµ |g − T | + ε ≤ δ , g∈G where L is any learning rule and µ∞ is the infinite product measure. We were motivated by two methods previously used in the investigation of sample complexity [1]. Firstly, there is the standard approach which uses the Glivenko-Cantelli (GC) condition to estimate the sample complexity. By this we mean the following: let G be a class of functions on Ω, let T be the target concept (which, for the sake of simplicity, is assumed to be deterministic), set 1 ≤ q < ∞ q and let F = {|g − T | |g ∈ G}. The Glivenko-Cantelli sample complexity of the class F with respect to accuracy ε and confidence δ is the smallest integer n0 such that for every n ≥ n0 , q q sup µ∞ sup |Eµ |g − T | − Eµn |g − T | | ≥ ε ≤ δ , µ g∈G D. Helmbold and B. Williamson (Eds.): COLT/EuroCOLT 2001, LNAI 2111, pp. 273–288, 2001. c Springer-Verlag Berlin Heidelberg 2001 274 S. Mendelson where µn is the empirical measure supported on the sample (ω1 , ..., ωn ). q Therefore, if g is an “almost” minimizer of the empirical loss Eµn |g − T | and if the sample is “large enough” then g is an “almost” minimizer with respect to the average loss. One can show that the learning sample complexity is bounded by the supremum of the GC sample complexities, where the supremum is taken over all possible targets T , bounded by 1. This is true even in the agnostic case, in which T may be random (for further details, see [1]). Lately, it was shown [7] that if the log-covering numbers (resp. the fat shattering dimension) of G are of the order of ε−p for p ≥ 2, then the GC sample complexity of F is Θ(ε−p ) up to logarithmic factors in ε−1 , δ −1 . It is important to emphasize that the learning sample complexity may be established by other means rather than via the GC condition. Hence, it comes with no surprise that there are certain cases in which it is possible to improve this bound on the learning sample complexity. In [5,6] the following case was 2 2 examined; let F be the loss class given by {|g − T | − |T − PG T | |g ∈ G}, where PG T is a nearest point to T in G with respect to the L2 (µ) norm. Assume that there is an absolute constant C such that for every f ∈ F , Eµ f 2 ≤ CEµ f , i.e., that it is possible to control the variance of each loss function using its expectation. In this case, the learning sample complexity with accuracy ε and confidence δ is 1 1 fatε (F ) fatε (F ) log2 + log . O ε ε δ Therefore, if fatε (F ) = O(ε−p ) the learning sample complexity is bounded (up to logarithmic factors) by O(ε−(1+p) ). If p < 1, this estimate is better than the one obtained using the GC sample complexities. As it turns out, the assumption above is not so far fetched; it is possible to show [5,6] that there are two generic cases in which Eµ f 2 ≤ CEµ f . The first case is when T ∈ G, because it implies that each f ∈ F is nonnegative. The other case is when G is convex and q = 2. Thus, every loss function is given by 2 2 |g − T | − |T − PG T | , where PG T is the nearest point to T in G with respect to the L2 (µ) norm. Here, we combine the ideas used in [6] and in [7] to improve the learning complexity estimates. We show that if sup sup log N ε, G, L2 (µn ) = O(ε−p ) n µn for p < 2 and if either T ∈ G or if G is convex, then the learning sample complexity is O(ε1+p/2 ) up to logarithmic factors in ε−1 , δ −1 . Recently it was shown in [7,8] that for every 1 ≤ q < ∞ there is a constant cq such that 2fat ε (G) 8 , sup sup log N ε, G, L2 (µn ) ≤ cq fat 8ε (G) log2 ε n µn (1) therefore, the estimates we obtain improve the O ε−(1+p) established in [6]. Learning Relatively Small Classes 275 Our discussion is divided into three sections. In the second section we investigate the GC condition for classes which have “small” log-covering numbers. We focus the discussion on the case where the deviation in the GC condition is of the same order of magnitude as supf ∈F Eµ f 2 . The proof is based on estimates on the Rademacher averages (defined below) associated with the class. Next, we explore sufficient conditions which imply that if F is the q-loss class associated with a convex class G, Eµ f 2 may be controlled by the Eµ f . We use a geometric approach to prove that if q ≥ 2 there is some constant B such that for every q-loss function f , Eµ f 2 ≤ B(Eµ f )2/q . It turns out that those estimates are the key behind the proof of the learning sample complexity, which is investigated in the fourth and final section. Next, we turn to some definitions and notation we shall use throughout this article. If G is a class of functions, T is the target concept and 1 ≤ q < ∞, then for every g ∈ G let q (g) be its q-loss function. Thus, q q q (g) = |g − T | − |g − PG T | , where PG T is a nearest element to T in G with respect to the Lq norm. We denote by F the set of loss functions q (g). Let G be a GC class. For every 0 < ε, δ < 1, denote by SG (ε, δ) the GC sample complexity of the class G associated with accuracy ε and confidence δ. q Let CG,T (ε, δ) be the learning sample complexity of the class G with respect to the target T and the q-loss, for accuracy ε and confidence δ. Given a Banach space X, let B(X) be the unit ball of X. If B ⊂ X is a ball, set int(B) to be the interior of B and ∂B is the boundary of B. The dual of X, denoted by X ∗ , consists of all the bounded linear functionals on X, endowed with the norm x∗ = supx=1 |x∗ (x)|. If x, y ∈ X, the interval [x, y] is defined by [x, y] = {tx + (1 − t)y|0 ≤ t ≤ 1}. For any probability measure µ on a measurable space (Ω, Σ), let Eµ be the q expectation with respect to µ. Lq (µ) is the set of functions which satisfy Eµ |f | < q 1/q ∞, and set f q = (E |f | ) . L∞ (Ω) is the space of bounded functions on Ω, with respect to the norm f ∞ = supω∈Ω |f (ω)|. We denote n by µn an empirical measure supported on a set of n points, hence, µn = n1 i=1 δωi , where δωi is the point evaluation functional at {ωi }. If (X, d) is a metric space, set B(x, r) to be the closed ball centered at x with radius r. Recall that if F ⊂ X, the ε-covering number of F , denoted by N (ε, F, d), is the minimal number of open balls with radius ε > 0 (with respect to the metric d) needed to cover F . A set A ⊂ X is said to be an ε-cover of F if the union of open balls a∈A B(a, ε) contains F . In cases where the metric d is clear, we shall denote the covering numbers of F by N (ε, F ). It is possible to show that the covering numbers of the q-loss class F are essentially the same as those of G. Lemma 1. Let G ⊂ B L∞ (Ω) and let F be the q-loss class associated with G. Then, for any probability measure µ, log N ε, F, L2 (µ) ≤ log N ε/q, G, L2 (µ) . 276 S. Mendelson The proof of this lemma is standard and is omitted. Next, we define the Rademacher averages of a given class of functions, which is the main tool we use in the analysis of GC classes. Definition 1. Let F be a class of functions and let µ be a probability measure on Ω. Set n 1 εi f (Xi ) , R̄n,µ = Eµ Eε sup √ n f ∈F i=1 where εi are independent Rademacher random variables (i.e. symmetric, {−1, 1} valued) and (Xi ) are independent, distributed according to µ. It is possible to estimate R̄n,µ of a given class using its L2 (µn ) covering numbers. The key result behind this estimate is the following theorem which is originally due to Dudley for Gaussian processes, and was extended to the more general setting of subgaussian processes [10]. We shall formulate it only in the case of Rademacher processes. Theorem 1. There is an absolute constant C such that for every probability measure µ, any integer n and every sample (ωi )ni=1 δ n 1 1 −2 Eε sup n εi f (ωi ) ≤ C log 2 N ε, F, L2 (µn ) dε , f ∈F 0 i=1 where δ = sup f ∈F n 1 n 12 f 2 (ωi ) i=1 and µn is the empirical measure supported on the given sample (ω1 , ..., ωn ). Finally, throughout this paper, all absolute constants are assumed to be positive and are denoted by C or c. Their values may change from line to line or even within the same line. We end the introduction with a statement of our main result. Theorem 2. Let G ⊂ B L∞ (Ω) such that sup sup log N ε, G, L2 (µn ) ≤ γε−p n µn for some p < 2 and let T ∈ B L∞ (Ω) be the target concept. If 1 ≤ q < ∞ and either (1) T ∈ G, or (2) G is convex, Learning Relatively Small Classes 277 there are constants Cq,p,γ which depend only on q, p, γ such that 1 1+ p2 2 q . CG,T 1 + log (ε, δ) ≤ Cq,p,γ δ ε Also, if γ d sup sup N ε, G, L2 (µn ) ≤ ε n µn and either (1) or (2) hold, then q (ε, δ) ≤ Cq,p,γ d CG,T 1 ε log 2 2 . + log ε δ There are numerous cases in which Theorem 2 applies. We shall name a few of them. In the proper case, Theorem 2 may be used, for example, if G is a VC class or a VC subgraph class [10]. Another case is if G satisfies that fatε (G) = O(ε−p ) [7]. In both cases the covering numbers of G satisfy the conditions of the theorem. In the improper case, one may consider, for example, convex hulls of VC classes [10] and many Kernel Machines [11]. 2 GC for Classes with “Small” Variance Although the “normal” GC sample complexity of a class is Ω(ε−2 ) when one allows the deviation to be arbitrary small, it is possible to obtain better estimates when the deviation we are interested in is roughly the largest variance of a member of the class. As we demonstrate in the final section, this fact is very important in the analysis of learning sample complexities. The first step in that direction is the next result, which is due to Talagrand [9]. Theorem 3. Assume that F is a class of functions into [0, 1]. Let √ σ 2 = sup Eµ (f − Eµ f )2 , Sn = nσ 2 + nR̄n,µ . f ∈F For every L, S > 0 and t > 0 define 2 φL,S (t) = t L2 S 1 t et 2 L log LS if t ≤ LS if t ≥ LS . √ There is an absolute constant K such that if t ≥ K nR̄n,µ , then n f (Xi ) − nEµ f ≥ t ≤ exp −φK,S (t) . µ i=1 F In the next two results, we present an estimate on the Rademacher averages R̄n,µ using data on the “largest” variance of a member of the class and on the covering numbers of F in empirical L2 spaces. We divide the results to the case where the covering numbers are polynomial in ε−1 and the case where the log-covering numbers are polynomial is ε−1 with exponent p < 2. 278 S. Mendelson Lemma 2. Let F be a class of functions into [0, 1] and set τ 2 = supf ∈F Eµ f 2 . Assume that there are γ ≥ 1 and d ≥ 1 such that for every empirical measure µn , N ε, F, L2 (µn ) ≤ (γ/ε)d . Then, there is a constant Cγ , which depends only on γ, such that R̄n,µ 1 1 1 √ d ≤ Cγ max √ log , dτ log 2 τ τ n . An important part of the proof is the following result, again, due to Talagrand [9], on the expectation of the diameter of F when considered as a subset of L2 (µn ). Lemma 3. Let F ⊂ B L∞ (Ω) and set τ 2 = supf ∈F Eµ f 2 . Then, for every probability measure µ, every integer n, and every (Xi )ni=1 which are independent, distributed according to µ, n √ 2 Eµ sup f (Xi ) ≤ nτ 2 + 8 nR̄n,µ . f ∈F i=1 n Proof (of Lemma 2). Set Y = n−1 supf ∈F i=1 f 2 (Xi ). By Theorem 1 there is an absolute constant C such that for every sample (X1 , ..., Xn ) √Y n 1 1 √ Eε sup εi f (Xi ) ≤ C log 2 N ε, F, L2 (µn ) dε n f ∈F i=1 0 √ √ Y 1 γ log 2 dε . =C d ε 0 It is easy to see that there is some absolute constant Cγ such that for every 0 < x ≤ 1, 0 1 x 1 log 2 1 Cγ γ dε ≤ 2x log 2 , ε x 1 and v(x) = x 2 log 2 (Cγ /x) is increasing and concave in (0, 10). Since Y ≤ 1, n √ 1 Eε sup √ εi f (Xi ) ≤ C dY n i=1 f ∈F 1 2 1 log 2 Cγ Y 1 2 . Also, since v(x) is increasing and concave in (0, 10), and since EY ≤ 9, then by Jensen’s inequality, Lemma 3 and the fact that v is increasing, there is a Learning Relatively Small Classes 279 constant Cγ which depends only on γ, such that 1 1 1 Cγ 1 2 1 1 2 ≤ Cγ (Eµ Y ) 2 log 2 Eµ Y 2 log 2 1 = Cγ Eµ Y 2 log 2 Y E 2 Y µY 1 1 2 R̄n,µ 2 log 2 ≤ Cγ τ 2 + 8 √ 8R̄ 2 n τ + √n,µ 8R̄n,µ ≤ Cγ τ 2 + √ n 12 n log 1 2 2 . τ Therefore, √ 1 2 8R̄n,µ 12 , R̄n,µ ≤ Cγ d τ 2 + √ log 2 τ n and our claim follows from a straightforward computation. Next, we explore the case in which the log-covering numbers are polynomial with exponent p < 2. Lemma 4. Let F be a class of functions into [0, 1] and set τ 2 = supf ∈F Eµ f 2 . Assume that there are γ ≥ 2 and p < 2 such that for every empirical measure µn , log N ε, F, L2 (µn ) ≤ γε−p . Then, there is constant Cp,γ which depend only on p and on γ such that 1 2−p p R̄n,µ ≤ Cp,γ max n− 2 2+p , τ 1− 2 . n Proof. Again, let Y = n−1 supf ∈F i=1 f 2 (Xi ). For every realization of (Xi )ni=1 , set µn to be the empirical measure supported on X1 , ..., Xn . By Theorem 1 for every fixed sample, there is an absolute constant C, such that √Y n 1 1 √ Eε sup εi f (Xi ) ≤C log 2 N ε, F, L2 (µn ) dε n f ∈F 0 i=1 ≤ 1 n 12 (1− p2 ) Cγ 2 1 f 2 (Xi ) . sup 1 − p/2 n f ∈F i=1 Taking the expectation with respect to µ and by Jensen’s inequality and Lemma 3, R̄n,µ ≤ Cp,γ p 12 (1− p2 ) 8R̄n,µ 12 (1− 2 ) 2 √ f (Xi ) ≤ τ + . Eµ sup n f ∈F i=1 n 1 n 2 Therefore, p 8R̄n,µ 12 (1− 2 ) R̄n,µ ≤ Cp,γ τ 2 + √ n and the claim follows. 280 S. Mendelson Combining Theorem 3, Lemma 2 and Lemma 4, we obtain the following deviation estimate: Theorem 4. Let F be a class of functions whose range is contained in [0, 1] and set τ 2 = supf ∈F Eµ f 2 . If there are γ > 0 and d ≥ 1 such that log N ε, F, L2 (µn ) ≤ d log γ/ε for every empirical measure µn and every ε > 0, then there is a constant Cγ , which depends only on γ, such that for every k > 0 and every 0 < δ < 1, 1 2 2 . SF (kτ 2 , δ) ≤ Cγ d max{k −1 , k −2 } 2 log + log τ τ δ If there are γ ≥ 1 and p < 2 such that log N ε, F, L2 (µn ) ≤ γεp for every empirical measure µn and every ε > 0, then there are constants Cp,γ which depend only on p and γ, such that for every k > 0 and every 0 < δ < 1, 1 1+ p2 2 . 1 + log τ2 δ SF (kτ 2 , δ) ≤ Cp,γ max{k −1 , k −2 } Since the proof follows from a straightforward (yet tedious) calculation, we omit its details. 3 Dominating the Variance The key assumption used in the proof of learning sample complexity in [6] was that there is some B > 0 such that for every loss function f , Eµ f 2 ≤ BEµ f . Though this is easily satisfied in proper learning (because each f is nonnegative), it is far from obvious whether the same holds in the improper setting. In [6] it was observed that if G is convex and F is the squared-loss class then Eµ f 2 ≤ BEµ f , and B depends on the L∞ bound on the members of G and the target. The question we study is whether the same kind of bound can be established in other Lq norms. We will show that if q ≥ 2 and if F is the q-loss function q associated with G, there is some B such that for every f ∈ F , Eµ f 2 ≤ B(Eµ f ) 2 . Our proof is based on a geometric characterization of the nearest point map onto a convex subset of Lq spaces. This fact was used in [6] for q = 2, but no emphasis was put on the geometric idea behind it. Our methods enable us to obtain the bound in Lq for q ≥ 2. Formally, let 1 ≤ q < ∞ and set G to be a compact convex subset of Lq (µ) such that each g ∈ G maps Ω into [0, 1]. Let F be the q-loss class associated with G and T , which is also assumed to have a range contained in [0, 1]. Hence, q q each f ∈ F is given by f = |T − g| − |PG T − T | , where PG T is the nearest point to T in G with respect to the Lq (µ) norm. Learning Relatively Small Classes 281 Since we are interested in GC sets with “small” covering numbers in Lq for every 1 ≤ q < ∞ (e.g. fatε (G) = O(ε−p ) for p < 2), the assumption that G is compact in Lq follows if G is closed in Lq [7]. It is possible to show (see appendix A) that if 1 < q < ∞ and if G ⊂ Lq is convex and compact, the nearest point map onto G is a well defined map, i.e., each T ∈ Lq has a unique best approximation in G. We start our analysis by an proving an upper bound on Eµ f 2 : Lemma 5. Let g ∈ G, assume that 1 < q < ∞ and set f = q (g). Then, 2 Eµ f 2 ≤ q 2 Eµ |g − PG T | . q Proof. Given any ω ∈ Ω, apply Lagrange’s Theorem to the function y = |x| for x1 = g(ω) − T (ω) and x2 = PG T (ω) − T (ω). The result follows by taking the expectation and since |x1 | , |x2 | ≤ 1. The next step, which is to bound Eµ f from below, is considerably more difficult. We shall require the following definitions which are standard in Banach spaces theory [2,4]: Definition 2. A Banach space is called strictly convex if for every x, y ∈ X such that x , y = 1, x + y < 2. X is called uniformly convex if there is a positive function δ(ε) such that for every 0 < ε < 2 and every x, y ∈ X for which x , y ≤ 1 and x − y ≥ ε, x + y ≤ 2 − 2δ(ε). Thus, 1 δ(ε) = inf 1 − x + y x , y ≤ 1, x − y ≥ ε . 2 The function δ(ε) is called the modulus of convexity of X. It is easy to see that X is strictly convex if and only if its unit sphere does not contain intervals. Also, if X is uniformly convex then it is strictly convex. Using the modulus of convexity one can provide a lower bound on the distance of an average of elements on the unit sphere of X and the sphere. It was shown in [3] that for every 1 < q ≤ 2 the modulus of convexity of Lq is given by δq (ε) = (q − 1)ε2 /8 + o(ε2 ) and if q ≥ 2 it is given by δq (ε) = 1/q . 1 − 1 − (ε/2)q The next lemma enables one to prove the lower estimate on Eµ f . The proof of the lemma is based on several ideas commonly used in the field of convex geometry, and is presented in appendix A Lemma 6. Let X be a uniformly convex, smooth Banach space with a modulus of convexity δX . Let G ⊂ X be compact and convex, set T ∈ G and d = T − PG T . Then, for every g ∈ G, δX (d−1 g g − PG T ) ≤ 1 − d/dg , where dg = T − g. Corollary 1. Let q ≥ 2 and assume that G is a compact convex subset of the unit ball of Lq (µ), which consists of functions whose range is contained in [0, 1]. 282 S. Mendelson Let T be the target function which also maps Ω into [0, 1]. If F is the q-loss class associated with G and T , there are constants Cq which depend only on q such that for every g ∈ G, Eµ f 2 ≤ Cq (Eµ f )2/q , where f is the q-loss function associated with g. Proof. Recall that the modulus of uniform convexity of Lq for q ≥ 2 is δq (ε) = 1/q . By Lemma 6 it follows that 1 − 1 − (ε/2)q 1− g − P T q d q G ≥ . 2dg dg Note that q (g) = dqg − dq , hence q f = q (g) = dqg − dq ≥ 2−q Eµ |g − PG T | . By Lemma 5 and since f 2 ≤ f q , 2 2 2 q Eµ f 2 ≤ q 2 Eµ |g − PG T | ≤ q 2 Eµ |g − PG T | q ≤ 4q 2 Eµ f ) q . 4 Learning Sample Complexity Unlike the GC sample complexity, the behaviour of the learning sample complexity is not monotone, in the sense that even if H ⊂ G, it is possible that the learning sample complexity associated with G may be smaller than that associated with F . This is due to the fact that a well behaved geometric structure of the class (e.g. convexity) enables one to derive additional data regarding the loss functions associated with the class. We will show that the learning sample complexity is determined by the GC sample complexity for a class of functions H such that suph∈H Eµ h2 is roughly the same as the desired accuracy. We shall formulate results in two cases. The first theorem deals with proper learning (i.e. T ∈ G). In the second, we discuss improper learning. We present a proof only for the second theorem. Let us introduce the following notation: for a fixed ε > 0 and given any empirical measure µn , let fµ∗n be some f ∈ F such that Eµn fµ∗n ≤ ε/2. Thus, if g ∈ G such that q (g) = fµ∗n then g is an “almost minimizer” of the empirical loss. Theorem 5. Assume that G is a class whose elements map Ω into [0, 1] and fix T ∈ G. Assume that 1 ≤ q < ∞, γ > 1, and let F be the q-loss class associated with G and T . Assume further that p < 2 and that for every integer n and any empirical measure µn , log N ε, G, L2 (µn ) ≤ γε−p for every ε > 0. Then, there is a constant Cp,q,γ which depends only on p, q and γ such that if n ≥ Cp,q,γ then P r Eµ fµ∗n ≥ ε ≤ δ. 1 1+ p2 ε 1 + log 1 , δ Learning Relatively Small Classes 283 Now, we turn to the improper case. Theorem 6. Let G, F and q be as in Theorem 5 and assume that T ∈ G. Assume further that G is convex and that there is some B such that for every f ∈ F , Eµ f 2 ≤ B(Eµ f )q/2 Then, the assertion of Theorem 5 remains true. Remark 1. The key assumption here is that Eµ f 2 ≤ B(Eµ f )2/q for every loss function f . Recall that in Section 3 we explored this issue. Combining Theorem 5 and Theorem 6 with the results of Section 3 gives us the promised estimate on the learning sample complexity. Before proving the theorem we will need the following observations. Assume that q ≥ 2 and let F be the q-loss class. First, note that P r Eµ fµ∗n ≥ ε ≤ P r ∃f ∈ F, Eµ f ≥ ε, Eµ f 2 < ε, Eµn f ≤ ε/2 + P r ∃f ∈ F, Eµ f ≥ ε, Eµ f 2 ≥ ε, Eµn f ≤ ε/2 = (1) + (2) . If Eµ f ≥ ε then Eµ f ≥ 1 1 (Eµ f + ε) ≥ Eµ f + Eµn f . 2 2 Therefore, |Eµ f − Eµn f | ≥ 12 Eµ f ≥ ε/2, and thus ε (1) + (2) ≤ P r ∃f ∈ F, Eµ f 2 < ε, |Eµ f − Eµn f | ≥ 2 1 2 + P r ∃f ∈ F, Eµ f ≥ ε, Eµ f ≥ ε, |Eµ f − Eµn f | ≥ Eµ f . 2 Let α = 2 − 2/q and set H= εα f Eµ f |f ∈ F, Eµ f ≥ ε, Eµ f 2 ≥ ε . (2) Since q ≥ 2 then α ≥ 1, and since ε < 1, each h ∈ H maps Ω into [0, 1]. Also, if Eµ f 2 ≤ B(Eµ f )q/2 then Eµ h2 ≤ B ε2α ≤ Bεα . (Eµ f )2−2/q Therefore, ε P r Eµ fµ∗n ≥ ε ≤ P r ∃f ∈ F, Eµ f 2 < ε, |Eµ f − Eµn f | ≥ 2 εα 2 α + P r ∃h ∈ H, Eµ h ≤ Bε , |Eµ h − Eµn h| ≥ 2 (3) 284 S. Mendelson In both cases we are interested in a GC condition at a scale which is roughly the “largest” variance of the member of a class. This is the exact question which was investigated in Section 2. The only problem in applying Theorem 4 directly to the class H is the fact that we do not have an a-priori bound on the covering numbers of that class. The question we need to tackle before proceeding is how to estimate the covering numbers of H, given that the covering numbers of F are well behaved. To that end we need to use the specific structure of F , namely, that it is a q-loss class associated with the class G. We divide our discussion to two parts. First we deal with proper learning, in p which each loss function is given by f = |g − T | and no specific assumptions are needed on the structure of G. Then, we turn to the improper case, and assume that G is convex and F is the q-loss class for 1 ≤ q < ∞. To handle both cases, we require the following simple definition: Definition 3. Let X be a normed space and let A ⊂ X. We say that A is starshaped with center x if for every a ∈ A the interval [a, x] ⊂ A. Given A and x, denote by star(A, x) the union of all the intervals [a, x], where a ∈ A. The next lemma shows that the covering numbers of star(A, x) are almost the same as those of A. Lemma 7. Let X be a normed space and let A ⊂ B(X). Then, for any x ≤ 1 and every ε > 0, N 2ε, star(A, x) ≤ 2N ε, A /ε. Proof. Fix ε > 0 and let y1 , ..., yk be an ε-cover of A. Note that for any a ∈ A and any z ∈ [a, x] there is some z ∈ [yi , x] such that z − z < ε. Hence, an ε-cover of the union ∪ni=1 [yi , z] is a 2ε-cover for star(A, x). Since for every i, x − yi ≤ 2 it follows that each interval may be covered by 2ε−1 balls with radius ε and our claim follows. Lemma 8. Let G be a class of functions into [0, 1], set 1 ≤ q < ∞ and denote α F = {q (g)|g ∈ G}, where T ∈ G. Let α = 2 − 2/q and put H = { Eε µff |f ∈ F }. Then, for every ε > 0 and every empirical measure µn , ε 2 log N 2ε, H, L2 (µn ) ≤ log + log N , G, L2 (µn ) . ε q Proof. Since every h ∈ H is of the from h = αf f where 0 < α ≤ 1 then H ⊂ star(F, 0). Thus, by Lemma 7, 2 log N 2ε, H, L2 (µn ) ≤ log + log N ε, F, L2 (µn ) . ε Our assertion follows from Lemma 1. Now, we turn to the improper case. Learning Relatively Small Classes 285 Lemma 9. Let G be a convex class of functions and let be F its q-loss class for some 1 ≤ q < ∞. Assume that α and H are as in Lemma 8. If there are γ ≥ 1 and p < 2 such that for any empirical measure µn and every ε > 0, log N ε, G, L2 (µn ) ≤ γε−p , then there are constants cp,q which depend only on p and q, such that for every ε > 0, ε 4 log N ε, H, L2 (µn ) ≤ log N , G, L2 (µn ) + 2 log . 4q ε Proof. Again, every member of H is given by κf f , where 0 < κf < 1. Hence, H ⊂ {κq (g)|g ∈ G, κ ∈ [0, 1]} ≡ Q . By the definition of the q-loss function, it is possible to decompose Q = Q1 +Q2 , where q q Q1 = κ |g − T | κ ∈ [0, 1], g ∈ G and Q2 = −κ |T − PG T | κ ∈ [0, 1] . q Note that T and PG T map Ω into [0, 1], hence |T − PG T | is bounded by 1 pointwise. Therefore, Q2 is contained in an interval whose length is at most 1, implying that for any probability measure µ, 2 . N ε, Q2 , L2 (µ) ≤ ε q Let V = {|g −T | |g ∈ G}. Since every g ∈ G and T map Ω into [0, 1] then V ⊂ B L∞ (Ω) . Therefore, by Lemma 1, N ε, V, L2 (µ) ≤ N ε/q, G, L2 (µ) for every probability measure µ and every ε > 0. Also, Q1 ⊂ star(V, 0), thus for any ε > 0, ε , G, L2 (µ) 2N 2q 2N 2ε , V, L2 (µ) ≤ N ε, Q1 , L2 (µ) ≤ , ε ε which suffices, since one can combine the separate covers for Q1 and Q2 to form a cover for H. Proof (of Theorem 6). The proof follows immediately from (3), Theorem 4 and Lemma 8 for the proper case and Lemma 9 in the improper case. Remark 2. It is possible to prove an analogous result to Theorem 6 when the covering numbers of G are polynomial; indeed, if γ sup sup log N ε, G, L2 (µn ) ≤ Cd log ε n µn then the sample complexity required is 1 2 2 q CG,T , (ε, δ) = Cq,γ d log + log ε ε δ where Cq,γ depend only on γ and q. 286 A S. Mendelson Convexity In this appendix we present the definitions and preliminary results needed for the proof of Lemma 6. All the definitions are standard and my be found in any basic textbook in functional analysis, e.g., [4]. Definition 4. Given A, B ⊂ X we say that a nonzero functional x∗ ∈ X ∗ separates A and B if inf a∈A x∗ (x) ≥ supb∈B x∗ (b). It is easy to see that x∗ separates A and B if and only if there is some α ∈ R such that for every a ∈ A and b∈ B , x∗ (b) ≤ α ≤ x∗ (a). In that case, the hyperplane H = x|x∗ (x) = α separates A and B. We denote the closed “positive” halfspace x|x∗ (x) ≥ α by H + and the “negative” one by H − . By the Hahn-Banach Theorem, if A and B are closed, convex and disjoint there is a hyperplane (equivalently, a functional) which separates A and B. Definition 5. Let A ⊂ X, we say that the hyperplane H supports A in a ∈ A if a ∈ H and either A ⊂ H + or A ⊂ H − . By the Hahn-Banach Theorem, if B ⊂ X is a ball then for every x ∈ ∂B there is a hyperplane which supports B in x. Equivalently, there is some x∗ , x∗ = 1 and α ∈ R such that x∗ (x) = α and for every y ∈ B, x∗ (y) ≥ α. Given a line V = tx + (1 − t)y|t ∈ R , we say it supports a ball B ⊂ X in z if z ∈ V ∩ B and V ∩ int(B) = ∅. By the Hahn-Banach Theorem, if V supports B in z, there is a hyperplane which contains V and supports B in z. Definition 6. A Banach space X is called smooth if for any x ∈ X there is a unique functional x∗ ∈ X ∗ , such that x∗ = 1 and x∗ (x) = x. Thus, a Banach space is smooth if and only if for every x such that x = 1, there is a unique hyperplane which supports the unit ball in x. It is possible to show [4] that for every 1 < q < ∞, Lq is smooth. We shall be interested in the properties of the nearest point map onto a compact convex set in “nice” Banach spaces, which is the subject of the following lemma. Lemma 10. Let X be a strictly convex space and let G ⊂ X be convex and compact. Then every x ∈ X has a unique nearest point in G. Proof. Fix some x ∈ X and set R = inf g∈G g − x. By the compactness of G and the fact that the norm is continuous, there is some g0 ∈ G for which the infimum is attained, i.e., R = g0 − x. To show uniqueness, assume that there is some other g ∈ G for which g − x = R. Since G is convex then g1 = (g + g0 )/2 ∈ G, and by the strict convexity of the norm, g1 − x < R, which is impossible. Lemma 11. Let X be a strictly convex, smooth Banach space and let G ⊂ X be compact and convex. Assume that x ∈ G and set y = PG x to be the nearest point to x in G. If R = x − y then the hyperplane supporting the ball B = B(x, R) at y separates B and G. Learning Relatively Small Classes 287 Proof. Clearly, we may assume that x = 0 and that R = 1. Therefore, if x∗ is the ∗ normalizedfunctional which supports B at y, then for every x ∈ B, x (x) ≤ 1. Let H = x|x∗ (x) = 1 , set H − to be the open halfspace x|x∗ (x) < 1 and assume that there is some g ∈ G such that x∗ (g) < 1. Since G is convex then for every 0 ≤ t < 1, ty + (1 − t)g ∈ G ∩ H − . Moreover, since y is the unique nearest point to 0 in G and since X is strictly convex, [g, y] ∩ B = {y}, otherwise there would have been some g1 ∈ G such that g1 − x < 1. Hence, the line V = {ty + (1 − t)g|t ∈ R} supports B in y. By the Hahn-Banach Theorem there is a hyperplane which contains V and supports B in y. However, this hyperplane can not be H because it contains g. Thus, B was two different supporting hyperplanes at y, contrary to the assumption that X is smooth. In the following lemma, our goal is to be able to “guess” the location of some g ∈ G based on the its distance from T ∈ G. The idea is that since G is convex and since the norm of X is both strictly convex and smooth the intersection of a ball centered at the target and G are contained within a “slice” of a ball, i.e., the intersection of a ball and a certain halfspace. Formally, we claim the following: Lemma 12. Let X be a strictly convex, smooth Banach space and let G ⊂ X be compact and convex. For any T ∈ G let PG T be the nearest point to T in G and set d = T − PG T . Let x∗ be the functional supporting B(T, d) in PG T and put H + = {x|x∗ (x) ≥ d + x∗ (T )}. Then, every g ∈ G, satisfies that g ∈ B(T, dg ) ∩ H + , where dg = g − T . The proof of this corollary is straightforward and is omitted. Finally, we arrive to the proof of the main claim. We estimate the diameter of the “slice” of G using the modulus of uniform convexity of X. This was formulated as Lemma 6 in the main text. Lemma 13. Let X be a uniformly convex, smooth Banach space with a modulus of convexity δX and let G ⊂ X be compact and convex. If T ∈ G and d = T − PG T then for every g ∈ G, g − P T d G δX ≤1− , dg dg where dg = T − g. Proof. Clearly, we may assume that T = 0. Using the notation of Lemma 12, g − PG T ≤ diam B(T, dg ) ∩ H + . Let z̃1 , z̃2 ∈ B(T, dg ) ∩ H + , put ε = z˜1 − z˜2 and set zi = z˜i /dg . Hence, zi ≤ 1, z1 − z2 = ε/dg and x∗ (zi ) ≥ d/dg . Thus, 1 d 1 z1 + z2 ≥ x∗ (z1 + z2 ) ≥ . 2 2 dg Hence, and our claim follows. ε d 1 , ≤ z1 + z2 ≤ 1 − δX dg 2 dg 288 S. Mendelson References 1. M.Anthony, P.L. Bartlett: Neural Network Learning: Theoretical Foundations, Cambridge University Press, 1999. 2. B. Beauzamy: Introduction to Banach spaces and their Geometry, Math. Studies, vol 86, North-Holland, 1982 3. O. Hanner: On the uniform convexity of Lp and lp , Ark. Math. 3, 239-244, 1956. 4. P. Habala, P. Haj́ek, V. Zizler: Introduction to Banach spaces vol I and II, matfyzpress, Univ. Karlovy, Prague, 1996. 5. W.S. Lee: Agnostic learning and single hidden layer neural network, Ph.D. thesis, The Australian National University, 1996. 6. W.S. Lee, P.L. Bartlett, R.C. Williamson: The Importance of Convexity in Learning with Squared Loss, IEEE Transactions on Information Theory 44 5, 1974-1980, 1998. 7. S. Mendelson: Rademacher averages and phase transitions in Glivenko–Cantelli classes, preprint. 8. S. Mendelson: Geometric methods in the analysis of Glivenko-Cantelli classes, this volume. 9. M. Talagrand: Sharper bounds for Gaussian and empirical processes, Annals of Probability, 22(1), 28-76, 1994. 10. A.W. Van–der–Vaart, J.A. Wellner: Weak convergence and Empirical Processes, Springer-Verlag, 1996. 11. R.C. Williamson, A.J. Smola, B. Schölkopf: Generalization performance of regularization networks and support vectors machines via entropy numbers of compact operators, to appear in IEEE transactions on Information Theory.