Learning Relatively Small Classes

advertisement
Learning Relatively Small Classes
Shahar Mendelson
Computer Sciences Laboratory, RSISE, The Australian National University,
Canberra 0200, Australia
shahar@csl.anu.edu.au
Abstract. We study the sample complexity of proper and improper
learning problems with respect to different Lq loss functions. We improve the known estimates for classes which have relatively small covering numbers (log-covering numbers which are polynomial with exponent
p < 2) with respect to the Lq norm for q ≥ 2.
1
Introduction
In this article we present sample complexity estimates for proper and improper
learning problems of classes with respect to different loss functions, under the
assumption that the classes are not “too large”. Namely, we assume that their
log-covering numbers are polynomial in ε−1 with exponent p < 2.
The question we explore is the following: let G be a class of functions defined
on a probability space (Ω, µ) such that each g ∈ G maps Ω into [0, 1] and let
T be an unknown function which is bounded by 1 (not necessarily a member
of G). Recall that a learning rule L is a map which assigns to each sample
Sn = (ω1 , ..., ωn ) a function LSn ∈ G. The learning sample complexity associated
with the q-loss, accuracy ε and confidence δ is the first integer n0 such that for
every n ≥ n0 , every probability measure µ on (Ω, Σ) and every measurable
function bounded by 1,
q
q
µ∞ Eµ |LSn − T | ≥ inf Eµ |g − T | + ε ≤ δ ,
g∈G
where L is any learning rule and µ∞ is the infinite product measure.
We were motivated by two methods previously used in the investigation of
sample complexity [1]. Firstly, there is the standard approach which uses the
Glivenko-Cantelli (GC) condition to estimate the sample complexity. By this we
mean the following: let G be a class of functions on Ω, let T be the target concept
(which, for the sake of simplicity, is assumed to be deterministic), set 1 ≤ q < ∞
q
and let F = {|g − T | |g ∈ G}. The Glivenko-Cantelli sample complexity of the
class F with respect to accuracy ε and confidence δ is the smallest integer n0
such that for every n ≥ n0 ,
q
q
sup µ∞ sup |Eµ |g − T | − Eµn |g − T | | ≥ ε ≤ δ ,
µ
g∈G
D. Helmbold and B. Williamson (Eds.): COLT/EuroCOLT 2001, LNAI 2111, pp. 273–288, 2001.
c Springer-Verlag Berlin Heidelberg 2001
274
S. Mendelson
where µn is the empirical measure supported on the sample (ω1 , ..., ωn ).
q
Therefore, if g is an “almost” minimizer of the empirical loss Eµn |g − T | and
if the sample is “large enough” then g is an “almost” minimizer with respect to
the average loss. One can show that the learning sample complexity is bounded
by the supremum of the GC sample complexities, where the supremum is taken
over all possible targets T , bounded by 1. This is true even in the agnostic case,
in which T may be random (for further details, see [1]).
Lately, it was shown [7] that if the log-covering numbers (resp. the fat shattering dimension) of G are of the order of ε−p for p ≥ 2, then the GC sample
complexity of F is Θ(ε−p ) up to logarithmic factors in ε−1 , δ −1 .
It is important to emphasize that the learning sample complexity may be
established by other means rather than via the GC condition. Hence, it comes
with no surprise that there are certain cases in which it is possible to improve
this bound on the learning sample complexity. In [5,6] the following case was
2
2
examined; let F be the loss class given by {|g − T | − |T − PG T | |g ∈ G}, where
PG T is a nearest point to T in G with respect to the L2 (µ) norm. Assume that
there is an absolute constant C such that for every f ∈ F , Eµ f 2 ≤ CEµ f ,
i.e., that it is possible to control the variance of each loss function using its
expectation. In this case, the learning sample complexity with accuracy ε and
confidence δ is
1
1 fatε (F )
fatε (F ) log2
+ log
.
O
ε
ε
δ
Therefore, if fatε (F ) = O(ε−p ) the learning sample complexity is bounded (up
to logarithmic factors) by O(ε−(1+p) ). If p < 1, this estimate is better than the
one obtained using the GC sample complexities.
As it turns out, the assumption above is not so far fetched; it is possible to
show [5,6] that there are two generic cases in which Eµ f 2 ≤ CEµ f . The first
case is when T ∈ G, because it implies that each f ∈ F is nonnegative. The
other case is when G is convex and q = 2. Thus, every loss function is given by
2
2
|g − T | − |T − PG T | , where PG T is the nearest point to T in G with respect
to the L2 (µ) norm.
Here, we combine the ideas used in [6] and in [7] to improve the learning
complexity estimates. We show that if
sup sup log N ε, G, L2 (µn ) = O(ε−p )
n
µn
for p < 2 and if either T ∈ G or if G is convex, then the learning sample
complexity is O(ε1+p/2 ) up to logarithmic factors in ε−1 , δ −1 . Recently it was
shown in [7,8] that for every 1 ≤ q < ∞ there is a constant cq such that
2fat ε (G) 8
,
sup sup log N ε, G, L2 (µn ) ≤ cq fat 8ε (G) log2
ε
n µn
(1)
therefore, the estimates we obtain improve the O ε−(1+p) established in [6].
Learning Relatively Small Classes
275
Our discussion is divided into three sections. In the second section we investigate the GC condition for classes which have “small” log-covering numbers.
We focus the discussion on the case where the deviation in the GC condition is
of the same order of magnitude as supf ∈F Eµ f 2 . The proof is based on estimates
on the Rademacher averages (defined below) associated with the class. Next, we
explore sufficient conditions which imply that if F is the q-loss class associated
with a convex class G, Eµ f 2 may be controlled by the Eµ f . We use a geometric
approach to prove that if q ≥ 2 there is some constant B such that for every
q-loss function f , Eµ f 2 ≤ B(Eµ f )2/q . It turns out that those estimates are the
key behind the proof of the learning sample complexity, which is investigated in
the fourth and final section.
Next, we turn to some definitions and notation we shall use throughout this
article.
If G is a class of functions, T is the target concept and 1 ≤ q < ∞, then for
every g ∈ G let q (g) be its q-loss function. Thus,
q
q
q (g) = |g − T | − |g − PG T |
,
where PG T is a nearest element to T in G with respect to the Lq norm. We
denote by F the set of loss functions q (g).
Let G be a GC class. For every 0 < ε, δ < 1, denote by SG (ε, δ) the GC
sample complexity of the class G associated with accuracy ε and confidence δ.
q
Let CG,T
(ε, δ) be the learning sample complexity of the class G with respect to
the target T and the q-loss, for accuracy ε and confidence δ.
Given a Banach space X, let B(X) be the unit ball of X. If B ⊂ X is a ball,
set int(B) to be the interior of B and ∂B is the boundary of B. The dual of
X, denoted by X ∗ , consists of all the bounded linear functionals on X, endowed
with the norm x∗ = supx=1 |x∗ (x)|. If x, y ∈ X, the interval [x, y] is defined
by [x, y] = {tx + (1 − t)y|0 ≤ t ≤ 1}.
For any probability measure µ on a measurable space (Ω, Σ), let Eµ be the
q
expectation with respect to µ. Lq (µ) is the set of functions which satisfy Eµ |f | <
q 1/q
∞, and set f q = (E |f | ) . L∞ (Ω) is the space of bounded functions on Ω,
with respect to the norm f ∞ = supω∈Ω |f (ω)|. We denote
n by µn an empirical
measure supported on a set of n points, hence, µn = n1 i=1 δωi , where δωi is
the point evaluation functional at {ωi }.
If (X, d) is a metric space, set B(x, r) to be the closed ball centered at x
with radius r. Recall that if F ⊂ X, the ε-covering number of F , denoted by
N (ε, F, d), is the minimal number of open balls with radius ε > 0 (with respect
to the metric d) needed to cover F . A set A ⊂ X is said to be an ε-cover of F if
the union of open balls a∈A B(a, ε) contains F . In cases where the metric d is
clear, we shall denote the covering numbers of F by N (ε, F ).
It is possible to show that the covering numbers of the q-loss class F are
essentially the same as those of G.
Lemma 1. Let G ⊂ B L∞ (Ω) and let F be the q-loss class associated with G.
Then, for any probability measure µ,
log N ε, F, L2 (µ) ≤ log N ε/q, G, L2 (µ) .
276
S. Mendelson
The proof of this lemma is standard and is omitted.
Next, we define the Rademacher averages of a given class of functions, which
is the main tool we use in the analysis of GC classes.
Definition 1. Let F be a class of functions and let µ be a probability measure
on Ω. Set
n
1 εi f (Xi ) ,
R̄n,µ = Eµ Eε sup √ n
f ∈F
i=1
where εi are independent Rademacher random variables (i.e. symmetric, {−1, 1}
valued) and (Xi ) are independent, distributed according to µ.
It is possible to estimate R̄n,µ of a given class using its L2 (µn ) covering numbers.
The key result behind this estimate is the following theorem which is originally
due to Dudley for Gaussian processes, and was extended to the more general
setting of subgaussian processes [10]. We shall formulate it only in the case of
Rademacher processes.
Theorem 1. There is an absolute constant C such that for every probability
measure µ, any integer n and every sample (ωi )ni=1
δ
n
1
1 −2
Eε sup n εi f (ωi ) ≤ C
log 2 N ε, F, L2 (µn ) dε ,
f ∈F
0
i=1
where
δ = sup
f ∈F
n
1 n
12
f 2 (ωi )
i=1
and µn is the empirical measure supported on the given sample (ω1 , ..., ωn ).
Finally, throughout this paper, all absolute constants are assumed to be positive
and are denoted by C or c. Their values may change from line to line or even
within the same line.
We end the introduction with a statement of our main result.
Theorem 2. Let G ⊂ B L∞ (Ω) such that
sup sup log N ε, G, L2 (µn ) ≤ γε−p
n
µn
for some p < 2 and let T ∈ B L∞ (Ω) be the target concept. If 1 ≤ q < ∞ and
either
(1) T ∈ G, or
(2) G is convex,
Learning Relatively Small Classes
277
there are constants Cq,p,γ which depend only on q, p, γ such that
1 1+ p2 2
q
.
CG,T
1 + log
(ε, δ) ≤ Cq,p,γ
δ
ε
Also, if
γ d
sup sup N ε, G, L2 (µn ) ≤
ε
n µn
and either (1) or (2) hold, then
q
(ε, δ) ≤ Cq,p,γ d
CG,T
1
ε
log
2
2
.
+ log
ε
δ
There are numerous cases in which Theorem 2 applies. We shall name a few of
them. In the proper case, Theorem 2 may be used, for example, if G is a VC class
or a VC subgraph class [10]. Another case is if G satisfies that fatε (G) = O(ε−p )
[7]. In both cases the covering numbers of G satisfy the conditions of the theorem.
In the improper case, one may consider, for example, convex hulls of VC classes
[10] and many Kernel Machines [11].
2
GC for Classes with “Small” Variance
Although the “normal” GC sample complexity of a class is Ω(ε−2 ) when one
allows the deviation to be arbitrary small, it is possible to obtain better estimates
when the deviation we are interested in is roughly the largest variance of a
member of the class. As we demonstrate in the final section, this fact is very
important in the analysis of learning sample complexities. The first step in that
direction is the next result, which is due to Talagrand [9].
Theorem 3. Assume that F is a class of functions into [0, 1]. Let
√
σ 2 = sup Eµ (f − Eµ f )2 , Sn = nσ 2 + nR̄n,µ .
f ∈F
For every L, S > 0 and t > 0 define
2
φL,S (t) =
t
L2 S
1
t
et 2
L log LS
if t ≤ LS
if t ≥ LS .
√
There is an absolute constant K such that if t ≥ K nR̄n,µ , then
n
f (Xi ) − nEµ f ≥ t ≤ exp −φK,S (t) .
µ i=1
F
In the next two results, we present an estimate on the Rademacher averages R̄n,µ
using data on the “largest” variance of a member of the class and on the covering
numbers of F in empirical L2 spaces. We divide the results to the case where
the covering numbers are polynomial in ε−1 and the case where the log-covering
numbers are polynomial is ε−1 with exponent p < 2.
278
S. Mendelson
Lemma 2. Let F be a class of functions into [0, 1] and set τ 2 = supf ∈F Eµ f 2 .
Assume that there are γ ≥ 1 and d ≥ 1 such that for every empirical measure
µn ,
N ε, F, L2 (µn ) ≤ (γ/ε)d .
Then, there is a constant Cγ , which depends only on γ, such that
R̄n,µ
1 1
1 √
d
≤ Cγ max √ log , dτ log 2
τ
τ
n
.
An important part of the proof is the following result, again, due to Talagrand
[9], on the expectation of the diameter of F when considered as a subset of
L2 (µn ).
Lemma 3. Let F ⊂ B L∞ (Ω) and set τ 2 = supf ∈F Eµ f 2 . Then, for every
probability measure µ, every integer n, and every (Xi )ni=1 which are independent,
distributed according to µ,
n
√
2
Eµ sup f (Xi ) ≤ nτ 2 + 8 nR̄n,µ .
f ∈F i=1
n
Proof (of Lemma 2). Set Y = n−1 supf ∈F i=1 f 2 (Xi ). By Theorem 1 there is
an absolute constant C such that for every sample (X1 , ..., Xn )
√Y
n
1
1
√ Eε sup εi f (Xi ) ≤ C
log 2 N ε, F, L2 (µn ) dε
n f ∈F i=1
0
√
√ Y
1 γ
log 2 dε .
=C d
ε
0
It is easy to see that there is some absolute constant Cγ such that for every
0 < x ≤ 1,
0
1
x
1
log 2
1 Cγ
γ
dε ≤ 2x log 2
,
ε
x
1
and v(x) = x 2 log 2 (Cγ /x) is increasing and concave in (0, 10). Since Y ≤ 1,
n
√
1 Eε sup √ εi f (Xi ) ≤ C dY
n i=1
f ∈F
1
2
1
log 2
Cγ
Y
1
2
.
Also, since v(x) is increasing and concave in (0, 10), and since EY ≤ 9, then
by Jensen’s inequality, Lemma 3 and the fact that v is increasing, there is a
Learning Relatively Small Classes
279
constant Cγ which depends only on γ, such that
1
1
1 Cγ
1 2
1
1
2
≤ Cγ (Eµ Y ) 2 log 2
Eµ Y 2 log 2 1 = Cγ Eµ Y 2 log 2
Y
E
2
Y
µY
1
1
2
R̄n,µ 2
log 2
≤ Cγ τ 2 + 8 √
8R̄
2
n
τ + √n,µ
8R̄n,µ
≤ Cγ τ 2 + √
n
12
n
log
1
2
2
.
τ
Therefore,
√ 1 2
8R̄n,µ 12
,
R̄n,µ ≤ Cγ d τ 2 + √
log 2
τ
n
and our claim follows from a straightforward computation.
Next, we explore the case in which the log-covering numbers are polynomial with
exponent p < 2.
Lemma 4. Let F be a class of functions into [0, 1] and set τ 2 = supf ∈F Eµ f 2 .
Assume that there are γ ≥ 2 and p < 2 such that for every empirical measure
µn ,
log N ε, F, L2 (µn ) ≤ γε−p .
Then, there is constant Cp,γ which depend only on p and on γ such that
1 2−p
p
R̄n,µ ≤ Cp,γ max n− 2 2+p , τ 1− 2 .
n
Proof. Again, let Y = n−1 supf ∈F i=1 f 2 (Xi ). For every realization of (Xi )ni=1 ,
set µn to be the empirical measure supported on X1 , ..., Xn . By Theorem 1 for
every fixed sample, there is an absolute constant C, such that
√Y
n
1
1
√ Eε sup εi f (Xi ) ≤C
log 2 N ε, F, L2 (µn ) dε
n f ∈F 0
i=1
≤
1
n
12 (1− p2 )
Cγ 2 1
f 2 (Xi )
.
sup
1 − p/2 n f ∈F i=1
Taking the expectation with respect to µ and by Jensen’s inequality and Lemma
3,
R̄n,µ ≤ Cp,γ
p
12 (1− p2 ) 8R̄n,µ 12 (1− 2 )
2
√
f (Xi )
≤ τ +
.
Eµ sup
n f ∈F i=1
n
1
n
2
Therefore,
p
8R̄n,µ 12 (1− 2 )
R̄n,µ ≤ Cp,γ τ 2 + √
n
and the claim follows.
280
S. Mendelson
Combining Theorem 3, Lemma 2 and Lemma 4, we obtain the following deviation
estimate:
Theorem 4. Let F be a class of functions whose range is contained in [0, 1] and
set τ 2 = supf ∈F Eµ f 2 .
If there are γ > 0 and d ≥ 1 such that
log N ε, F, L2 (µn ) ≤ d log γ/ε
for every empirical measure µn and every ε > 0, then there is a constant Cγ ,
which depends only on γ, such that for every k > 0 and every 0 < δ < 1,
1
2
2
.
SF (kτ 2 , δ) ≤ Cγ d max{k −1 , k −2 } 2 log + log
τ
τ
δ
If there are γ ≥ 1 and p < 2 such that
log N ε, F, L2 (µn ) ≤ γεp
for every empirical measure µn and every ε > 0, then there are constants Cp,γ
which depend only on p and γ, such that for every k > 0 and every 0 < δ < 1,
1 1+ p2 2
.
1
+
log
τ2
δ
SF (kτ 2 , δ) ≤ Cp,γ max{k −1 , k −2 }
Since the proof follows from a straightforward (yet tedious) calculation, we omit
its details.
3
Dominating the Variance
The key assumption used in the proof of learning sample complexity in [6] was
that there is some B > 0 such that for every loss function f , Eµ f 2 ≤ BEµ f .
Though this is easily satisfied in proper learning (because each f is nonnegative),
it is far from obvious whether the same holds in the improper setting. In [6] it was
observed that if G is convex and F is the squared-loss class then Eµ f 2 ≤ BEµ f ,
and B depends on the L∞ bound on the members of G and the target. The
question we study is whether the same kind of bound can be established in
other Lq norms. We will show that if q ≥ 2 and if F is the q-loss function
q
associated with G, there is some B such that for every f ∈ F , Eµ f 2 ≤ B(Eµ f ) 2 .
Our proof is based on a geometric characterization of the nearest point map
onto a convex subset of Lq spaces. This fact was used in [6] for q = 2, but no
emphasis was put on the geometric idea behind it. Our methods enable us to
obtain the bound in Lq for q ≥ 2.
Formally, let 1 ≤ q < ∞ and set G to be a compact convex subset of Lq (µ)
such that each g ∈ G maps Ω into [0, 1]. Let F be the q-loss class associated
with G and T , which is also assumed to have a range contained in [0, 1]. Hence,
q
q
each f ∈ F is given by f = |T − g| − |PG T − T | , where PG T is the nearest
point to T in G with respect to the Lq (µ) norm.
Learning Relatively Small Classes
281
Since we are interested in GC sets with “small” covering numbers in Lq for
every 1 ≤ q < ∞ (e.g. fatε (G) = O(ε−p ) for p < 2), the assumption that G is
compact in Lq follows if G is closed in Lq [7].
It is possible to show (see appendix A) that if 1 < q < ∞ and if G ⊂ Lq is
convex and compact, the nearest point map onto G is a well defined map, i.e.,
each T ∈ Lq has a unique best approximation in G.
We start our analysis by an proving an upper bound on Eµ f 2 :
Lemma 5. Let g ∈ G, assume that 1 < q < ∞ and set f = q (g). Then,
2
Eµ f 2 ≤ q 2 Eµ |g − PG T |
.
q
Proof. Given any ω ∈ Ω, apply Lagrange’s Theorem to the function y = |x| for
x1 = g(ω) − T (ω) and x2 = PG T (ω) − T (ω). The result follows by taking the
expectation and since |x1 | , |x2 | ≤ 1.
The next step, which is to bound Eµ f from below, is considerably more difficult.
We shall require the following definitions which are standard in Banach spaces
theory [2,4]:
Definition 2. A Banach space is called strictly convex if for every x, y ∈ X
such that x , y = 1, x + y < 2. X is called uniformly convex if there is
a positive function δ(ε) such that for every 0 < ε < 2 and every x, y ∈ X for
which x , y ≤ 1 and x − y ≥ ε, x + y ≤ 2 − 2δ(ε). Thus,
1
δ(ε) = inf 1 − x + y x , y ≤ 1, x − y ≥ ε .
2
The function δ(ε) is called the modulus of convexity of X.
It is easy to see that X is strictly convex if and only if its unit sphere does not
contain intervals. Also, if X is uniformly convex then it is strictly convex. Using
the modulus of convexity one can provide a lower bound on the distance of an
average of elements on the unit sphere of X and the sphere.
It was shown in [3] that for every 1 < q ≤ 2 the modulus of convexity of
Lq is given by δq (ε) = (q − 1)ε2 /8 + o(ε2 ) and if q ≥ 2 it is given by δq (ε) =
1/q
.
1 − 1 − (ε/2)q
The next lemma enables one to prove the lower estimate on Eµ f . The proof
of the lemma is based on several ideas commonly used in the field of convex
geometry, and is presented in appendix A
Lemma 6. Let X be a uniformly convex, smooth Banach space with a modulus of convexity δX . Let G ⊂ X be compact and convex, set T ∈ G and
d = T − PG T . Then, for every g ∈ G, δX (d−1
g g − PG T ) ≤ 1 − d/dg , where
dg = T − g.
Corollary 1. Let q ≥ 2 and assume that G is a compact convex subset of the
unit ball of Lq (µ), which consists of functions whose range is contained in [0, 1].
282
S. Mendelson
Let T be the target function which also maps Ω into [0, 1]. If F is the q-loss
class associated with G and T , there are constants Cq which depend only on q
such that for every g ∈ G, Eµ f 2 ≤ Cq (Eµ f )2/q , where f is the q-loss function
associated with g.
Proof. Recall that the modulus of uniform convexity of Lq for q ≥ 2 is δq (ε) =
1/q
. By Lemma 6 it follows that
1 − 1 − (ε/2)q
1−
g − P T q d q
G
≥
.
2dg
dg
Note that q (g) = dqg − dq , hence
q
f = q (g) = dqg − dq ≥ 2−q Eµ |g − PG T | .
By Lemma 5 and since f 2 ≤ f q ,
2
2
2
q
Eµ f 2 ≤ q 2 Eµ |g − PG T | ≤ q 2 Eµ |g − PG T | q ≤ 4q 2 Eµ f ) q .
4
Learning Sample Complexity
Unlike the GC sample complexity, the behaviour of the learning sample complexity is not monotone, in the sense that even if H ⊂ G, it is possible that the
learning sample complexity associated with G may be smaller than that associated with F . This is due to the fact that a well behaved geometric structure
of the class (e.g. convexity) enables one to derive additional data regarding the
loss functions associated with the class. We will show that the learning sample
complexity is determined by the GC sample complexity for a class of functions
H such that suph∈H Eµ h2 is roughly the same as the desired accuracy.
We shall formulate results in two cases. The first theorem deals with proper
learning (i.e. T ∈ G). In the second, we discuss improper learning. We present a
proof only for the second theorem.
Let us introduce the following notation: for a fixed ε > 0 and given any
empirical measure µn , let fµ∗n be some f ∈ F such that Eµn fµ∗n ≤ ε/2. Thus, if
g ∈ G such that q (g) = fµ∗n then g is an “almost minimizer” of the empirical
loss.
Theorem 5. Assume that G is a class whose elements map Ω into [0, 1] and fix
T ∈ G. Assume that 1 ≤ q < ∞, γ > 1, and let F be the q-loss class associated
with G and T . Assume further
that p < 2 and that for every integer n and any
empirical measure µn , log N ε, G, L2 (µn ) ≤ γε−p for every ε > 0. Then, there
is a constant Cp,q,γ which depends only on p, q and γ such that if
n ≥ Cp,q,γ
then P r Eµ fµ∗n ≥ ε ≤ δ.
1 1+ p2 ε
1 + log
1
,
δ
Learning Relatively Small Classes
283
Now, we turn to the improper case.
Theorem 6. Let G, F and q be as in Theorem 5 and assume that T ∈ G.
Assume further that G is convex and that there is some B such that for every
f ∈ F , Eµ f 2 ≤ B(Eµ f )q/2 Then, the assertion of Theorem 5 remains true.
Remark 1. The key assumption here is that Eµ f 2 ≤ B(Eµ f )2/q for every loss
function f . Recall that in Section 3 we explored this issue. Combining Theorem
5 and Theorem 6 with the results of Section 3 gives us the promised estimate on
the learning sample complexity.
Before proving the theorem we will need the following observations. Assume that
q ≥ 2 and let F be the q-loss class. First, note that
P r Eµ fµ∗n ≥ ε ≤ P r ∃f ∈ F, Eµ f ≥ ε, Eµ f 2 < ε, Eµn f ≤ ε/2
+ P r ∃f ∈ F, Eµ f ≥ ε, Eµ f 2 ≥ ε, Eµn f ≤ ε/2
= (1) + (2) .
If Eµ f ≥ ε then
Eµ f ≥
1
1
(Eµ f + ε) ≥ Eµ f + Eµn f .
2
2
Therefore, |Eµ f − Eµn f | ≥ 12 Eµ f ≥ ε/2, and thus
ε
(1) + (2) ≤ P r ∃f ∈ F, Eµ f 2 < ε, |Eµ f − Eµn f | ≥
2
1
2
+ P r ∃f ∈ F, Eµ f ≥ ε, Eµ f ≥ ε, |Eµ f − Eµn f | ≥ Eµ f .
2
Let α = 2 − 2/q and set
H=
εα f
Eµ f
|f ∈ F, Eµ f ≥ ε, Eµ f 2 ≥ ε
.
(2)
Since q ≥ 2 then α ≥ 1, and since ε < 1, each h ∈ H maps Ω into [0, 1]. Also, if
Eµ f 2 ≤ B(Eµ f )q/2 then
Eµ h2 ≤ B
ε2α
≤ Bεα .
(Eµ f )2−2/q
Therefore,
ε
P r Eµ fµ∗n ≥ ε ≤ P r ∃f ∈ F, Eµ f 2 < ε, |Eµ f − Eµn f | ≥
2
εα 2
α
+ P r ∃h ∈ H, Eµ h ≤ Bε , |Eµ h − Eµn h| ≥
2
(3)
284
S. Mendelson
In both cases we are interested in a GC condition at a scale which is roughly
the “largest” variance of the member of a class. This is the exact question which
was investigated in Section 2.
The only problem in applying Theorem 4 directly to the class H is the fact
that we do not have an a-priori bound on the covering numbers of that class.
The question we need to tackle before proceeding is how to estimate the covering
numbers of H, given that the covering numbers of F are well behaved. To that
end we need to use the specific structure of F , namely, that it is a q-loss class
associated with the class G.
We divide our discussion to two parts. First we deal with proper learning, in
p
which each loss function is given by f = |g − T | and no specific assumptions are
needed on the structure of G. Then, we turn to the improper case, and assume
that G is convex and F is the q-loss class for 1 ≤ q < ∞.
To handle both cases, we require the following simple definition:
Definition 3. Let X be a normed space and let A ⊂ X. We say that A is starshaped with center x if for every a ∈ A the interval [a, x] ⊂ A. Given A and x,
denote by star(A, x) the union of all the intervals [a, x], where a ∈ A.
The next lemma shows that the covering numbers of star(A, x) are almost the
same as those of A.
Lemma 7. Let X be
a normed space
and let A
⊂ B(X). Then, for any x ≤ 1
and every ε > 0, N 2ε, star(A, x) ≤ 2N ε, A /ε.
Proof. Fix ε > 0 and let y1 , ..., yk be an ε-cover of A. Note that for any a ∈ A
and any z ∈ [a, x] there is some z ∈ [yi , x] such that z − z < ε. Hence, an
ε-cover of the union ∪ni=1 [yi , z] is a 2ε-cover for star(A, x). Since for every i,
x − yi ≤ 2 it follows that each interval may be covered by 2ε−1 balls with
radius ε and our claim follows.
Lemma 8. Let G be a class of functions into [0, 1], set 1 ≤ q < ∞ and denote
α
F = {q (g)|g ∈ G}, where T ∈ G. Let α = 2 − 2/q and put H = { Eε µff |f ∈ F }.
Then, for every ε > 0 and every empirical measure µn ,
ε
2
log N 2ε, H, L2 (µn ) ≤ log + log N , G, L2 (µn ) .
ε
q
Proof. Since every h ∈ H is of the from h = αf f where 0 < α ≤ 1 then
H ⊂ star(F, 0). Thus, by Lemma 7,
2
log N 2ε, H, L2 (µn ) ≤ log + log N ε, F, L2 (µn ) .
ε
Our assertion follows from Lemma 1.
Now, we turn to the improper case.
Learning Relatively Small Classes
285
Lemma 9. Let G be a convex class of functions and let be F its q-loss class
for some 1 ≤ q < ∞. Assume that α and H are as in Lemma 8. If there are
γ ≥ 1 and p < 2 such that for any empirical measure µn and every ε > 0,
log N ε, G, L2 (µn ) ≤ γε−p , then there are constants cp,q which depend only on
p and q, such that for every ε > 0,
ε
4
log N ε, H, L2 (µn ) ≤ log N
, G, L2 (µn ) + 2 log .
4q
ε
Proof. Again, every member of H is given by κf f , where 0 < κf < 1. Hence,
H ⊂ {κq (g)|g ∈ G, κ ∈ [0, 1]} ≡ Q .
By the definition of the q-loss function, it is possible to decompose Q = Q1 +Q2 ,
where
q
q
Q1 = κ |g − T | κ ∈ [0, 1], g ∈ G and Q2 = −κ |T − PG T | κ ∈ [0, 1] .
q
Note that T and PG T map Ω into [0, 1], hence |T − PG T | is bounded by 1
pointwise. Therefore, Q2 is contained in an interval whose length is at most 1,
implying that for any probability measure µ,
2
.
N ε, Q2 , L2 (µ) ≤
ε
q
Let V = {|g −T | |g ∈ G}. Since every g ∈ G and T map Ω into [0, 1] then
V ⊂ B L∞ (Ω) . Therefore, by Lemma 1,
N ε, V, L2 (µ) ≤ N ε/q, G, L2 (µ)
for every probability measure µ and every ε > 0. Also, Q1 ⊂ star(V, 0), thus for
any ε > 0,
ε
, G, L2 (µ)
2N 2q
2N 2ε , V, L2 (µ)
≤
N ε, Q1 , L2 (µ) ≤
,
ε
ε
which suffices, since one can combine the separate covers for Q1 and Q2 to form
a cover for H.
Proof (of Theorem 6). The proof follows immediately from (3), Theorem 4 and
Lemma 8 for the proper case and Lemma 9 in the improper case.
Remark 2. It is possible to prove an analogous result to Theorem 6 when the
covering numbers of G are polynomial; indeed, if
γ
sup sup log N ε, G, L2 (µn ) ≤ Cd log
ε
n µn
then the sample complexity required is
1
2
2
q
CG,T
,
(ε, δ) = Cq,γ d
log + log
ε
ε
δ
where Cq,γ depend only on γ and q.
286
A
S. Mendelson
Convexity
In this appendix we present the definitions and preliminary results needed for
the proof of Lemma 6. All the definitions are standard and my be found in any
basic textbook in functional analysis, e.g., [4].
Definition 4. Given A, B ⊂ X we say that a nonzero functional x∗ ∈ X ∗
separates A and B if inf a∈A x∗ (x) ≥ supb∈B x∗ (b).
It is easy to see that x∗ separates A and B if and only if there is some α ∈ R
such that for every a ∈ A and b∈ B , x∗ (b) ≤ α ≤ x∗ (a). In that case,
the hyperplane H = x|x∗ (x) = α separates A and B. We denote the closed
“positive” halfspace x|x∗ (x) ≥ α by H + and the “negative” one by H − . By
the Hahn-Banach Theorem, if A and B are closed, convex and disjoint there is
a hyperplane (equivalently, a functional) which separates A and B.
Definition 5. Let A ⊂ X, we say that the hyperplane H supports A in a ∈ A
if a ∈ H and either A ⊂ H + or A ⊂ H − .
By the Hahn-Banach Theorem, if B ⊂ X is a ball then for every x ∈ ∂B there
is a hyperplane which supports B in x. Equivalently, there is some x∗ , x∗ = 1
and α ∈ R such that x∗ (x) = α and for every
y ∈ B, x∗ (y) ≥ α.
Given a line V = tx + (1 − t)y|t ∈ R , we say it supports a ball B ⊂ X in z
if z ∈ V ∩ B and V ∩ int(B) = ∅. By the Hahn-Banach Theorem, if V supports
B in z, there is a hyperplane which contains V and supports B in z.
Definition 6. A Banach space X is called smooth if for any x ∈ X there is a
unique functional x∗ ∈ X ∗ , such that x∗ = 1 and x∗ (x) = x.
Thus, a Banach space is smooth if and only if for every x such that x = 1,
there is a unique hyperplane which supports the unit ball in x. It is possible to
show [4] that for every 1 < q < ∞, Lq is smooth.
We shall be interested in the properties of the nearest point map onto a
compact convex set in “nice” Banach spaces, which is the subject of the following
lemma.
Lemma 10. Let X be a strictly convex space and let G ⊂ X be convex and
compact. Then every x ∈ X has a unique nearest point in G.
Proof. Fix some x ∈ X and set R = inf g∈G g − x. By the compactness of G
and the fact that the norm is continuous, there is some g0 ∈ G for which the
infimum is attained, i.e., R = g0 − x.
To show uniqueness, assume that there is some other g ∈ G for which
g − x = R. Since G is convex then g1 = (g + g0 )/2 ∈ G, and by the strict
convexity of the norm, g1 − x < R, which is impossible.
Lemma 11. Let X be a strictly convex, smooth Banach space and let G ⊂ X be
compact and convex. Assume that x ∈ G and set y = PG x to be the nearest point
to x in G. If R = x − y then the hyperplane supporting the ball B = B(x, R)
at y separates B and G.
Learning Relatively Small Classes
287
Proof. Clearly, we may assume that x = 0 and that R = 1. Therefore, if x∗ is the
∗
normalizedfunctional which
supports B at y, then for every x ∈ B, x (x) ≤ 1.
Let H = x|x∗ (x) = 1 , set H − to be the open halfspace x|x∗ (x) < 1
and assume that there is some g ∈ G such that x∗ (g) < 1. Since G is convex
then for every 0 ≤ t < 1, ty + (1 − t)g ∈ G ∩ H − . Moreover, since y is the
unique nearest point to 0 in G and since X is strictly convex, [g, y] ∩ B = {y},
otherwise there would have been some g1 ∈ G such that g1 − x < 1. Hence,
the line V = {ty + (1 − t)g|t ∈ R} supports B in y. By the Hahn-Banach
Theorem there is a hyperplane which contains V and supports B in y. However,
this hyperplane can not be H because it contains g. Thus, B was two different
supporting hyperplanes at y, contrary to the assumption that X is smooth.
In the following lemma, our goal is to be able to “guess” the location of some
g ∈ G based on the its distance from T ∈ G. The idea is that since G is convex
and since the norm of X is both strictly convex and smooth the intersection of a
ball centered at the target and G are contained within a “slice” of a ball, i.e., the
intersection of a ball and a certain halfspace. Formally, we claim the following:
Lemma 12. Let X be a strictly convex, smooth Banach space and let G ⊂ X
be compact and convex. For any T ∈ G let PG T be the nearest point to T in
G and set d = T − PG T . Let x∗ be the functional supporting B(T, d) in PG T
and put H + = {x|x∗ (x) ≥ d + x∗ (T )}. Then, every g ∈ G, satisfies that g ∈
B(T, dg ) ∩ H + , where dg = g − T .
The proof of this corollary is straightforward and is omitted.
Finally, we arrive to the proof of the main claim. We estimate the diameter
of the “slice” of G using the modulus of uniform convexity of X. This was
formulated as Lemma 6 in the main text.
Lemma 13. Let X be a uniformly convex, smooth Banach space with a modulus
of convexity δX and let G ⊂ X be compact and convex. If T ∈ G and d =
T − PG T then for every g ∈ G,
g − P T d
G
δX
≤1−
,
dg
dg
where dg = T − g.
Proof. Clearly, we may assume that T = 0. Using the notation of Lemma 12,
g − PG T ≤ diam B(T, dg ) ∩ H + .
Let z̃1 , z̃2 ∈ B(T, dg ) ∩ H + , put ε = z˜1 − z˜2 and set zi = z˜i /dg . Hence,
zi ≤ 1, z1 − z2 = ε/dg and x∗ (zi ) ≥ d/dg . Thus,
1
d
1
z1 + z2 ≥ x∗ (z1 + z2 ) ≥
.
2
2
dg
Hence,
and our claim follows.
ε
d
1
,
≤ z1 + z2 ≤ 1 − δX
dg
2
dg
288
S. Mendelson
References
1. M.Anthony, P.L. Bartlett: Neural Network Learning: Theoretical Foundations,
Cambridge University Press, 1999.
2. B. Beauzamy: Introduction to Banach spaces and their Geometry, Math. Studies,
vol 86, North-Holland, 1982
3. O. Hanner: On the uniform convexity of Lp and lp , Ark. Math. 3, 239-244, 1956.
4. P. Habala, P. Haj́ek, V. Zizler: Introduction to Banach spaces vol I and II, matfyzpress, Univ. Karlovy, Prague, 1996.
5. W.S. Lee: Agnostic learning and single hidden layer neural network, Ph.D. thesis,
The Australian National University, 1996.
6. W.S. Lee, P.L. Bartlett, R.C. Williamson: The Importance of Convexity in Learning with Squared Loss, IEEE Transactions on Information Theory 44 5, 1974-1980,
1998.
7. S. Mendelson: Rademacher averages and phase transitions in Glivenko–Cantelli
classes, preprint.
8. S. Mendelson: Geometric methods in the analysis of Glivenko-Cantelli classes, this
volume.
9. M. Talagrand: Sharper bounds for Gaussian and empirical processes, Annals of
Probability, 22(1), 28-76, 1994.
10. A.W. Van–der–Vaart, J.A. Wellner: Weak convergence and Empirical Processes,
Springer-Verlag, 1996.
11. R.C. Williamson, A.J. Smola, B. Schölkopf: Generalization performance of regularization networks and support vectors machines via entropy numbers of compact
operators, to appear in IEEE transactions on Information Theory.
Download