Foundations of Machine Learning Learning with Infinite Hypothesis Sets Mehryar Mohri

advertisement
Foundations of Machine Learning
Learning with
Infinite Hypothesis Sets
Mehryar Mohri
Courant Institute and Google Research
mohri@cims.nyu.edu
Motivation
With an infinite hypothesis set H, the error bounds
of the previous lecture are not informative.
Is efficient learning from a finite sample possible
when H is infinite?
Our example of axis-aligned rectangles shows that
it is possible.
Can we reduce the infinite case to a finite set?
Project over finite samples?
Are there useful measures of complexity for
infinite hypothesis sets?
Mehryar Mohri - Foundations of Machine Learning
page 2
This lecture
Rademacher complexity
Growth Function
VC dimension
Lower bound
Mehryar Mohri - Foundations of Machine Learning
page 3
Empirical Rademacher Complexity
Definition:
G family of functions mapping from set Z to [a, b] .
sample S = (z1 , . . . , zm ).
i s (Rademacher variables): independent uniform
random variables taking values in { 1, +1} .
•
•
•
1
RS (G) = E sup
g G m
g(z1 )
1
..
.
m
·
..
.
g(zm )
1
= E sup
g G m
correlation with random noise
Mehryar Mohri - Foundations of Machine Learning
page 4
m
i g(zi )
i=1
.
Rademacher Complexity
Definitions: let G be a family of functions mapping
from Z to [a, b] .
Empirical Rademacher complexity of G :
•
1
RS (G) = E sup
g G m
m
i g(zi )
,
i=1
where i s are independent uniform random variables
taking values in { 1, +1} and S = (z1 , . . . , zm ) .
• Rademacher complexity of G :
Rm (G) =
Mehryar Mohri - Foundations of Machine Learning
E m [RS (G)].
S D
page 5
Rademacher Complexity Bound
(Koltchinskii and Panchenko, 2002)
Theorem: Let G be a family of functions mapping
from Z to [0, 1]. Then, for any > 0 , with probability
g G:
at least 1 , the following holds for
all
⇥
E[g(z)]
E[g(z)]
m
log 1
g(zi ) + 2Rm (G) +
.
2m
i=1
⇤
m
2
log
1
⇥ S (G) + 3
g(zi ) + 2R
.
m i=1
2m
1
m
Proof: Apply McDiarmid’s inequality to
(S) = sup E[g]
g G
Mehryar Mohri - Foundations of Machine Learning
ES [g].
page 6
•
Changing one point of S changes (S) by at most m1 .
(S 0 )
(S) = sup {E[g]
g2G
 sup {{E[g]
g2G
b S [g]
= sup {E
g2G
b S 0 [g]}
E
b S [g]}
E
sup {E[g]
g2G
b S 0 [g]}
E
{E[g]
b S 0 [g]} = sup
E
g2G
b S [g]}}
E
1
m (g(zm )
0
g(zm
)) 
1
m.
• Thus,1by McDiarmid’s inequality, with probability at
least
2
(S)
E[ (S)] +
S
log 2
2m
.
• We are left with bounding the expectation.
Mehryar Mohri - Foundations of Machine Learning
page 7
• Series of observations:
⇥
E[ (S)] = E sup E[g]
S
S
⇥
g2G
b S (g)
E
b S 0 (g)
= E sup E0 [E
S
g2G S
(Jensen’s ineq.)  E 0
S,S
= E0
S,S
(swapzi and
zi0 )
=
⇥
⇥
E
,S,S 0
(sub-additiv. of sup)  E 0
,S
=2 E
⇥
,S
Mehryar Mohri - Foundations of Machine Learning
b S 0 (g)
sup E
g2G
m
⇤
b S (g)]
E
b S (g)
E
1 X
sup
(g(zi0 )
g2G m i=1
⇥
g(zi ))
m
1 X
sup
g2G m i=1
0
(g(z
i
i)
m
1 X
sup
g2G m i=1
⇥
⇤
⇤
m
1 X
sup
g2G m i=1
0
g(z
i
i)
⇤
i g(zi )
g(zi ))
+ E
⇤
⇤
,S
⇥
⇤
m
1 X
sup
g2G m i=1
= 2Rm (G).
page 8
i g(zi )
⇤
• Now, changing one point of S makes R (G) vary by
S
at most m1 . Thus, again by McDiarmid’s inequality,
with probability at least 1 2 ,
Rm (G)
RS (G) +
⇥
log 2
.
2m
• Thus, by the union bound, with⇥ probability at
least 1
,
(S)
Mehryar Mohri - Foundations of Machine Learning
2RS (G) + 3
log 2
.
2m
page 9
Loss Functions - Hypothesis Set
Proposition: Let H be a family of functions taking
values in{ 1, +1}, G the family of zero-one loss
functions of H: G = (x, y) 1h(x)=y : h H . Then,
1
Rm (G) = Rm (H).
2
Proof:
Rm (G) = E
S,
= E
S,
=
=
1
E
2 S,
1
E
2 S,
Mehryar Mohri - Foundations of Machine Learning
1
sup
h H m
sup
h H
1
m
sup
h H
sup
h H
m
i 1h(xi )=yi
i=1
m
i=1
m
1
m
1
m
1
i (1
2
yi h(xi ))
i yi h(xi )
i=1
m
i h(xi )
.
i=1
page 10
Generalization Bounds - Rademacher
Corollary: Let H be a family of functions taking
values in { 1, +1}. Then, for any > 0 , with
probability at least 1 , for any h H ,
R(h)
R(h)
⇥
log 1
R(h) + Rm (H) +
.
2m
⇥
log 2
R(h) + RS (H) + 3
.
2m
Mehryar Mohri - Foundations of Machine Learning
page 11
Remarks
First bound distribution-dependent, second datadependent bound, which makes them attractive.
But, how do we compute the empirical Rademacher
complexity?
Computing E
i h(xi )] requires
solving ERM problems, typically computationally
hard.
1
[suph H m
m
i=1
Relation with combinatorial measures easier to
compute?
Mehryar Mohri - Foundations of Machine Learning
page 12
This lecture
Rademacher complexity
Growth Function
VC dimension
Lower bound
Mehryar Mohri - Foundations of Machine Learning
page 13
Growth Function
Definition: the growth function
hypothesis set H is defined by
⇥m
N,
H (m)
=
max
{x1 ,...,xm } X
H: N
N for a
⇧⇤
⇥
⇧
⇧ h(x1 ), . . . , h(xm ) : h
⌅⇧⇧
H ⇧.
Thus, H (m) is the maximum number of ways m
points can be classified using H.
Mehryar Mohri - Foundations of Machine Learning
page 14
Massart’s Lemma
(Massart, 2000)
Theorem: Let A Rm be a finite set, with R = max x 2,
x A
then, the following holds:
m
⇤
1
sup
E
m x A i=1
Proof: exp
i xi
⌅
R 2 log |A|
.
m
⇥
m
m
t E sup
x A i=1
i xi
E exp t sup
x A i=1
m
=E
sup exp t
x A
i xi
i=1
m
m
E exp t
x A
(Hoeffding’s ineq.)
exp
x A
Mehryar Mohri - Foundations of Machine Learning
(Jensen’s ineq.)
i xi
i xi
E (exp [t i xi ])
=
i=1
m
2
2
i=1 t (2|xi |)
8
page 15
x A i=1
|A|e
t2 R2
2
.
• Taking the log yields: ⇥
E sup
m
⇤
x A i=1
i xi
log |A| tR2
+
.
t
2
• Minimizing the bound by choosing t =
gives
E sup
m
⇤
x A i=1
Mehryar Mohri - Foundations of Machine Learning
i xi
⇥
2 log |A|
R
⌅
R 2 log |A|.
page 16
Growth Function Bound on Rad. Complexity
Corollary: Let G be a family of functions taking
values in { 1, +1}, then the following holds:
2 log
Rm (G)
G (m)
m
.
Proof:
1
RS (G) = E sup
g G m
..1
. ·
m
g(z1 )
..
.
g(zm )
m 2 log |{(g(z1 ), . . . , g(zm )) : g G}|
m
m 2 log G (m)
2 log G (m)
=
.
m
m
Mehryar Mohri - Foundations of Machine Learning
(Massart’s Lemma)
page 17
Generalization Bound - Growth Function
Corollary: Let H be a family of functions taking
values in { 1, +1}. Then, for any > 0 , with
probability at least 1 , for any h H ,
R(h)
R(h) +
⇥
2 log
H (m)
+
m
⇤
log 1
.
2m
But, how do we compute the growth function?
Relationship with the VC-dimension (VapnikChervonenkis dimension).
Mehryar Mohri - Foundations of Machine Learning
page 18
This lecture
Rademacher complexity
Growth Function
VC dimension
Lower bound
Mehryar Mohri - Foundations of Machine Learning
page 19
VC Dimension
(Vapnik & Chervonenkis, 1968-1971;Vapnik, 1982, 1995, 1998)
Definition: the VC-dimension of a hypothesis set H
is defined by
VCdim(H) = max{m :
H (m)
= 2m }.
Thus, the VC-dimension is the size of the largest set
that can be fully shattered by H.
Purely combinatorial notion.
Mehryar Mohri - Foundations of Machine Learning
page 20
Examples
In the following, we determine the VC dimension
for several hypothesis sets.
To give a lower bound d for VCdim(H) , it suffices
to show that a set S of cardinality d can be
shattered by H.
To give an upper bound, we need to prove that no
set S of cardinality d+1 can be shattered by H,
which is typically more difficult.
Mehryar Mohri - Foundations of Machine Learning
page 21
Intervals of The Real Line
Observations:
Any set of two points can be shattered by four
intervals
-+-+
•
++
• No set of three points can be shattered since
the following dichotomy “+ - +” is not realizable
(by definition of intervals):
+-+
• Thus, VCdim(intervals in R) = 2 .
Mehryar Mohri - Foundations of Machine Learning
page 22
Hyperplanes
Observations:
Any three non-collinear points can be shattered:
+
+
•
• Unrealizable dichotomies for four points:
-
•
+
+
+
-
+
+
Thus, VCdim(hyperplanes in Rd ) = d+1 .
Mehryar Mohri - Foundations of Machine Learning
page 23
Axis-Aligned Rectangles in the Plane
Observations:
The following four points can be shattered:
•
-
+
+
-
+
+
+
-
-
+
-
-
-
+
-
• No set of five points can be shattered: label
•
+
negatively the point that is not near the sides.
+
+
+
+
Thus, VCdim(axis-aligned rectangles) = 4 .
Mehryar Mohri - Foundations of Machine Learning
page 24
Convex Polygons in the Plane
Observations:
2d+1 points on a circle can be shattered by a d-gon:
•
- +
+ +
-
|positive points| < |negative points|
- ++
+
+
+
+ +|positive points| > |negative points|
• It can be shown that choosing the points on the
circle maximizes the number of possible
dichotomies. Thus,VCdim(convex d-gons) = 2d+1 .
Also,VCdim(convex polygons) = + .
Mehryar Mohri - Foundations of Machine Learning
page 25
Sine Functions
Observations:
Any finite set of points on the real line can be
shattered by {t ⇤ sin( t) : ⇥ R} .
Thus,VCdim(sine functions) = + .
•
•
Mehryar Mohri - Foundations of Machine Learning
page 26
Sauer’s Lemma
(Vapnik & Chervonenkis, 1968-1971; Sauer, 1972)
Theorem: let H be a hypothesis set withVCdim(H) = d
then, for all m N ,
⇥
d
H (m)
⇤ m
.
i
i=0
Proof: the proof is by induction on m+d. The
statement clearly holds for m = 1 and d = 0 or d = 1.
Assume that it holds for (m 1, d 1) and (m 1, d).
• Fix a set S = {x , . . . , x
with H (m) dichotomies
and let G = H|S be the set of concepts H induces
by restriction to S .
1
Mehryar Mohri - Foundations of Machine Learning
m}
page 27
•
Consider the following families over S ⇥ = {x1 , . . . , xm
G1 = G|S
G2 = {g
S : (g ⇥ G) ⌅ (g ⇤ {xm } ⇥ G)}.
x1
x2
1
1
0
1
1
1
1
1
0
0
···
···
· · · xm
Mehryar Mohri - Foundations of Machine Learning
1
xm
0
0
1
0
0
1
1
1
1
0
0
1
1
0
1
···
···
···
• Observe that|G | + |G | = |G|.
1
:
1}
2
page 28
• Since VCdim(G )
d , by the induction hypothesis,
1
d
|G1 |
G1 (m
1)
1
m
i
i=0
.
• By definition of G , if a set Z
S is shattered by G2,
then the set Z {xm } is shattered by G . Thus,
2
VCdim(G2 ) ⇥ VCdim(G)
1=d
1
and by the induction hypothesis,
d 1
|G2 |
• Thus, |G|
=
G2 (m
⇤d
1)
⇥
m 1
i=0
i
⇥
⇤d
m 1
i=0
i
Mehryar Mohri - Foundations of Machine Learning
+
+
1
m
i
.
i=0
⇤d 1 m 1⇥
i=0
i
⇥ ⇤d
⇥
m 1
m
i=0 i .
i 1 =
page 29
Sauer’s Lemma - Consequence
Corollary: let H be a hypothesis set with VCdim(H) = d
then, for all m d ,
H (m)
Proof:
d ⇤ ⌅
⇧
m
i=0
i
Mehryar Mohri - Foundations of Machine Learning
em
d
d
= O(md ).
d ⇤ ⌅
⇧
m
m ⇥d i
i
d
i=0
m ⇤ ⌅
⇧
m
m ⇥d i
i
d
i=0
⌅i
m ⇤ ⌅⇤
⇥
⇧
d
m
d
m
=
i
d
m
i=0
⇤
⌅m
⇥
d
m
d
m ⇥d d
=
1+
e .
d
m
d
page 30
Remarks
Remarkable property of growth function:
either VCdim(H) = d < + and H (m) = O(md )
or
and H (m) = 2m .
VCdim(H) = +
•
•
Mehryar Mohri - Foundations of Machine Learning
page 31
Generalization Bound - VC Dimension
Corollary: Let H be a family of functions taking
values in { 1, +1} with VC dimension d . Then, for
any > 0 , with probability at least 1 , for any h H ,
R(h)
R(h) +
⇥
2d log em
d
+
m
⇤
log 1
.
2m
Proof: Corollary combined with Sauer’s lemma.
Note: The general form of the result is
R(h)
Mehryar Mohri - Foundations of Machine Learning
⇤
R(h)
+O
⌅
log(m/d)
(m/d)
⇥
.
page 32
Comparison - Standard VC Bound
(Vapnik & Chervonenkis, 1971;Vapnik, 1982)
Theorem: Let H be a family of functions taking
values in { 1, +1} with VC dimension d . Then, for
any > 0 , with probability at least 1 , for any h H ,
R(h)
R(h) +
⇥
4
+
8
log
8d log 2em
d
.
m
Proof: Derived from growth function bound
⇧
Pr R(h)
⌅
R(h)
>
Mehryar Mohri - Foundations of Machine Learning
⌃
⇥4
H (2m) exp
⇥
page 33
m
8
2
⇤
.
This lecture
Rademacher complexity
Growth Function
VC dimension
Lower bound
Mehryar Mohri - Foundations of Machine Learning
page 34
VCDim Lower Bound - Realizable Case
(Ehrenfeucht et al., 1988)
Theorem: let H be a hypothesis set with VC
dimension d > 1 . Then, for any learning algorithmL,
D, f
H,
Pr m
S D
d 1
RD (hS , f ) >
32m
1/100.
Proof: choose D such that L can do no better than
tossing a coin for some points.
Let X = {x0 , x1 , . . . , xd 1 }be a set fully shattered.
For any > 0 , define D with support X by
•
Pr[x0 ] = 1
D
8
Mehryar Mohri - Foundations of Machine Learning
and ⇤i ⇥ [1, d
1], Pr[xi ] =
D
page 35
8
d
1
.
loss of generality that L
• We can assume without
makes no error on x .
• For a sample S , let S denote the setSof its elements
0
•
falling in X1 = {x1 , . . . , xd 1 } and let be the set of
samples of size m with at most (d 1)/2 points in X1.
Fix a sample S S. Using |X S| ⇥ (d 1)/2 ,
f
E [RD (hS , f )] =
U
⇤⇤
x⇥X
f
⇥
=
1h(x)⇤=f (x) Pr[x] Pr[f ]
⇤⇤
f
1h(x)⇤=f (x) Pr[x] Pr[f ]
x⇤⇥S
⇤ ⇤
x⇤⇥S
f
⇥
1h(x)⇤=f (x) Pr[f ] Pr[x]
1⇤
1d 1 8
=
Pr[x] ⇥
=2 .
2
2 2 d 1
x⇤⇥S
Mehryar Mohri - Foundations of Machine Learning
page 36
• Since the inequality holds for all S
•
also holds in
expectation: ES,f U [RD (hS , f )] 2 . This implies that
there exists a labeling f0 such that ES [RD (hS , f0 )] 2 .
Since PrD [X {x0 }] ⇥ 8 , we also have RD (hS , f0 ) 8 . Thus,
2
E[RD (hS , f0 )]
S
8 Pr [RD (hS , f0 )
S S
• Collecting terms in Pr [R
S S
Pr [RD (hS , f0 )
S S
] + (1
D (hS , f0 )
]
1
(2
7
S , it
Pr [RD (hS , f0 )
S S
] , we
obtain:
1
)= .
7
• Thus, the probability over all samples S (not
necessarily in S ) can be lower bounded as
Pr[RD (hS , f0 )
S
Mehryar Mohri - Foundations of Machine Learning
]
Pr [RD (hS , f0 )
S S
] Pr[S]
page 37
1
Pr[S].
7
]) .
• This leads us to seeking a lower bound for Pr[S].
The probability that more than (d 1)/2 points be
drawn in a sample of size m verifies the Chernoff
bound for any > 0 :
1
Pr[S] = Pr[Sm ⇤ 8⇥m(1 + )] ⇥ e
• Thus, for
Pr[Sm ⇤
for
3
.
1)/(32m) and = 1,
= (d
d 1
2 ]
8 m
2
⇥e
(d 1)/12
.01. Thus, Pr[S]
Mehryar Mohri - Foundations of Machine Learning
1/12
⇥1
7 and
Pr[RD (hS , f0 )
S
⇥e
]
.
page 38
7 ,
Agnostic PAC Model
Definition: concept class C is PAC-learnable if there
exists a learning algorithm L such that:
for all c C, ⇥ > 0, > 0, and all distributions D,
•
Pr
S⇠D
h
R(hS )
inf R(h)  ✏
h2H
i
1
,
• for samples S of size m = poly(1/⇥, 1/ ) for a fixed
polynomial.
Mehryar Mohri - Foundations of Machine Learning
page 39
VCDim Lower Bound - Non-Realizable Case
(Anthony and Bartlett, 1999)
Theorem: let H be a hypothesis set with VC
dimension d > 1 . Then, for any learning algorithmL,
D over X {0, 1},
Pr m RD (hS )
inf RD (h) >
S D
h H
d
320m
1/64.
Equivalently, for any learning algorithm, the sample
complexity verifies
m
Mehryar Mohri - Foundations of Machine Learning
d
.
2
320
page 40
References
•
Martin Anthony, Peter L. Bartlett. Neural network learning: theoretical foundations. Cambridge
University Press. 1999.
•
Anselm Blumer, A. Ehrenfeucht, David Haussler, and Manfred K. Warmuth. Learnability and
the Vapnik-Chervonenkis dimension. Journal of the ACM (JACM),Volume 36, Issue 4, 1989.
•
A. Ehrenfeucht, David Haussler, Michael Kearns, Leslie Valiant. A general lower bound on
the number of examples needed for learning. Proceedings of 1st COLT. pp. 139-154, 1988.
•
Koltchinskii,Vladimir and Panchenko, Dmitry. Empirical margin distributions and bounding
the generalization error of combined classifiers. The Annals of Statistics, 30(1), 2002.
•
Pascal Massart. Some applications of concentration inequalities to statistics. Annales de la
Faculte des Sciences de Toulouse, IX:245–303, 2000.
•
N. Sauer. On the density of families of sets. Journal of Combinatorial Theory (A), 13:145-147,
1972.
Mehryar Mohri - Foundations of Machine Learning
page 41
References
•
•
•
•
Vladimir N.Vapnik. Estimation of Dependences Based on Empirical Data. Springer, 1982.
•
Vladimir N.Vapnik and Alexey Chervonenkis. On the uniform convergence of relative
frequencies of events to their probabilities. Theory of Prob. and its Appl., vol. 16, no. 2, pp.
264-280, 1971.
Vladimir N.Vapnik. The Nature of Statistical Learning Theory. Springer, 1995.
Vladimir N.Vapnik. Statistical Learning Theory. Wiley-Interscience, New York, 1998.
Vladimir N.Vapnik and Alexey Chervonenkis. Theory of Pattern Recognition. Nauka, Moscow
(in Russian). 1974.
Mehryar Mohri - Foundations of Machine Learning
page 42
Download