авбдг авбдж азбиг have true error in © ¡ ае ¡ аев . азбиж азбиг

advertisement
Computable Shell Decomposition Bounds
John Langford
School of Computer Science
Carnegie Mellon University
jcl@cs.anyu.edu
www.cs.cmu.edu/˜jcl
Abstract
Haussler, Kearns, Seung and Tishby introduced
the notion of a shell decomposition of the union
bound as a means of understanding certain empirical phenomena in learning curves such as
phase transitions. Here we use a variant of
their ideas to derive an upper bound on the
generalization error of a hypothesis computable
from its training error and the histogram of
training errors for the hypotheses in the class.
In most cases this new bound is significantly
tighter than traditional bounds computed from
the training error and the cardinality or VC
dimension of the class. Our results can also
be viewed as providing PAC theoretical foundations for a model selection algorithm proposed by Scheffer and Joachims.
1 Introduction
For an arbitrary finite hypothesis class we consider the
hypothesis of minimal training error. We give a new
upper bound on the generalization error of this hypothesis computable from the training error of the hypothesis and the histogram of the training errors of the other
hypotheses in the class. This new bound is typically
much tighter than more traditional upper bounds computed from the training error and cardinality or VC dimension of the class.
As a simple example, suppose that we observe that
all but one empirical error in a hypothesis space is
and one empirical error is . Furthermore, suppose that
the sample size is large enough (relative to the size of the
hypothesis class) that with high confidence we have that,
for all hypotheses in the class, the
true (generalization)
error of a hypothesis is within
of its training error.
This implies, that with
high confidence, hypotheses
with
.
training error near
have true error in
Intuitively, we would expect that the true error of the hypothesis with minimum empirical error to be very near
David McAllester
AT&T Labs-Research
dmac@research.att.com
www.research.att.com/˜dmac
to rather than simply less than
because none ofthe
hypotheses which produced an empirical error of
could have a true error close enough to that there exists
a significant probability of producing empirical error.
The bound presented here validates this intuition. We
showthat
you can ignore hypotheses with training error
in calculating an “effective size” of the class for
near
hypotheses with training error near . This new effective
class size allows us to calculate a tighter bound on the
difference between training error and true error for hypotheses with training error near . The new bound is
proved using a distribution-dependent application of the
union bound similar in spirit to the shell decomposition
introduced by Haussler, Kearns, Seung and Tishby [1].
We actually give two upper bounds on generalization error — an uncomputable bound and a computable
bound. The uncomputable bound is a function of the unknown distribution of true error rates of the hypotheses
in the class. The computable bound is, essentially, the
uncomputable bound with the unknown distribution of
true errors replaced by the known histogram of training
errors. Our main contribution is that this replacement is
sound, i.e., the computable version remains, with high
confidence, an upper bound on generalization error.
When considering asymptotic properties of learning
theory bounds it is important to take limits in which the
cardinality (or VC dimension) of the hypothesis class is
allowed to grow with the size of the sample. In practice
more data typically justifies a larger hypothesis class.
For example, the size of a decision tree is generally proportional the amount of training data available. Here
we analyze the asymptotic properties of our bounds by
considering an infinite sequence of hypothesis classes
, one for each sample size , such that approaches a limit larger than zero. This kind of asymptotic analysis provides a clear account of the improvement achieved by bounds that are functions of error rate
distributions rather than simply the size (or VC dimension) of the class.
We give a lower bound on generalization error show-
ing that the uncomputable upper bound is asymptotically
as tight as possible — any upper bound on generalization error given as a function of the unknown distribution of true error rates must asymptotically be greater
than or equal to our uncomputable upper bound. Our
lower bound on generalization error also shows that there
is essentially no loss in working with an upper bound
computed from the true error distribution rather than expectations computed from this distribution as used by
Scheffer and Joachims [4].
Asymptotically, the computable bound is simply the
uncomputable bound with the unknown distribution of
true errors replaced with observed histogram of training errors. Unfortunately, we can show that in limits
where converges to a value greater than zero,
the histogram of training errors need not converge to
the distribution of true errors — the histogram of training errors is a “smeared out” version of the distribution
of true errors. This smearing loosens the bound even
in the large-sample asymptotic limit. We give a precise asymptotic characterization of this smearing effect
for the case where distinct hypotheses have independent
training errors. In spite of the divergence between the
uncomputable and computable bounds, the computable
bound is still significantly tighter than classical bounds
not involving error distributions.
The computable bound can be used for model selection. In the case of model selection we can
an in assume
"!
where
finite (
sequence
of
finite
model
classes
$#
%
&
#
#
'
('
each
is a finite class with )&*,+
+ growing linearly
in - . To perform model selection we find the hypothesis of minimal training error in each class and use the
computable bound to bound its generalization error. We
can then select, among these, the model with the smallest upper bound on generalization error. Scheffer and
Joachims propose (without formal justification) replacing the distribution of true errors with the histogram of
training errors. Under this replacement, the model selection algorithm based on our computable upper bound
is asymptotically identical to the algorithm proposed by
Scheffer and Joachims.
The shell decomposition is a distribution-dependent
use of the union bound. Distribution-dependent uses of
the union bound have been previously exploited in socalled self-bounding algorithms. Freund [5] defines, for
a. given learning algorithm and data distribution, a set
of hypotheses such that with high probability over the
sample, the algorithm
. will always return a hypothesis in
that set. Although is defined in terms of the unknown
data
.0/ distribution, Freund gives a way of computing a set
from the given algorithm
and the
.0/
. sample such that,
with high confidence,
contains and hence the “ef.0/
fective size” of the hypothesis class is bounded by + + .
Langford and Blum [7] give a more practical version
of this algorithm. Given an algorithm and data distribu-
tion they conceptually define a weighting over the possible executions of the algorithm. Although the data distribution is unknown, they give a way of computing a
lower bound on the weight of the particular execution of
the algorithm generated by the sample at hand. In this
paper we consider distribution dependent union bounds
defined independently of any particular learning algorithm.
2 Mathematical Preliminaries
For an arbitrary measure
.54 on
. an arbitrary sample space
we use the notation
768 to mean that with prob.
:9 132
ability at least
over
the
choice. of the sample we
6
4 .
have that
768 holds. In practice is the training
sam.54
.
;= >68
ple of a learning algorithm.
.
4 Note. that 13;<1 2
does not imply 1 2
13;
;=
?68 . If @ is a finite
set, and
for
all
we
have
the assertion 136ED
B
;
C
A
@
.E4 .
F1 2
7;=768 then by a standard application
. of the
13;KA
union
4 . bound we have the assertion 1G6HDIJ1 2
@
7;= 2 . We will call this the quantification rule.
.SR .
LM .O4 .
If 136HDN1 2
P68 and 136QDNJ1 2
T68 then
by a standard
application
of
the
union
bound
we have
.I4 .
R .
2W GX
2W . We will call this the
136UDVH1 2
conjunction rule.
!f of p from q, denoted YQZ\[ +%+ ]_^ , is
The KL-divergence
e9
!f
[`)%*=Zb a ^7cdZ
[^ )&*=Z b a ^ with >)%*gZ b ^>hK and [`)%*=Z8a ^h
i . Let ] j be the fraction of heads in a sequence . of tosses of a biased coin where the probability of heads is
] . For ]l
j k5] we have the following inequality given by
Chernoff in 1952 [3].
f_yxyz
b}|
1G[mAn ] oqpsrtZ]Q
j ku[^
v
w
(1)
a {
This bound can be rewritten as follows.
136,Du~1 2
.
YQZ€‚ƒGZ]=j \]_^}+&+ ]_^:v
!
)&*Z ^
2
(2)
Toz†derive (2) from (1) note that psrtZ„YHZ€‚ƒ=Z]=j „]3^}+&+ ]3^yk
‡ |
z ^ equals psrtZ]ˆ
j kN[^ where [‰kŠ] and YQZ„[ +&+ ]_^mh
‡ |
. By (1)f_ywe
xyz then have that this probability is no
b}|
larger than w
a { h‹6 . It is just as easy to derive
(1) from (2) so the two statements are equivalent. By
duality, i.e., by
9 considering the problem defined by re] , we get the following.
placing ] by
!
)%*`Z ^
.
2
(3)
136mDŒ~1 2
YHZ€(%*Z]`j „]_^$+%+ ]_^:v
Conjoining (2) and (3) yields the following corollary of
(1).
W
)%*Z ^
.
2
136,Du~1 2
YQZ]j +%+ ]_^:v
(4)
W
9
Using the inequality YQZ\[ +%+ ]_^:k
Z\[
]3^ one can show
that (4) implies the following better known form of the
Chernoff bound.
136<DŽ1 2

.
+]
9
]`j +tv
W
)%*Z ^
2
(5)
z f
b
|€‘
W a , which holds
Using the inequality YQZ„[ +&+ ]3^:k
! for
a
[<v‰] , we can show that (3) implies the following.

!
!
)%*=Z ^
]
j )%*Z ^
.
136,DŽ1 2
]Fv’]<
j c
2 c
2
(6)
Note that for small values of ] j formula (6) gives a tighter
upper bound on ] than does (5). The upper bound on ]
implicit in (4) is somewhat tighter than the minimum of
the bounds given by (5) and (6).
We now consider a formal
setting for hypothesis learning.
We
assume
a
finite
set
of hypotheses and a space
“
of instances. We assume
that
“
• each hypothesis represents a function from to ”
where we write –Z€;3^
for the value of the function represented by hypothesis –
when applied to instance ; . We also
assume a distribu“
•
tion Y on pairs —€;0˜t™ with ;lA
and ˜FAš”8
. For
any hypothesis – we define the error rate of – , denoted
w Z\–_^ , to be p —€;>˜ ™œ› x Z\–Z€;_^(hS
 ˜ ^ . For a given sample
.
of pairs drawn from Y we .write žjw Z\–_^ to denote the
fraction of the pairs —€ ;=Ÿ˜t™ in such that –=Z„;_^Žh  ˜ .
in (4) yields the following secQuantifying over –FA
ond corollary of (1).
W
)&*,+ +}cš)%*=Z ^
.
1 2
1g–"A
YHZ¡jw Z\–_^}+&+ w Z¢–£^7^yv
2
(7)
By consider bounds on YHZ„[ +&+ ]3^ we can derive the following more well known corollary of (7).

W
)%*,+ +8c5)&*=Z ^
.
9
2
1 2
1G–"A
+ w Z¢–£^ žjw Z\–_^}+ v
(8)
These two formulas both limit the distance between jw Z\–_^
and wžZ¢–£^ . In this paper we will work with (7) rather than
(8) because it yields an (uncomputable) upper bound
on generalization error that is optimal up to asymptotic
equality.
3 The Upper Bound
Our goal now is to improve
on (7). Our first step is to
divide the hypotheses in
into disjoint sets based
on their true error rates. More
z«!†¬®­¯ specifically, for ]¤A
b¡°±|
define ¥$¥{]§¦}¦ to be ¨0©7ª
. Note that ¥$¥{]§¦}¦
is of the form ² where
either
N
]
³
h
and ´µh
or
fG!
]CD¶ and! ]CA·Z$² ¸² . In either case we have
•
¥}¥¹] ¦$¦ŽAº”
‰#}#$#}
and if ¥}¥¹] ¦$¦5h ² then ]EA
»
A derivation of this formula can be found in [8] or [9]. To
see the need for the last term consider the case where ½,
¼ ¾5¿ .
fG!
. Now we define Z ² ^ to be the set of –ŒA
such that ¥}¥„w Z\–_^À¦}¦Áh ² . We define žZ ² ^ to be
)%*Z€‚ƒgZ 8+ Z ² ^$+ ^7^ . We now have the following lemma.
.
Lemma 3.1 1G6mDŒ1 2
1g–"A
W
žZ8¥}¥€w Z¢–£^æ}¦^c5)&*=Z
^
2
YQZ¡jw Z\–_^}+&+ wžZ¢–£^7^yv
!
•
Proof: Quantifying over ]FAH” #$#}#$ ! and –?A
.
•
#$#}#}
Z¹]3^ in (4) gives 136lD‹ , 1 2 , 1ž]ŠAO”
,
1G–"A
Z¹]3^ ,
W )&*,+ Z¯]_^$+$cš)%*=Z
^
YQZ¡žjw Z¢–£^$+%+ w Z\–_^œ^yv
2
But this implies the lemma. Ä
Lemma 3.1 imposes a constraint, and hence a bound,
we have the following where
on w Z\–_^ . More4 specifically,
•
;§ denotes4 the • least upper bound (the
lub ”8;‹o
; .
maximum) of the set ”$;Fo
ÆІРË$Ñ†Ñ È£ÒFÓ¹Ô_Æ}ÕÀ× Ö È
ÅƀÇtÈ`É lub Ê¡ËMÌ8Í Æ Å¼ ƀÇtÈ†Î¹Î Ë ÈÉOÏ
Ø
Ù (9)
²
²
This is our uncomputable! bound. It is uncomputable because the numbers žZ ^ , #$#}# , žZ ^ are unknown. Ignoring this problem, however, we can see that this bound
is typically significantly tighter than (7). More specifically, we can rewrite (7) as follows.
W
)&*,+ +$c5)&*=Z ^ •
w Z\–_^:v lub ”8[<oYQZ¡jw Z\–_^}+&+ [^yv
2
(10)
is small for large
Since žZ ² ^mvÚ)%*<+ + , and since , we have that (9) is never significantly looser than
(10). Now consider a hypothesis – such that the bound
on w Z\–_^ given by (7), or equivalently, (10), is significantly less than 1/2. Assuming is large, the bound
given by (9) must also be significantly less than 1/2. But
for [ significantly less than 1/2 we will typically
have
that žZ$¥$¥€[8¦}¦^ is significantly
smaller
than
.
For
&
)
,
*
+
+
is
the
set
of
all
decision
trees
of
example,
suppose
. For large , a random decision tree of this
size size will have error rate near 1/2. The set of decision
trees with error rate significantly smaller than 1/2 will
be an exponentially small faction of the set of all possible trees. So for [ small compared to 1/2 we get that
žZ$¥}¥„[8¦}¦^ is significantly smaller than )&*,+ ÛU+ . This will
make the bound given by (9) significantly tighter than
the bound given by (10).
We now show that the distribution of true errors can
be replaced, essentially, by the histogram of training errors. We first introduce the following definitions.
z „ë ‘
!
|
z f
‡ !5
Ü ¬
çÜ | ²
Œé ê Hfm
ì
²
W
2œÞHßáà:âäã låŽæ â
UÝ
æ
æ
æ
æ
æè
¬
!¬ W Ü
¬
íÜ ²
²
2Þ ß
2 Þ æ ÞGÞ
Ý
æ Ý
Ý ¨0©œª Ý
æ
æ
æ
æ
û†ü
The definition of 
jÂ î ² 06ï is motivated by the following lemma.
ð
!
.
•
Lemma 3.2 136mDŒ , 1 2 , 13[mA‰” #$#}#¡ ,
žZ„[^:vNžj Z„[
ϼ
æ à:âäã æ
ÆІРË$цÑ$òó È ÒQӹԣơôÃ× Ö È
Ø
õ
As for lemma 3.1, the bound implicit in theorem 3.3
is typically significantly tighter than the bound in (7) or
its equivalent form (10). The argument for the improved
tightness of theorem 3.3 over (10) is similar to the argument for (9). More specifically, consider a hypothesis
– for which the bound in (10) is significantly less than
1/2. Since žj Z$¥}¥„[8¦}¦ž6^,vS)%*,+ + , the set of values of [
satisfying the condition in theorem 3.3 must all be significantly
z«!±÷P less than 1/2. But for large we have that
ö
~fg‘P! ø 2 |
W
is small. So if [ is significantly less than
1/2 then all hypotheses in j Z8¥}¥„[8¦}¦eP6^ have empirical
error rates significantly less than 1/2. But for most hypothesis classes, e.g., decision trees, the set of hypotheses with empirical error rates far from 1/2 should be an
exponentially small fraction of the class. Hence
we get
that žj Z$¥}¥„[8¦}¦ž6^ is significantly less than )%*,+ + and theorem 3.3 is tighter than (10).
The remainder of this section is a proof of lemma 3.2.
Our departure point for the proof is the following lemma
from [6].
Lemma 3.4 (McAllester 99) For any measure on any
hypothesis class we have the following where E Z\–_^
âžù
denotes the expectation of Z\–_^ under the given measure
ù
on – .
z W fG! z Ü z f z
| ç | ç |„|€‘
.
â
â
1G6mDΊ1 2
E w
vEú
â
6
Intuitively, this lemma states that with high confidence over the choice of the sample most hypotheses
have empirical error near their true error. This will allow us to prove that žj Z$¥$¥€[8¦$¦e>6^ bounds žZ$¥}¥„[8¦}¦^ . More
specifically, by considering the uniform distribution on
Z ² ^ , lemma 3.4 implies the following.
ÿ ÿ û ÿ û ‘
çœÿ ‘ E ûü ý
Þ
î_þ ï Ý
2
!
è
û
û
ûü ý
çœÿ ‘ ÿ ÿ ÿ ‘
W
Þ
î£þ ï Ý
2
è
z
æ
æ
æ à âyã æ
æ
z
² |
z
çÜ
² |
å â
æ
6^
Before proving lemma 3.2 we note that by conjoining (9) and lemma 3.2 we get the following. This is our
main result.
.
Theorem 3.3 1G6,Du , 1 2 , 1g–"A
,
ÅÆ€Ç ÈÉ lub ñˏÌÍ Æ Å¼ ƀÇtÈ†Î¹Î Ë ÈÉ
æ
ÿ ÿ û ÿ û ‘
ç ÿ‘ î þ ï Ý
ý
z
çÜ
åä â
|
æ
z
æ
f
² |
f
|
²
z
ç
|
â è
ê
!
Œé ê
!
W
Þ
2
èz
|
‡ !G
ì æ
W lfm
!
æ
æ
æ
z ‡ ! | ì
W lfm
æ
!
W æ
æ
æ
è
W
Ü ¬ W ²
27Þ æ
æ 5Ý
z
² |
W æ
z
² |
æ
æ
æ
æ
æ
æ
æ
æ è
æ by quantification
æ
!
now follows
Lemma
3.2
over [NA
•
. Ä
”
#}#$#}
4 Asymptotic Analysis and Phase
Transitions
The bounds given in (9) and theorem 3.3 exhibit phase
transitions. More specifically, the bounding expression
can be discontinuous in 6 and , e.g., arbitrarily small
changes in 6 can cause large changes in the bound. To
see how this happens consider the following constraint
on the quantity [ .
W žZ8¥}¥€[8¦$¦^=cš)&*=Z
^
2
(11)
YQZ¡jw Z¢–£^$+%+ [^yv
The bound given by (9) is the least upper bound of the
that is sufficiently
values of [ satisfying (11). Assume
z±­Ã­
í
°Ã°±|
large that we can think of a
as a continuous function of [ which we will write as Âe Z\[^ . We can then
rewrite (11) as follows where is a quantity not depending on [ and ž Z„[^ does not depend on 6 .
YQZ¡jw Z\–_^}+&+ [^yvÂe Z„[^c
(12)
For [(kNjw Z\–_^ we that YQZ¡jw Z\–_^}+&+ [^ is a monotonically increasing function
of [ . It is reasonable to assume that
for [‰v
we also have that ž Z„[^ is a monotonically
increasing function of [ . But even under these conditions it is possible that the feasible values of [ , i.e.,
those satisfying (12), can be divided into separated regions. Furthermore, increasing can cause a new feasible region to come into existence. When this happens
the bound, which is the least upper bound of the feasible
values, can increase discontinuously. At a more intuitive level, consider a large number of high error concepts and smaller number of lower error concepts. At a
certain confidence level the high error concepts can be
ruled out. But as the confidence requirement becomes
more stringent suddenly (and discontinuously) the high
error concepts must be considered. A similar discontinuity can occur in sample size. Phase transitions in shell
decomposition bounds are discussed in more detail by
Haussler et al. [1].
Phase transition complicate asymptotic analysis. But
asymptotic
analysis illuminates the nature of phase tran
sitions. As mentioned in the introduction, in the asymptotic analysis of learning
theorem bounds it is important
that one not hold fixed as the sample size increases.
If we hold fixed then )%&
 h . But this
is not what one expects for large samples in practice.
As the sample size increases one typically uses larger
hypothesis classes. Intuitively, we expect that even for
very large we have that  is far from zero.
For the asymptotic analysis of the bound in (9)"
we
!
assume
,
an infinite sequence of hypothesis classes
W ,
$
#
}
#
#
and
an
infinite
sequence
of
data
distributions
!
Y , Y W , Y , #}#$# . Let  Z ² ^ be žZ ² ^ defined relative to
and Y
. In the asymptotic
analysis
we assume that
zœ­Ã­
í °Ã°±|
a
the sequence of functions , viewed as functions
of [?AŒ , converge uniformly to a continuous function ž Z„[^ . This means that for any äDu there exists a ´
such that for all BDΫ we have the following.
 Z8¥}¥€[8¦$¦^ 9
1G[mAn § H+
ž Z„[^}+ v z±­Ã­ í
b¡°Ã°±|
Given the functions and their limit function
ž Z¹]3^ , we define the following functions of an empirical
error rate jw .
! Æ Å¡¼ È#" lub ñ`Ë~ÌÍ Æ Å¼ Î¹Î Ë ÈÉ Ï Ö ÆІРË}Ñ†Ñ È§ÒQÓ¹Ô_Æ ÕÀ× Ö È
Ö
Ø
õ
!
Æ ¡Å ¼ È#"
lub ÊˏÌÍ Æ Å¼ Î¹Î Ë ÈÉÏ $ Æ Ë È Ù
The function %
Z¡jw ^ corresponds directly to the upper
bound in (9). The function %Z¡ jw ^ is intended to be the
Z¡jw ^ . However, phase
large asymptotic limit of %
transitions complicate asymptotic analysis. The bound
%Z¡jw ^ need not be a continuous function of jw . A value of
jw where the bound %Z¡jw ^ is discontinuous corresponds to
a phase transition
in the bound. At a phase transition the
Z¡jw ^ need not converge. Away from phase
sequence %
transitions, however, we have the following theorem.
Theorem 4.1 If the bound %Z¡jw ^ is continuous at the
z±­Ã­ are not at a phase transition), and the
point jw (so we
í
°Ã°±|
functions a
, viewed as functions of [uA‹ § ,
converge uniformly to a continuous function ž Z\[^ , then
we have the following.
&)&'
% %
Z¡jw ^>(
h %Z¡jw ^
Proof: Define the set )
*
Ö Æ Å¡¼ È"
Z¡jw ^ as follows.
Ö ÆІРË}Ñ†Ñ È£Ò?Ó¯Ô£Æ ÀÕ × Ö È
ñ`ˏÌäÍ Æ Å¼ Î¹Î Ë ÈÉ Ï
Ø
õ
This gives the following.
% Z¡jw ^Ÿh lub ) ¡Z jw ^
Similarly, define )‚Z¡žjw +^ and %JZ¡žjw +^ as follows.
)‚Z¡žjw +^-,
%JZ¡žjw +^-,
8” [<A5 § gosYHZ¡tjw +%+ [^:v.Âe Z\[^=c/
lub )‚Z¡žjw +^
•
We first show that the continuity of %Z¡jw ^ at the point
jw implies the continuity of %JZ¡žjw 0^ at the point —¡ejw >e™ .
We note that there exists a continuous function Z¡ejw 1^
ù
with Z¡jw ž^h jw and such that for any sufficiently
ù
near 0 we have the following.
9
YHZ Z¡žjw +^}+&+ [^ŸhˆYQZ¡tjw +%+ [^
ù
We then get the following equation.
%Z¡ejw +^>h2%Z ù Z„;=3^7^
% Z¡jw ^ is continuous at the
Since is continuous, and ù
point jw , we get that % Z¡žjw 4†^ is continuous at the point
—¡žjw >e™ .
z±We
­Ã­ now prove the lemma. The functions of the form
í
°Ã°±| é
‘‡
a converge uniformly to the function ž Z„[^ .
This implies that for any ~DŽ there exists a ´ such that
for all BDˆ´ we have the following.
9
)‚Z¡žjw 6
^ 5 )
Z¡jw &
^ 5)‚Z¡žjw +†^
But this in turn implies the following.
9
%Z¡žjw ^:
v %
Z¡jw ^:v %JZ¡žjw +^
(13)
The lemma now follows from the continuity of the function %Z¡ejw +^ at the point —¡žjw >e™ . Ä
Theorem 4.1 can be interpreted as saying that for
large sample sizes, and for values of jw other than the
special phase transition values, the bound has a well defined value independent of the confidence parameter 6
and determined only by a smooth function ž Z„[^ . A similar statement can be made for the bound in theorem 3.3
— for large , and at points other than phase transitions, the bound is independent of 6 and is determined
by a smooth limit curve.
For the asymptotic analysis
3.3 we as
"! of theorem
sume an infinite sequence , W . , ! . , #$#}. # of hypothesis
classes and an infinite sequence
, W , , #$#} # of sam. ZG ² d6^
ples such that sample
has size . Let
and j Z ² ,6^ be Z ² ,6^ and žj Z ² <6^ respectively
. and
.
defined
to hypothesis class
sample
relative
²
Let 7 Z ^ be the set of hypotheses in
having an
. empirical error of exactly ² in the sample
. Let
8 Z ² ^ be )%*Z€‚ƒgZ K+ 7 Z ² ^$+ ^ . In the analysis
of
z±­Ã­
a °Ã°±|
theorem 3.3 we allow that the functions 9 are
only locally uniformly convergent to a continuous function 8 Z„[^ , i.e., for any [šAN § and any QD’ there
exists an integer ´ and real number :HDŒ satisfying the
following.
8 Z$¥}¥¹] ¦$¦^ 9
9
8 Z¯]_^}+ 1_ Dˆ´3`1ž]"A5Z„[
:`0[>;
c :g^+
v Locally uniform convergence plays a role in the analysis
in section 6.
z±­Ã­
°Ã°±|
Theorem 4.2 If the functions 9 a
locally
8 ¬ Z„[converge
uniformly
<
to a continuous function
^
then,
for
any
œ
z
Ã
­
­
íÜ
°Ã° |
2 also converge
fixed value of 6 , the functions a
z±­Ã­
°Ã°±|
locally uniformly to 8 Z„[^ . If the convergence
of
¬ a
9
œ
z
Ã
­
­
íÜ
Ã
°
°
|
2 .
is uniform, then so is the convergence of a
Proof: Consider an arbitrary value [KA and
. We will construct the desired ´ and : . More
specifically, select ´ sufficiently large and : sufficiently
small that we have the following properties.
I Ö ÆІР½ Ñ†Ñ È I3$ Æ{½ È JK
= ?
Ø >A@ ò = C
½ BJÆ E
Ë DCFHG3ò=Ë Ò FHG È æ
D
L
æ
Ø
æ
æ
æ
æ
æ
æ
95
9 8
8
1 ]?AlZ\[
:0[äc :G^ŽM+ Z¹]3^ Z„[^$+tv
‰DB

´
c
!±÷
‘
)&*=Z ² ^
2
u
9 ON
´
:
)&*~´
v
´
9
Consider an D ´ and ]qAÚZ„[
:`d[scP:g^ . It now
suffices to show the following.
 j Z$¥$¥{]§¦}¦e06^ 9 8
Z¹]_^ æ v
æ
æ
æ
æ
æ æ
æ
Because 7 Z$¥$¥{] ¦$¦^ is a subset of
Z8¥}¥{]§¦}¦ž‰6^ we
have the following.
8 Z$¥}¥¹] ¦$¦^
j Z$¥}¥¹] ¦$¦e>6^
9?
k
k?8 Z¹]3^
z±­Ã­ ¬
íÜ
b¡°Ã° |
2 as follows.
We can also upper bound +
Z8¥}¥¹] ¦}¦ž76^$+¤v
þ
v
þ
v
þ
v
Qf
Qf
R
b
Qf
æ
è
9
R
zVU z
9
w
b R
þ 
zV U z
¡b | é ‘ X W |
?w
9è
v
v
8
Z¹]3^=c
v
8
Z¹]3^=cY
ej Z8¥}¥¹] }¦ ¦ž>6^
þ |
zVU z
w
è
æ
z
w 9 R
Qf
æ
è
b
b
?S ´
æ
/T æ
æ
æ7
c
þ | éXW |
b}| é ‘ X W |
)%*
æ
z±­Ã­
°Ã°±|
A similar argument shows that ifz±­Ã­9 a
converges
a °Ã°±|
8
uniformly to Z„[^ then
. Ä
­Ã­ does
¬ 9 [í Ü Z z±so
a °Ã° 2 |
Given quantities
that converge uniformly
to 8 Z„[^ the remainder of the analysis is identical to that
for the asymptotic analysis of (9). We define the following upper bounds.
zœ­Ã­ ¬
íÜ
°Ã° | é
‡
xyz Ü
\ Ü z ç±Ü |
2
ç
|
a
^î ] ï ì
lub
à a å
ß
{ a
èU z
xyz Ü
\ Ü z ç±Ü |
ç |
a| `
lub _
ß
a å
{ a
9 a
Again we say that jw is at a è phase transition if the funcj Z¡jw ^ is discontinuous at the value jw . We then get
tion %
the following whose proof is identical to that of theorem 4.1.
j Z¡jw ^ is continuous at the
Theorem 4.3 If the bound %J
zœ­Ã­ are not at a phase transition), and the
point jw (so we
°Ã°±|
converge uniformly to 8 Z\[^ , then we
functions 9 a
have the following.
j Z¡jw ^
)&
% % j Z¡jw ^>h %J
5 Asymptotic Optimality of (9)
Formula (9) can be viewed as providing an upper bound
on w Z\–_^ as a function of jw Z\–_^ and the function  . In this
section we show that for any entropy curve  and value jw
there exists a hypothesis class and data distribution such
that the upper bound in (9) is realized up to asymptotic
equality. Up to asymptotic equality, (9) is the tightest
possible! bound computable
from žjw Z¢–£^ and the numbers ÂeZ ^ , #$#}# , žZ ^ .
The classical VC dimensions bounds are nearly op!
ejw Z¢c
timal
over bounds computable
from
– b$^ and the class
. The numbers žZ ^ , #}#$# , žZ ^ depend on both
and the data distribution. Hence the bound in (9) uses information about the distribution and hence can be tighter
than classical VC bounds. A similar statement applies
!
to the bound in theorem (3.3) computed
from
the empirically observable numbers Âej Z ^ , #$#}# , Âej Z ^ . In this
case, the bound uses more information from the sample
than just jw Z¢–£^ . The optimality theorem given here also
differs from the traditional lower bound results for VC
dimension in that here the lower bounds match the upper
bounds up to asymptotic equality.
The departure point for our optimality analysis is the
following lemma from [2].
Lemma 5.1 (Cover and Thomas) If ] j is the fraction of
heads out of tosses of a coin where the true probability of heads is ] then for [,k‰] we have the following.
f_yxyz
b}|
psrtZ]F
j ku[^:k
w
a {
Nc
This lower bound on psrtZ]V
j k¤[^ is very close to
Chernof
d f’s 1952 upper bound (1). The tightness of (9)
is a direct reflection of the tightness (1). To exploit
Lemma 5.1 we need to construct hypothesis classes and
data distributions where distinct hypotheses have independent training errors.
we say that a
! More specifically,
•
trainset of hypotheses ”– #}#}#$–fe has independent
!
– e£^ are
ing errors if the random variables jw Z\– ^ , #$#}# , ejw Z¢g
independent.
By an argument similar to the derivation of (3) from
(1) we can prove the following from Lemma 5.1.
»
Ò m¡È
h&i S Í MÆ j0k¹Ô§Æ ½ ¼ ò ½ ȆιΠ½ EÈ l Ó¹Ô_Æ × È D Ó¹Ô_Ʈ؊n
l ó (14)
Ø
T
.
Lemma 5.2 . Let @ be any finite set, a random variable, and od 7;=P68 a formula
. such that for every ;FA?@
and 6F. Dµ we have psrtZVod œ;768€^<
. kµ6 , and psrtZ«13;ŽA
@ od ;J68€^"r
p
h qts
psr Zuod J;
z J68€^ . We then
ã L
.
w
v
.
‡ |
have 136mDu1 2
;HAJw
@
od 7;= .
Ls
Proof:
z
z
.
‡ |
‡ |
z
psrtVZ od œ; z ®^
k
Ls ‡ |
.
9 Ls ‡ |
€^
v
psrtVZ x+od 7;= f y { ÿ Ls
z
LM
| }~‡ | v
w
z
f z†
.
‡ |
‡ |
psrtZ«13;FA?
@ x+od `; hˆ6
€^ v
w Ls
Ä
Now define –cbZ ² ^ to be the hypothesis of minimal
4
•
training error in the set Z ² ^ . Let glb ”8;So
; denote the
lower bound (the minimum) of the
4 greatest
•
;§ . We now have the following lemma.
set ”$;Fo
!
Z8¥}¥„[8 ¦}¦^ are
Lemma 5.3 If the hypotheses. in the class
•
independent then 136<DŽ , 1 2 , 13[,Al” #}#}#$ ,
xyz
z Ü ¬ fK
ç
| |
¨  y { a y { { a y { ì
Üç z z |„| glb ç Ü
à å
ۉ  a
„ÿ ƒ ÿ ÿ ÿ ‡
è
è ‚ 5.2 let [ be a fixed rational
Proof: To prove lemma
²
number of the form . Assuming independent hypotheses. wev can applying Lemma 5.3 to (14) to get 1G6?Dµ ,
1 2 , –?A
Z ² ^ ,
z f
z%
! f
z z
í |
é |
xyz
zÜz ¬ z
z
‡ |€|
ç | ç |€| ç |€|
a
¨  â â { â
!
Let † be the hypothesis in Z\[^ satisfying
9 this formula.
– bZ„[^7^yvEjw Z †^ and [
vŒw ‡Z †^vŽ[ .
We now have jw Z¢c
.
These two conditions imply 136,Du , 1 2 ,
!
9 YQZ€(&*=Z¡jw Z\–ˆbeZ„[^7^7[
^}+&+ [^
k
í
z
a
|
f
z%
! f
é |
z
‡ |
This implies the following.
ż ƀNJ‰Æ Ë ÃÈ È`É glb
Œ
‹
ż Ì
Í ÆMj0k¹Ô_Æ Å ¼ ò=ËzD
ɏŽ’‘”“•—– ˜™ ֚
Ö
»
Ö
œ
» †È ιΠÈ
Ë
“‡•Š– ˜H„– ˜^ ‡ “›“ ž
! Lemma 5.2 now follows by quantification over [?A
•
#}#$#}
.
Ä
For [UAE we have that lemma 3.1 implies the
following.
x Ü ­Ã­
f
œ
î ç { a °Ã°
ï
Œ Ü
zÜ
z±­Ã­
ç
°Ã°±|€| glb ‹ ç
ۉ  a
å
aÿ ŸaŸ ƒaa y { î ‘ ‡ ï ž
è ‚ bounds on the quanWe now have upper and lower
tity jw Z\ˆ
– bZ8¥}¥€[8¦$¦^7^ which agree up
z±­Ã­ to asymptotic equality
í
°Ã°±|
converges (point— in a large limit where a
wise) to a continuous function Âe Z\[^ we have that the
– bZ$¥$¥€[8¦}¦^œ^ both converge
upper and lower bound on jw Z¢c
(pointwise) to the following.
•
žjw Z¢– b Z„[^7^Ÿh glb ”£m
jw oeYQZ¡jw +&+ [^:.
v ž Z„[^
”
– bZ„[^œ^ is a continuous funcThis asymptotic value of jw Z\ˆ
tion of [ . Since [ is held fixed in calculating the bounds
on jw Z$¥$¥€[8¦$¦^ , phase transitions are not an
z±­Ã­ issue and unií °Ã°±|
a
form convergence of the functions is not required. Note that for large and independent hypothe– bZ\[^œ^ is
ses we get that jw Z\ˆ
z±­Ã­ determined as a function of
í °Ã°±|
a
the true error rate [ and
.
The following lemma states that any limit function
ž Z¹]_^ is consistent with the possibility that hypotheses
are independent. This, together with lemma 5.2 implies
! uniform bound
that
no
on w Z\–_^ as a function of žjw Z\–_^ and
+ Z ^}+ , #$#}# , + Z ^$+ can be asymptotically tighter than
(9).
Theorem 5.4 Let Âe Z¯]_^ be any continuous function of
]ŒAq . There
! exists
an infinite sequence of hypothesis spaces ! , W , , #}#$# , and sequence of data
distributions Y , Y W , Y , #$#}# such that each class has
independent zœhypotheses
for data distribution Y
and
­Ã­
í
b°Ã°±|
such that converges (pointwise) to ž Z¹]3^ .
Uz Z
í
|
Z ¡ ^}+hBw
Proof: First we show
that if +
z±­Ã­
í b¡°Ã°±|
then the functions U z Z converge (pointwise) to Âe Z¯]_^ .
í
|
Assume +
Z ¡ ^}+£hÚw
. In this case we have the
following.
 Z8¥}¥¹] ¦}¦^
h ž Z$¥$¥{]§¦}¦^
?
Since Âe z±Z¯]_
­Ã­ ^ is continuous, for any fixed value of ] we get
í
b¡°Ã°±|
that converges
to ž Z¹]3^ .
Recall that Y
is•a probability distribution on pairs
—€;>˜ ™ with ˜nAŒ”8§
and ;uA5@
for some set @
.
We take
to be a disjoint union of sets
Z ² ^ where
!
selected
as
above.
Let
,
,
be the el+
Z ² ^$+ is
}
#
$
#
#
ù ù€¢
ements of
with £BhÚ+
be the set of all
+ . Let @
£ -bit bit strings and define ù ¡ Z„;_^ to be the value of ¤ th
bit of the bit vector ; . Now define the distribution Y
on pairs —€;>˜ ™ by selecting ˜ to be 1 with probability
1/2 and then selecting each bit of ; independently where
˜ with probability
the ¤ th bit is selected to disagree
with
²
Z3² ^ . Ä
where ´ is such that A
ù ¡
6 Relating ¦ ¥ and ¦
In this section we show that in large limits of the type
discussed in section 4 the histogram of empirical errors
need not converge to the histogram of true errors. So
even in the large asymptotic limit, the bound given
by theorem 3.3 is significantly weaker than the bound
given by (9).
To show that ej Z$¥$¥€[8¦$¦eŸ6^ can be asymptotically different from žZ$¥$¥€[8¦$¦^ we consider the case of independent hypotheses. More specifically, given a continuous function Âe Z¯]_^ we
"! construct
an infinite sequence of
hypothesis spaces
, W , ! , #$#}# and an infinite sequence of data distributions Y , Y W , Y , #$#}# using the
construction in the proof of theorem ??. We note that
if ž Z¹]3^ is differentiable
with bounded derivative then the
z±­Ã­
í
b¡°Ã°±|
converge uniformly to ž Z¯]_^ .
functions For a given infinite sequence data. ! distributions
we
.
. W , , #}#$# , by
generate an
infinite
sample
sequence
,
. selecting
to consists
of pairs —€;=Ÿ˜t™ drawn IID
Y
.
For
a given sample sequence and
from distribution
–ŒA
we define jw Z¢–£^ and j Z_ ² m6^ in a manner
. 6^ but for sample
similar to jw Z\–_^ and Âj Z ²
. The
main result of this section is the following.
z±­Ã­
Statement 6.1 If each
has independent hypotheses
í
b¡°Ã°±|
under data distribution Y
, and the functions converge uniformly to a continuous function ž Z¹]3^ , then
for any 6KD¸ and ]EAV § , we have the following
with probability 1 over the generation of the sample sequence.
j Z8¥}¥¹] ¦}¦ž76^
9
)%
&
h
¨
§©«}ª ¬ [! ­ Âe Z\[^ YQZ¹]+%+ [^
a ㊬
We call this a statement rather than a theorem because the proof has not been worked out to a high level
of rigor. Nonetheless, we believe the proof sketch given
below can be expanded to a fully rigorous argument.
Before giving
sketch we note that the limz±the
­Ã­ proof
¬
íÜ
b¡°Ã° 2 |
iting value of is independent of 6 . This is
j Z¹]3^ as follows.
consistent with theorem 4.2. Define ž
9
j Z¹]_6
ž
^ ,¨§”©«$ª ¬ u! ­ Âe Z\[^
YQZ¹]+%+ [^
a ㊬
j Z¯]_^"krž Z¯]_^ . This gives an asymptotic verNote that ž
sion of lemma 3.2. 9 ButW since YQZ¯]`+&+ [^ can be locally
[^ (up to its second order Taylor
approximated as ®Z¹]
expansion), if ž Z¯]_^ is increasing at the point ] then we
j Z¹]3^ is strictly larger than ž Z¹]3^ .
also get that ž
Proof
Outline: To prove! statement 6.1
we first de •
Z¹]d[ ^ for ]7[5Aڔ J#
to be the set
fine
$#}#$
Z„[^ such that jw Z¢–£^Qh ] . Intuitively,
of
all –IA
Z¯]=U[^ is the set of concepts with true error rate
near [ that have empirical error rate ] . Ignoring factors that are only polynomial in , the probability of
a hypothesis with true error rate [ having empirical
f_yxyz erb |
ror rate ] can be written as(approximately)
w
{ a .
So the expected
Z¹]š[^ can be written as
f3:xz size of
b |
+ + w yxyz
alternatively,
as
U z Z„[^$f_
z íU z | f_xyz b |€(approximately)
a , or~
í |
b {| |
w
a w
a
{ a or w
{ a . More formally,
we have the following for any fixed value of ] and [ .
)&
% )&*=Z„(ƒgZ
hK‚ƒGZ„
E ZP+
žZ\[^
Z$¥$¥{]§¦}¦e,¥$¥€[8¦$¦^}+ ^œ^7^
9
YHZ¹]`+&+ [^œ^
We now show that the expectation can be eliminated
from the above limit. First,
consider distinct values of
9
] and [ such that Âe Z\[^
YQZ¯]`+&+ [^šD¤ . Since ] and
distinct, the probability
that a fixed hypothesis in
[ are
Z8¥}¥„[8¦}¦^ is in
Z8¥}9 ¥¹] ¦}¦ž"¥$¥€[8¦}¦^ declines exponenYHZ¹]`+&+ [^:Du the expected size
tially
in . Since ž Z„[^
of
Z$¥$¥{] ¦$¦e ¥}¥„[8¦}¦^ grows exponentially in . Since
the hypotheses are
independent, the distribution of posZ8¥}¥¹] ¦}¦ž,¥}¥„[8¦}¦^$+ becomes essentially
sible values of +
a Poisson mass distribution with an expected number of
arrivals
growing exponentially in . The probability
Z$¥}¥¹] ¦$¦e5¥}¥€[8¦$¦^$+ deviates from its expectation
that +
by as much as a factor of 2 declines exponentially in .
We say that a sample sequence
is safe after ´ if for all
D¸´ we have that +
Z$¥$¥{] ¦$¦eŒ¥}¥€[8¦$¦^$+ is within a
factor of 2 of its expectation. Since the probability of
being unsafe at declines exponentially in , for`9any
6 there exists a ´ such that with probability at least
6
the sample sequence is safe after ´ . 09
So for any 6nDI
6 the sequence
we have that with probability at least
is safe after some ´ . But since this holds for all 6JDS ,
with probability 1 such a ´ must exist.
)%*=Z„‚ƒgZ 8+
Z$¥$¥{]§¦}¦e,¥$¥€[8¦$¦^}+ ^œ^
)&
& 9
h ž Z„[^
?
YQZ¹]+%+ [^
We now define  Z8¥}¥{]§¦}¦žT¥$¥€[8¦$¦^ as follows.
 Z8¥}¥¹] ¦}¦žm¥}¥€[8¦$¦+
^ ,K)%*Z€‚ƒgZ ~+
Z$¥$¥{]§¦}¦e,¥$¥€[8¦$¦^}+ ^œ^
It is also possible to show for ]n
z±­Ãhº
­ [ ¬ we
­Ã­ have that with
í
b¡°Ã °
°Ã°±|
a
probability 1 we have that approaches
9
ž Z¹]_^ and that for distinct ] and [ with ž Z„[^
YQZ¹]+%+ [^:v
zœ­Ã­ ¬ ­Ã­
í
°Ã°
°Ã°±|
a a
approaches 0. Putting
we have that theseð together yields that with probability 1 we have the
following.
„Ó k„j Ï Ö ÆÐPÐ ½ Ñ†Ø Ñ8ò ІРË$Ñ†Ñ È n
¾ j±H² ƀ¿ ò $ Æ Ë È D‚Í Æ{½GÎ¹Î Ë ÈÃÈ (15)
Ï
Ö ¯1°
Define 7 Z ² ^ and 8 Z ² ^ as in section 4. We now
have the following equality.
¬µ´’´’´ ¬
` Z¹]>[^
7 Z¯]_^>
h ³
_
a ã Wez will now show that with probability 1 we have that
b}|
9 approaches ž j Z¯]_^ . First, 9 consider a ]QA5 such
j Z¹]3^:Du . Let Since ž Z„[^ YQZ„[ +&+ ]_^ is a continuous
that ž
9
$¬ u! ­
function, and is a compact set, §©«ª
 Z\[^
ã ¬
a Š
at some value [ b Au § . Let
YQZ¯]`+&+ [^ must be realized
9
¶[ b be8 such
that ž Z„¶
[ b$^ YQZ¯]`+&+ ¶
[ b$^ equals žj Z¹]_^ . We have
Z$¥}¥¹] ¦$¦^JkV Z$¥}¥¹] ¦$¦e5¥}¥€¶
[ b¦}¦^ . This, together
that
with (15), implies the following.
8 Z$¥$¥{] ¦$¦^
j Z¹]_^
&Ž
& Š* ·
)&
k ž
We will now
sequence is safe at say that the sample
and ² if +
Z8¥}¥{]§ ¦}¦ž U¥}¥ ² ¦}¦^}+ does not exceed twice
Z8¥}¥¹z±] ­Ã­¦}¦žm¥}¥€[¶b¦}¦^}+ . Assuming unithe expectation of +
í
b¡°Ã°±|
form convergence of , the probability of not
²
being safe at and declines exponentially in at a
rate at least as fast as the rate of decline of the probabil[ b¦}¦ . By the union
ity of not being safe at and ¥$¥€¶
bound this implies that for a given the probability
that there exists an unsafe ² also declines exponentially. We say that the sequence is safe after £ if it
D £ . The probabilis safe for all and ² with ¸
ity of not being being safe after £ also declines exponentially with £ . By an argument similar to that
given above, this implies that with probability 1 over
the choice of the sequence there exists a £ such that the
sequence
is safe after
£ . But if we are safe at then
+7
Z$¥}¥¹] ¦$¦^}+mv
¹+
;
Z¹]q¥}¥„¶
[ b†¦$¦^$+ . This implies
the following.
8 Z8¥}¥{]§¦}¦^
j Z¹]3^
º
v ž
)&%
 §” ©gª
Putting the two bounds together we get the following.
8 Z$¥$¥{]§¦}¦^
j Z¹]_^
)%
& h Âe
The above argument establishesz±­Ã(to
­ some level of
|̡ۡ
j Z¯]_^ . It is
rigor) pointwise convergence of 9 to Âe
also possible to establish a convergence rate that is a
continuous function
of ] . This implies that the converz±­Ã­
b¡°Ã°±|
gence of 9 can be made locally uniform. Theorem 4.2 then implies the desired result. Ä
7 Future Work
A practical difficulty with the bound implicit in theorem 3.3 is that it is usually impossible to enumerate the
elements of an exponentially large hypothesis class and
hence impractical to compute the histogram of training
errors for the hypotheses in the class. In practice the
values of žZ ² ^ might be estimated using some form of
Monte-Carlo Markov chain sampling over the hypotheses. For certain hypothesis spaces it might also be possible to directly calculate the empirical error distribution
without evaluating every hypothesis.
Here we have emphasized asymptotic properties of
our bound but we have not addressed rates of convergence. For finite sample sizes the rate at which the bound
converges to its asymptotic behavior can be important.
Before mentioning some ways that the convergence rate
might be improved, however, we note that near phase
transitions standard notions of convergence rate are inappropriate. Near a phase transition
the bound is “un
stable” — replacing 6 by 6
can alter the bound significantly. In fact, near a phase transition it is likely
that w Z\– b ^ is significantly different for different samples even though jw Z\ˆ
– b}^ is highly predictable. Intuitively,
we would like a notion of convergence rate that measures the size of the “region of instability” around a
phase transition. As the sample size increases the phrase
transition becomes sharper and the region of instability
smaller. It would be nice to have a formal definition for
the region of instability and the rate at which the size
of this region goes to zero, i.e., the rate at which phase
transitions in the bound become sharp.
The rate of convergence of our bound might be improved in various ways:
»
Removing the discretization of true errors.
»
Using one-sided bounds.
»
»
Using nonuniform union bounds over discrete values of the form ² .
»
Tightening the Chernoff bound using direct calculation of Binomial coefficients.
Improving Lemma 3.4.
The above ideas may allow one to remove all )%*`Z€F^
terms from the statement of the bound.
8 Conclusion
Traditional PAC bounds are stated in terms of the training error and class size or VC dimension. The computable bound given here is typically much tighter because it exploits the additional information in the histogram of training errors. The uncomputable bound uses
the additional (unavailable) information in the distribution of true errors. Any distribution of true errors can
be realized in a case with independent hypotheses. We
haveð shown that in such cases this uncomputable bound
is asymptotically equal to actual generalization error.
Hence this is the tightest possible bound, up to asymptotic equality, over all bounds expressed as functions of
jw Z\ˆ
– b}^ and the distribution of true errors. We have also
shown that the use of the histogram of empirical errors
results in a bound that, while still tighter than traditional
bounds, is looser than the uncomputable bound even in
the large sample asymptotic limit.
Acknowledgments: Yoav Freund, Avrim Blum, and
Tobias Scheffer all provided useful discussion in forming this paper.
References
[1] David Haussler, Michael Kearns, H. Sebastian
Seung, and Naftali tishby, “Rigorous learning
curve bounds from statistical mechanics”, Machine
Learning 25, 195-236, 1996.
[2] Thomas & Cover. “Elements of Information Theory”, Wiley, 1991.
[3] H. Chernoff. “A measure of asymptotic efficiency
for test of a hypothesis based on the sum of observations”, Annals of Mathematical Statistics, 23:493507, 1952.
[4] Tobias Scheffer and Thorsten Joachims, “Expected
error analysis for model selection”, International
Conference on Machine Learning (ICML), 1999.
[5] Yoav Freund, “Self bounding algorithms”, COLT,
1998.
[6] David McAllester, “Pac-Bayesian model averaging”, COLT, 1999.
[7] John Langford and Avrim Blum, “Microchoice and
self-bounding algorithms”, COLT, 1999.
[8] Yishay Mansour and David McAllester, “Generalization Bounds for Decision Trees”, COLT-2000.
[9] David McAllester and Robert Schapire, “On the
Convergence Rate of Good-Turing Estimators”,
COLT-2000.
Download