Information-Theoretic Determination of Minimax Rates of Convergence

advertisement
Information-Theoretic Determination of Minimax Rates of Convergence
Author(s): Yuhong Yang and Andrew Barron
Reviewed work(s):
Source: The Annals of Statistics, Vol. 27, No. 5 (Oct., 1999), pp. 1564-1599
Published by: Institute of Mathematical Statistics
Stable URL: http://www.jstor.org/stable/2674082 .
Accessed: 24/10/2012 14:43
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp
.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact support@jstor.org.
.
Institute of Mathematical Statistics is collaborating with JSTOR to digitize, preserve and extend access to The
Annals of Statistics.
http://www.jstor.org
TheAnnalsofStatistics
1999,Vol.27,No. 5, 1564-1599
INFORMATION-THEORETIC
DETERMINATION
RATES OF CONVERGENCE1
OF MINIMAX
BY YUHONG YANG AND ANDREW BARRON
Iowa State University
and Yale University
Wepresentsomegeneralresultsdetermining
minimaxboundson statisticalriskfordensityestimationbased on certaininformation-theoretic
considerations.
Theseboundsdependonlyonmetricentropy
conditions
and
are used to identify
theminimaxratesofconvergence.
1. Introduction. The metricentropystructureofa densityclass determinesthe minimaxrate ofconvergence
ofdensityestimators.Here we prove
such resultsusingnew directmetricentropyboundson the mutualinformationthatarisesbyapplicationofFano'sinformation
inequalityin thedevelopmentoflowerboundscharacterizing
theoptimalrate.No specialconstruction
is requiredforeach densityclass.
We here studyglobal measuresof loss such as integratedsquared error,
squaredHellingerdistanceor Kullback-Leibler(K-L) divergencein nonparametriccurveestimationproblems.
The minimaxrates of convergenceare oftendeterminedin two steps. A
good lowerbound is obtainedforthe targetfamilyof densities,and a specificestimatoris constructed
so thatthe maximumriskis withina constant
factorofthe derivedlowerbound.For globalminimaxriskas we are consideringhere,inequalitiesforhypothesistestsare oftenused to deriveminimax
lowerbounds,includingversionsofFano'sinequality.These methodsare used
in Ibragimovand Hasminskii(1977, 1978), Hasminskii(1978), Bretagnolle
and Huber(1979),Efroimovich
and Pinsker(1982),Stone(1982),Birge(1983,
1986), Nemirovskii(1985), Devroye(1987), Le Cam (1986), Yatracos(1988),
Hasminskiiand lbragimov(1990), Yu (1996) and others.Upper bounds on
minimaxriskundermetricentropyconditionsare in Birge(1983, 1986),Yatracos(1985), Barronand Cover(1991),Van de Geer (1990), Wongand Shen
(1995) and Birge and Massart (1993, 1994). The focusofthe presentpaper
is on lowerboundsdetermining
the minimaxrate,thoughsomenovelupper
boundresultsare givenas well.Parallelto thedevelopment
ofriskboundsfor
globalmeasuresofloss are resultsforpointestimationofa densityor functionalsofthedensity;see, forexample,Farrell(1972),Donohoand Liu (1991),
Birgeand Massart (1995).
Received November 1995; revised July 1999.
in partbyNSF GrantsECS-94-10760and DMS-95-05168.
1Supported
AMS 1991subject classifications.Primary62G07;secondary
62B10,62C20,94A29.
Key words and phrases. Minimax risk, density estimation, metric entropy,Kullback-Leibler
distance.
1564
MINIMAXRATES OF CONVERGENCE
1565
In its originalform,Fano's inequalityrelates averageprobability
oferror
in a multiplehypothesistest to the Shannonmutualinformation
fora joint
ofparameterand randomsample[Fano(1961); see also Coverand
distribution
Thomas(1991), pages 39 and 205]. BeginningwithFano, thisinequalityhas
been used in information
theoryto determinethe capacityofcommunication
channels.Ibragimovand Hasminskii(1977, 1978) initiatedthe use ofFano's
ofminimaxrates ofestimationfor
inequalityin statisticsfordetermination
certainclasses offunctions.To applythis technique,suitablecontrolof the
is required.FollowingBirge (1983), statisticianshave
Shannoninformation
statedand used versionsofFano's inequalitywherethe mutualinformation
is replacedby a bound involvingthe diameterof classes of densitiesusing
the Kullback-Lieblerdivergence.For success of this tacticto obtain lower
boundson minimaxrisk,one mustmake a restriction
to small subsetsofthe
functionspace and then establishthe existenceof a packingset of suitably
As Birge(1986) pointsout,in thatmanner,theuse ofFano's
largecardinality.
inequalityis similarto the use ofAssouad'slemma.As it is notimmediately
apparentthatsuchlocalpackingsetsexist,thesetechniqueshavebeenapplied
on a case bycase basis to establishminimaxratesforvariousfunction
classes
in the above citedliterature.
In thispaperwe providetwomeansbywhichto revealminimaxratesfrom
globalmetricentropies.The firstis byuse ofFano's inequalityin its original
formtogetherwithsuitableinformation
inequalities.This is the approachwe
take in Sections2 through6 to get minimaxboundsforvariousmeasuresof
loss. The secondapproachwe take in Section7 showsthat globalmetricentropybehaviorimpliestheexistenceofa localpackingset satisfying
conditions
forthe theoryofBirge(1983) to be applicable.
Thus we demonstrate
fornonparametric
function
classes (and certainmeasures ofloss) thatthe minimaxconvergence
rate is determined
by the global
metricentropyoverthe wholefunction
class (or overlarge subsetsofit). The
advantageis thatthe metricentropiesare available in approximation
theory
formanyfunction
classes [see,e.g.,Lorentz,Golitzchekand Makovoz(1996)].
It is no longernecessaryto uncoveradditionallocal packingpropertieson a
case by case basis.
The following
proposition
is representative
oftheresultsobtainedhere.Let
Y be a class offunctions
and let d(f, g) be a distancebetweenfunctions
[we
=
required(f, g) nonnegative
and equal to zeroforf g, butwe do notrequire
it tobe a metric].Let N(E; Y) be thesize ofthelargestpackingset offunctions
separatedby at least E in S, and let E satisfy82 = M(En)/n,whereM(E) =
logN(e; S) is theKolmogorov
8-entropy
and n is thesamplesize. Assumethe
targetclass Y is richenoughto satisfyliminf 0 M(E/2)/M(6) > 1 [which
is true,e.g.,ifM(e) =
(1/6)r
K(?)
withr > 0 and K(8/2)/K(8)
->
1 as ?
-
0O].
This conditionis satisfiedin typicalnonparametric
classes.
For convenience,
we will use the symbols>- and -, wherean >- bnmeans
bn= O(an), and an bnmeans both an >- bnand bn>- an. Probabilitydensities
are takenwithrespectto a finitemeasure,uin the following
proposition.
Y. YANG AND A. BARRON
1566
rate is
PROPOSITION1. In the followingcases, the minimaxconvergence
in termsofthecriticalseparationEn as follows:
characterized
bymetricentropy
minmaxEfd2(f, f)-2n.
f fey
(i) Y is any class ofdensityfunctions
boundedaboveand below0 < C <
f < C forf Y. Hered2(f, g) is eitherintegrated
squarederrorf(f(x) g(X))2
squaredHellingerdistanceor Kullback-Leibler
divergence.
dA,u,
(ii) Y is a convexclass ofdensitieswithf < C forf e Y and thereexists
at least one densityin Y boundedaway fromzeroand d is theL2 distance.
(iii) Y is any class offunctionsf withI f I< C forf E Y fortheregression
model Y = f(X) + 8, X and e are independentX - Px and e - Normal
(0, cr2), (T > 0 and d is theL2(Px) norm.
E
Fromthe aboveproposition,
theminimaxL2 riskrateis determined
bythe
metricentropyalone,whetherthe densitiescan be zeroor not.For Hellinger
and Kullback-Leibler(K-L) risk,we showthatbymodifying
a nonparametric
class of densitieswithuniformly
boundedlogarithmsto allow the densities
to approachzeroor even vanishin someunknownsubsets,the minimaxrate
mayremainunchangedcomparedto thatofthe originalclass.
Now let us outlineroughlythemethodoflowerboundingthe minimaxrisk
usingFano's inequality.The firststepis to restrictattentionto a subsetSO of
theparameterspace whereminimaxestimationis nearlyas difficult
as forthe
wholespace and,moreover,
wheretheloss function
ofinterestis relatedlocally
totheK-L divergence
thatarisesin Fano'sinequality.(Forexample,thesubset
can in some cases be the set ofdensitieswitha boundon theirlogarithms.)
As we shall reveal,thelowerboundon the minimaxrateis determined
bythe
metricentropyofthe subset.
The prooftechniqueinvolving
Fano'sinequalityfirstlowerboundstheminimax riskbyrestricting
to as largeas possiblea finiteset ofparametervalues
{ O .... 0m } in So separatedfromeach otherbyan amountEn in the distance
ofinterest.The criticalseparationEn is the largestseparationsuch thatthe
hypotheses{ 01,..., Om} are nearlyindistinguishable
on the averageby tests
as we shall see. Indeed,Fano's inequalityrevealsthisindistinguishability
in
termsofthe K-L divergencebetweendensitiespo(x1,
.(
. ., X") = Hnp6(xi)
and the centroidofsuch densitiesq(xl,
, xn) = (1/m)
...,
I p((xi,
xn)
o
[whichyieldsthe Shannonmutualinformation
I(e; X1, ..., Xn) between0
and X1, ..., Xn witha uniformdistribution
on 0 in {01, .,
Jm}]. Here the
keyquestionis to determinethe separationsuchthatthe averageofthisK-L
divergenceis small comparedto the distancelogm thatwouldcorrespondto
maximallydistinguishable
densities(forwhich0 is determined
by XI). It is
criticalherethatK-L divergencedoes nothave a triangleinequalitybetween
thejoint densities.Indeed,coveringentropypropertiesofthe K-L divergence
(underthe conditionsofthe proposition)showthat the K-L divergencefrom
everyPo ((Xi,..., xn) to the centroidis boundedby the rightorder2nE2,
even though the distance between two such poi(x1, ...,
xn) is as large as n,f
MINIMAXRATES OF CONVERGENCE
1567
where 1 is the K-L diameter of the whole set {p1, ..., po }. The proper convergencerate is thus identifiedprovidedthe cardinalityof the subset m is
chosensuch that ne2/logm is boundedby a suitableconstantless than 1.
Thus E is determinedby solvingforn82/M(8s) equal to such a constant,
whereM(?) is the metricentropy(the logarithmofthe largestcardinalityof
an e-packingset). In thisway,the metricentropyprovidesa lowerboundon
the minimaxconvergence
rate.
ApplicationsofFano's inequalityto densityestimationhave used the K-L
diameter n,f of the set {pn1, ..., pn
}
[see, e.g., Birge (1983)] or similar rough
bounds[such as nI(O; X1) as in Hasminskii(1978)] in place ofthe average
distanceof p'1, ..., p' fromtheircentroid.In that theory,
to obtaina suitable bound,a statisticianneeds to findifpossiblea sufficiently
large subset
{ 01, . . . Om} forwhichthe diameterofthis subset(in the K-L sense) is ofthe
same orderas the separationbetweenclosestpointsin thissubset(in thechosen distance).Apparently,
such a boundis possibleonlyforsubsetsofsmall
diameter.Thus by that technique,knowledgeis needednot onlyofthe metric entropybut also ofspecial localizedsubsets.Typicaltoolsforsmoothness
classes involveperturbations
ofdensitiesparametrized
byverticesofa hypercube. Whileinteresting,
such involvedcalculationsare not needed to obtain
the correctorderbounds.It sufficesto knowor boundthe metricentropyof
the chosenset S0. Our resultsare especiallyusefulforfunction
classes whose
globalmetricentropieshave been determined
but local packingsets have not
been identified
forapplyingBirge'stheory(e.g.,generallinearand sparse approximation
sets offunctions
in Sections4 and 5, and neuralnetworkclasses
in Section6).
It is not our purposeto criticizethe use ofhypercube-type
argumentsin
general.In fact,besides the success ofsuch methodsmentionedabove,they
are also usefulin otherapplicationssuch as determining
the minimaxrates
ofestimatingfunctionals
ofdensities[see,e.g.,Bickeland Ritov(1988),Birge
and Massart (1995) and Pollard(1993)] and minimaxrates in nonparametric
classification
[Yang(1999b)].
The densityestimationproblemwe consideris closelyrelatedto a data
compressionproblemin information
theory(see Section3). The relationship
allows us to obtainbothupperand lowerboundson the minimaxriskfrom
the minimaxredundancyof data compression,
upper-bounding
whichis related to the globalmetricentropy.
Le Cam (1973) pioneeredthe use oflocal entropyconditionsin whichconvergencerates are characterizedin termsofthe coveringor packingofballs
ofradius 8 byballs ofradius 8/2,withsubsequentdevelopments
byBirgeand
others,as mentionedabove.Such localentropy
conditions
provideoptimalconvergencerates in finite-dimensional
as well as infinite-dimensional
settings.
In Section7, we showthat knowledgeofglobal metricentropyprovidesthe
existenceofa set withsuitablelocalentropyproperties
in infinite-dimensional
settings.In suchcases, thereis no need to explicitly
requireor construct
such
a set.
Y. YANG AND A. BARRON
1568
The paper is dividedinto 7 sections.In Section2, the main resultsare
presented.Applicationsin data compression
and regressionare givenin Section3. In Sections4 and 5, resultsconnecting
linearapproximation
and minimax rates,sparse approximation
and minimaxrates,respectively,
are given.
In Section6, we illustratethe determination
ofminimaxratesofconvergence
forseveralfunction
classes. In Section7, we discussthe relationship
between
the globalentropyand local entropy.
The proofsofsomelemmasare givenin
the Appendix.
2. Main results. Suppose we have a collectionof densities{po: 0 E 0}
definedon a measurablespace Y withrespectto a o-finitemeasure,u.The
parameterspace 0 could be a finite-dimensional
space or a nonparametric
space (e.g.,the class ofall densities).Let X1, X2, ..., X, be an i.i.d. sample
frompo, 0 E 0. We want to estimatethe true densitypo or 0 based on the
sample.The K-L loss,thesquaredHellingerloss,theintegratedsquarederror
and someotherlosses willbe consideredin thispaper.We determine
minimax
boundsforsubclasses{p6: 0 E S}, S c 0. Whenthe parameteris the density
itself,we mayuse f and 7 in place of 0 and S, respectively.
Our technique
is most appropriatefornonparametric
classes (e.g., monotone,Lipschitz,or
neuralnet; see Section6).
Let S be an actionspace forthe parameterestimateswith S C S C 0.
An estimator0 is then a measurable mappingfromthe sample space of
X1, X2, . . ., Xn to S. Let Sn be the collectionofall such estimators.For nonparametricdensityestimation,S = 0 is oftenchosento be the set of all
densitiesor sometransform
ofthe densities(e.g.,square rootofdensity).We
considergeneralloss functions
d, whichare mappingsfromS x S to R+ with
d(6, 0) = 0 and d(O, 0') > 0 for0 =A6'. We call such a loss function
a distance
whetheror notit satisfiespropertiesofa metric.
The minimaxriskofestimating0 C S withactionspace S is definedas
rn =
minmaxEod2(, 0).
OESn
OeS
Here "min"and "max"are understoodto be "inf"and "sup,"respectively,
ifthe
minimizeror maximizerdoes notexist.
DEFINITION
separation E
1. A finiteset N. c S is said to be an 8-packingset in S with
>
0, if for any 0, 6' C N,, 60
06',we have d(6, 0') > E. The
logarithmofthe maximumcardinalityofe-packingsets is called thepacking
orKolmogorov
e-entropy
capacityofS withdistancefunction
d and is denoted
Md(e).
DEFINITION2. A set G,,c S is said to be an 8-netforS ifforany 6 E S,
thereexistsa 00 E G, such that d(6, 00) < 8. The logarithmofthe minimum
cardinalityof 8-netsis called the covering8-entropyof S and is denoted Vd(E).
MINIMAXRATES OF CONVERGENCE
1569
to see that Md(c) and Vd(W)
it is straightforward
From the definitions,
in 6 and Md(E) is rightcontinuous.These definitions
are nonincreasing
are
ofthemetricentropynotionsintroduced
slightgeneralizations
byKolmogorov
and Tihomirov
we informally
(1959). In accordancewithcommonterminology,
"metric"entropiesevenwhenthe distanceis nota metcall these 6-entropies
ric. One choiceis the square rootof the Kullback-Leibler(K-L) divergence
whichwe denoteby dK, whered2(0, 0') = D(pHI0tpo)= fp0log(p0/po)dl-.
Clearly dK(6, 0') is asymmetricin its two arguments,so it is not a metric. Other distanceswe considerinclude the HellingermetricdH(6, 0') =
1/2
/f2
and the Lq(Q) metricdq(6, 0') = (f Ipo- POlqdA)Iq
Po _ po)2 dA)
forq > 1. The Hellingerdistanceis upperboundedbythesquarerootoftheKL divergence,
thatis, dH(6, 0') < dK(6, 0'). We assumethedistanced satisfies
the following
condition.
CONDITION
0 (Local triangleinequality). There exist positive constants
A < 1 and 6Q suchthatforany 0, 6' E S, 6 C S, ifmax(d(H, 0), d(6', 0))
0,
thend(6, 6) + d(6', 0) > Ad(6, 0').
REMARKS.(i) If d is a metricon 0, thenthe conditionis alwayssatisfied
C 0.
(ii) For dK, the conditionis notautomaticallysatisfied.It holdswhenS is
a class ofdensitieswithboundedlogarithmsand S C 0 is any actionspace,
includingthe class ofall densities.It is also satisfiedby regressionfamilies
as consideredin Section3. See Section2.3 formorediscussion.
with A = 1 forany S
When Condition0 is satisfied,the packingentropyand coveringentropy
have the simplerelationshipfor6 < 6o, Md(2A/A) < VA(S) < Md(A6).
We will obtainminimaxresultsforsuch general d and then special results will be givenwithseveralchoicesofd: the square rootK-L divergence,
Hellingerdistance and Lq distance.We assume Md(A) < 00 forall 6 >
0. The square root K-L, Hellingerand Lq packingentropiesare denoted
and Mq(e), respectively.
MK(?),
MH(6)
2.1. Minimax risk under a global entropycondition. Suppose a good up-
per bound on the covering6-entropy
underthe square rootK-L divergence
is available. That is, assume VK(E) < V(?). Ideally,VK(W) and V(?) are of
the same order.Similarly,
let M(?) < Md(8) be an available lowerboundon
6-packingentropywithdistanced, and ideally,M(?) is ofthe same orderas
Md(?).
Suppose V(?) and M(?) are nonincreasing
and right-continuous
functions.To avoid a trivialcase (in whichS is a small finiteset), we assume
M(?) > 2 log2 for? small enough.Let 8, (called the criticalcoveringradius)
be determined
by
?n= V(8n)/n.
The trade-off
here betweenV(?)/n and s2 is analogousto that betweenthe
squared bias and varianceofan estimator.As will be shownlater,282 is an
Y. YANG AND A. BARRON
1570
upperboundon the minimaxK-L risk.Let gn,d be a separations suchthat
M(En,d) = 4ns2 + 2 log 2.
withthe criticalcovering
We call ?n d the packingseparationcommensurate
radius sn. This ?2 determinesa lowerboundon the minimaxrisk.
In general,the upperand lowerrates r2 and 2d need notmatch.Conditionsunderwhichtheyare ofthe same orderare exploredat the end ofthis
the minimaxrate forL2
subsectionand are used in Section2.2 to identify
distanceand in Sections2.3 and 2.4 to relate the minimaxK-L risk to the
minimaxHellingerriskand the Hellingermetricentropy.
THEOREM1 (Minimaxlowerbound). Suppose Condition0 is satisfiedfor
thedistanced. Thenwhenthesamplesize n is largeenoughsuch thatEn,d <
2e,- theminimaxrisksforestimating6 E S satisfies
minmax Po{d(6,
6
OES
0) > (A/2)n,
1/2
d}1
and consequently,
6)
minmaxE0,d2(, 0
(A2/8)n
d,
wheretheminimumis overall estimatorsmappingfromXn to S.
PROOF. Let
be an En, d-packingset withthe maximumcardinality
in S underthe givendistanced and let Gn be an en-netforS underdK.
For any estimator0 takingvalues in S, define0 = argmino_
d(H',0) (if
Nn, d
thereare morethanone minimizer,
chooseanyone),so that 6 takes values in
the finitepackingset Ne d. Let 0 be any pointin Ne d. If d(6, 0) < AEn,d/2,
thenmax(d(6, 0) d(6, t)) < An, d/2 < eo and hencebyCondition0, d(, 0^)+
d(6, 0) > Ad(6, 6), whichis at least Ar- d if 0 #0. Thus if 0 :#0, we must
have d(, 0^)> AEn d/2, and
minmaxPo{d(6, 0) > (A/2),
E
6
d}
>
min max Po{d(6, t0)> (A/2)8n,d}
d
O9NEn,
=min max PO(6)
0
>min
w(6)Po(6 #
H)
ONenfd
-min Pw (0 7&a
where in the last line, 0 is randomlydrawn accordingto a discreteprior
probabilityw restrictedto NE d and Pw denotesthe Bayes average probabilitywithrespectto the priorw. Moreover,
since (d(6, 0))' is notless than
MINIMAXRATES OF CONVERGENCE
((A/2)n,
1571
d)1l{0+0}, takingthe expectedvalue it followsthatforall / > 0,
minmaxE6d1(0, 6) > ((A/2)En,d) minPw(0
o
6 OES
6).
By Fano's inequality[see, e.g., Fano (1961) or Cover and Thomas (1991),
pages 39 and 205],withw beingthediscreteuniform
prioron 0 in thepacking
setNen,d wehave
(1)
P (0#6)
>1
w
I(0; Xn)+log2
~~~log
IN,ndI
where 1(0; Xn) is Shannon'smutual information
betweenthe randomparameterand the randomsample,when 0 is distributedaccordingto w. This
mutual information
is equal to the average (with respectto the prior)of
the K-L divergencebetweenp(xn I0) and pw(xn) = IIw(0)p(xnI0), where
p(x8jO) = p6(xn) = n=L~
and x8 = (xl, ..., xn). It is upperboundedby
p09(xi)
the maximumK-L divergence
betweenthe productdensitiesp(xn 0) and any
joint densityq(xn) on the samplespace tn. Indeed,
I (08;Xn) =E?w( 0)
0
p(xn l0)1log p(xn l0)/ pw(xn))A(d xn)
p(x' |0)1q(Xn))I(d
J0)1log(
~<YEw(0)p(xn
0
Xn)
< max D(PxnIo11Qx)
GE
n,d
The firstinequalityabove followsfromthe fact that the Bayes mixture
density p"(xn) minimizesthe average K-L divergenceover choices of
densities q(xn) [any other choice yields a larger value by the amount
f pw(xn)log(pW(xn)/q(xn))A(dXn) > 0]. We have w uniformon NEnd . Now
choose w1 to be the uniformprior on Gn and let q(xn) = pw(xn) Zow1(0)p(xnI0) and Qxn be the corresponding
Bayes mixturedensityand
distribution,
respectively.
Because G is an En-netin S underdK, foreach
= d%(0, 0) < E2. Also by
0 E S, thereexists 0 E G such that D(pjllp)
definition,
logIGn I < VK(En). It followsthat
D(Pxn10 Qxn) =E log
(2)
p(X i0)
(1/1GJn)ZEOEG,
<Elog
p(XnI0')
p(X 0) n
(11/GE n)P(X
10)
=logjG I+ D(Pxn10IIPxn1J)
< V(E71) n?
+
2
Thus,by our choiceofen di (I(0; Xn) + log2)/log INE
sion follows.This completesthe proofofTheorem1. 0
dI <
1/2.The conclu-
1572
Y. YANG AND A. BARRON
REMARKS.(i) Up to inequality(1), the development
hereis standard.Previoususe ofFano'sinequalityforminimaxlowerboundtakesone ofthefollowingweak boundson mutualinformation:
I(O; X8) < nI(f; X1) or 1(0; Xn) <
n max6,/Eo D(Px, 11Px1I6) [see,e.g.,Hasminskii(1978) and Birge(1983),reAn exceptionis workofIbragimovand Hasminskii(1978) where
spectively].
a moredirectevaluationof the mutualinformation
forGaussian stochastic
processmodelsis used.
(ii) Our use of the improvedboundis borrowedfromideas in universal
data compressionforwhichI(0; Xn) representsthe Bayes average redundancyand maxo,SD(Pxn 11Pxn) representsan upperboundon the minimax
redundancyCn = minQxnmaxo,sD(Pxnj6IQx.) = max. Iw(6; Xn), where
the maximumis over priorssupportedon S. The universaldata compression interpretations
ofthese quantitiescan be foundin Davisson (1973) and
Davisson and Leon-Garcia(1980) [see Clarke and Barron(1994), Yu (1996),
Haussler (1997) and Haussler and Opper(1997) forsomeofthe recentwork
in that area]. The bound D(Px,1o0IPxn)
<
V(sn) + nE' has roots in Barron
[(1987),page 89],whereit is givenin a moregeneralformforarbitrary
priors,
that is D(Pxnlo 11Pxn) < log 1/w(Qfrg)+ n82, where IO,E = {O':D(pI|p0 P) <
82} and PXn has densitypw(xn) = f p(xn'6)w(d6). The redundancy
bound
n2
can
also
be
obtained
from
+
use of a two-stagecode of length
V(8n)
logIG8 + mino8G log1/p(xnIO,)[see Barronand Cover(1991), SectionV].
(iii) Frominequality(1), the minimaxriskis boundedbelowbya constant
timess2 d(l - (Cn +log 2)/Kn), whereC, = maxwIw(0; X8) is the Shannon
capacityof the channel { p(xn 0)
) 6 E S} and Kn = logIN, d is the KolmogorovE d-capacityof S. Thus E d lowerboundsthe minimaxrate provided the Shannoncapacityis less than the Kolmogorovcapacityby a factorless than 1. This Shannon-Kolmogorov
characterization
is emphasizedby
lbragimovand Hasminskii(1977, 1978).
(iv) For applications,the lowerbounds may be applied to a subclass of
densities{po: 0 E So} (S0 c S) whichmaybe richenoughto characterizethe
oftheestimationofthedensitiesin thewholeclassyetis easyenough
difficulty
to checkthe conditions.For instance,if{po: 0 e So} is a subclassofdensities
< T forall 0 E S0, then
thathave supporton a compactspace and IIlogPoK11
the square rootK-L divergence,Hellingerdistanceand L2 distanceare all
equivalentin the sense that each ofthemis bothupperboundedand lower
boundedby multiplesofeach other.
We nowturnour attentionto obtaininga minimaxupperbound.We use a
uniform
priorw1 on the ?-netGC forS underdK. For n = 1,2, ...,let
p(xn)
=
E
W1(6)P(Xn8)
=
0,EG,n
1
En I OeG,n
be the corresponding
mixturedensity.Let
(3)
n-1
p(X)
= n-1 E pi(x)
i=o
p(xn 6)
MINIMAXRATES OF CONVERGENCE
1573
be the densityestimatorconstructedas a Cesaro average ofthe Bayes predictivedensityestimatorspi(x) = p(Xi+jIXi) evaluatedat Xi+1 = x, which
equal p(Xi, x)/p(Xi) fori > 0 and pi(x) = p(x) = (l/GI)
p(xI6) for
p(xG
= 0. Note that pi(x) is the averageof p(xlI) withrespectto the posterior
distribution
p(6IXt) on 0 C G,
n
THEOREM2 (Upperbound). Let V(e) be an upperboundon thecovering
entropy
of S underdK and let En satisfyV(En) = n?2. Then
minmaxE D(p6^1p) <
2?2
wheretheminimization
is overall densityestimators.
Moreover,
if Condition
0 is satisfiedfora distanced and Aod2(6,6') < dK(6, 6') forall 0, 6' E S,
and if theset Sn ofallowed estimators(mappingsfromXn to S) contains
constructed
in (3) above,thenwhenEn,d< 280,
(AoA/8)82
0) < minmaxE6dKj,
0) < 2r2.
(AO A/8)fn,d < AOminimaxEod2(6,
K
dOn-S
nIe
The conditionthat Sn contains - in Theorem2 is satisfiedif {p: 0 E S} is
convex(because is a convexcombination
ofdensitiesp0). In particular,
this
holds if the actionspace S is the set of all densitieson X. In whichcase,
whend is a metricthe onlyremainingconditionneededforthe secondset of
inequalitiesis Aod2(6,0') < d2(0, 0') forall 0, 6'. Thisis satisfiedbyHellinger
distanceand L1 distance(withAo = 1 and Ao = 1/2,respectively).
Whend is
thesquare rootK-L divergence,
Condition0 restricts
thefamilypO: 0 E S (e.g.,
to consistofdensitieswithuniformly
boundedlogarithms),
thoughit remains
acceptableto let the actionspace S consistofall densities.
PROOF. By convexity
and the chainrule [as in Barron(1987)],
EOD(pol1jp)
< EO(n= n
(4)
I
In-1 D(Px.+llPilx)
Elog
Ei=O
Zn-1
p(Xi+1 6)
Xi)
p(Xi+1JI
= n-nElog p(XnI0)
p(Xn)
= n-1D(Px0lIo Pxn)
< n-1(V(E,)
+ n2)
= 2
2
wherethe last inequalityis as derivedas in (2). Combiningthisupperbound
withthelowerboundfromTheorem1, thiscompletesthe proofofTheorem2.
[
If ? d and 2r2 convergeto 0 at the same rate,thenthe minimaxrate of
convergenceis identifiedby Theorem2. For 8vnd and En to be of the same
1574
Y. YANG AND A. BARRON
thatthe following
twoconditionshold(fora proofofthis
order,it is sufficient
simplefact,see Lemma4 in theAppendix).
1 (Metricentropyequivalence). Thereexistpositiveconstants
CONDITION
a, b and c suchthatwhenE is small enough,M(8) < V(bs) < cM(a8).
CONDITION2 (Richnessofthe function
class). For some0 < a < 1,
lim inf M(a8)/M(e)
>
1.
Condition1 is the equivalenceofthe entropystructureunderthe square
rootK-L divergenceand under d distancewhen e is small (as is satisfied,
forinstance,whenall the densitiesin the targetclass are uniformly
bounded
aboveand awayfrom0 and d is takentobe HellingerdistanceorL2 distance).
Condition2 requiresthe densityclass to be large enough,namely,M(e) approachesoc at least polynomially
fastin 1/ as e -- 0, that is, thereexists
a constant8 > 0 such that M(e) > 8-8. This conditionis typicalofnonparametricfunctionclasses. It is satisfiedin particularif M(E) can be expressed
-as M(e) = 8-rK(8), where r > 0 and K(a8)/K(E)
1 as 8 -1 0. In most
situations,the metricentropiesare knownonlyup to orders,whichmakes it
generally not tractable to check lim inf 0
Md(a8E)/Md(8)
> 1 for the exact
packingentropyfunction.
That is whyCondition2 is statedin termsofa pre-
sumed known bound M(?) < Md(E). Both Conditions 1 and 2 are satisfiedif,
forinstance, M(?)
with r and K(?) as mentioned above.
V(?) --rK(E)
COROLLARY
1. AssumeCondition0 is satisfiedfora distanced satisfying
Aod2(0,6') < d2%(0,0') in S and assume {po: 0 E S} is convex.UnderConditionsI and 2, we have
minmax Eod2(0, )
6E
OES
where 8Enis determinedby the equation Md(En)
finiteand supOES
minmaxEOdK(,
11log
= nEn.
In particular, if ,u is
< o0 and Condition2 is satisfied,then
POj1,,,
0) AminmaxEOdH(6, 0)
minmaxE0OjpO- mP2
n
where 8,, satisfies M2(E,,) = ne2 or MH(En) = nE8.
Corollary1 is applicableformanysmoothnonparametric
classes as we shall
see. However,fornot veryrich classes of densities(e.g., finite-dimensional
familiesor analyticaldensities),the lowerbound and the upper bound derivedin the above way do not convergeat the same rate. For instance,fora
finite-dimensional
class, both MK(E) and MH(8) may be of orderlog(1/8)m
forsomeconstantm > 1, and thenEn and En, H are notofthe same orderwith
8, I,(log
For smoothfinite-dimensional
modn)/n and E H =n o(1/-).
els, the minimaxrisk can be solvedusing some traditionalstatisticalmethods (such as Bayes procedures,
Cramer-Raoinequality,
Van Tree'sinequality,
MINIMAXRATES OF CONVERGENCE
1575
etc.),but these techniquesrequiremorethan the entropycondition.If local
resultscan be
entropyconditionsare used insteadofthoseon globalentropy,
obtainedsuitableforbothparametricand nonparametric
familiesofdensities
(see Section7).
2.2. MinimaxratesunderL2 loss. In this subsection,we deriveminimax
boundsforL2 riskwithoutrequiringK-L coveringentropy.
Let Y be a class ofdensityfunctions
f withrespectto a probability
measure
,u on a measurableset X such as [0,1]. (Typically Y will be taken to be
a compactset, thoughit need not be; we assume onlythat the dominating
measure ,u is finite,then normalizedto be a probabilitymeasure.)Let the
packingentropyofY be Mq(8) underthe Lq(A.)metric.
To deriveminimaxupperbounds,we derivea lemmathatrelates L2 risk
fordensitiesthatmaybe zeroto the corresponding
riskfordensitiesbounded
away fromzero.
In addition to the observed i.i.d sample X1, X2, ..., Xn from f, let
Y1, Y2, . . . Yn be a sample generatedi.i.d fromthe uniformdistribution
on
X withrespectto p. (generatedindependently
of X1, ..., Xn). Let Zi be Xi
or Yi with probability(1/2,1/2)accordingto the outcomeof Bernoulli(1/2)
randomvariables Vi generatedindependently
fori = 1,..., n. Then Zi has
densityg(x) = (f(x) + 1)/2. Clearlythe new densityg is boundedbelow
(away from0), whereasthe familyofthe originaldensitiesneed not be. Let
= {g: g = (f + 1)/2,f e Y} be the new densityclass.
LEMMA1. The minimaxL2 risksof the two classes Y and Sy have the
followingrelationship:
minmaxExnlff f
f~E9
< 4minmaxEzn
12g-
ge52
g
wheretheminimization
on theleft-hand
side is overall estimatorsbased on
.
.
.
XX
and
the
minimization
on
the
right-hand
side is overall estimators
,
X,
based on n independent
observations
fromg. Generally,
forq > 1 and q 0 2,
we have
minmaxExn lf-f
ll4
< 4qminmax
EZn lg-
-
jjq
PROOF. We changethe estimationoff to anotherestimationproblemand
showthatthe minimaxriskofthe originalproblemis upperboundedby the
minimaxrisk ofthe new class. Fromany estimatorin the new class (e.g.,a
minimaxestimator),an estimatorin the originalproblemis determinedfor
whichthe riskis notgreaterthan a multipleofthe riskin the new class.
Let g be any densityestimatorof g based on Z1, i = 1, . . . n. Fix q > 1.
Let g be the densitythatminimizesIlh- gllq overfunctions
in the set 4
{h: h(x) > 1/2,f h(x) dp.= 1}. (If the minimizerdoes notexist,thenone can
chooseg_within? oftheinfimum
and proceedsimilarlyas followsand finally
obtainthe same generalupperbound in Lemma 1 by lettingE -+ 0.) Then
-
Y. YANGAND A. BARRON
1576
by the triangleinequalityand because g E X, llg - ^IIq < 2q-lllg_ gllq +
< 2qllg - gllq.For q = 2, this can be improvedto IIg - g12 <
Hilbert
space convexanalysis [see, e.g.,Lemma 9 in Yang and
lIg gL12by
Barron(1997)]. We now focuson the proofofthe assertionforthe L2 case.
The proofforgeneralLq is similar.We constructa densityestimatorforf.
Note that f(x) = 2g(x) - 1, let frand(x) = 2g(x) - 1. Then frand(X) is a
nonnegativeand normalizedprobabilitydensityestimatorand depends on
Xl, ... . Xn, Y1, ...,
Yn and the outcomes ofthe coin flips V1,..., Vn. So it is
a randomizedestimator.
The squared L2 loss offrand
is boundedas follows:
(f(x)
- frand(X))
d/A=f(2g(x)
-
2g(x))2dA
< 4JI g-2
To avoidrandomization,
we mayreplacefrand(x)withits expectedvalue over
and
coin
flips
V1,..., Vn to get f(x) with
.,
Yn
Yl,
Ex,
|| f
-
f 2 = EXn || f
-
Eyn,
vnfrand
< EXn Eyn Vn| f-
=
Ez
2
frand 2
|I f - frand 2
< 4EZnll0g- g 112v
wherethe firstinequalityis by convexity
and the secondidentityis because
frand
dependson Xi, Yn, Vn onlythroughzn. Thus maxf, Exn 1f - f 2 4 maxgE Ezn g
i2. Takingthe minimumoverestimatorsg completes
the proofofLemma 1. C
-
Thus the minimaxrisk of the originalproblemis upper boundedby the
minimaxrisk on F. Moreover,the 8-entropiesare related. Indeed, since
+ 1)/2 - (f2 + 1)/2112= (1/2)| f1 - f2112,forthe new class S9, the
V1(f1
E-packingentropyunder L2 is M2(A) = M2(2E).
Now we give upper and lower bounds on the minimax L2 risk. Let us first
getan upperbound.Forthenewclass,thesquarerootK-L divergence
is upper
boundedby multiplesofL2 distance.Indeed,fordensitiesg1,g2 E Y,
D(g,11g2) <
j(1
g
)2
d
<2J(g1
- 92)2dA,
where the firstinequalityis the familiarrelationshipbetween K-L divergenceand chi-squaredistance,and the second inequalityfollowsbecause g2 is lower bounded by 1/2. Let VK(E) denote the dK covering
entropyof 5F. Then VK(8) < M2(8/V) = M2(V?). Let 8n be chosen such that M2(vX-8n) = n82. From Theorem2, there exists a density estimator ^O such that maxge~ EznD(gI ^g) < 282. It followsthat
maxgE5FEZnd2(g, RO) < 2?2 and maxg, Ez-llg 011 < 8E2. ConseI
quently,by Lemma 1, minfmaxf Exn f - f < 8V8
3
. To get a good
estimator in terms of L2 risk,we assume SUpf l
<< L < oo. Let g be
MINIMAXRATES OF CONVERGENCE
1577
the densityin Y thatis closestto A0in Hellingerdistance.Then by triangle
inequality,
maxEZn,d2
(g,g) < 2maxE .d2(g,gO)+ 2maxE
gEy~
gef9
<
gEfJ
4maxEznd2(g, g0) < 82
Ago)
8d(
.
gEY
NowbecausebothIgII,,and JJgjj1
arebounded
by(L + 1)/2,
f(g
-
2d/
g-
f(4
?
g)2(
g)
dp < 2(L + 1)dH(g,g)
Thus maxgE Ez.l g- _I12 < 16(L + 1)82. Using Lemma 1 again,we have
an upperboundon the minimaxsquaredL2 risk.The actionspace S consists
ofall probability
densities.
THEOREM3. Let M2(s) be theL2 packingentropy
ofa densityclass 5 with
respectto a probabilitymeasure.Let En satisfyM2(V2 ) - ne2. Then
minmaxExn,t,
f - f 118<8Ve8n
f
fEY
Ifin addition,
SUpfEY1if 0 < L < o, then
minmaxEff
i$
-
fEY'
fj2 < 256(L + 1)8n.
The above resultupperboundsthe minimaxL1 risk and L2 risk (under
forL2) usingonlytheL2 metric
entropy.
Using the relationship between Lq norms namely,Ilf1 -2-for
fq
If -f
<
I
q < 2, underSUpf IfKII < 0, we have
<o
SUpfE5 lifjjff
for 1 < q < 2.
/ fE/
To get a minimaxlowerbound,we use the followingassumption,which
is satisfiedby manyclassical classes such as Besov,Lipschitz,the class of
monotonedensitiesand more.
minmaxExnhlf-f112-<r2
CONDITION3. There exists at least one densityf* E 9 with minXE
f *(x) = C > 0 and a positive constanta C (0, 1) such that o {(1 - a)f* + ag: g c 7} C 7.
For a convexclass ofdensities,Condition3 is satisfiedifthereis at least
one densityboundedawayfromzero.UnderCondition3, the subclass'70has
L2 packingentropyM(e) -M2(8/a) and fortwodensitiesf1 and f2 in 57,
D(f11f2)<' Jf1f2)2
d, < (1
f)CfI
-
f2)2d
Thus applyingTheorem1 on 3o, and thenapplyingTheorem2, we have the
following
conclusion.
Y. YANG AND A. BARRON
1578
THEOREM4. Suppose Condition 3 is satisfied.Let M2(E) be the L2 packing
entropyofa densityclass Y withrespectto a probabilitymeasure, let En satisfy
- a)C/a)
= n2 and ? be chosensuch that M2(E,/a) = 4nE ?
M2(2(1
2 log2. Then
minmaxExnllf-fI-
/
fEff
>-
n/8.
iftheclass Y is richusingtheL2 distance(Condition2), thenwith
Moreover,
?n
determinedby M2(9n) = nE2n:
(i) If M2(A)
M1(s)
holds, then mini maxfEy EXnllf
< oo, then minfmaxfEy EXnllf
(ii) If SUPfEY IIf Kloo
-
n.
-fIll
f 2 ?n
Using the relationshipbetweenL2 and Lq(l < q < 2) distancesand apcorollary.
plyingTheorem1, we have the following
COROLLARY 2.
Suppose Y is rich using both the L2 distance and the Lq
distance for some q E [1, 2). Assume Condition 3 is satisfied. Let n, q satisfy
Mq (fn q) = n?2n.Then
82 q<
-n,q
minmax
/ fEY
Exn llf-
128
q <
n
If the packingentropiesunder L2 and Lq are equivalent(whichis the
classes; see Section6 forexamples),
case formanyfamiliarnonparametric
then the above upper and lowerbounds convergeat the same rate. Generupper boundeddensityclass Y on a compactset, beally fora uniformly
we knowM1(8) < M2(8) <
cause f(f - g)2d.L < (IIf + gI)f If - gId,tL,
boundforL1 riskmayvary
lower
Then
the
corresponding
f
Klo).
M&?2/ supy 11
from8n to 82 depending on how differentthe two entropies are [see also Birge
(1986)].
2.3. MinimaxratesunderK-L and Hellingerloss. For the square rootKL divergence,Condition0 is not necessarilysatisfiedforgeneralclasses of
densities.Whenit is not satisfied(forinstance,if the densitiesin the class
lemmais helpfulto lowerboundas well
have different
supports),thefollowing
as upperboundthe K-L riskinvolvingonlythe Hellingermetricentropyby
relatingit to the coveringentropyunderdK.
We considerestimatinga densitydefinedon X withrespectto a measure
,t with 4u(X)= 1 forthe restofthis sectionand Section2.3.
< V2, there
LEMMA 2. For each probabilitydensityg, positive T and 0 <
flloo < T and
exists a probabilitydensityg such thatforeverydensityf with 1I
dH(f, g) < 8, the K-L divergenceD(f Ig) satisfies
8
D(fIIg) < 2(2 + log(9T/482))(9
+ 8(8T
-
1)2)82.
MINIMAXRATESOF CONVERGENCE
1579
Bounds analogousto Lemma 2 are in Barron,Birgeand Massart [(1999),
Proposition1], Wongand Shen [(1995),Theorem5]. A proofofthislemmais
givenin the Appendix.
We now showhow Lemma 2 can be used to giverates ofconvergence
under Hellingerand K-L loss withoutforcingthe densityto be boundedaway
fromzero. Let Y be a densityclass with IfKl,, < T foreach f E Y. Asthat the metricentropyMH(E) under dH is of order
sume, forsimplicity,
8-1/a for some a > 0. Then by replacingeach g in a Hellingercovering
withthe associatedg, Lemma 2 impliesthat we obtaina dK coveringwith
VK(0) ? ((log(j/_))1/2/,)l/a and we obtainthe associatedupperboundon the
K-L risk fromTheorem2. For the lowerbound,use the factthat dK is no
smallerthan the Hellingerdistanceforwhichwe have the packingentropy.
(Note that the dK packingentropy,
whichmay be infinity,
may not be used
here since Condition0 is not satisfied.)Consequently,
fromTheorem1 and
Theorem2, we have
(n log n) 2/(2a+l)
<
minmaxEdH(f, f) < minmaxED(f 1f)
f
fE/
ffE/f
n-2a/(2a+1)(10g
n)l/(2a+l).
Here,the minimization
is overall estimators,f; thatis, S is the class ofall
densities.
Thus even ifthe densitiesin Y maybe 0 or near 0, the minimaxK-L risk
is withina logarithmic
factorofthe squaredHellingerrisk.See Barron,Birge
and Massart (1999) forrelatedconclusions.
2.4. Some moreresultsforriskswhendensitiesmaybe 0. In this section,
we showthatby modifying
a nonparametric
class ofdensitieswithuniformly
boundedlogarithmsto allowthe densitiesto approachzeroat somepointsor
evenvanishin somesubsets,theminimaxratesofconvergence
underK-L and
Hellinger(and L2) may remainunchangedcomparedto that ofthe original
class. Densitiesin thenewclass are obtainedfromtheoriginalnonparametric
class by multiplying
bymembersin a smallerclass (oftena parametricclass)
offunctionsthat may be near zero. The resultis applicableto the following
example.
EXAMPLE. (Densitieswithsupporton unknownsetofk intervals.)Let Y =
< ak =
{h(x) Ei- 1bi?bl{aj<x<ai+}/c: h E , O= ao < a, < a2 <
>
<
1,
' b +l(ai +
Pyand 0
ai)
bi < Y2, 1 < i < k} (c is the normalizingconstant).Here Y is a class of functionswithuniformly
bounded
k is a positiveinteger,-y and Y2 are positiveconstants.The conlogarithms,
stantsYiand Y2forcethedensitiesin Y tobe uniformly
upperbounded.These
densitiesmaybe 0 on someintervalsand arbitrarily
closeto 0 on others.
Notethatifthedensitiesin J are continuous,
thenthedensitiesin Y have
at most k - 1 discontinuous points. For instance, if Jr,' is a Lipschitz class,
thenthe functions
in Y are piecewiseLipschitz.
1580
Y. YANG AND A. BARRON
Somewhatmoregenerally,
let 4 be a class ofnonnegativefunctionssatisfyingjigII0< C and fg d1u> c forall g e 4. Thoughthe g's are nonnegative,
theymay equal zero on some subsets.Suppose 0 < c < h < C < o forall
hE
V.
Consider
a classofdensities
withrespectto/t:
=
{
f(x) =
h(x)g(x)
f h(x)g(x)dthL
hE
~
g
?4
S
Let 4 = {g g/ f g d: g E 4} be the densityclass corresponding
to 4
and similarlydefineX. Let M2(8; V) be the packingentropyofX underL2
distanceand let VK(E; Y) and VK(E;4) be coveringentropiesas definedin
Section2 ofY and 4, respectively,
underdK. The actionspace S consistsof
all probability
densities.
THEOREM
5. SupposeVK(,; 4) < A1M2(A2e;V) forsomepositive
constantsA1,A2 and supposetheclass X is richin thesensethatliminf,0O
M2(ag;X)/M2(8; ') > 1 forsome0 < a < 1. If 4 contains
a constant
function,then
minmaxEfD(f
If
f
/fEY
Y
minmaxE
(,f,
Ef
where?n is determined
byM2(sn;Y)
=
f)
minmax EfIlf
fEY
-
f 112
?2
,
nE2.
The resultbasicallysays that if 4 is smallerthan the class Y in an 8entropysense, then forthe new class, being 0 or close to 0 due to 4 does
not hurtthe K-L risk rate. To apply the result,we still need to boundthe
coveringentropyof 4 under dK. One approachis as follows.Suppose the
Hellingermetricentropyof4 satisfiesMH( /log(1/E);4) < M2(Ar; 4) for
someconstantA > 0. FromLemma 2, VK(E; 4j) < MH(A'E/ log(1/r);4!) for
someconstantA' > 0 whens is sufficiently
small(notethattheelementsg of
the coverfromLemma 2 are notnecessarilyin Y thoughtheyare in the set
S ofall probability
densities).Then we have VK(8; 4) < M2(AA'E; Y). This
coverstheexampleabove(forwhichcase,4 is a parametric
family)and it even
allows 4 to be almostas largeas %. An exampleis X = {h: log h E Ba, q(C)}
and 4 = {g:g e B , q (C), g > 0 and f gd,u = 1} witha' biggerthan but
closeto a (fora definition
arbitrarily
ofBesovclasses,see Section6). Notethat
herethe densitycan be 0 on infinitely
manysubintervals.
Otherexamplesare
givenin Yang and Barron(1997).
PROOF. By Theorem2, we have mini maxfEyEf D(f ||f) n< I whereEn
satisfiesVK(?n; Y) = nk2. Undertheentropy
condition
that4 is smallerthan
X, it can be shownthat VK(E; Y) < A1M2(A28;Y) forsome constantsA1
and A2 [see Yang and Barron(1997) fordetails].Then underthe assumption
of the richness of
,
we have en = O(8n). Thus en upper bounds the minimax
risk rates underbothd2 and d2. Because 4 containsa constantfunction,
g is a subsetofF. Since the log-densities
in 4Vare uniformly
bounded,the
MINIMAXRATES OF CONVERGENCE
1581
L2 metricentropies of Y and Y are of the same order.Thus taking S0 = 4?,
the lower bound rate under K-L or squared Hellinger or square L2 distance
is of order E2 by Corollary 1. Because the densities in Y are uniformlyupper
bounded, the L2 distance between two densities in Y is upper bounded by
a multiple of the Hellinger distance and dK. Thus under the assumptions in
the theorem,the L2 metric entropyof Y satisfies M2(E; Y) < VK(AS; S) --<
M2(A'E; X) forsome positive constants A and A'. Consequently,the minimax
L2 risk is upper bounded by order 82 by Theorem 3. This completes the proof
of Theorem 5. 0
3. Applications in data compression and regression. Two cases
when the K-L divergenceplays a directrole include data compression,where
total K-L divergenceprovides the redundancy of universal codes, and regression with Gaussian errors where the individual K-L divergencebetween two
Gaussian models is the customary squared L2 distance. A small amount of
additional work is required in the regressioncase to converta predictivedensity estimator into a regression estimator via a minimum Hellinger distance
argument.
3.1. Data compression. Let X1, ..., Xn be an i.i.d. sample of discrete random variables frompo(x), 0 E S. Let qn(Xl, ... xn) be a density (probability
mass) function.The redundancy of the Shannon code using density qn is the
differenceofits expected codelengthand the expected codelengthofthe Shannon code using the true density p0(x1, . .., x,), that is, D(p'Ilqn). Formally,
we examine the minimax propertiesof the game with loss D(pn (qn) forcontinuous random variables also. In that case, D(pn Iqn)corresponds to the
redundancy in the limit of fine quantization ofthe random variable [see, e.g.,
Clarke and Barron (1990), pages 459 and 460].
The asymptoticsof redundancy lower bounds have been considered by Rissanen (1984), Clarke and Barron (1990, 1994), Rissanen, Speed and Yu (1992)
and others. These results were derived for smooth parametric families or a
specificsmooth nonparametricclass. We here give general redundancy lower
bounds fornonparametricclasses. The key propertyrevealed by the chain rule
is the relationshipbetween the minimax value ofthe game with loss D(pn 11
q, )
and the minimax cumulative K-L risk,
n-1
=mmn max i-),E6D(pI I
minmax
D(pn 11qn)
q~~~Ons
~
i=O
iOS
where the minimization on the left is over all joint densities q"(x1,..., x,)
and the minimizationon the rightis over all sequences ofestimators Pi based
on samples of size i = 0, 1, . . ., n - 1 (fori = 0, it is any fixeddensity).Indeed,
one has D(pn IIqn) = ,nZ-1 ED(p0 11Pi) when q,(xl X... Xn) = Fli? kxi(xi+l?)
[cf.Barron and Hengartner (1998)]. Let Rn = minq En max6EsD(p'Ilqn) be
the minimax total K-L divergence and let rn = minp
max0es E0D(p0jlp)
1582
Y. YANGAND A. BARRON
be the individualminimaxK-L risk.Then
n-I
(5)
nrn-1< Rn < E: ri
i=O
Here the firstinequalityrn-1 < n-1Rn followsfromnotingthat as in (4),
foranyjoint densityqn on the sample space of X1,..., Xn, thereexistsan
forall 0 E S. The second
estimator such that EOD(poll ) < n-1D(ponIlqn)
inequalityRn < 7iyn-1
ri is fromthe factthat the maximum(over 0 in S) of
the sum ofrisksis notgreaterthan the sum ofthe maxima.
In particular,
we see thatrn Rn/nwhenrn n-1L ' ri. Effectively,
this
means thatthe minimaxindividualK-L riskand 1/ntimesminimaxtotalKwhen rn convergesat a rate sufficiently
L divergencematchasymptotically
slowerthan 1/n(e.g.,n-P with0 < p < 1).
Now let d(6, 0') be a metricon S and assume that {po: 0 E S} contains
all probabilitydensitieson X. Let Md(r) be the packingentropyof S under d and let V(?) be an upper bound on the coveringentropyVK(E) of
= V(En)/n and choose n,d such that
S under dK. Choose
C
enr, such that
=
4n?2n + 21og2. Based on Theorem1, (2) and inequality(5), we
Md(En d)
have the following
result.
3. AssumethatD(pollp,) > Aod2(6,0') forall 6, 6' E S. Then
COROLLARY
we have
(Ao/8)nE2d < minmaxD(pJ
qn)
2n
wheretheminimization
is overall densitieson Xn.
Two choicesthat satisfythe requirements
are the Hellingerdistanceand
the L1 distance.
Wheninterestis focusedon the cumulativeK-L risk(or on the individual
riskrnin the case that r
n-1En'Iri), directproofofsuitableboundsare
possiblewithoutthe use ofFano's inequality.See Haussler and Opper(1997)
fornew resultsin that direction.Anotherproofusing an idea of Rissanen
(1984) is in Yang and Barron(1997).
3.2. Applicationin nonparametricregression. Consider the regression
model
Yi = U(xi) + 'i
i=1..n.
SupposetheerrorsEi, 1 < i < n, are i.i.d.withtheNormal(O,u-2)distribution.
The explanatoryvariablesxi, 1 < i < n, are i.i.d. witha fixeddistribution
P.
The regressionfunctionu is assumed to be in a functionclass 4. For this
case, the square rootK-L divergencebetweenthejointdensitiesof(X, Y) in
the familyis a metric.Let IIu - v112= (f(u(x) - v(x))2 dP)1/2 be the L2(P)
distancewith respectto the measure inducedby X. Let M2(E) and similarly for q > 1 let Mq(r) be the 8-packing entropyunder L2(P) and Lq(P),
MINIMAXRATESOF CONVERGENCE
1583
respectively.
AssumeM2(e) and Mq(E) are bothrich(in accordancewithCondition 2). Choose En such that M2(En)= nEn. Similarly let En,q be determined
by Mq(fn,q) = nE2n
THEOREM6. Assume
|ulO
supUEII
< L. Then
hl
minmaxEllu -u112
2
~i
n
UE4
For theminimaxLq(P) risk,we have
?
minmaxEllu- utiq
lq>_-n,q
ii
If further,
M2(E)
UE4
Mq(E) for1 < q < 2, thenfortheLq(P) risk,we have
minmaxEluu utiq
n
a
ue6l
PROOF. Withno loss ofgenerality,
supposethatPx has a densityh(x) with
respect to a measure A. Let pu(x, y) = (2lTo2)y1/2exp (-(y - u(x))2/2uf2)h(x)
denote the joint density of (X, Y) with regressionfunctionu. Then
=
D(pullp,)
(1/2uf2)E((Y - u(X))2
-
reduces as is well
(Y - v(X))2)
knownto (1/2o-2)f(u(x) - v(x))2h(x)dA, so that dK is equivalentto the
L2(P) distance.The lowerrates then followfromTheorem1 togetherwith
the richnessassumptionon the entropies.
We nextdetermineupperboundsforregressionby specializingthe bound
fromTheorem2. We assume hlul,,< L uniformly
foru c 4. Theorem2
providesa densityestimatorPn such that max> ED(pIi3n) -< 2r2. It follows that maxUE4Ed (p>, Pn) < 2s2. [A similarconclusionis available in
Birge(1986),Theorem3.1.] Here we take advantageofthefactthatwhenthe
densityh(x) is fixed,pn(x, y) takes the formofh(x)g(ylx), whereg(ylx) is
an estimateofthe conditionaldensityof y givenx (it happensto be a mixtureofGaussians using a posteriorbased on a uniformprioron 8-nets).For
givenx and (Xi, Y )4i=1,let in(x) be the minimizerofthe Hellingerdistance
dH(gn(.Ix), 4?_)between
gn(ylx)and the normal0,(y) densitywithmean z
and the givenvarianceoverchoicesof z with izj < L. Then ii(x) is an estimatorofu(x) based on (Xi, Yi)n By the triangleinequality,givenx and
( Xi,
Yi )n=
'
dH(ck
u(
ifi (x))
+
< dH(4u(x)
9n(.lx))
< 2dHf (ou(x),
S n( x)) -
dH(fU(x),
gn
( lx))
It followsthat maxU4EEd2(pu, P n) < 4 max>E-Ed2(p, jPUA )
8rn. Now
Ed2 (pU, PU) = 2E f h(x)(1 - exp (-(u(x) - i7ni(x))2/8ur2))
dA. The concave
function(1 - e-v) is above the chord (v/B)(1 - e-B) for0 < v < B. Thus using
v = (u(x)
-
iun(x))2/8of2and B = L2/2U2, we obtain
maxE
(u(x) - in(X))2h(x) dA
< 2L2(1 - exp(-L2/2a-2))-1 maxEd 2 (pu, Pu)
UE4
H
.< 82n
Y. YANG AND A. BARRON
1584
This completesthe proofofTheorem6. 0
4. Linear approximation and minimax rates. In this section,we apto givea generalconclusionon
plythemainresultsand knownmetricentropy
the relationshipbetweenlinearapproximation
and minimaxrates ofconvergence.A moregeneraland detailedtreatment
is in Yang and Barron(1997).
Let (F = 1 = 1, 02.-., Ok, ...} be a fundamentalsequencein L2[0, 1]d
(thatis, linearcombinations
are dense in L2[0, I]d). Let r = {yo, ..., Yk.i.
k
be a decreasingsequenceofpositivenumbersforwhichthereexist0 < c' <
c < 1 such that
(6)
C Yk < Y2k < CYk,
as is true for Yk
k-a and also forYk - k-a(logk)I, a > 0,13 E R. Let
fork > 1 be the kthdegree
i 112
min{aI} ||f -_kL1ai
ofapproximation
off E L2[0, l]d bythesystem(F. Let y(r, (F) be all functions
in L2[0, 1]d withthe approximation
errorsboundedby r; thatis,
no(f
= Ilf 112
and
7k(f)
=
S~(F, (F) = {f C L2[0, 1]d:
'k(f)
< Yk, k =
0, 1, ..
They are called the fullapproximation
sets. Lorentz(1966) gives metricentropyboundson these classes (actuallyin moregeneralitythan statedhere)
and theboundsare used to derivemetricentropyordersfora varietyoffunctionclasses includingSobolevclasses.
Suppose the functionsin Y(F, (F) are uniformlybounded, that is,
1g1100 < p for some positive constantp. Let Y(F, (F) be all
supgkEy(rFq)
the probabilitydensityfunctionsin S7(F, (F). When yo is large enough,the
L2 metricentropiesofS~(F, (F) and Y(F, (F) are ofthe same order.Let knbe
chosensuchthat -y k/n.
THEOREM 7. The minimaxrateofconvergence
fora fullapproximation
set
is determined
offunctions
simplyas follows:
min max Ellf -f
f
fES(F,V)
kn/n.
A similarresultholdsforregression.
Notethatthereis no specialrequirement
on the bases (F (theymayeven notbe continuous).For such a case, it seems
hard to finda local packingset forthe purposeofdirectlyapplyingthe lower
boundingresultsofBirge(1983).
The conclusionof the theoremfollowsfromTheorem4. Basically,from
Lorentz,the metricentropyof Y(F, (F) is order ke = inf{k:Yk < 4} and
by M(En)
Y(F, (F) is richin L2 distance.As a consequence,En determined
ns2 also balances y2 and k/n.
As an illustration,
fora system(F, considerthe functions
that can be approximated
bythe linearsystemwithpolynomially
decreasingapproximation
error'Yk- k-, a > 0. Thenminimaxfi(Fr,4)E f -f 2 - n2a/(l?2a). Similarlywehaveraten-2a/(1+2a)(log n)-2,f/(1+2a)if Yk k- (logk)P.
MINIMAXRATES OF CONVERGENCE
1585
As we have seen, the optimalconvergence
rate in this fullapproximation
settingis ofthe same orderas mink(yk+ k/n),whichwe recognizeas the faformeansquarederror.Indeed,for
miliarbias-squaredplus variancetrade-off
in
with
=
regressionas Section3,
yi u(xi) + Ei, u E Y(F, UP),approximation
estimatesthat achievethis rate.This
systemsyieldnaturaland well-known
is familiarin the literature[see, e.g., Cox (1988) forleast squares
trade-off
regressionestimates,Cenov(1982) or Barronand Sheu (1991), formaximum
likelihoodlog-density
estimates,and Birgeand Massart (1996) forprojective
densityestimatorsand othercontrasts].
The bestrate knis ofcourse,unknownin applications,suggestingtheneed
tochoosea suitablesize modeltobalancethe
ofa goodmodelselectioncriterion
based on data. For recentresultson model
twokindsoferrorsautomatically
see
for
selection,
instance,Barron,Birge,and Massart (1999) and Yang and
rates forvarious
Barron(1998). Thereknowledgeofthe optimalconvergence
situationsis still of interest,because it permitsone to gauge the extentto
classes.
whichan automaticprocedureadapts to multiplefunction
To sum up this section,if a function
class 7 is containedin Y(F, q)) and
containsY(F', V') forsomepair offundamental
sequences"Pand V' forwhich
the Yk, yk sequencesyielden -n then s, providesthe minimaxrate forS
and moreover(underthe conditionsdiscussedabove) minimaxoptimalestimatesare availablefromsuitablelinearestimates.However,someinteresting
functionclasses do not permitlinearestimatorsto be minimaxrate optimal
[see Nemirovskii(1985), Nemirovskii,
Polyakand Tsybakov(1985), Donoho,
Johnstone,
Kerkyacharianand Picard (1996)]. Lack of a fullapproximation
set characterization
does notprecludedetermination
ofthemetricentropyby
otherapproximation-theoretic
means in specificcases as will be seen in the
nexttwosections.
5. Sparse approximations and minimax rates. In the previoussection,fullapproximation
sets offunctionsare definedthroughlinear approximationwithrespectto a givensystem(D. There,to get a givenaccuracyof
approximation
6, oneuses thefirstk8basis functions
withk8= min{i: Yi < 6}.
This choiceworksforall g E Y(F, (D) and these basis functionsare needed
to get the accuracy6 forsome g E Y(F, (D). For slowlyconverging
sequences
occursespeYk,verylarge k8 is neededto get accuracy6. This phenomenon
functionapproximation.
ciallyin high-dimensional
For instance,if one uses
fullapproximation
withany chosenbasis fora d-dimensionalSobolevclass
withall a (a > 1) partial derivativeswell behaved,the approximation
error
withk termscan not convergefasterthan k-a'd. It thenbecomesofinterest
to examineapproximation
usingmanageablesize subsetsoftermssparse in
comparisonto the total that would be needed withfullapproximation.
We
nextgiveminimaxresultsforsomesparsefunction
classes.
Let (F and F be as in the previoussection.Let Ik > k, k > 1 be a given
nondecreasing sequence of integers satisfyingliminf Ikik = o (Io = 0) and
let f = {I1, I2, .. .}. Let r)J(g) = min11<11,...lk<Ik min{ai} 11
g -Ei=L1 ai4k1112
be
called the kthdegreeofsparse approximation
ofg E L2[0,
i]d
by the system
1586
Y. YANGAND A. BARRON
(. Here fork = 0, thereis no approximation
The kthterm
and ?j0(g) = 11g92.
used to approximateg is selectedfromIk basis functions.Let Y(F, (D) =
in L210, 1]d withthe sparseapproximation
J2(r, 'F,J) be all functions
errors
boundedby F, thatis,
k = 0, 1, . .
Y'(F, () = {g E L2[0, ld. 1k(g) < Yk,
We call it a sparse approximation
set offunctions(fora fixedchoiceof J).
LargerIk's provideconsiderablemorefreedomofapproximation.
In termsofmetricentropy,
a sparse approximation
set is notmuchlarger
than the corresponding
fullapproximation
set Y(F, (D) it contains.Indeed,
as will be shown,its metricentropyis largerby at mosta logarithmic
factor
undertheconditionIk < kTforsomepossiblylarger> 1. Iffullapproximation
is used instead to approximatea sparse approximation
set, one is actually
a
class
with
much
approximating
a
largermetricentropythan J(F,(D) if Ik
is muchbiggerthan k [see Yang and Barron(1997)].
For a class ofexamples,let T be a class offunctions
uniformly
boundedby
v witha L2 8-coverofcardinalityoforder(l/)d' [metricentropyd' log(1/8)]
forsome positiveconstantd'. Let 4 be the closureof its convexhull. For
k> 1,let Ik1+1' ...
Ik in ( be the membersofa v/(2k1/2)-cover ofT. Here
Ik - Ik-i is oforderkd'/2. Then 4 c J(F,(D) withYk = 4v/k1/2.That is, the
closureoftheconvexhulloftheclass can be uniformly
sparselyapproximated
by a systemconsistingofsuitablychosenmembersin the class at rate k-1/2
with k sparse termsout of about kd'/2 manycandidates.This containment
result,4 c /J(F,(D),can be verifiedusinggreedyapproximation
[see Jones
(1992) or Barron(1993), Section8], whichwe omithere.A specificexampleof
T is {fu(ax + b):a E [-1, 1]d, b E R}, whereo-is a fixedsigmoidalfunction
satisfyinga Lipschitz conditionsuch as ((z) = (ez - 1)/(ez + 1), or a sinusoidal
functionc(z) = sin(z). For metricentropyboundson 4, see Dudley(1987),
withrefinements
in Ball and Pajor (1990).
Now let us provemetricentropyboundson J(F, (D). Because Y(F, (D) c
Y"(F, (D),the previouslowerboundforS(F, (F) is a lowerboundforY(F, (D).
We next derivean upperbound.Let 1i < Ii, 1 < i < k, be fixedfora moment. Consider the subset Al
11
of J(F, ()
in the span of
..
0,
that has approximationerrors bounded by Yl, ..., Yk, using the basis
if and only if g = L la>*f01
forsome coeffi?kyl, , ffiI(i.e., g E 41k
L'
cientsa* and minm
a
gai
112
<
Ym
for
1
<
m
<
k,). From the
1,
i
ai_,am
previous section,we know the s-entropyof .1Ik is upper bounded by order
k? = inf{k:Yk < 8/2}. Based on the construction,
it is nothardto see thatan
?/2-net in Ul. < Ii 1<i<k_, .1k.,
is an 8-netfor f(F, (D). Thereare fewerthan
(Ik,)
manychoicesofthe basis 4i, 1 < i < k8,thus the e-entropy
of J(F,(D)
is upperboundedbyorderk, + log(1ke) O(k_)log(E-1)undertheassumption
Ik < kTforsome r > 1. As seen in Section4, if Yk
ka(log k)-P, we have
that k is oforderE-?l/a(log(E-1))-P/a. Then the metricentropyof J(F, (D) is
-
bounded by order 8-1/a(log(-1))1-,/a.
MINIMAXRATES OF CONVERGENCE
1587
Thus we pay at mosta priceofa log(s-1) factorto coverthe largerclass
J(F, (F) witha greaterfreedomofapproximation.
in J(F, (F) are uniformly
For densityestimation,supposethefunctions
upthe
functions
be
all
in
per bounded.Let J(F, (F)
probability
density
J(F,(F).
Whenyois largeenough,the L2 metricentropiesofY'(F, (F) and /J(F,(F) are
ofthe same order.Let s, satisfynr2n= k log(E-1) and s satisfyke = n?.
Then applyingTheorem3 togetherwiththelowerboundon Y(F, (F), we have
result.
derivedthe following
THEOREM 8. The minimaxrateofconvergence
fora sparseapproximation
set satisfies
e2 -< min max Ellff-
_<r2
f f2 (n)
As a special case, if Yk
n-<2a/(1+2a)
-
a > 0, then
k-a,
<min
max Ellf -
f fE
(F,()2
i112<
(n/log n)-2a/(1+2a).
Note thatupperand lowerboundrates differ
factor.
onlyin a logarithmic
From the proofsof the minimaxupperbounds,the estimatorsthereare
constructedbased on Bayes averagingover the 8n-netof J(F,(f). In this
context,it is also natural to considerestimatorsbased on subset selection.
Upper bound resultsin this directioncan be foundin Barron,Birge, and
Massart (1999),Yang (1999a) and Yang and Barron(1998).
A generaltheoryof sparse approximationshould avoid requiringan asofthebasis functionsi, i > 1. In contrastwiththe
sumptionoforthogonality
storyforfullapproximation
sets whereY(F, (F) is unchangedby the GramSchmidtprocess,sparse approximation
is notpreservedbyorthogonalization.
ofthosefunctions
Nonetheless,consideration
that are approximatedwell by
oforthonormal
sparse combinations
basis has the advantagethat conditions
can be moreeasily expresseddirectlyin termsof the coefficients.
Here we
discussconsequencesofDonoho'streatment(1993) ofsparse orthonormal
apforminimaxstatisticalrisks.
proximation
Let {I1, 42, .. .} be a givenorthonormal
basis in L2[0, 1]. For0 < q < 2, let
y4(C1,
-~'qCl C2, I3) =
Ci
-x
E
=l
00
00
i-l
<i=l
<Ci:
E I iIq < C, and Ei
-
<1C2l
forall l > 1
Here C1, C2 and /3are positiveconstantsthough,Bmay be quite small,for
example,s/d withs smallerthan d. The condition>2, 1112 < C2h- is used
to make the targetclass small enoughto have convergent
estimatorsin L2
norm.Roughlyspeaking,the sparsityofthe class comesfromthe condition
that L=1lfiq < C1 whichimpliesthat the ith largestcoefficient
satisfies
q
<
and
that
selection
of
the
k
are
coefficients
sufficient
to
C1/i
largest
16(I)
achievea small remainingsum ofsquares.The conditionEll (i 12< C2h- is
used to ensurethatit suffices
to selectthe k largestcoefficients
fromthefirst
Ik = kT termswithsufficiently
large T.
Y. YANG AND A. BARRON
1588
The class YI(C1, C2, 13)is a special case offunctionclasses withuncondiboundedfunctionclass and <D = {f1 =
tional basis. Let 4 be a uniformly
1 02,...}
be an orthonormal
basis in L2. The basis 4) is said to be an un1 (iti
conditionalfor4 if forany g = i?? 0i c 4, then
E 4 forall
(= (41, (2, *..) with lIjI < 6i,
. Donoho(1993, 1996) gives resultson metric
entropyof these classes and provesthat an unconditionalbasis fora funcof the functionsand
tion class gives essentiallybest sparse representation
estimatorsare nearlyoptimal.We hereapply
showsthatsimplethresholding
the main theoremson minimaxrates in this paper to these sparse function
classes usinghis metricentropyresults.
Let q* = q*(4) = inf{q: supi,1 il/lq (i)l < oo}, where 6(1)' ((2), ... are the coefficientsordered in decreasing magnitude 14(1)1> 16(2)1> .... This q* is called
the sparsityindexin Donoho(1993, 1996). Let a* = a*(a) = sup{a: M2(E) =
0(?-1/a)}
be theoptimalexponentoftheL2 metricentropyM2(s) of4. From
Donoho (1996), if 4 furthersatisfies the condition 1j6? ,12 < Cl-3 for all
1 > 1, then a* = l/q* - 1/2, that is, for any 0 <
<
< a* < a2, 8-1/a2
M2 < ?-1/a,. Let c, = {f: f/C' E 4 and f is a density}.Then whenC' is
largeenough,Sc, has the same metricentropyorderas thatof4. Thus under
the same conditions,
we have
n-2a2/(2a1+l)
-<
minmaxEllf
/ fEYC,
- f112
<
n-2a/(2ai+l)
forany 0 < a, < a* < a2. The firstorderin the exponentofthe minimaxrisk
is -2a*/(2a* + 1).
For the special case ofYq(Cl, C2,,3),betterentropyboundsare available
based on Edmunds and Triebel [(1987), page 141] [see Yang and Barron
(1997)], resultingin a sharper upper rate by n-2a/(2a+l)when ,3/2> l/q - 1/2
and by (n/ log n)-2a/(2a+l)when 0 < /3/2< l/q - 1/2, where a = l/q - 1/2.
6. Examples. In thissection,we demonstrate
theapplicationsofthetheoremsdevelopedin the previoussections.As will be seen fromthe following
examples,once we knowthe orderof metricentropyof a targetclass, the
minimaxrate can be determined
rightaway formanynonparametric
classes
withoutadditionalwork.Forresultson metricentropyordersofvariousfunctionclasses,see Lorentz,Golitschekand Makovoz(1996) and references
cited
there.
6.1. Ellipsoidal classes in L2. Let {lo = 1, 42' ..., ,kk .. .} be a complete
orthonormalsystemin L2[0, 1]. For an increasingsequence of constants
bk with b1 > 1 and bk ?? oc, definean ellipsoidal class &({bk}, C) =
{g
=
,(p,:E,j62
2 < C}. Define m(t) = sup{i:bi
<
t} and let
l(t) = fot
m(t)/td t. Then fromMitjagin(1961), one knowsthat the L2 coveringmetricentropyof '({bk}, C) satisfies1(1/2E) < V2(E) < 1(8/e). For
the special case with bk = ka (a > 0), the metric entropyorder -1/a deter-
minedabove was previouslyobtainedby Kolmogorovand Tihomirov(1959)
forthe trigonometric
basis. [Whena > 1 or bk increasessuitablyfast,the
MINIMAXRATES OF CONVERGENCE
1589
entropyrate can also be derivedusing the resultsof Lorentz(1966) on full
approximation
sets.]
6.2. Classes offunctions
withboundedmixeddifferences.As in Temlyakov
(1989),definethe function
classes H' on -Td =
[-_T,,1]d
havingbounded
mixed differences
as follows.Let k = (k1,..., kd) be a vectorof inte= rv < ru+1 <
gers, q > 1, and r = (rl,..., rd) with r1 =
<
rd. Let Lq(-lTd) denote the periodic functionson 17d with finitenorm
lgIILq(lTd)
=
dx)l/q. Denote by Hr1 the class of functions
(2lT)d(fd
g(x)(f
E
=
dxi
g(x)
0 for 1 < i < d, and Itg(X)IlL (rd) <
Lq(lTd), 7d g(x)
fld= tjIrj (1 > maxj rj), where t = (t1, ..., td) and Al is the mixed lth differAl ...1g(x,
ence withstep tj in the variablexj, thatis, Alg(x) = Ald
'
From Temlyakov (1989), for r = (rl,..., rd) with r, > 1, 1 < p < oo and
,
1 < q < oo,
Mp(r; H~r)
(1/E)l/rl (log 1/,)(1+1/2r1)(v-1)
Functionsin thisclass are uniformly
bounded.
6.3.Besovclasses.Let zr(g,
x) =
r
(r
(k)
r-kg(x+ kh).Thenthe
r-thmodulusof smoothnessof g E Lq[O,1] (O < q < oc) or of g E C[O,1] if
q = oo is definedby wt(g, t)q = supo<h<t zlAX(g,)Iq Let a > 0, r = [a] + 1
and
1 1IB
dt~
/00
w
/(t-a
1
1r
r(g,
t)q)(Jd)
sup t&)r(g, t)q,
t>O
i/CT
,
for 0 < a < o0
foro-= o0.
Then the Besov normis definedas 11gl1BI llgllq + IBa [see DeVoreand
Lorentz(1993)]. Closelyrelatedare Triebelor F classes, whichcan be handled similarly.For definitions
and characterizations
ofBesov (and F) classes
in the d-dimensionalcase, see Triebel(1975). Theyincludemanywell-known
functionspaces such as Holder-Zygmund
spaces, Sobolev spaces, fractional
Sobolevspaces or Bessel potentialspaces and inhomogeneous
Hardyspaces.
For 0 < cr,q < oo(q < o0 forF) and a > 0, let Ba q(C) be the collections of all functionsg E L[O, l]d such that 11g1IBa < C. Buildingon the
conclusionspreviouslyobtainedforSobolevclasses by Birmanand Solomajak (1974), Triebel(1975), withrefinements
in Carl (1981), showedthat for
1
<
Ur< o00
1 < p, q <
o0
and a/d > ll/q-l lp,
Mp(c; Ba q(C))
6.4. Bounded variation and Lipschitz classes. The functionclass BV(C)
consists of all functions g(x) on [0,1] satisfying 11gh10,< C and V(g) :=
sup Em1 lg(xi+,) - g(xi))l < C, where the supremum is taken for all finite
sequences xl < x2 < ... < xm in [0,1]. For 0 < a < 1, let Lipa q(C)
{g: 11g(x+ h)- g(x)lq < Cha and lIgIlq< C} be a Lipschitzclass. Whena >
I /q- l /p, 1 < p, q < o0, the metricentropysatisfies MP(E; Lipa q(C)) -
1590
Y. YANGAND A. BARRON
[see Birmanand Solomajak(1974)]. Forthe class offunctions
in BV(C), with
ofthe value assignedat discontinuity
suitablemodification
pointsas in DeVoreand Lorentz[(1993),Chapter2] one has Lip1
j,(C) c BV(C) c Lip,,1(C)
and since the Lp (1 < p < oc) metricentropiesof Lip1 1(C) and Lipl OO(C)
are bothoforder1/e,the Lp (1 < p < oo) metricentropyofBV(C) is also of
order1/e.
6.5. Classes offunctionswithmoduliofcontinuity
ofderivativesbounded
by fixedfunctions. Instead of Lipschitzrequirements,one may consider
more general bounds on the moduli of continuityof derivatives.Let
be the collectionof all functionsg on [0,1d
Ci...Cr)
r,=
Ar,(Co
whichhave all partialderivativesIIDkgjj2? Ck, Iki= k = 0,1, ..., r,andthe
modulusofcontinuity
in the L2 normofeach rthderivativeis boundedby a
functionw. Here w is any givenmodulusofcontinuity
see
[fora definition,
DeVoreand Lorentz(1993), page 41]. Let 8 = 8(E) be definedbythe equation
8'rw(5)= E. Then if r > 1, again fromLorentz(1966), the L2 metricentropy
ofAd,2 is oforder(8(r))-d.
6.6. Classes offunctionswithdifferent
moduliofsmoothnesswithrespect
to different
variables. Let kl, ..., kd be positive integers and 0 < Pi < ki,
1 < i < d. Let k = (kl,..., kd) and =
Let V(k, 3, C) be the
(1, .,d).
collectionofall functionsg on [0, if' withIIgIlx< C and suplhl<tIIA'h9gII2
<
Cti, whereAhkiis the kithdifference
withsteph in variablexi. FromLorentz
[(1996),page 921], the L2 metricentropyorderofV(k, f, C) is (1/8)E=131
6.7. Classes E ' k(C).
ofperiodicfunctions
?00
g(xi, . . ., x
=
X
=
E
,,md--o
Let Edak(C) (a > 1/2and k > 0) be the collection
(dd
am .md
cos(
i-1
sin(
mixi) + bml...md
i_1
mixi
? b I__ < C(m.
? 1),
)-a) (logk(ri
.d)
where m = max(m,1). The L2 metricentropyis of order (1/,)1/(a-1/2)
log(2k+2a(d-1))/(2al)(l/r)by Smoljak (1960). Note that forthese classes, the
dependenceof entropyorders on the input dimensiond is only through
logarithmic
factors.
on [0, 27T] with
2a2.Old
6.8. Neural networkclasses. Let N(C) be the closurein L240, 1]d ofthe
set ofall functionsg: Rd -* R ofthe formg(x) = co + >Zicic(vix + bi), with
Icol+ Li icil < C, and ivi = 1, where a is a fixedsigmoidalfunctionwith
o-(t) - > 1 as t -- oc and o-(t) -O 0 as t -> -oc. We further require that a is
eitherthestepfunctiono*(t) = 1 fort > 0, and o(*(t)= 0 fort < 0, or satisfies
the Lipschitzrequirementthat lo(t) - o(t')l < Cllt - tjl forsome C1 and
to
I0(t) - o.*(t)I< C2ItK-forsomeC2 and y > 0 forall t 0 0. Approximations
in N(C) using k sigmoidsachievesL2 errorboundedby C/k1/2
functions
[as
MINIMAXRATES OF CONVERGENCE
1591
shownin Barron(1993)],and usingthisapproximation
bound,Barron[(1994),
page 125] givescertainmetricentropybounds.The approximation
errorcan
not be made uniformlysmaller than (l/k)l/2+lld+? for 8 > 0 as shown in
Barron[(1991), Theorem3]. Makovoz[(1996), pages 108 and 109] improves
the approximation upper bound to a constant times (1/k)l/2+l/(2d) and uses
theseboundsto showthatthe L2 metricentropy
oftheclass N(C) witheither
the step sigmoid or a Lipschitz sigmoid satisfies (1/8)1/(1/2+1/d)? M2(r) z
(1/8)1/(1/2+1/(2d)) log(l/s). For d = 2, a betterlowerboundmatchesthe upper
bound in the exponent (1hr)4!3
log-2/3(1/r)
? M2(8)
z
d = 1, N(C) is equivalentto a boundedvariationclass.
(1/8)4/3 log(l/s).
For
Convergence
rates. We giveconvergence
rates fordensityestimation.Unless statedotherwise,attentionis restricted
to densitiesin the corresponding
class. Normparametersofthefunction
classes are assumedtobe largeenough
so that the restriction
does not changethe metricentropyorder.For regression,the rates ofconvergence
are the same foreach ofthesefunction
classes
assumingthe designdensity(withrespectto Lebesguemeasure)is bounded
above and away fromzero.
1. Assumethe functions
in &({bk}, C) are uniformly
boundedby C0 and assume that l(t) satisfiesliminft0 l(/pt)/l(t)> 1 forsome fixedconstant
1B3>1. Let Enbe determinedby l(41/n) = n?2; we have
max E IIf-f2
min
f
fE,f({bk},C)
n
Specially,ifbk= ka, k > 1 and the basis functions
satisfysupk,l 1I'kk K <
oo, then when a > 1/2,functionsin &({bk}, C) are uniformly
bounded.
Then the rate ofconvergence
is n-2a/(2a+). See Efroimovich
and Pinsker
(1982) formoredetailedasymptotics
withthetrigonometric
basis and bk =
ka, see Birge (1983) forsimilarconclusionsusing a hypercubeconstruction forthe trigonometric
basis and Barron,Birge and Massart [(1999),
Section3] forgeneralellipsoids.
2. Let Hq(C) = { f = eg/fegdl: g E Hq(C)} be a uniformly
boundeddensity
class. The metricentropyof Hq(C) is ofthe same orderas forHq(C). So
for r1> 1 and 1 < q < o <,1 < p < 2, by Corollary 1, we have
minmaxElIff
f
feHf
p
n-rl/(2r+l)(log n)(v- 1)/2
For this densityclass, the minimaxriskunderK-L divergenceor squared
Hellingerdistanceis also ofthe same orderas thatunderthe squared L2
distance, n-2r1/(2r1+l)(log
n)(V- 1)
3. When ald > l/q, the functionsin Ba q(C) are uniformly
bounded.For
< oo,1 < p < 2 and a/d> l/q,
1 < oa < o <,1q
min max Ellf-fp
f fEBa0,q(C)
-2a(2a+d)
1592
Y YANG AND A. BARRON
min max Ellf /fEBa0,q(C)
flIlpn/(2dn
Whenl/q - 1/2< a/d < l/q,somefunctions
in Bc1q(C) are notbounded;
nevertheless,
thisclass has the same ordermetricentropyas the subclass
Ba j(C) withq* > d/a whichis uniformly
bounded.ConsequentlyforC
sufficiently
large the metricentropyof class of densities in Ba q(C)
is still
the same orderas forB,, q(C). From Theorem4, forL1 risk we have,
when a/d > l/q - 1/2, the rate is also n-a/(2a?d). From the monotonicityproperty
ofthe Lp normin p, we have forp > 1 and a/d> l/q- 1/2,
- f1>
max
In particular, when q > 2, the last
-a/(2ad).
minf
fEyEllf
l
twoconclusionsholdforall a > 0. Donoho,Johnstone,
Kerkyacharianand
Picard (1993) obtainedsuitable minimaxbounds forthe case of p > q,
a > l/q. Here we permitsmallera. The rates forLipschitzclasses were
previouslyobtainedbyBirge(1983) and Devroye(1987) (thelatterforspecial cases withp = 1). If the log-density
is assumedto be in Ba q(C),then
byCorollary1, fora/d > 1/q,theminimaxriskunderK-L divergenceis of
the same ordern-2a/(2a+d), whichwas previouslyshownby Koo and Kim
(1996).
4. For 1 < p < 2, the minimaxrateforsquare L riskor L riskis oforder
or n113. Since a class ofboundedmonotonefunctions
nhas the same
ordermetricentropy
as a boundedvariationclass,we concludeimmediately
thatit has the same minimaxrate ofconvergence.
The rate is obtainedin
Birge(1983) by settingup a delicatelocal packingset forthe lowerbound.
5. Let ?n be chosen such that (6(?))-d
= ne2. Then for r > 1, the minimax
square L2 risk is at rate s2.
6. Let a-1 = yd>1 -31;thentheminimaxsquare L2 riskis at rate n-2a/(2a+l)
7. The class Ea, k(C) is uniformly
boundedwhena > 1. Then
mmn max
f
faEE
F lf-
f{-A2
-(a-1/2)/a
log(n+a(d-1))1a
(C)
8. We have upperand lowerrates
n-(1+2/d)/(2+1/d)
(logn)-(1+1/d)(1+2/d)(2+1/d)
-<
min max El
/ fEN(C)
f -f112
f2
< (n/logn)-(1+1/d)/(2+1/d)
and ford = 2, usingthe betterlowerboundon the metricentropy,
we have
n3/5 (log n)-910
-< min max E F
f
fcEN(C)
-
f
2 -<
(n/ log n)
For this case, it seems unclear how one could set up a local packing set for
applying Birge's lower bounding result. Though the upper and lower rates do
not agree (except d = 2, ignoringa logarithmicfactor),when d is moderately
large, the rates are roughlyn-1/2,which is independent of d.
MINIMAX RATES OF CONVERGENCE
1593
7. Relationship between global and local metric entropies. As
the previoususe ofFano's inequalityto derivethe
statedin the introduction,
minimaxrates of convergenceinvolvedlocal metricentropycalculation.To
ofspeciallocal packingsets capturingthe
applythistechnique,constructions
essentialdifficulty
ofestimatinga densityin the targetclass are seemingly
requiredand theyare usually done with a hypercubeargument.We have
the minishownin Section2 thatthe globalmetricentropyalone determines
max lower rates of convergencefortypicalnonparametric
densityclasses.
Thus there is no need to put in effortsin search of special local packing
ofan earlyversionofthis paper whichcontained
sets. Afterthe distribution
the main resultsin Section2, we realized a connectionbetweenthe global
metricentropyand local metricentropy.In fact,the global metricentropy
ensuresthe existenceofat least one local packingset whichhas the property
requiredforthe use ofBirge'sargument.This factalso allows one to bypass
Here we showthis connectionbetweenthe global
the special constructions.
metricentropyand local metricentropyand commenton the uses ofthese
metricentropies.
For simplicity,
we considerthe case whend is a metric.Supposethe global
packingentropyofS underdistanced is M(E).
DEFINITION
at 6 E S is thelog(Local metricentropy). The local E-entropy
arithmofthelargest(8/2)-packing
set in B(O, e) = {H' E S: d(6', 0) < E}. The
local 8-entropy
at 0 is denotedby M(E I 0). The local 8-entropy
ofS is defined
as MloC(E)= maxo,s M(8 i 0).
LEMMA3. The global and local metricentropieshave thefollowingrelationship:
M(E/2) - M(?) < M"(8)
< M(812).
PROOF. Let N. and N/2 be the largest8-packingset and thelargest8/2in S. Let us partitionN,12into IN,8 partsaccording
packingset respectively
to the minimumdistancerule (the Voronoipartition).For 0 E NE, let Ro =
{O': 0' ES, H= argmin8N, d(6', O)} be thepointsin N8,2 thatare closestto
6 (ifa pointin N,12has the same distanceto twodifferent
pointsin N., any
rule can be used to ensure R0 n Ro = f). Note that N,12 = U0ENRO. Since
N. is the largestpackingset,forany 6' E N/2, thereexists0 E N, suchthat
d(6', 6) < 8. It followsthatd(6', 0) < E for6' E R0. Fromabove,we have
INE12 1
INEIl
IN
IR01.
|EIOc=N
Roughlyspeaking,the ratioofthe numbersofpointsin the twopackingsets
characterizesthe averagelocal packingcapability.
From the above identity,
thereexists at least one 6* E N. with IR0* >
6 > M(8/2) - M(?). On the otherhand,
NE/2/NE 1.Thus we have M(8 O*)
Y YANG AND A. BARRON
1594
by concavityofthe log function,
M(2)
-
M(?) = log
LIRI)
log(IR0I).
of
is upperboundedbythe difference
Thus an averageofthe local s-entropies
twoglobalmetricentropiesM(c/2) - M(?). An obviousupperboundon the
local s-entropiesis M(8/2). This ends the proofofLemma3. 0
that
The sets Ro in theproofare local packingsetswhichhave theproperty
the diameterofthe set is ofthe same orderas the smallestdistancebetween
anytwopoints.Indeed,thepointsin Ro are 8/2apartfromeach otherand are
within8 from0. This property
togetherwiththe assumptionthat locallydK
is upperboundedby a multipleofd enables the use ofBirge'sresult[(1983),
Proposition2.8] to getthe lowerboundon the minimaxrisk.
we assume thereexistconthe minimaxrates ofconvergence,
To identify
stants A > 0 and 80 > 0 such that D(pollp,) < Ad2(0, 0'), for0, 6' c S
withd(6, 0') < 80. FromLemma 3, thereexist 0* E S and a subsetSO = R*
at leastM(E/2) - M(?). Using
witha localpackingset N oflog-cardinality
as
Fano's inequalitytogetherwiththe diameterboundon mutualinformation
distributedrandomvariableon N, when
in Birge,yieldswith0 a uniformly
8 <
80,
I(E); X') < n max D(po p0po)< nA max d2(6, 0') < nA82,
0, 0'eN
0, 0'ESo
and hence, choosingsn to satisfyM(8n/2) - M(8n) = 2(nA?2 + log2),
as in the proofof Theorem1, but with S replacedby SO = R0*,one gets
minbmaxojsoEod2(6, 0) > 8n/32, whichis similarto our conclusionsfrom
in theboundjust obtainedcompared
Section2 (cf.Corollary1). The difference
withthe previousworkofBirge'and othersis the use ofLemma 3 to avoid
ofthe local packingset.
requiringexplicitconstruction
ofan 8-ballwiththelargestorder
Ifone doeshave knowledgeoftheentropy
8/2-packing
set,thenthe same argumentwithSO equal to this 8-ballyields
the conclusion
(7)
minmaxEod2(6 0) wz
OES0
0
where e = in is determinedby Mloc(kn)= n?, providedD(pollpoi) <
Ad2(060') holdsfor0, 6' in the chosenpackingset.
The above lowerboundis oftenat the optimalrate even whenthe target
class is parametric.Forinstance,the 8-entropy
ofa usual parametricclass is
oftenofordermlog(1/), wherem is the dimensionofthe model.Then the
metricentropydifferenceM(E/2)
-
M(8) or Ml1c(E) is of order of a constant,
1/ln and En
yieldingthe anticipatedrate 8n
1/a/ii. The achievability
of the rate in (7) is due to Birge [(1986), Theorem3.1] underthe following
condition.
MINIMAXRATES OF CONVERGENCE
CONDITION4. There is a nonincreasing functionU: (0, ox)
1595
-
[1, oc). For
any s > 0 withnE2> U(E), thereexistsan E-netS, forS underd such that
forall A > 2 and 0 E S, we have card(S8 n Bd(6, As)) < AU(8).Moreover,
thereare positiveconstantsA1 > 1, A2 > 0, a 00 E S and forA > 2 there
existsforall sufficiently
small E, an s-packingset N, forBd(6o,As) satisfying
card(N, n Bd(6O,A,)) > (A/A1)U(8)and d2(0, 0') < A2d2(6,0') forall 0, 6' in
NE.
2 (Birge). Suppose Condition4 is satisfied.If thedistanced
PROPOSITION
is a metricboundedabovebya multipleoftheHellingerdistance,then
minmaxEd2(0, H) _<2
OESn 'ES
where?n satisfiesns2
=
U(s,n).
The upperboundis fromBirge [(1986), Theorem3.1] using the firstpart
of Condition4. From the secondpart of Condition4 with A = 2, we have
betweendK and d in a maximal
MlOc(2s) U(s) and a suitablerelationship
orderlocal s-net,so the lowerboundfollowsfrom(7) by the Fano inequality
boundas discussedabove in accordancewithBirge[(1983),Proposition2.8].
When liminf8OM(s/2)/M(s) > 1 (as in Condition2), the upperbounds
givenin Section2 are optimalin termsofrates when d and dK are locally
in considering
equivalent.For suchcases,thereis no difference
globalorlocal
metricentropy.The conditionliminf8-OM(s/2)/M(s) > 1 is characteristic
of large functionclasses, as we have seen. In that case, the threeentropy
quantitiesin Lemma 3 are asymptotically
equivalentas s -* 0,
M(s/2) - M(E)
MlOC(e)
M(s/2).
In contrast,whenlim80 M(E/2)/M(e) = 1 as is typicaloffinite-dimensional
parametriccases, M(s/2) - M(e) is ofsmaller orderthan M(s/2). In this case,
one mayuse (7) to determinea satisfactory
lowerboundon convergence
rate.
From the above results,it seems that for lower bounds,it is enough
to know the global metricentropy,
but forupper bounds,when liminf0,O
M(s/2)/M(E)= 1, undera strongerhomogeneousentropyassumption(Condition4), the local entropyconditiongives the rightorderupper bounds,
while using the global entropyresultsin suboptimalupper bounds (within
a logarithmic
factor).However,forinhomogeneous
finite-dimensional
spaces,
no generalresultsare available to identify
the minimaxratesofconvergence.
Besides beingused forthe determination
ofminimaxrates ofconvergence,
local entropyconditionsare used to provideadaptive estimatorsusing differentkinds of approximating
models[see Birge and Massart (1993, 1995),
Barron,Birgeand Massart (1999) and Yang and Barron(1998)].
Y. YANG AND A. BARRON
1596
APPENDIX
Proofs of some lemmas.
PROOFOF LEMMA2. The proofis by a truncationof g fromabove and be-
low.Let G = {x: g(x) < 4T}. Let g = gIG + 4TIG,. Thenbecause dH(f, g) <
8, we have JGC(YIf g/)2 dA < ?2. Since f(x) < T < g(x)/4 forx C GC,it fol-
)2 d1 < 82. Thus fGCgd/ <482,
lows that f0C(f
7/fxc&f
which implies 1 - 482 < f dp,u< 1 and f(,g - v/g)2d,u. < IGC g dp < 42.
Let g = (g + 482)/(f g d,u + 482). Clearly g is a probabilitydensityfunction
with respect to ,t. For 0 < z < 4T, by simple calculation using 1 - 482 <
fgd,u < 1, we have 1 - (z?+2)/(fgd/u ?2)j
+
< 2(8T - 1)8. Thus
g)2
<
1)282. Therefore,by triangle inequality,
_
d/it 4(8T
-
2
2J(f
-
2)2db ?
4f(
+4f(V
< 22
+ 16E2 + 16(8T
-
-
d
- 1)2?2.
That is, d2(f, g) < 2(9 + 8(8T - 1)2)82. Because f/g is upper bounded by
+ 482)) < 9T/(482), the K-L divergenceis upper bounded bya
T/(482/(f d,u/
multiple of the square Hellinger distance [see, e.g., Birge and Massart (1994),
Lemma 4 or Yang and Barron (1998), Lemma 4] as given in the lemma. This
completes the proofof Lemma 2. 0
LEMMA4. Assume Conditions 1 and 2 are satisfied.Let 8? satisfies ?2 =
V(8n)/n and ?nd be chosen such that M(?f d) = 4n82 + 2log2. Then8? d
?n~~~~~
n.n
> 1. Under the assumption
PROOF. Let ar = liminf8O0M(a8)/M(E)
> 21og2 when 8 is small, we have 4n82 + 21og2 < 6n 2n when
> (1/C)V(En) >
n is large enough. Then under Condition 1, M((a/b)8n)
<
>
6c. Then M((alb)ak8n)
(6c)-1M(Sn, d). Take k large enough such that Sk
> M(Sn d). Thus (a/b)akn_ < n d' that is, ?n = (?n d). SimakM((a/b)8n)
ilarly, M(En/b) < V(En) = n82 < (1l/4)M(8n, d) <M(fn,
d ) SO 8nlb > En, d,
that is, En d = O(8n). This completes the proof a
M(8)
The referees and an Associate Editor are thanked
Acknowledgments.
forencouraging improvedfocus and clarityof presentationin the paper.
REFERENCES
BALL, K. and PAJOR,A. (1990). The entropyof convex bodies with "few"extreme points. In Geometryof Banach Spaces 26-32. Cambridge Univ. Press.
BARRON,A. R. (1987). Are Bayes rules consistent in information?In Open Problems in Communication and Computation (T. M. Cover and B. Gopinath, eds.) 85-91. Springer,New
York.
MINIMAX RATES OF CONVERGENCE
1597
In Proceedings
A. R. (1991). Neuralnetapproximation.
onAdaptive
BARRON,
oftheYale Workshop
LearningSystems(K. Narendra,ed.) Yale University.
ofa sigmoidalfunction.
A. R. (1993). Universalapproximation
boundsforsuperpositions
BARRON,
IEEE Trans.Inform.Theory39 930-945.
neuralnetworks.
and estimationboundsforartificial
BARRON,A. R. (1994). Approximation
MachineLearning14 115-133.
A. R.,BIRGE,L. and MASSART,
P. (1999). Riskboundsformodelselectionvia penalization.
BARRON,
Probab.TheoryRelatedFields 113 301-413.
IEEE Thans.
BARRON,A. R. and COVER,T. M. (1991). Minimumcomplexity
densityestimation.
Inform.Theory37 1034-1054.
A. R. and HENGARTNER,
N. (1998). Information
Ann.Statist.
BARRON,
theoryand superefficiency.
26 1800-1825.
BARRON,A. R. and SHEU,C.-H. (1991). Approximation
ofdensityfunctions
bysequencesofexpo-
nentialfamilies.Ann.Statist.19 1347-1369.
BICKEL,P. J.and RITOV,Y. (1988). Estimating
integrated
squareddensityderivatives:
sharpbest
orderofconvergence
estimates.SankhydSer.A 50 381-393.
BIRGE, L. (1983). Approximation
dans les espaces metriqueset theoriede l'estimation.Z.
Wahrsch.Verw.Gebiete65 181-237.
a densityusingHellingerdistanceand someotherstrangefacts.
BIRGt,L. (1986). On estimating
Probab.TheoryRelatedFields 71 271-291.
BIRGE,L. and MASSART,
P. (1993). Ratesofconvergence
forminimum
contrastestimators.
Probab.
TheoryRelatedFields 97 113-150.
P. (1994). Minimum
BIRGE,L. and MASSART,
contrastestimators
onsieves.Technicalreport,
Univ.
Paris-Sud.
BIRGE,L. and MASSART,P. (1995). Estimationofintegralfunctionals
ofa density.
Ann.Statist.
23 11-29.
BIRGE, L. and MASSART,P. (1996). Frommodelselectionto adaptiveestimation.In Research
Papersin Probability
and Statistics:Festschrift
inHonorofLucienLe Gam (D. Pollard,
E. Torgersenand G. Yang,eds.) 55-87. Springer,
New York.
BIRMAN,M. S. and SOLOMJAK,
M. (1974). Quantitativeanalysisin Sobolevembedding
theorems
and applicationto spectraltheory.
TenthMath.SchoolKiev 5-189.
J. and HUBER, C. (1979). Estimationdes densites:risqueminimax.Z. Wahrsch.
BRETAGNOLLE,
Verw.Gebiete47 119-137.
CARL,B. (1981). Entropynumbersofembedding
mapsbetweenBesovspaceswithan application
to eigenvalueproblems.
Proc.Roy.Soc. Edinburgh90A 63-70.
CENCOV,N. N. (1972). StatisticalDecisionRulesand OptimalInference.
Nauka,Moscow;English
translationinAmer.Math.Soc. Transl.53 (1982).
B. and BARRON,A. R. (1990). Information-theoretic
CLARKE,
asymptotics
ofBayesmethods.
IEEE
Trans.Inform.Theory36 453-471.
CLARKE,B. and BARRON,A. R. (1994). Jeffrey's
prioris asymptotically
least favorableunder
risk.J Statist.Plann. Inference
41 37-60.
entropy
COVER,T. M. and THOMAS,J. A. (1991). ElementsofInformation
Theory.
Wiley,New York.
Cox, D. D (1988). Approximation
ofleast squares regression
on nestedsubspaces.Ann.Statist.
16 713-732.
DAVISSON,L. (1973). Universalnoiselesscoding.IEEE Trans.Inform.
Theory19 783-795.
DAvISSON,L. and LEON-GARCIA,
A. (1980). A sourcematching
approachtofinding
minimaxcodes.
IEEE Trans.Inform.Theory26 166-174.
DEVORE, R. A. and LORENTZ,G. G. (1993). Constructive
Approximation.
Springer,
New York.
DEVROYE,L. (1987). A Coursein DensityEstimation.Birkhauser,
Boston.
DONOHO,D. L. (1993). Unconditional
bases are optimalbases fordata compression
and forstatisticalestimation.
Appl.Comput.Harmon.Anal. 1 100-115.
DONOHO,D. L. (1996). Unconditional
bases and bit-levelcompression.
Technicalreport498,Dept.
Statistics,StanfordUniv.
DONOHO,D. L., JOHNSTONE,
I. M., KERKYACHARIAN,
G. and PICARD,D. (1996). Densityestimation
bywaveletthresholding.
Ann.Statist.24 508-539.
1598
Y. YANG AND A. BARRON
DONOHO,D. L. and Liu, R. C. (1991). Geometrizing
II. Ann. Statist.19
rates of convergence
633-667.
Ann.Probab.15 1306-1326.
DUDLEY,R. M. (1987). UniversalDonskerclassesand metricentropy.
EDMUNDS,
D. E. and TRIEBEL,
H. (1987). Entropynumbersand approximation
numbersin functionspaces.Proc.LondonMath.Soc. 58 137-152.
EFROIMOVICH,
S. Yu. and PINSKER,
M. S. (1982). Estimationofsquare-integrable
denprobability
sityofa randomvariable.Problemy
PeredachiInformatsii
18 19-38.
A StatisticalTheoryof Communication.
MIT
FANo,R. M. (1961). Transmission
ofInformation:
Press.
in estimationof a
rates ofconvergence
FARRELL,
R. (1972). On the best obtainableasymptotic
at a point.Ann.Math.Statist.43 170-180.
densityfunction
estimatesofdensitiesin
R. Z. (1978). A lowerboundon therisksofnonparametric
HASMINSKII,
theuniform
metric.Theory
Probab.Appl.23 794-796.
in theviewofKolmogorov's
R. Z. and IBRAGIMOV,
I. A. (1990). On densityestimation
HASMINSKII,
Ann.Statist.18 999-1010.
ideas in approximation
theory.
IEEE Trans.Inform.Theory
HAUSSLER,D. (1997). A generalminimaxresultforrelativeentropy.
40 1276-1280.
D. and OPPER,M. (1997). Mutualinformation,
metricentropy
and cumulativerelative
HAUSSLER,
risk.Ann.Statist.25 2451-2492.
entropy
I. A. and HASMINSKII,R. Z. (1977). On the estimation of an infinite-dimensionalpaIBRAGIMOV,
rameterin Gaussianwhitenoise.SovietMath.Dokl. 18 1307-1309.
I. A. and HASMINSKII,
R. Z. (1978). On the capacityin communication
IBRAGIMOV,
by smooth
signals.SovietMath.Dokl. 19 1043-1047.
in Hilbertspace and convergence
L. K. (1992). A simplelemmaon greedyapproximation
JONES,
Ann.Statist.20
ratesforprojection
pursuitregressionand neuralnetworktraining.
608-613.
ofsets in function
V. M. (1959). s-entropy
and E-capacity
A. N. and TIHOMIROV,
KOLMOGOROV,
spaces. UspehhiMat. Nauk 14 3-86.
oflog-densities.
Koo, J.Y. and KIM,W. C. (1996). Waveletdensityestimationbyapproximation
Statist.Probab.Lett26 271-278.
LE CAM,L. M. (1973). Convergence
ofestimatesunderdimensionality
restrictions.
Ann.Statist.
1, 38-53.
LE CAM,L. M. (1986). Asymptotic
Methodsin StatisticalDecisionTheory.Springer,
New York.
G. G. (1966). Metricentropy
LORENTZ,
and approximation.
Bull. Amer.Math.Soc. 72 903-937.
LORENTZ,G. G., GOLITSCHEK,M. V. and MAKOvOz,Y. (1996). Constructive
AdApproximation:
vancedProblems.Springer,
New York.
Y. (1996). Randomapproximants
MAKOVOZ,
J Approx.Theory85 98-109.
and neuralnetworks.
B. S. (1961). The approximation
MITJAGIN,
dimensionand bases in nuclearspaces.UspehhiMat.
Nauk 16 63-132.
A. (1985). Nonparametric
NEMIROVSKII,
estimationof smoothregressionfunctions.
J Comput.
System.Sci. 23 1-11.
A. B. (1985). Rates ofconvergenceofnonparametNEMIROVSKII,
A., POLYAK,B. T. and TSYBAKOV,
ricestimatesofmaximum-likelihood
type.Probl.PeredachiInf.21 17-33.
D. (1993). Hypercubesand minimaxratesofconvergence.
POLLARD,
Preprint.
J. (1984). Universalcoding,information,
RISSANEN,
prediction,
and estimation.
IEEE Trans.Inform.Theory30 629-636.
RISSANEN,
J., SPEED,T. and Yu, B. (1992). Densityestimationby stochasticcomplexity.
IEEE
Trans.Inform.Theory38 315-323.
SMOLJAK, S. A. (1960). Thee-entropy of some classes ESak(B) and Wa (B) in the Lm
metric.Dokl.
SMOLAK, . A
(190). he -entopy fsme cassSE'(B
an WB)ithL2
Akad.Nauk SSSR 131 30-33.
STONE,C. J. (1982). Optimalglobal rates of convergence
fornonparametric
regression.Ann.
Statist.10 1040-1053.
V. N. (1989). Estimationoftheasymptotic
TEMLYAKOV,
characteristics
ofclassesoffunctions
with
boundedmixedderivativeor difference.
ThudyMat. Inst.Steklov189 162-197.
MINIMAX RATES OF CONVERGENCE
1599
H. (1975). Interpolation
ofE-entropy
anddiameters.
Geometric
characteristics
TRIEBEL,
properties
ofembedding
forfunction
spaces ofSobolev-Besovtype.Mat. Sb. 98 27-41.
VANDE GEER,S. (1990). Hellingerconsistency
of certainnonparametric
maximumlikelihood
estimates. Ann. Statist. 21 14-44.
WONG,
W. H. and SHEN,X. (1995). Probability
inequalitiesforlikelihoodratiosand convergence
ratesofsieveMLEs. Ann.Statist.23 339-362.
Y. (1999). Modelselectionfornonparametric
YANG,
regression.
Statist.Sinica 9 475-500.
Y. (1999). Minimaxnonparametric
YANG,
classification
I: ratesofconvergence.
IEEE Trans.Inform.Theory45 2271-2284.
YANG,Y. and BARRON,
A. R. (1997). Information-theoretic
determination
of minimaxrates of
convergence.
TechnicalReport28, Dept. Statistics,Iowa State Univ.
YANG,Y. and BARRON,
ofmodelselectioncriteria.IEEE
A. R. (1998). An asymptotic
property
Trans.Inform.Theory44 95-116.
Y. G. (1985). Rates ofconvergence
YATRACOS,
ofminimum
distanceestimatorsand Kolmogorov's
entropy.
Ann.Statist.13 768-774.
YATRACOS,
Y. G. (1988). A lowerboundon theerrorin nonparametric
regression
typeproblems.
Ann.Statist.16 1180-1187.
Yu, B. (1996). Assouad,Fano, and Le Cam. In ResearchPapers in Probabilityand Statistics:
Festschrift
in HonorofLucienLe Cam (D. Pollard,E. Torgersenand G. Yang,eds.)
423-435. Springer,
New York.
IOWA STATE UNIVERSITY
DEPARTMENT OF STATISTICS
SNEDECOR HALL
YALE UNIVERSITY
DEPARTMENT OF STATISTICS
AMES,IOWA50011-1210
E-MAIL:yyang@iastate.edu
YALE STATION
P.O. Box 208290
NEW HAVEN,CONNECTICUT
06520-8290
E-MAIL:barron@stat.yale.edu
Download