Computational Learning Theory: A Bayesian Perspective

From: AAAI Technical Report SS-93-06. Compilation copyright © 1993, AAAI (www.aaai.org). All rights reserved.
Computational Learning
Theory: A Bayesian
Perspective
Tom Ch4vez
Departmentof Engineering-EconomicSystems,
StanfordUniversity / RockwellInternational Science
Lab, 444HighSt., Palo Alto, CA94301
Abstract: Computationallearning theory (COLT),
the field of research stemming
fromValiant’ seminal
1986paper[Valiant], differs fzompreviousattempts
at formalizinginferencein that it allowsuncertainty
and requires efficiency: an algorithmcountsas a
learningalgorithmonlyif it errs withprobability
parameterswe stipulate in advance,andonly if it
can be arrived at in polynomial
time. Yet there are
manyaspects of learning that the modeldoes not
accommodate.
It does not allow a learning agent to
bringand assimilateprior beliefs to a problem.
Whileit allowsuncertaintyin the formof its error
(¢) andconfidence(~i) parameters,it does not allow
morespecific "hedging"about the elementsof a
hypothesizedconcept. It does not permit us to sup
port positively a hypothesison the basis of sampled
data. andit does not allowus to learn facts suchas
whetheror not a coinis fair. Mostdistxessingly,it
relies ona classical statistical approach
that is
incompatiblewith the expressive requirementsof
real-world inference and the moreconvincing,welljustified notionsof subjectiveprobability.
currently support. The newmethodologypossesses
other distinct advantages:mistakenprior beliefs
washout quickly; the exactitude of inference over
individual attributes can be specified in advance;
andthe efficiency of learning does not dependon the
size of the conceptspace.
1.0 Introduction
Learning is such an intriguing topic because it draws on
and concerns itself with someof the most interesting elements of psychology, computer science and philosophy. It
has to do with psychology because we would like to know
howpeople actually learn things; we would like to understand better howwe form, support, extend and refute concepts and ideas. It touches on manyideas from philosophy,
but most directly on topics from the philosophy of science
- how scientific theory emerges from large, complexand
apparently self-contradictory bodies of data. Finally, it has
muchto do with computerscience and artificial intelligence because we would like to build agents and programs
that can go out in the world and infer newknowledgein
somerobust, efficient way.
This paper introduces a reformulationof the model
whichsolves the precedingdifficulties. Wehave
beeninvestigating a modelof learning similar to
COLT
in its demands
for efficiency anddistributionindependence,
yet onethat also possessesthe attractive features of Bayesianism,
such as prior beliefs,
continuous,incrementalupdating, and subjective
probability. Ourapproachcovers the classes of
learnable concepts coveredby COLT,
as well as
other categoriesof facts andconceptsthat it doesnot
As good as Valiant’s model has proven at capturing and
formalizing the most distinctive andsignificant aspectsof
the learning process, we believe that it possesses certain
16
COLT:An Overview
shortcomings.Wefirst present an overviewof COLT,
and
wehighlight its useful features andinsights. Weproceedto
identify certain shortcomingswith it and describe why
they should concern anybodywhostudies inference. Many
of those perceiveddifficulties haveto do with the age-old
debate betweenBayesiansand non-Bayesiansin philosophyand mathematical
statistics. Wewill try to focus that
debateas muchas possibleinto the current context, though
someof whatwearguerequires -- or at least references-argumentspresented elsewhereand in other contexts. We
go on to presentan alternate modelof inference-- or perhaps simply a morerefined, Bayesianversion of the COLT
model-- basedon the methodof updatingBernoulli
parametersusing second-orderconjugatedistributions.
Alongthe way, we observe howthe approachpresented
here solves the deficiencies observedin the COLT
model,
and weproceedto identify its other strengths as well.
2.0 COLT:An Overview
Wedefine a representationover Xto be a pair (o,C),
whereC ¢ I 1,0}" and o is a mappingo: (C ~ 2x). For
somefixed c ¢ C, we call o (c) the conceptover X, and
o (C) the conceptclass. Wewill also use pos (c) =O (c),
the imageof o, or the set of positive examplesof the concept. Weassumethat domainpoints x ~ X and concept
representations c ¢ C are efficiently encodedusing any of
the standardtechniques(see, for example,[Garey]). Since
X and o are usually clear fromcontext, we will often simply refer to the representationclass C. 1
A labeled examplefroma domainX is a pair <x,b>, where
x ¢ X and b e 10,11. In this paper, we will be considering
onlypositive examples,i.e., elementsof X for which
b = 1. Weshall refer to a sampleas somefinite set of positive examplesover X.
2.1 An Example
Wewill first motivatethings witha simpleexample,following[Keams].Supposewe are interested in learning
aboutmembers
of the animal kingdom,and to this end we
create a collection of Booleanvariables describingvarious
features of animals. Weassumethat our variables, suchas
has_mane,walks on fourlegs, does_fly, hatches_eggs,
completelycoverall the aspects weare interested in, and,
of course, that the numberof variables is finite. Nowwe
mightlike to use this encoding,alongwith somesort of
inferencescheme,to learn whata particular animalis.
Moreexactly, we want someframeworkfor seeing examples of lions, say, and then constructinga conceptthat
allowsus to distinguishlions fromnon-lionson the basis
of those observations.
2.2
build a conceptfor someelement,or most typically a subset of elements,in X. In COLT,
becauseconceptsare usually consideredclusterings of properties, wetypically use
conceptrepresentationsthat are expressibleas simplerules
over domaininstances. Booleanformulaeare obvious candidates for representing COLT
concepts because they can
capture suchrule-based descriptions easily and elegantly.
In mostof COLT,
wefind results relative to specific,
parameterizedclasses of representations. Typicallywe
have a stratified domainX = k.) Xsandc = L.) C,,. The
parametern can be regardeda~ appropriatd’/’nlr.asure of
complexity.Usuallyn is the numberof attributes, or variables, in the domainX. Asan example,Xn mightbe the set
{0,1 }" and C,, the set M,, of all monomials
of length< n,
i.e., the set of all conjunctions
of literals overthe Boolean
variablesxI ,...,x
n.
Returningto our example,we see that, given our variable
definitions correspondingto animalattributes, wecan represent a lion as somefixed combinationof values over
those variables. If all of our variablesare Boolean,then
each variable correspondsto the possessionor non-possession of someattribute. Supposethat wehavedefined n
variables such that each possible combination(out of the
Representations
Welet X denote a set called a domain,also sometimes
referred to as an instancespace. Wecanthink of Xas the
set of all encodingsof objects, events, or phenomena
of
interest to us. For example,an elementof Xmightbe an
object in a roompossessingparticular values for different
propertiessuch as color, height andshape. Or, in the case
of our precedingexample,an elementof Xis a particular
animal,such as a cheetah. Alearning algorithmseeks to
1. The reader should not be disturbed by the suddeno formalism.Weare simplymakingexplicit the notion that, in
order to representa conceptin termsof a clustering of elementsin the domainX, wemust define a representation
class C whichidentifies those elementsfor a fixed concept
C.
Computational
LearningTheory:A BayesianPerspective
17
COLT:An Overview
2n possible combinations)effectively describes some
memberof the animal kingdom.For example,
is_bird ^ ~ ^ weighs_over
200 Ibs ^
runs_fastmightbe a conceptrepresentationfor an ostrich,
wherethe conceptclass is taken to be Mnas definedabove.
Providedthat we havea clearly defined orderingon the
variables, it is a simplemauerto construct a labeledsample consistingof lions andnon-lions.
2.3 Distributions
Nowthat we have a wayof representing concepts and phenomena,as well as definitions for examplesand samples,
we need someprecise methodfor presenting those examples to a learning algorithm. Supposewehavea fixed representation for the concept of a lion given by is_mammal
^ is_large ^ has_four_legs ^ has_claws. In
computationallearning theory, a learning algorithmis
given examplesof a single conceptc ¢ C accordingto
somefixed but unknown
distribution D.
Theidea of learning with respect to a fixed but unknown,
arbiwarydistribution on examples(often referred to as d/stribution-freelearning)is oneof the great strengthsof the
COLT
model.Wecan think of the target distribution as a
meansof capturingthe structure of the real world-- not
necessarilycapturingthat structurewell, but at least as
representing someslice, howeverinaccurateor biased, of
the phenomena
weare trying to learn about. Becausewe
seek a framework
for describing learning in unanticipated
scenarios,wewill rarely haveaccess to the distributions
on data or examplesprior to constructingthe learning
algorithms. Thereforeweseek algorithmsthat perform
well underany target distribution.
Of course, if the algorithmsees examplesaccordingto a
skeweddistribution, it will infer conceptsreflecting only
the data it has seen, and not necessarilythe objectivestructure of the learning domain.Thusweare forced to characterize the error of the learningalgorithmin termsof the
target distribution. Supposethat a learningalgorithmoutputs a hypothesish froma representationclass H after seeing examplesfrom the instance space drawnaccordingto
D. (Recentworkin computationallearning theory distinguishes betweenH and C becauseit is possible for some
learningalgorithmsto create hypothesesin one representa-
Computational
LearningTheory:A BayesianPerspective
18
tion class that serve as goodapproximations
to target concepts in another.) COLT
defines the error E as
D (x)
E
£ = xeXlk(x)
,,c (x)
Thatis, E is the probabilityweightaccordingto Dof all
the mistakesh can possibly make.MeasuringE in this
mannerimplies that "fluke" examplesoccurring with very
lowprobability in the samplingprocess contribute only a
negligible amountof error to the hypothesis. COLT
requires that a learning algorithmoutput a hypothesiswith
error no larger than e regardlessof the distribution D,
where¢ is specified in advance.Returningto our example,
if wespecify a value of ~ = 0.1, then withprobability at
least 90%the output hypothesiswill correctly classify a
lion as a lion anda non-lionas a non-lion. Conversely,the
probability that the hypothesismistakesa cheetahfor a
lion, say, is less than10%.
2.4 Learnablllty
In [Kearns], Kearnsmakesa point not usually addressedin
muchof the COLT
literature: becausewe are interested in
computationalefficiency, wemustinsure that the hypotheses output by a learningalgorithmbe evaluatedefficiently,
not just that theybe arrivedat efficiendy.It is of little use
to us if a learning algorithmoutputs a systemof differential equationsthat can onlybe evaluatedin an exponential
numberof time steps. Thuswerequire a preliminarydeftnition: if C is a representationclass overX, then wesay
that C is polynomiallyevaluatableif there is a polynomial
algorithmAthat on input a representation c ¢ C and a
domainpoint x ~ Xoutputs c(x).
Weare nowreadyto presentthe central definition of learnability accordingto COLT.
Let C and H be representation
classes over X. ThenC is learnable from examplesby H if
there exists a probabilistic algorithmA suchthat for some
c e C, ifA has access to examplesfrom pos(c) drawn
accordingto any fixed but unknown
distribution D, then
for any inputs E and 8 such that 0<e,8<l, A outputs a
hypothesish ¢ H that with probability1-/i has error less
than t. Typicallyweinclude a requirementof efficiency as
well: C is polynomiallylearnableby examplesfromH if C
and H are polynomiallyevaluatable and A runs in time
Where
COLT
falls short
polynomialin l/c, 1/8, and Icl. Oftenwewill drop the
phrase "by examplesfrom H" and simply say that C is
learnable, or polynomiallylearnable.
Supposewe are attemptingto learn a monomial
representation is_mammal ^ is_large ^ has_four_legs
^ has claws for the concept of a lion. A COLT
learning algorithmis given a distribution on lions, whereeach
lion is representedas a truth assignmentoverthe variables
wehave defined. The algorithmis then fed positive examples of lions as it attemptsto constructa monomial
sufficiently close to the target monomial
so as to meeterror
specifications given by ~ and 8. Wecan employthe learning algorithmgiven by Valiant in his ground-breaking
paper[Valiant]. Let Xl,...,xn denoteliterals (whatwehave
been giving namessuch as has_four_legs) and let POS
denotean oracle whichretm’nsa positive exampleof a lion
drawnaccordingto a distribution D.
for i:-- 1 to mdo {msatisfies E and 5 constraints}
begin
~,=POS
Thelearning algorithmcan err in two wayscorresponding
to the two error parameters~ and 8. It can makean E-type
error in the followingmanner:supposethat a rare breed of
Tibetan lion has no claws; yet becausethose lions are so
rare, not one of themappearedin the training set of exampies encounteredby the algorithm. Thusthe literal has_claws should have been deleted from the monomial.The
problemis not serious, since the probability of misclassifying a Tibetanlion as a non-lion is boundedaboveby E,
and~ is at least as great as the probabilityweightof all
Tibetanlions with respect to the distribution D. A5-type
error can occur if a randomlydrawnset of lions is highly
unrepresentativeof lions in general -- for example,if
everylion in the training set just happensto live in the circus, then the learning algorithmmightinclude
lives in the circus in its conceptfor lion.
Ideally, wewouldlike to havelearning algorithmsthat
never makemistakes. Sucha goal, however,is clearly
impossible,giventhe richness and unpredictability of
most learning domains.The next best thing, it would
seem,is to havesomegrasp of the types of error learning
algorithmsmakeas well as the magnitudeof those errors
in probabilistic terms. COLT
providessuch a capability,
and this has distinguishedit favorablyfromprevious
attemptsat formalizinginductive inference.
for j:= 1 to n do
if vy--Othendeletexyfromh
else delete~j fromh
3.0 WhereCOLTfalls short
end
output h
As the learning algorithmsees positive examplesof lions,
variables such as can..speak and breathes_underwater
will be quicklydeletedfromh, since it is safe to assume
that no lions can speakor breathe underwater.Avariable
such as hasmane,meanwhile,will be eliminated from h
with probability roughly1/2, presumingthat the sampling
distributionDaccuratelyreflects the fact that abouthalf of
all lions are males,and only malelions havemanes.Variables such as has clawsand is mammal
are true for all
positive examplesof lions, and so shouldwithstandthe
pruningprocess of the learning algorithmand be included
in the outputhypothesis.
3.1 Prior beliefs, prior knowledge
Essential to the processof learning is building uponwhat
is already known.Neither we nor agents approachnew
phenomena
as tabula rasa -- blank boards upon whichwe
write the solution to a problemor engravethe representation of a concept.If the car is broken,wedo not infer what
is wrongwith it just by looking underthe hood. Rather we
employall of our backgroundknowledgeabout howcars
workand especially whatcausesthemto fail. Similarly, in
the realmof intelligent agency,wewouldlike robots to
use knowledgewe have given themat design time, as well
as everythingthey learn in the meanwhile,
as they proceed
to take action in the world.
COLT
provides a goodmodelfor inductive inference for a
fixed problem,but it does not providemachineryto
describe howa particular problemmightbe approached;
in particular, it doesnot give us a meansfor incorporating
Computational
LearningTheory:A BayesianPerspective
19
Where
COLT
falls short
possibly uncertain prior beliefs or knowledge
about the
focus of inference. Arguably,a conceptcan be stored, and
then at someother time the learning algorithmcan see
moreexamplesand thereby reduce its error and increase
its confidencefor the concept.In this sense wemightsay
that the learner has incorporated"prior" knowledge
of a
conceptC in order to infer "more"aboutC, but clearly we
are employingpoorly-definednotions of "prior knowledge" and "more."
tion with n:20 and we already knowthat lions do not fly
and they are mammals,
then we are really learning Mnfor
n=lS. Somework has been done studying someversions
of this problem
(e.g., [Littlestone]), but still wemightlike
to accountfor it in the basic model.
The second, muchmoreserious problemis that we might
not really be able to commit
with certainty to the fact that
lions do not fly, or, moreto the point, that all lions have
claws(remember
the Tibetanbreed). All the lions that the
Whatwe want, rather, is the ability to drawon knowledge learner mighthaveseen before had claws, but it would
in a knowledgebase in order to focus and perhapseven
clearly be erroneousto assumethe quantified proposition
accelerate the processof inference. Ratherthan throwing
that all lions haveclaws.In fact, the learner’s previous
static conceptsinto a knowledgebase and drawingon
experienceaboutlions withclawspoints to the fact that it
has onlyseen clawedlions. Yetrather than discardthis bit
themonly to completecontrivedchains of reasoningor to
answerstraightforwardqueries, wewishto use static conof information,wewouldlike to bring it to the learning
cepts as dynamiccontributors to whatweare learning.
instance somehow
in a waythat allows revision yet adds
Ratherthan using prior knowledge
as solely a constraining information. Thuswe wouldlike to have a meansfor
mechanism
-- a tool for checkingthe consistencyof what
incorporatingprior beliefs of the form"Mostlions have
we already knowto what we are seeing -- we should also
claws" or "Mostlions are someshade of brown." The way
use it as a meansof enrichingand clarifying inference.
to do this is with probability, andwewill havemoreto say
aboutthis soon.
Computationallearning theory makeswhat we might call
a "freeze frame"assumption:whatwe infer is basedon
3.2 Hedgingwithin a concept
whatwesee for a particular learning instance. COLT
has
nothingto say about whatcamebefore or after, and while
Oneof COLT’s
strengths is that it gives explicit probabilthis has provenuseful for stripping awaycomplexityin an
ity
parameters
describingthe uncertaintyof the inference
attemptto get at an essential modelfor learning,it is too
procedure.
Thus
it is often referred to as "PAClearning,"
drastic; what has comebefore (and perhaps what comes
for
probably
approximately
correct learning.
after) wouldseemto matter a great deal to whatwe are
inferring. Returningto our previous example,supposewe
Yet supposethat as weare learning, we wantto "hedge"
are trying to learn a conceptfor a lion basedonpositive
on particular aspects of the conceptwe are building. Going
examples.A COLT
algorithmbegins learning as if it
backto our example,supposethat we see a lion with a
knowsnothing about lions: has_mane,walks on four_mane,and then another with no mane.In COLT,since the
legs, and flies are all unknown
variables that mayassume
literal has_mane
is not satisfiable in light of this apparent
the valuesI or 0 (Trueor False). Yetit seemsquite reaindeterminacy,
it
and its negationare eliminatedfromthe
son,ablethat the learner mightalready knowthat lions do
output
hypothesis.
Rather than throwout the babywith the
not fly and that they are mammals,
and it wouldbe goodto
bath
water,
however,
wemightlike to express and localize
incorporate this knowledgesomehow
into the learning
our
uncertainty
about
the variable has_mane.In fact, the
process.
variable is not indeterminateat all. Thetruth of the matter
is that abouthalf of all lions havemanesandthe other half
Nowone mightargue that we can simply set the values of
do not; this seemslike informationworthincludingin our
those variables in advance.Thefirst problemwiththis
conceptfor a lion. Thereare manyconceivablecases
responseis that those variables, since they are includedin
whereoutside knowledgecan affect the scopo and accuthe representation schemefor the conceptspace, already
racy of inference, particularly with respect to individual
determinethe complexityof inference. Wemust redo our
variables. COLT
does not allow hedgingon particulars,
proofs of the runningtimes of the learning algorithmor
and
this
is
something
a morerobust modeloughtto allow.
re.computelower boundsto matchthe representation we
are using. In our example,if weare using a Mnrepresenta-
Computational
LearningTheory:A BayesianPerspective
20
Where
COLT
falls short
3.3 Degreesof support
Supposeas a learning algorithmgoes about observing
lions, it sees that everylion in the sampleso far walkson
four legs, and not a single oneflies. Mostweighmorethan
100pounds,and mostchase antelope, while about half of
themhave manes. A COLT
algorithm will infer only what
is consistent in the Booleanlogical sense with the endre
sample. Thus, even though most of themweigh over 100
pounds,it mighthaveseen somebabylions that weighless
than 100 pounds, and thus the value of the welghs_more
_than_lOO_pounds
variable in the algorithm’s output
conceptis False. After seeingfemalelions, it will make
the moreegregious error and concludethat has_mane
=False. It will infer only somethinglike has_four_legs ^ ]I~ -- clearly muchvaluable information has
beenhiddenor effectivelylost.
Whatwe wouldlike mhave arc degrees of support over all
the attributes of the learning domain,somewayof capturing and expressingall the richnessand uncertaintyof the
learningprocess.If weobservethat all lions havefour
legs, weare not merelylearningthat this is consistentwith
the concept has_four_legs ^ ]!~, we are also learning
somethingaboutthe extent to whichbelief in the variable
has_four_legsis supported.Plainly, this sort of uncertain
knowledge
goesto the very heart of whatit is to learn
something.Uncertaintyis not a bad thing, so long as we
havea meansof capturingit in someexpressive,axiomatic
way.
3.4 Learningother facts
Thereare manyother facts in the worldworthyof learning
that COLT
does not accommodate.
Althoughit is based on .
statistics, it doesnot give us a framework
for learningvarious sorts of statistical facts -- for example,whetheror not
a coin is fair after observinga certain number
of tosses. It
doesnot contextualizewhatit learns to the realmof action,
yet action wouldseemto constitute the backdropagainst
whichanythingweqhlearning is actually learned (for
morediscussionof this point, see [Good]).Moresignificandy,it is not at all obvioushowone wouldgo about
extendingCOLT
into this decision-theoretic domain,e.g.,
accountingfor learning in the inevitable presenceof constraints onlearningresources,or learningin the presence
of variable costs of information(whenit costs valuable
time or moneyto see another example,say) Theoretical
strides on research on decision makingunder bounded
resourcescannotbe smoothlyintegrated into traditional
COLT
techniques.
3.5 Theright kind of probability
Themostseriousproblem
with COLT
is that it relies on a
faulty interpretationof probabilitythat in turn fosters a
false sense of comfortwith the probabifity parametersc
and 8. Wecan highlight this difficulty by asking simply,
"Whatis meantby the term ’probably approximatelycorrect’?" Nowhere
in the literature (so far as we know)has
there been any systematic discussion of COLT’s
probabilistic approach,its consequencesand assumptions.
Therehas beenongoingdebate in mathematicalstatistics,
engineeringand philosophysurroundingproper interpretations of probability. Wecan neverthelessisolate a few
points relevant to our current context. Becausethe probability parametersin COLT
are specified in advance,and
samplesizes are computedreladve to those parameters
followingstandardmethodsfromclassical (frequentist)
statistics, it seemsthat COLT’s
intended,or at least
default, probabilistic interpretationis roughlyas follows:
givena learningdomain(i.e., conceptspace, a suitable
representation and somedistribution D on examplesfixed
in advance),in Nidentical, repeatablelearninginstances
wherea numberof examplesgiven by the boundsof
COLT
are observedaccording to D, a learning algorithm
will output representationssuchthat as N goesto **, the
proportionof representations whichhaveerror greater
than ~ is less than 8. Nowthis is a very roughsketch, for
we haveleft muchunspecified. Dowe fix the learning
algorithm,or do weallowdifferent learning algorithmsfor
varyingvalues of N, as long as they all performlearning
for the sameconcept space? Dowe fix the concept being
learnedas well? Is it allowableto fix the conceptand vary
the learning algorithms?Wehavealso ignored the question of error, thoughit is safe to take it as the weightrelative to D of examplesmisclassified by the learning
algorithm.But then we mightask if it is sensible to vary
the distributions fromlearninginstance to learning
instance.
All of these permutationscall into questionthe verymeaning of re_oeatabilityrequiredby a frequentistinterpretation.
In whatsense is eachlearninginstancea repeatabletrial
identical to any other learning instance? Notice howfar
weare fromthe simplistic realmof coin flipping, where,
Computational
LearningTheory:A BayesianPerspective
21
WhereCOLT
fails short
all things beingequal, it appearsrelatively safe to assume
that distinct flips of a coin are repeatableandidentical.
Theapproachto assigningvalues for the probability
parametersin COLT
is indistinguishable fromthe wayin
whichpeople assign thresholdsfor confidenceintervals.
Wesay that we want8=.1 -- but what does this mean,
really, and whatmakesit different from 8=.05?Whatconstitutes "extreme"error or confidence?(See [Good],p.
49.) Theconfidenceinterval is, as mosthoneststatisticians
admit, a confidencetrick, a wayof makingconfidencepronouncements
that are true 95%or 99%of cases in the long
run. Yetwehavealreadyseen howdifficult it is to fix any
reasonablenotion of a "long run" for learning.
Evenif wecould fix somesensible notion of the "long
run" for learning, confidencetechniquesfromclassical
statistics still do not give us whatwewant.AsKeynessuecinctly put it, "In the longrun wewill all be dead." As
anybodyfamiliar with these kinds of statistical sampling
techniques knows,the samplesrequired to accommodate
the pre-specitied probability parametersare often absurdly
large. Moreover,they flagrantly ignore information
acquired during sampling.The criticism can be brought
into focus by giving an exampledue to Good:if you
observe300spins of a roulette wheeland there is no
occurrenceof a 7, shouldyouregard the probability of a
sevenon the next spin as 0 or 1/37, its official value?A
frequentist’smostlikely responseis to tell youto spin it
another 200times.
Somemightargue that this is irrelevant since COLT
does
notsupportthat kind of inference,andthoughthis reply is
misguided,we can nevertheless accommodate
the objection by meansof another example.Supposewe are trying
to build a conceptfor a lion, andour learningalgorithm
has seen ¯ examplesso far, wherer<<m,the numberof
examplesprescribed by COLT
in order to achieve confidence1-8 and error ~. Yet the learning algorithmmight
havealready observedthat all lions are brown,walkon
four legs, and havelots of teeth -- enoughinformationto
classify lions relative to the givendistributionwitherror
less than t. Regardlessof howclever this particular learning algorithmis, the numberof samplesrequired is fixed
in advance,and, in manycases, muchmorethan is needed
or even possible. Wetake an examplefrom [Kearns], p.
16, whereweare interested in learning a monomial
over n
Computational
LearningTheory:A BayesianPerspective
22
literals using positive examplesonly. Kearnsgives the followingexact houndfor the numberof examplesrequired:
2n
1
~- (log2n+ log ~)
Supposeweare using 4 literals, and we set 8=I/16 and
¢=1/10. Thenthe preceding equation implies that a COLT
learning algorithmrequires 560examples;yet there are
only 24, or 16, possible monomials
over 4 literals, so
clearly the numberof examplesneededby the algorithmis
overstated (with just 16 examples,we have 8 = e = 0).
short, the statistical techniquesuponwhichCOLT
relies
do not allow inference to proceed dynamicallybasedon
relevant knowledgeacquired during the samplingprocess.
3.6
Bayesianlearning as a solution
Thesolution is a morerealistic, morehonestacceptanceof
whatweare really using all along: subjective, or personal,
probability. Those whostudy COLT
presumablyhave no
difficulty using probability as a meansof expressing
uncertainty.If that is the case, then let us makeourselves
clear about what we meanwhenwe use it. Wefollow the
conventionalBayesianviewin defining a probability as a
degree of belief in a proposition E given the background
state of informationI in the believer’s mind.Weshall not
dwellon the distinct, yet consistent, perspectiveson subjective probability(e.g., logical subjectiveprobabilityversus personalsubjectiveprobability). Like probabilities
interpreted as frequencies, we employthe axiomsof probability to regulate and makeconsistent all our probability
assessments, nowconsidered as subjective measuresof
belief.
The approachis honest because, as has beenarguedforcefully elsewhere,it is mostlikely whatfrequentists and
other empiricists are doinganyway;and, mostattractively,
it "covers"the other modelsand interpretations [de
Finetti], so that one can continueadheringstubbornlyto
ones frequentist wayswhile still acceptingeverythingwe
do here.
Theimmediateconsequenceof our approachis that rather
than employingBooleantruth values (0 or 1, True or
False) for the conceptsand facts we will be inferring, we
Bayesian
Learnablllty
employa closed unit interval for each variable to describe
belief in its truth or falsity. Ahighvaluemeansthat we
ascribea highlevel of belief to the variable’sbeingtrue,
anda lowvalueimpliesa lowlevel of belief in its truth, or,
conversely,a higherbelief in its falsity. Suchan attackwill
allowus to do all the things wewouldlike to be able to do
with COLT:hedgingon particulars, degrees of support
(wheresupport, using our newterminology,is indistinguishablefrombelief), andprior beliefs.
subtractionof degreesof belief for elementsof a concept
based on observeddata. The solution we posited wasa
Bayesian,subjective use of probability. Thepoint of defining equivalencebetweenrepresentations is that the concept beinglearnedor the fact beinginferred can still retain
all of its exactitudein the Booleanrepresentation,only
nowwe are being honest: no matter howmanytimes we
flip the coin, no matter howmanylions we see with four
legs, we will alwaysallow roomfor someuncertainty.
Thuswe will performinference over a Bernoullirepresentation relative to the exact facts describedin the Boolean
one. TheBooleanfacts are still true or false - wejust
never presume(or pretend) to knowthemwith certainty.
Thegoal of a learningalgorithmin the COLT
literature is
Themost significant consequence
of this approachis that
to infer someunknown
subset of X, called a concept, after
seeing a certain numberof its instances. Weaccommodate we are alwayslearning. Attributes of a conceptare always
undergoingpositive or negativeupdatingconsistent with
such tasks, but wealso allowsomeflexibility for learning
the laws of probability. TheBernoullirepresentationis a
propositionswhichneednot be directly stated in strict logmoreepistemologicallyaccurate framewithin whichto
ical termsrelative to the domain.For example,learning
performinference becauseit captures and expresses
whetheror not a coin is fair is a propositionnot directly
knowledge
such as the fact that mostbirds fly, but penexpressible as a combinationof"Tails" or "Heads,"yet it
guins do not. Mostlions havefour legs, but wedo not
is somethingour modelallowsas a valid subject of inferwantto abandonthat fact altogether if weencounterone
ence. Thusour approachsubsumesCOLT
in that it allows
that lost a leg in a bloodyfight witha rhinoceros.One
learning beyondsimple conceptsexpressibleas logical
importantadvantageof this approach,then, is that it cirrules over domaininstances. For the sakeof brevity, we
cumventsthe traditional problemsof monotonicityAI
shall not consistentlymakeexplicit these different types of
researchers havestruggled with for so long, problems
learning, but it is useful to keepthemin mindas we proCOLT
never fully escapes.
ceed.
4.0 BayesianLearnability
Still, wehaveno formalismfor performingthe type of
inference we seek. Whatwe wantis a frameworkthat possesses all of the strengths of COLT
-- in particular,
demandsfor efficiency and allowancesfor uncertainty -yet also accommodates
desiderata such as hedging, ranges
of uncertaintyfor particular attributes, and prior beliefs.
Thesolution, and the mainresult of this paper, is whatwe
call a Bayesianlearning scheme,whichwe will also
abbreviate by BLS.
Wenowpresent a definition central to this paper: a nBernoullirem’esentationclass is a representationclass
over n Bernoullivariables, i.e., variables assumingsome
value in the closed interval [03]. Moreover,weshall say
that a p-Booleanrepresentation class and a q-Bernoulli
representationclass are equivalentif there exists someisomorphism
betweenthe literals of the Booleanclass and the
variables of the Bernoulliclass. Sincesuch an isomorphismexists if the twoclasses use the samenumberof
variables, wecan see that twosuch classes are equivalent
in general if p---q. (Wewill sometimes
drop the parameter
n and simplyrefer to Booleanand Bernoullirepresentation
classes; wewill alwaysassumethe existence of a fixed
domainX.)
Definition:Supposethat a Booleanrepresentation class B
is equivalentto a Bernoullirepresentationclass C, andlet
bi andci denotethe ith variables of B andC, respectively.
Weshall denote the domainof inference by X. Weshall
call 4) a BLSfor B relative to C iff the followingall
obtain:
It is perhapsappropriateat this timeto state the motivation
for the precedingdefinition of equivalence.In Section3,
we observedsomeof COLT’s
shortcomings:in particular,
its rigidity withrespect to particular atuibutesof a concept
beinglearned, and its inability to accommodate
additionor
1) 4) maintainsa set of distributionsI~i, for i=l..n, wheren
is the numberof variables in B and C. Each13i encapsu-
Computational
LearningTheory:A BayesianPerspective
23
PreliminaryResultsandFutureWork
lates currentbelief aboutthe valueof ci, andthereforeis a
second-orderdensity function on the Bernoullivariable bi.
2) Let c° denotesomefixed conceptor fact in C. For some
x ¢ pos(c°) , let xi ¢ {0,1 } denotethe observationvalueof
the ith attribute or variableofx. Wedenotethe conjunction
of all backgroundinformation and knowledgeby 4, and
bracketsdenoteprobabilities. After ¯ observesx, it
updates B according to Bayes’Theorem:
l--1 B~r..,~){Bb~
I ~}
bring all the variables to matchinglevels. Futureworkwill
focus on accountingmorecompletely for observation
costs, not just in termsof samplesize but also particular
values of a given example’svariables.
Finally, weshall say that a class F is Bayesian(polynomi-
~0ammal~iff
l) there exists someBernoullirepresentation class B such
that F andB are equivalent; and
2) there exists a BLS4) for B relative to
5.0 Preliminary Results and Future
Work
Moreexactly, we have Vi
Wehavejust begin to study learning in this framework,
thoughthere a numberof interesting results wehave
already derived for the model.Proofs and further developmentof the Bayesianlearning frameworkare forthcoming
in [Chavez].
(xiI biffs,;) {bibd~J
;}
4) Givenn-place vectors ~ and 8, and a distribution D on
examplesin pos (c°), whereVi we have Ei<l and
0<8i<1 , O will output an updated version of B, denoted
B*, that is probablyapproximately
correct in nearly the
samesense as Valiant: that is, Ohalts and outputsB*such
that Vi, with probability greater than 1-8i wehave
Theorem
1: Let C be someBooleanrepresentation class,
and let B be someBernoulliclass equivalent to it ThenB
convergesto C in the followingsense: if O is a BLSfor B
relative to C, then as N, the numberof examplesthat 4)
observes,
goesto ,,,,, for all i wehavebi ---* ci.
Amoresignificant result is Theorem
2, whichessentially
tells us that anythinglearnable in COLT
is learnable in a
BLSas well.
-1
Theorem
2: AnyBooleanrepresentation class C is Bayesian polynomiallylearnable.
5) Let e,.,. and 81. be the smallest elementsof ~ and ~,
respectively. Let ~ti denotethe meanof the corresponding
distribution 13i. Then¯ computesB* in time polynomial
in
r lie,.i., in 118.~., andll~t
Oneof the shortcomings
of this definition of a BLSis that
its runningtime is determinedsolely by the smallestvalues of the probabilityparametersfor a particular attribute.
Thetime cost of using a BLS,then, is determinedexclusively by the parametersof the mostprecise variable(s)
interesLIn practice, samplingin order to satisfy the most
precise variables’ probabilityparameterswill
necessarily
Computational
LearningTheory:A Bayesian
Perspective
24
Weprove theorem2 by using Monte-Carlosampling and
recent approachesto Bayesiananalyses of simulation
algorithms[Dagum].Suchanalysis also showsthat the
details of prior distributions often "washout" quicklyin
the samplingprocess, thus addressingthe standard complaint fromfrequentists worriedabout the adverseeffects
of assigningpossiblyarbitrary or mistakenprior distributions. Becausewe use Monte-Carlosampling techniques,
complexityof inference dependsonly on the sizes of the
samplesused, and not on the numberof variables instantiated during learning. Thus, we have the followingimportant result.
PreliminaryResultsandFutureWork
Corollary:The computationalcomplexityof a BLSis
independentof the complexityof the conceptrepresentation.
lLittlestone] N. Littlestone. "Learningquicklywhenirrelevant attributes abound:a newlinear threshold algorithm,"
MachineLearning,2(4), 1988, p. 245-318.
Theprecedingcorollary appearsto be a significant
advancementover COLTmethods, where the complexity
of the representationused figures prominentlyin the running times of the algorithms. Asworkproceeds, wewill
pay special attention to the sourcesof cost and complexity
in a BLS,and howthey distinguishit, either positively or
negatively, from COLT
approaches.
[Valiant] L.G. Valiant. "ATheoryof the Learnable,"Communicationsof the ACM,27(11), 1984, p. 1134-1142.
Acknowledgments:
I wouldlike to thank Paul Dagum
and
Eric Horvitz for insightful comments
and discussions on
this work.Eric Horvitzread andhelpedclarify a draft of
this paper. This research wassupported by IR&DFundsof
the Decision-TheoreticSystemsGroupof the Palo Alto
Laboratory,RockwellInternational ScienceCenter.
References
[Chavez] TomChavez. "Bayesian Learning Schemes"
TechnicalReport, RockwellInternational ScienceLab,
Palo Alto, CA.
[Dagum]Paul Dagumand Eric Horvitz. "A Bayesian
Analysis of Randomized
ApproximationProcedures,"
September,1991. Knowledge
SystemsLaboratory, Stanford University, Stanford, CA.KSL-91-54.
[de Finetti] Brunode Finetti. "Foresight:Its logical laws,
in subjective sources," reprinted in H.E. Kyburgand H.E.
Smokier(eds.), Studies in SubjectiveProbability. London:
JohnWiley,1964, p. 93-158.
[Garey] MichaelR. Gareyand David S. Johnson. Computers and Intractability: A Guideto the Theoryof NP-Completeness. 1979, W.H.Freemanand Company,NewYork.
[Good]I. J. Good."TheBayesianInfluence, or howto
sweepsubjectivismunderthe carpet," in Harperand
Hooker(eds.), Foundations
of ProbabilityTheory,Statistical Inference, andStatistical Theoryof Sc&nce.Dordrecht: Reidei.
[Kearns] MichaelJ. Kearns. The ComputationalComplexity of machineLearning.1990, The MITPress.
Computational
LearningTheory:A BayesianPerspective
25