From: AAAI Technical Report SS-93-06. Compilation copyright © 1993, AAAI (www.aaai.org). All rights reserved. Computational Learning Theory: A Bayesian Perspective Tom Ch4vez Departmentof Engineering-EconomicSystems, StanfordUniversity / RockwellInternational Science Lab, 444HighSt., Palo Alto, CA94301 Abstract: Computationallearning theory (COLT), the field of research stemming fromValiant’ seminal 1986paper[Valiant], differs fzompreviousattempts at formalizinginferencein that it allowsuncertainty and requires efficiency: an algorithmcountsas a learningalgorithmonlyif it errs withprobability parameterswe stipulate in advance,andonly if it can be arrived at in polynomial time. Yet there are manyaspects of learning that the modeldoes not accommodate. It does not allow a learning agent to bringand assimilateprior beliefs to a problem. Whileit allowsuncertaintyin the formof its error (¢) andconfidence(~i) parameters,it does not allow morespecific "hedging"about the elementsof a hypothesizedconcept. It does not permit us to sup port positively a hypothesison the basis of sampled data. andit does not allowus to learn facts suchas whetheror not a coinis fair. Mostdistxessingly,it relies ona classical statistical approach that is incompatiblewith the expressive requirementsof real-world inference and the moreconvincing,welljustified notionsof subjectiveprobability. currently support. The newmethodologypossesses other distinct advantages:mistakenprior beliefs washout quickly; the exactitude of inference over individual attributes can be specified in advance; andthe efficiency of learning does not dependon the size of the conceptspace. 1.0 Introduction Learning is such an intriguing topic because it draws on and concerns itself with someof the most interesting elements of psychology, computer science and philosophy. It has to do with psychology because we would like to know howpeople actually learn things; we would like to understand better howwe form, support, extend and refute concepts and ideas. It touches on manyideas from philosophy, but most directly on topics from the philosophy of science - how scientific theory emerges from large, complexand apparently self-contradictory bodies of data. Finally, it has muchto do with computerscience and artificial intelligence because we would like to build agents and programs that can go out in the world and infer newknowledgein somerobust, efficient way. This paper introduces a reformulationof the model whichsolves the precedingdifficulties. Wehave beeninvestigating a modelof learning similar to COLT in its demands for efficiency anddistributionindependence, yet onethat also possessesthe attractive features of Bayesianism, such as prior beliefs, continuous,incrementalupdating, and subjective probability. Ourapproachcovers the classes of learnable concepts coveredby COLT, as well as other categoriesof facts andconceptsthat it doesnot As good as Valiant’s model has proven at capturing and formalizing the most distinctive andsignificant aspectsof the learning process, we believe that it possesses certain 16 COLT:An Overview shortcomings.Wefirst present an overviewof COLT, and wehighlight its useful features andinsights. Weproceedto identify certain shortcomingswith it and describe why they should concern anybodywhostudies inference. Many of those perceiveddifficulties haveto do with the age-old debate betweenBayesiansand non-Bayesiansin philosophyand mathematical statistics. Wewill try to focus that debateas muchas possibleinto the current context, though someof whatwearguerequires -- or at least references-argumentspresented elsewhereand in other contexts. We go on to presentan alternate modelof inference-- or perhaps simply a morerefined, Bayesianversion of the COLT model-- basedon the methodof updatingBernoulli parametersusing second-orderconjugatedistributions. Alongthe way, we observe howthe approachpresented here solves the deficiencies observedin the COLT model, and weproceedto identify its other strengths as well. 2.0 COLT:An Overview Wedefine a representationover Xto be a pair (o,C), whereC ¢ I 1,0}" and o is a mappingo: (C ~ 2x). For somefixed c ¢ C, we call o (c) the conceptover X, and o (C) the conceptclass. Wewill also use pos (c) =O (c), the imageof o, or the set of positive examplesof the concept. Weassumethat domainpoints x ~ X and concept representations c ¢ C are efficiently encodedusing any of the standardtechniques(see, for example,[Garey]). Since X and o are usually clear fromcontext, we will often simply refer to the representationclass C. 1 A labeled examplefroma domainX is a pair <x,b>, where x ¢ X and b e 10,11. In this paper, we will be considering onlypositive examples,i.e., elementsof X for which b = 1. Weshall refer to a sampleas somefinite set of positive examplesover X. 2.1 An Example Wewill first motivatethings witha simpleexample,following[Keams].Supposewe are interested in learning aboutmembers of the animal kingdom,and to this end we create a collection of Booleanvariables describingvarious features of animals. Weassumethat our variables, suchas has_mane,walks on fourlegs, does_fly, hatches_eggs, completelycoverall the aspects weare interested in, and, of course, that the numberof variables is finite. Nowwe mightlike to use this encoding,alongwith somesort of inferencescheme,to learn whata particular animalis. Moreexactly, we want someframeworkfor seeing examples of lions, say, and then constructinga conceptthat allowsus to distinguishlions fromnon-lionson the basis of those observations. 2.2 build a conceptfor someelement,or most typically a subset of elements,in X. In COLT, becauseconceptsare usually consideredclusterings of properties, wetypically use conceptrepresentationsthat are expressibleas simplerules over domaininstances. Booleanformulaeare obvious candidates for representing COLT concepts because they can capture suchrule-based descriptions easily and elegantly. In mostof COLT, wefind results relative to specific, parameterizedclasses of representations. Typicallywe have a stratified domainX = k.) Xsandc = L.) C,,. The parametern can be regardeda~ appropriatd’/’nlr.asure of complexity.Usuallyn is the numberof attributes, or variables, in the domainX. Asan example,Xn mightbe the set {0,1 }" and C,, the set M,, of all monomials of length< n, i.e., the set of all conjunctions of literals overthe Boolean variablesxI ,...,x n. Returningto our example,we see that, given our variable definitions correspondingto animalattributes, wecan represent a lion as somefixed combinationof values over those variables. If all of our variablesare Boolean,then each variable correspondsto the possessionor non-possession of someattribute. Supposethat wehavedefined n variables such that each possible combination(out of the Representations Welet X denote a set called a domain,also sometimes referred to as an instancespace. Wecanthink of Xas the set of all encodingsof objects, events, or phenomena of interest to us. For example,an elementof Xmightbe an object in a roompossessingparticular values for different propertiessuch as color, height andshape. Or, in the case of our precedingexample,an elementof Xis a particular animal,such as a cheetah. Alearning algorithmseeks to 1. The reader should not be disturbed by the suddeno formalism.Weare simplymakingexplicit the notion that, in order to representa conceptin termsof a clustering of elementsin the domainX, wemust define a representation class C whichidentifies those elementsfor a fixed concept C. Computational LearningTheory:A BayesianPerspective 17 COLT:An Overview 2n possible combinations)effectively describes some memberof the animal kingdom.For example, is_bird ^ ~ ^ weighs_over 200 Ibs ^ runs_fastmightbe a conceptrepresentationfor an ostrich, wherethe conceptclass is taken to be Mnas definedabove. Providedthat we havea clearly defined orderingon the variables, it is a simplemauerto construct a labeledsample consistingof lions andnon-lions. 2.3 Distributions Nowthat we have a wayof representing concepts and phenomena,as well as definitions for examplesand samples, we need someprecise methodfor presenting those examples to a learning algorithm. Supposewehavea fixed representation for the concept of a lion given by is_mammal ^ is_large ^ has_four_legs ^ has_claws. In computationallearning theory, a learning algorithmis given examplesof a single conceptc ¢ C accordingto somefixed but unknown distribution D. Theidea of learning with respect to a fixed but unknown, arbiwarydistribution on examples(often referred to as d/stribution-freelearning)is oneof the great strengthsof the COLT model.Wecan think of the target distribution as a meansof capturingthe structure of the real world-- not necessarilycapturingthat structurewell, but at least as representing someslice, howeverinaccurateor biased, of the phenomena weare trying to learn about. Becausewe seek a framework for describing learning in unanticipated scenarios,wewill rarely haveaccess to the distributions on data or examplesprior to constructingthe learning algorithms. Thereforeweseek algorithmsthat perform well underany target distribution. Of course, if the algorithmsees examplesaccordingto a skeweddistribution, it will infer conceptsreflecting only the data it has seen, and not necessarilythe objectivestructure of the learning domain.Thusweare forced to characterize the error of the learningalgorithmin termsof the target distribution. Supposethat a learningalgorithmoutputs a hypothesish froma representationclass H after seeing examplesfrom the instance space drawnaccordingto D. (Recentworkin computationallearning theory distinguishes betweenH and C becauseit is possible for some learningalgorithmsto create hypothesesin one representa- Computational LearningTheory:A BayesianPerspective 18 tion class that serve as goodapproximations to target concepts in another.) COLT defines the error E as D (x) E £ = xeXlk(x) ,,c (x) Thatis, E is the probabilityweightaccordingto Dof all the mistakesh can possibly make.MeasuringE in this mannerimplies that "fluke" examplesoccurring with very lowprobability in the samplingprocess contribute only a negligible amountof error to the hypothesis. COLT requires that a learning algorithmoutput a hypothesiswith error no larger than e regardlessof the distribution D, where¢ is specified in advance.Returningto our example, if wespecify a value of ~ = 0.1, then withprobability at least 90%the output hypothesiswill correctly classify a lion as a lion anda non-lionas a non-lion. Conversely,the probability that the hypothesismistakesa cheetahfor a lion, say, is less than10%. 2.4 Learnablllty In [Kearns], Kearnsmakesa point not usually addressedin muchof the COLT literature: becausewe are interested in computationalefficiency, wemustinsure that the hypotheses output by a learningalgorithmbe evaluatedefficiently, not just that theybe arrivedat efficiendy.It is of little use to us if a learning algorithmoutputs a systemof differential equationsthat can onlybe evaluatedin an exponential numberof time steps. Thuswerequire a preliminarydeftnition: if C is a representationclass overX, then wesay that C is polynomiallyevaluatableif there is a polynomial algorithmAthat on input a representation c ¢ C and a domainpoint x ~ Xoutputs c(x). Weare nowreadyto presentthe central definition of learnability accordingto COLT. Let C and H be representation classes over X. ThenC is learnable from examplesby H if there exists a probabilistic algorithmA suchthat for some c e C, ifA has access to examplesfrom pos(c) drawn accordingto any fixed but unknown distribution D, then for any inputs E and 8 such that 0<e,8<l, A outputs a hypothesish ¢ H that with probability1-/i has error less than t. Typicallyweinclude a requirementof efficiency as well: C is polynomiallylearnableby examplesfromH if C and H are polynomiallyevaluatable and A runs in time Where COLT falls short polynomialin l/c, 1/8, and Icl. Oftenwewill drop the phrase "by examplesfrom H" and simply say that C is learnable, or polynomiallylearnable. Supposewe are attemptingto learn a monomial representation is_mammal ^ is_large ^ has_four_legs ^ has claws for the concept of a lion. A COLT learning algorithmis given a distribution on lions, whereeach lion is representedas a truth assignmentoverthe variables wehave defined. The algorithmis then fed positive examples of lions as it attemptsto constructa monomial sufficiently close to the target monomial so as to meeterror specifications given by ~ and 8. Wecan employthe learning algorithmgiven by Valiant in his ground-breaking paper[Valiant]. Let Xl,...,xn denoteliterals (whatwehave been giving namessuch as has_four_legs) and let POS denotean oracle whichretm’nsa positive exampleof a lion drawnaccordingto a distribution D. for i:-- 1 to mdo {msatisfies E and 5 constraints} begin ~,=POS Thelearning algorithmcan err in two wayscorresponding to the two error parameters~ and 8. It can makean E-type error in the followingmanner:supposethat a rare breed of Tibetan lion has no claws; yet becausethose lions are so rare, not one of themappearedin the training set of exampies encounteredby the algorithm. Thusthe literal has_claws should have been deleted from the monomial.The problemis not serious, since the probability of misclassifying a Tibetanlion as a non-lion is boundedaboveby E, and~ is at least as great as the probabilityweightof all Tibetanlions with respect to the distribution D. A5-type error can occur if a randomlydrawnset of lions is highly unrepresentativeof lions in general -- for example,if everylion in the training set just happensto live in the circus, then the learning algorithmmightinclude lives in the circus in its conceptfor lion. Ideally, wewouldlike to havelearning algorithmsthat never makemistakes. Sucha goal, however,is clearly impossible,giventhe richness and unpredictability of most learning domains.The next best thing, it would seem,is to havesomegrasp of the types of error learning algorithmsmakeas well as the magnitudeof those errors in probabilistic terms. COLT providessuch a capability, and this has distinguishedit favorablyfromprevious attemptsat formalizinginductive inference. for j:= 1 to n do if vy--Othendeletexyfromh else delete~j fromh 3.0 WhereCOLTfalls short end output h As the learning algorithmsees positive examplesof lions, variables such as can..speak and breathes_underwater will be quicklydeletedfromh, since it is safe to assume that no lions can speakor breathe underwater.Avariable such as hasmane,meanwhile,will be eliminated from h with probability roughly1/2, presumingthat the sampling distributionDaccuratelyreflects the fact that abouthalf of all lions are males,and only malelions havemanes.Variables such as has clawsand is mammal are true for all positive examplesof lions, and so shouldwithstandthe pruningprocess of the learning algorithmand be included in the outputhypothesis. 3.1 Prior beliefs, prior knowledge Essential to the processof learning is building uponwhat is already known.Neither we nor agents approachnew phenomena as tabula rasa -- blank boards upon whichwe write the solution to a problemor engravethe representation of a concept.If the car is broken,wedo not infer what is wrongwith it just by looking underthe hood. Rather we employall of our backgroundknowledgeabout howcars workand especially whatcausesthemto fail. Similarly, in the realmof intelligent agency,wewouldlike robots to use knowledgewe have given themat design time, as well as everythingthey learn in the meanwhile, as they proceed to take action in the world. COLT provides a goodmodelfor inductive inference for a fixed problem,but it does not providemachineryto describe howa particular problemmightbe approached; in particular, it doesnot give us a meansfor incorporating Computational LearningTheory:A BayesianPerspective 19 Where COLT falls short possibly uncertain prior beliefs or knowledge about the focus of inference. Arguably,a conceptcan be stored, and then at someother time the learning algorithmcan see moreexamplesand thereby reduce its error and increase its confidencefor the concept.In this sense wemightsay that the learner has incorporated"prior" knowledge of a conceptC in order to infer "more"aboutC, but clearly we are employingpoorly-definednotions of "prior knowledge" and "more." tion with n:20 and we already knowthat lions do not fly and they are mammals, then we are really learning Mnfor n=lS. Somework has been done studying someversions of this problem (e.g., [Littlestone]), but still wemightlike to accountfor it in the basic model. The second, muchmoreserious problemis that we might not really be able to commit with certainty to the fact that lions do not fly, or, moreto the point, that all lions have claws(remember the Tibetanbreed). All the lions that the Whatwe want, rather, is the ability to drawon knowledge learner mighthaveseen before had claws, but it would in a knowledgebase in order to focus and perhapseven clearly be erroneousto assumethe quantified proposition accelerate the processof inference. Ratherthan throwing that all lions haveclaws.In fact, the learner’s previous static conceptsinto a knowledgebase and drawingon experienceaboutlions withclawspoints to the fact that it has onlyseen clawedlions. Yetrather than discardthis bit themonly to completecontrivedchains of reasoningor to answerstraightforwardqueries, wewishto use static conof information,wewouldlike to bring it to the learning cepts as dynamiccontributors to whatweare learning. instance somehow in a waythat allows revision yet adds Ratherthan using prior knowledge as solely a constraining information. Thuswe wouldlike to have a meansfor mechanism -- a tool for checkingthe consistencyof what incorporatingprior beliefs of the form"Mostlions have we already knowto what we are seeing -- we should also claws" or "Mostlions are someshade of brown." The way use it as a meansof enrichingand clarifying inference. to do this is with probability, andwewill havemoreto say aboutthis soon. Computationallearning theory makeswhat we might call a "freeze frame"assumption:whatwe infer is basedon 3.2 Hedgingwithin a concept whatwesee for a particular learning instance. COLT has nothingto say about whatcamebefore or after, and while Oneof COLT’s strengths is that it gives explicit probabilthis has provenuseful for stripping awaycomplexityin an ity parameters describingthe uncertaintyof the inference attemptto get at an essential modelfor learning,it is too procedure. Thus it is often referred to as "PAClearning," drastic; what has comebefore (and perhaps what comes for probably approximately correct learning. after) wouldseemto matter a great deal to whatwe are inferring. Returningto our previous example,supposewe Yet supposethat as weare learning, we wantto "hedge" are trying to learn a conceptfor a lion basedonpositive on particular aspects of the conceptwe are building. Going examples.A COLT algorithmbegins learning as if it backto our example,supposethat we see a lion with a knowsnothing about lions: has_mane,walks on four_mane,and then another with no mane.In COLT,since the legs, and flies are all unknown variables that mayassume literal has_mane is not satisfiable in light of this apparent the valuesI or 0 (Trueor False). Yetit seemsquite reaindeterminacy, it and its negationare eliminatedfromthe son,ablethat the learner mightalready knowthat lions do output hypothesis. Rather than throwout the babywith the not fly and that they are mammals, and it wouldbe goodto bath water, however, wemightlike to express and localize incorporate this knowledgesomehow into the learning our uncertainty about the variable has_mane.In fact, the process. variable is not indeterminateat all. Thetruth of the matter is that abouthalf of all lions havemanesandthe other half Nowone mightargue that we can simply set the values of do not; this seemslike informationworthincludingin our those variables in advance.Thefirst problemwiththis conceptfor a lion. Thereare manyconceivablecases responseis that those variables, since they are includedin whereoutside knowledgecan affect the scopo and accuthe representation schemefor the conceptspace, already racy of inference, particularly with respect to individual determinethe complexityof inference. Wemust redo our variables. COLT does not allow hedgingon particulars, proofs of the runningtimes of the learning algorithmor and this is something a morerobust modeloughtto allow. re.computelower boundsto matchthe representation we are using. In our example,if weare using a Mnrepresenta- Computational LearningTheory:A BayesianPerspective 20 Where COLT falls short 3.3 Degreesof support Supposeas a learning algorithmgoes about observing lions, it sees that everylion in the sampleso far walkson four legs, and not a single oneflies. Mostweighmorethan 100pounds,and mostchase antelope, while about half of themhave manes. A COLT algorithm will infer only what is consistent in the Booleanlogical sense with the endre sample. Thus, even though most of themweigh over 100 pounds,it mighthaveseen somebabylions that weighless than 100 pounds, and thus the value of the welghs_more _than_lOO_pounds variable in the algorithm’s output conceptis False. After seeingfemalelions, it will make the moreegregious error and concludethat has_mane =False. It will infer only somethinglike has_four_legs ^ ]I~ -- clearly muchvaluable information has beenhiddenor effectivelylost. Whatwe wouldlike mhave arc degrees of support over all the attributes of the learning domain,somewayof capturing and expressingall the richnessand uncertaintyof the learningprocess.If weobservethat all lions havefour legs, weare not merelylearningthat this is consistentwith the concept has_four_legs ^ ]!~, we are also learning somethingaboutthe extent to whichbelief in the variable has_four_legsis supported.Plainly, this sort of uncertain knowledge goesto the very heart of whatit is to learn something.Uncertaintyis not a bad thing, so long as we havea meansof capturingit in someexpressive,axiomatic way. 3.4 Learningother facts Thereare manyother facts in the worldworthyof learning that COLT does not accommodate. Althoughit is based on . statistics, it doesnot give us a framework for learningvarious sorts of statistical facts -- for example,whetheror not a coin is fair after observinga certain number of tosses. It doesnot contextualizewhatit learns to the realmof action, yet action wouldseemto constitute the backdropagainst whichanythingweqhlearning is actually learned (for morediscussionof this point, see [Good]).Moresignificandy,it is not at all obvioushowone wouldgo about extendingCOLT into this decision-theoretic domain,e.g., accountingfor learning in the inevitable presenceof constraints onlearningresources,or learningin the presence of variable costs of information(whenit costs valuable time or moneyto see another example,say) Theoretical strides on research on decision makingunder bounded resourcescannotbe smoothlyintegrated into traditional COLT techniques. 3.5 Theright kind of probability Themostseriousproblem with COLT is that it relies on a faulty interpretationof probabilitythat in turn fosters a false sense of comfortwith the probabifity parametersc and 8. Wecan highlight this difficulty by asking simply, "Whatis meantby the term ’probably approximatelycorrect’?" Nowhere in the literature (so far as we know)has there been any systematic discussion of COLT’s probabilistic approach,its consequencesand assumptions. Therehas beenongoingdebate in mathematicalstatistics, engineeringand philosophysurroundingproper interpretations of probability. Wecan neverthelessisolate a few points relevant to our current context. Becausethe probability parametersin COLT are specified in advance,and samplesizes are computedreladve to those parameters followingstandardmethodsfromclassical (frequentist) statistics, it seemsthat COLT’s intended,or at least default, probabilistic interpretationis roughlyas follows: givena learningdomain(i.e., conceptspace, a suitable representation and somedistribution D on examplesfixed in advance),in Nidentical, repeatablelearninginstances wherea numberof examplesgiven by the boundsof COLT are observedaccording to D, a learning algorithm will output representationssuchthat as N goesto **, the proportionof representations whichhaveerror greater than ~ is less than 8. Nowthis is a very roughsketch, for we haveleft muchunspecified. Dowe fix the learning algorithm,or do weallowdifferent learning algorithmsfor varyingvalues of N, as long as they all performlearning for the sameconcept space? Dowe fix the concept being learnedas well? Is it allowableto fix the conceptand vary the learning algorithms?Wehavealso ignored the question of error, thoughit is safe to take it as the weightrelative to D of examplesmisclassified by the learning algorithm.But then we mightask if it is sensible to vary the distributions fromlearninginstance to learning instance. All of these permutationscall into questionthe verymeaning of re_oeatabilityrequiredby a frequentistinterpretation. In whatsense is eachlearninginstancea repeatabletrial identical to any other learning instance? Notice howfar weare fromthe simplistic realmof coin flipping, where, Computational LearningTheory:A BayesianPerspective 21 WhereCOLT fails short all things beingequal, it appearsrelatively safe to assume that distinct flips of a coin are repeatableandidentical. Theapproachto assigningvalues for the probability parametersin COLT is indistinguishable fromthe wayin whichpeople assign thresholdsfor confidenceintervals. Wesay that we want8=.1 -- but what does this mean, really, and whatmakesit different from 8=.05?Whatconstitutes "extreme"error or confidence?(See [Good],p. 49.) Theconfidenceinterval is, as mosthoneststatisticians admit, a confidencetrick, a wayof makingconfidencepronouncements that are true 95%or 99%of cases in the long run. Yetwehavealreadyseen howdifficult it is to fix any reasonablenotion of a "long run" for learning. Evenif wecould fix somesensible notion of the "long run" for learning, confidencetechniquesfromclassical statistics still do not give us whatwewant.AsKeynessuecinctly put it, "In the longrun wewill all be dead." As anybodyfamiliar with these kinds of statistical sampling techniques knows,the samplesrequired to accommodate the pre-specitied probability parametersare often absurdly large. Moreover,they flagrantly ignore information acquired during sampling.The criticism can be brought into focus by giving an exampledue to Good:if you observe300spins of a roulette wheeland there is no occurrenceof a 7, shouldyouregard the probability of a sevenon the next spin as 0 or 1/37, its official value?A frequentist’smostlikely responseis to tell youto spin it another 200times. Somemightargue that this is irrelevant since COLT does notsupportthat kind of inference,andthoughthis reply is misguided,we can nevertheless accommodate the objection by meansof another example.Supposewe are trying to build a conceptfor a lion, andour learningalgorithm has seen ¯ examplesso far, wherer<<m,the numberof examplesprescribed by COLT in order to achieve confidence1-8 and error ~. Yet the learning algorithmmight havealready observedthat all lions are brown,walkon four legs, and havelots of teeth -- enoughinformationto classify lions relative to the givendistributionwitherror less than t. Regardlessof howclever this particular learning algorithmis, the numberof samplesrequired is fixed in advance,and, in manycases, muchmorethan is needed or even possible. Wetake an examplefrom [Kearns], p. 16, whereweare interested in learning a monomial over n Computational LearningTheory:A BayesianPerspective 22 literals using positive examplesonly. Kearnsgives the followingexact houndfor the numberof examplesrequired: 2n 1 ~- (log2n+ log ~) Supposeweare using 4 literals, and we set 8=I/16 and ¢=1/10. Thenthe preceding equation implies that a COLT learning algorithmrequires 560examples;yet there are only 24, or 16, possible monomials over 4 literals, so clearly the numberof examplesneededby the algorithmis overstated (with just 16 examples,we have 8 = e = 0). short, the statistical techniquesuponwhichCOLT relies do not allow inference to proceed dynamicallybasedon relevant knowledgeacquired during the samplingprocess. 3.6 Bayesianlearning as a solution Thesolution is a morerealistic, morehonestacceptanceof whatweare really using all along: subjective, or personal, probability. Those whostudy COLT presumablyhave no difficulty using probability as a meansof expressing uncertainty.If that is the case, then let us makeourselves clear about what we meanwhenwe use it. Wefollow the conventionalBayesianviewin defining a probability as a degree of belief in a proposition E given the background state of informationI in the believer’s mind.Weshall not dwellon the distinct, yet consistent, perspectiveson subjective probability(e.g., logical subjectiveprobabilityversus personalsubjectiveprobability). Like probabilities interpreted as frequencies, we employthe axiomsof probability to regulate and makeconsistent all our probability assessments, nowconsidered as subjective measuresof belief. The approachis honest because, as has beenarguedforcefully elsewhere,it is mostlikely whatfrequentists and other empiricists are doinganyway;and, mostattractively, it "covers"the other modelsand interpretations [de Finetti], so that one can continueadheringstubbornlyto ones frequentist wayswhile still acceptingeverythingwe do here. Theimmediateconsequenceof our approachis that rather than employingBooleantruth values (0 or 1, True or False) for the conceptsand facts we will be inferring, we Bayesian Learnablllty employa closed unit interval for each variable to describe belief in its truth or falsity. Ahighvaluemeansthat we ascribea highlevel of belief to the variable’sbeingtrue, anda lowvalueimpliesa lowlevel of belief in its truth, or, conversely,a higherbelief in its falsity. Suchan attackwill allowus to do all the things wewouldlike to be able to do with COLT:hedgingon particulars, degrees of support (wheresupport, using our newterminology,is indistinguishablefrombelief), andprior beliefs. subtractionof degreesof belief for elementsof a concept based on observeddata. The solution we posited wasa Bayesian,subjective use of probability. Thepoint of defining equivalencebetweenrepresentations is that the concept beinglearnedor the fact beinginferred can still retain all of its exactitudein the Booleanrepresentation,only nowwe are being honest: no matter howmanytimes we flip the coin, no matter howmanylions we see with four legs, we will alwaysallow roomfor someuncertainty. Thuswe will performinference over a Bernoullirepresentation relative to the exact facts describedin the Boolean one. TheBooleanfacts are still true or false - wejust never presume(or pretend) to knowthemwith certainty. Thegoal of a learningalgorithmin the COLT literature is Themost significant consequence of this approachis that to infer someunknown subset of X, called a concept, after seeing a certain numberof its instances. Weaccommodate we are alwayslearning. Attributes of a conceptare always undergoingpositive or negativeupdatingconsistent with such tasks, but wealso allowsomeflexibility for learning the laws of probability. TheBernoullirepresentationis a propositionswhichneednot be directly stated in strict logmoreepistemologicallyaccurate framewithin whichto ical termsrelative to the domain.For example,learning performinference becauseit captures and expresses whetheror not a coin is fair is a propositionnot directly knowledge such as the fact that mostbirds fly, but penexpressible as a combinationof"Tails" or "Heads,"yet it guins do not. Mostlions havefour legs, but wedo not is somethingour modelallowsas a valid subject of inferwantto abandonthat fact altogether if weencounterone ence. Thusour approachsubsumesCOLT in that it allows that lost a leg in a bloodyfight witha rhinoceros.One learning beyondsimple conceptsexpressibleas logical importantadvantageof this approach,then, is that it cirrules over domaininstances. For the sakeof brevity, we cumventsthe traditional problemsof monotonicityAI shall not consistentlymakeexplicit these different types of researchers havestruggled with for so long, problems learning, but it is useful to keepthemin mindas we proCOLT never fully escapes. ceed. 4.0 BayesianLearnability Still, wehaveno formalismfor performingthe type of inference we seek. Whatwe wantis a frameworkthat possesses all of the strengths of COLT -- in particular, demandsfor efficiency and allowancesfor uncertainty -yet also accommodates desiderata such as hedging, ranges of uncertaintyfor particular attributes, and prior beliefs. Thesolution, and the mainresult of this paper, is whatwe call a Bayesianlearning scheme,whichwe will also abbreviate by BLS. Wenowpresent a definition central to this paper: a nBernoullirem’esentationclass is a representationclass over n Bernoullivariables, i.e., variables assumingsome value in the closed interval [03]. Moreover,weshall say that a p-Booleanrepresentation class and a q-Bernoulli representationclass are equivalentif there exists someisomorphism betweenthe literals of the Booleanclass and the variables of the Bernoulliclass. Sincesuch an isomorphismexists if the twoclasses use the samenumberof variables, wecan see that twosuch classes are equivalent in general if p---q. (Wewill sometimes drop the parameter n and simplyrefer to Booleanand Bernoullirepresentation classes; wewill alwaysassumethe existence of a fixed domainX.) Definition:Supposethat a Booleanrepresentation class B is equivalentto a Bernoullirepresentationclass C, andlet bi andci denotethe ith variables of B andC, respectively. Weshall denote the domainof inference by X. Weshall call 4) a BLSfor B relative to C iff the followingall obtain: It is perhapsappropriateat this timeto state the motivation for the precedingdefinition of equivalence.In Section3, we observedsomeof COLT’s shortcomings:in particular, its rigidity withrespect to particular atuibutesof a concept beinglearned, and its inability to accommodate additionor 1) 4) maintainsa set of distributionsI~i, for i=l..n, wheren is the numberof variables in B and C. Each13i encapsu- Computational LearningTheory:A BayesianPerspective 23 PreliminaryResultsandFutureWork lates currentbelief aboutthe valueof ci, andthereforeis a second-orderdensity function on the Bernoullivariable bi. 2) Let c° denotesomefixed conceptor fact in C. For some x ¢ pos(c°) , let xi ¢ {0,1 } denotethe observationvalueof the ith attribute or variableofx. Wedenotethe conjunction of all backgroundinformation and knowledgeby 4, and bracketsdenoteprobabilities. After ¯ observesx, it updates B according to Bayes’Theorem: l--1 B~r..,~){Bb~ I ~} bring all the variables to matchinglevels. Futureworkwill focus on accountingmorecompletely for observation costs, not just in termsof samplesize but also particular values of a given example’svariables. Finally, weshall say that a class F is Bayesian(polynomi- ~0ammal~iff l) there exists someBernoullirepresentation class B such that F andB are equivalent; and 2) there exists a BLS4) for B relative to 5.0 Preliminary Results and Future Work Moreexactly, we have Vi Wehavejust begin to study learning in this framework, thoughthere a numberof interesting results wehave already derived for the model.Proofs and further developmentof the Bayesianlearning frameworkare forthcoming in [Chavez]. (xiI biffs,;) {bibd~J ;} 4) Givenn-place vectors ~ and 8, and a distribution D on examplesin pos (c°), whereVi we have Ei<l and 0<8i<1 , O will output an updated version of B, denoted B*, that is probablyapproximately correct in nearly the samesense as Valiant: that is, Ohalts and outputsB*such that Vi, with probability greater than 1-8i wehave Theorem 1: Let C be someBooleanrepresentation class, and let B be someBernoulliclass equivalent to it ThenB convergesto C in the followingsense: if O is a BLSfor B relative to C, then as N, the numberof examplesthat 4) observes, goesto ,,,,, for all i wehavebi ---* ci. Amoresignificant result is Theorem 2, whichessentially tells us that anythinglearnable in COLT is learnable in a BLSas well. -1 Theorem 2: AnyBooleanrepresentation class C is Bayesian polynomiallylearnable. 5) Let e,.,. and 81. be the smallest elementsof ~ and ~, respectively. Let ~ti denotethe meanof the corresponding distribution 13i. Then¯ computesB* in time polynomial in r lie,.i., in 118.~., andll~t Oneof the shortcomings of this definition of a BLSis that its runningtime is determinedsolely by the smallestvalues of the probabilityparametersfor a particular attribute. Thetime cost of using a BLS,then, is determinedexclusively by the parametersof the mostprecise variable(s) interesLIn practice, samplingin order to satisfy the most precise variables’ probabilityparameterswill necessarily Computational LearningTheory:A Bayesian Perspective 24 Weprove theorem2 by using Monte-Carlosampling and recent approachesto Bayesiananalyses of simulation algorithms[Dagum].Suchanalysis also showsthat the details of prior distributions often "washout" quicklyin the samplingprocess, thus addressingthe standard complaint fromfrequentists worriedabout the adverseeffects of assigningpossiblyarbitrary or mistakenprior distributions. Becausewe use Monte-Carlosampling techniques, complexityof inference dependsonly on the sizes of the samplesused, and not on the numberof variables instantiated during learning. Thus, we have the followingimportant result. PreliminaryResultsandFutureWork Corollary:The computationalcomplexityof a BLSis independentof the complexityof the conceptrepresentation. lLittlestone] N. Littlestone. "Learningquicklywhenirrelevant attributes abound:a newlinear threshold algorithm," MachineLearning,2(4), 1988, p. 245-318. Theprecedingcorollary appearsto be a significant advancementover COLTmethods, where the complexity of the representationused figures prominentlyin the running times of the algorithms. Asworkproceeds, wewill pay special attention to the sourcesof cost and complexity in a BLS,and howthey distinguishit, either positively or negatively, from COLT approaches. [Valiant] L.G. Valiant. "ATheoryof the Learnable,"Communicationsof the ACM,27(11), 1984, p. 1134-1142. Acknowledgments: I wouldlike to thank Paul Dagum and Eric Horvitz for insightful comments and discussions on this work.Eric Horvitzread andhelpedclarify a draft of this paper. This research wassupported by IR&DFundsof the Decision-TheoreticSystemsGroupof the Palo Alto Laboratory,RockwellInternational ScienceCenter. References [Chavez] TomChavez. "Bayesian Learning Schemes" TechnicalReport, RockwellInternational ScienceLab, Palo Alto, CA. [Dagum]Paul Dagumand Eric Horvitz. "A Bayesian Analysis of Randomized ApproximationProcedures," September,1991. Knowledge SystemsLaboratory, Stanford University, Stanford, CA.KSL-91-54. [de Finetti] Brunode Finetti. "Foresight:Its logical laws, in subjective sources," reprinted in H.E. Kyburgand H.E. Smokier(eds.), Studies in SubjectiveProbability. London: JohnWiley,1964, p. 93-158. [Garey] MichaelR. Gareyand David S. Johnson. Computers and Intractability: A Guideto the Theoryof NP-Completeness. 1979, W.H.Freemanand Company,NewYork. [Good]I. J. Good."TheBayesianInfluence, or howto sweepsubjectivismunderthe carpet," in Harperand Hooker(eds.), Foundations of ProbabilityTheory,Statistical Inference, andStatistical Theoryof Sc&nce.Dordrecht: Reidei. [Kearns] MichaelJ. Kearns. The ComputationalComplexity of machineLearning.1990, The MITPress. Computational LearningTheory:A BayesianPerspective 25