From: AAAI Technical Report SS-94-02. Compilation copyright © 1994, AAAI (www.aaai.org). All rights reserved. Deciding Whenand Howto Learn Jonathan Gratcha, Gerald DeJonga, b and Steve A. Chien aBeckman Institute Universityof Illinois 405 N. MathewsAv., Urbana,IL 61801 {gratch, dejong}@cs.uiuc.edu bJet PropulsionLaboratory, CaliforniaInstitute of Technology, 4800 Oak GroveDrive, M/S525-3660 Pasadena, CA91109-8099 sources1 andmayforce an agentto redirect its behaviortemporarilyawayfromsatisfyingits goalsandtowardsexploratory behavior.Thus,morerecentworkhas begunto considerthe problemof developingtechniquesthat can makea reasoned compromise betweenlearning andacting, andmoregenerally, to considerthe relationshipbetweenlearningandthe overall goals andresourceconstraints of a rational agent[des Jardins92, Gratch93b,Provost92]. Abstract Learning is an importantaspectof intelligent behavior. Unfortunately,learningrarely comesfor free. Techniques developedby machinelearning can improve the abilities of an agentbut theyoftenentail considerable computational expense.Furthermore,there is an inherenttradeoff betweenthe powerandefficiencyof learning. Morepowerfullearning approachesrequire greater computational resources. This poses a dilemmatoa learningagentthat mustact in the worldunder a variety of resourceconstraints. This paperinvestigates the issues involvedin constructinga rational learning agent. Drawingon workin decision-theory wedescribea framework for a rational agentthat embodieslearningactionsthatcanmodify its ownbehavior. Theagentmustpossesdeliberativecapabilitiesto assessthe relativemeritsof theseactionsin the larger context of its overall behaviorand resource constraints. Wethen sketchseveral algorithmsthat have beendevelopedwithin this framework. 1. INTRODUCTION Aprincipalcomponent of intelligent behavioris an ability to learn. In this article weinvestigatehowmachine learningapproachesrelate to the designof a rationallearning agent.This is an entitythat mustact in the worldin the serviceof goalsand inetudesamong its actionsthe ability to changeits behavior bylearningaboutthe world.Thereis a longtraditionof studying goaldirectedactionundervariableresourceconstraintsin economics[Good71 ] andthe decision sciences [Howard70], andthis formsthe foundations for artificial intelligenceapproaches to resonrce-bounded computational agents [Doyle90,Horvitz89, Russell89]. However,these AI approacheshavenot addressedmanyof the issues involvedin constructinga learningagent. Whenlearning has beenconsidered, it has involvedsimple,unobtrusiveprocedureslike passivelygatheringstatistics [Wefald89] or off-line analysis, wherethe cost of learningis safely discounted[Horvitz89]. Manymachinelearning approachesrequire extensive re36 Decidingwhenandhowto learn is madedifficult bythe realization that there is no "universal"learning approach.Machine learning approachesstrike somebalance betweentwo conflicting objectives. First is the goal to create powerful learningapproachesto facilitate large performance improvements.Secondis the needto restrict approaches for pragmatic considerationssuchas efficiency. Learningalgorithmsimplementsomebias whichembodiesa particular compromise betweenthese objectives. Unformately,in whatis called the static bias problem, thereis nosinglebias that appliesequally well acrossall domains,tasks, andsituations [Provost92].As a consequence there is a bewilderingarray of learning approacheswith differing tradeoffs betweenpowerand pragmarieconsiderations. Thistradeoff posesa dilemma to anyagentthat mustlearn in complexenvironmentsunder a variety of resource constraints. Practical learningsituations place varyingdemands on learning techniques. Onecase mayallow several CPU monthsto devoteto learning; anotherrequires an immediate answer.For onecasetraining examples are cheap;in another, expensiveproceduresate required to obtain data. Anagent wishing to use a machinelearning approachmust decide whichof manycompetingapproachesto apply, if any, given its currentcircumstances. This evaluationis complicated because techniques do not provide feedbackon howa given tradeoffplays out in specific domains.Thissameissue arises t. GeraldTesauronoted that his reinforcementlearning approachto learning strategies in backgammon required a monthof learning time on an IBMRS/6000[Tesanro92]. Environment ~ Agent-<~ I ala2 t Leamer-~ Environment Learning ,,,,~ Agent ~\ al a2. learning ~ actions 11 12 Ii 12 Figurela" Standarddecision-theoreticview. l..eamingis consideredexternalto the agent. Figurelb: Alearning agent. Learningis underthe direct controlof the agent. in a number of different learningproblems. In statistical modeling the issue appearsin howto choosea modelthat balances the tradeoffbetween the fit to the data andthe number of examplesrequiredto reacha givenlevel of predictiveaccuracy [AI&STAT93]. In neural networksthe issue arises in the choice of networkarchitecture (e.g. the numberof hidden units). Thetradeoffisatthe coreof the field of decisionanalysis whereformalismshavebeendevelopedto determinethe value of experimentation undera variety of conditions[Howard70]. for a speed-uplearningsystemor a spaceof possibleconcepts for a classification learningsystem.Eachof the transformations impartssomebehaviorto the computational agenton the tasks it performs.Thequality of a behavioris measured by a utility functionthat indicateshowwellthe behaviorperforms ona giventask. Theexpectedutility of a behavioris the averageutility overthefixed distributionof tasks. Ideally, aleaming systemshouldchoosethe transformationthat maximizes the expectedutility of the agent. Tomakethis decision, the learning systemmayrequire manyexamplesfromthe environment to estimateempiricallythe expectedutility of different transformations. Thisarticle outlines a decision-theoretic formalism with which tovicw theproblem offlexibly biasing learning behavior.A learning algorithm isassumed tohave somechoice in 1.1 Learning an Agent vs. a Learning Agent howitcanproceed andbehavior ina given situation isdeterFigurela illustrates this viewof learningsystems.Thereis a mined bya rational cost-benefit analysis ofthese alternative computational agentthat mustinteract with an environment. "learning actions." A chief complication inthis approach is The agent exhibits somebehaviorthat is characterizedby the assessing theconsequences ofdifferent actions inthecontext expected utility of its actionswithrespectto the particularenofthecurrent situation. Often these consequences cannot be vironment. For example the agentmaybe an automatedplaninferred simply onthebasis ofpriorinformation. Thesolution ning system, the environment a distributionof planningprobweoutline inthis article uses theconcept ofadaptive bias. In this approach a learning agent uses intermediate informationlems, and the behaviorcharacterized by the averagetime acquired inearlier stages oflearning toguide later behavior.required to producea plan. A learning algorithmdecides, froma spaceof transformations,howbest to changethe beWeinstantiate these notionsin the area of speed-uplearning haviorof the agent. This viewclarifies the purposeof learnthroughthe implementation of a "speed-uplearning agent." ing: the learningalgorithmshouldadopta transformation that Theapproachautomaticallyassessesthe relative benefitsand maximizes the expected utility of the agent’s behavior. This costs of learningto improvea problem solverandflexibly adviewdoes not, however,say anythingabout howto balance justs its tradeoffdepending on the resourceconstraintsplaced pragmaticconsiderations.Theutility of the agentis defined uponit. withoutregardto the learningcost requiredto achievethat lev2. RATIONALITY IN LEARNING el of performance. Pragmaticissues are addressedby an external entity throughthe choiceof a learningalgorithm. Decisiontheoryprovidesa formalframework for rational deFigurelb illustrates the viewweexplorein this article. Again cision makingunderuncertainty(see [Berger80]).Recently, therehas beena trend to providea decision-theoretic interprethere is a computational agentthat mustinteract withan envitation for learning algorithms[Brand91,Gratch92,Greinronment.Againthe agentexhibits behaviorthat is characterer92, Subramanian92]. The"goal" of a learning systemis to izedbythe expected utility of its actions.Thedifference is that improvethe capabilities of somecomputational agent(e.g. the decisionof whatandhowto learn is undercontrol of the planneror classifier). Thedecision-theoreticperspective agent, and consequently,pragmaticconsiderationsmustalso viewsa learningsystemas a decisionmaker.Thelearningsysbe assessedby the agent, and cannotbe handedover to some temmustdecidehowto improvethe computational agentwith externalentity. In additionto normalactions,it can perform respecttoafuted,but possiblyunknown, distributionof tasks. learningactionsthat changeits behaviortowardsthe environTiaelearuingsystemregresentsa spaceof possibletransform~nt.A learning action mightinvolve running a machine mat/onsto the agent:a spaceof alteraativecontrolstrategies learningprogramas a black box, or it mightinvolvedirect 37 control over low-levellearningoperations- therebygiving the agentmoreflexibility over the learningprocess.Theexpenseof learningmaywellnegativelyimpactexpectedutility. Therefore,the learningagentmustbe reflective [Horvitz89]. This meansit mustpossessthe meanstoreasonaboutboth the benefitsandcosts of possiblelearningactions.Asan example, such a learning agentcould be a automatedplanningsystem withlearningcapabilities.Aftersolvingeachplanthis system has the opportunityto expendresourcesanalyzingits behavior for possible improvements. Thedecision of whatand howto learn mustinvolve considerationofhowtheimprovement and the computational expenseof learningimpactthe averageutility exhibitedbythe agent. Notethat the senseof utility differs in the twoviews.In the first case,utility is definedindependent of learningcost. In the secondease,utility mustexplicitlyencodethe relationshipbetweenthe potential benefitsandcosts of learningactions.For example,the averagetime to producea plan maybe a reasonable wayto characterizethe behaviorof a non-learningagent butit doesnotsufficeto characterize the behaviorof a learning agent. Thereis never infinite time to amortizethe cost of learning.Anagentoperatesunderfinite resourcesandlearning actions are only reasonableif the improvement theyprovide outweighs the investmentin learningcost, giventhe resourceconstraintsof the presentsituation. Thesetwosenses of utility are discussedin the rational decisionmaking literature. Following the terminology of Goodwerefer to a utility functionthat doesnot reflectthe costof learningas TypeI utility function[Good71 ]. Autility functionthat doesreflect learningcost is a TypeII utility function.TheTypeI vs. Type II distinctionis alsoreferredto as substantivevs. procedural utility [Simon76] or object-level vs. comprehensive utility [Horvitz89].Weare interested in agentsthat assess TypelI utility. 1.2 Examples of Type II Utility natural goal is to evaluatethe behaviorof the learningagent by howmanyproblemsit can solve withinthe resourcelimit. Alearningactionis onlyworthwhile if leads to an increasein this number.In this case the expectedTypeII utility correspondsto the expectednumber of problemsit can solve given the resourcelimit. If weare givena finite number of problems to solve, an alternativerelationshipmightbe to solvethemin as fewresourcesas possible. In this case expected"I~peII wouldbe the expected resources (including learning resources) required to solve the fixed numberof problems. Learning actionsshouldbe chosenonlyif theywill reducethis value. Anothersituation arises in the caseof learninga classifier wherethere is great expensein obtainingtraining examples. Thepresumed benefit of a learnedclassifier is that it enables an agent to makeclassifications relevant to achievingits goals.Obviously, the moreaccuratethe classifier, the greater the benefit. It is also generallythe casethat a moreaccurate classifier will result byincreasingthe richnessof the hypothesis space- increasingthe number of attributes in a decision tree, increasingthe numberof hiddenunits in a neural network,or increasingthe number of parametersin a statistical modelingprocedure.Whilethis improvesthe potential benefit, addingflexibility will increasethe number of examples requiredto achievea stable classifier. ATypeII utility function shouldrelate the expectedaccuracyto the cost of obtaining training examples.Alearning systemshouldthen adjust the flexibility (attributes, hiddenunits, etc.) in sucha waytomaximizethis TypeII function. 1.3 Adaptive Bias In our modelan agent mustbe able to assess the costs and benefitsof alternativelearningactionsas theyrelate to the current learningtask. Occasionally it maybe possibteto provide the learning agent with enoughinformationin advanceto makesuchdeterminations (e.g. uniformitiesare onesourceof statistical knowledge that allowsan agentto infer the benefits of usinga givenbias [desJardins92]).Unfortunately, suchinformationis frequentlyunavailable.Weintroducethe notion of adaptivebias to enableflexible control of behaviorin the presenceof limited informationon the cost and benefit of learning actions. Withadaptivebias a learning agentestimatesthe factors influencescost andbenefit as the learning processunfolds. Informationgained in the early stages of learningis usedto adaptsubsequent learningbehaviorto the characteristicsof the givenlearningcontext. TypeI funetiomcan be usedin realistic learningsituations ¢xdyaftar somehumanexpert has assessed the current resourceconstraints, andchosena learningapproachto reflect the appropriatetradeoff. Thedecision-theoreticviewis that the decision makingof the humanexpert can be modeledby a TypeII utility function. Thus,to developa learningagent wemustcapturethe relationshipbetween benefit andcost that webelieve the humanexpert is using. This, of course, will vary withdifferentsituations. Wediscussa fewlikely situations andshowhowthese can be modeledwith TypeII functions. 3. Speed-uplearningis generallycast as reducingthe average amountof resources neededby an agent to solve problems. Theprocessof learning also consumes resourcesthat might otherwisebe spent on problemssolving. ATypeII function mustrelatethe expenseof this learninginvestment andits poteatialreturn.If weare restrictedto a finite resourcelimi’t, a Thenotionof a rational learningagent outlinedin Section NOTAG is quite general. Manyinteresting, and moremanageablelearning problemsarise fromimposingassumptions on the general framework. Weconsiderone specialization of these ideas in somedetail in this section,whilethe following sectionrelates severalalteraative approaches. 38 RATIONAL SPEED-UP LEARNING Oneofourprimary interests is in area of speed-uplearningand thereforewepresentan approachmimed to this class of learning problems.Weadoptthe followingfive assumptions to the generalmodel.(1) Anagentis viewedas a problemsolverthat mustsolve a sequenceof tasks providedby the environment accordingto a fixed distribution. (2) Theagent possesses learningactionsthat changethe averageresourcesrequiredto solve problems.(3) Learningactions mustprecedeanyother actions.Thisfollowsthe typicalcharacterization of the speeduplearningparadigm: there is an initial learningphasewhere the learning systemconsumes training examples and produces an improved problemsolver;, this is followedby a utilization phasewherethe problemsolver is used repeatedlyto solve problems.Oncestopped,learningmaynot be returnedto. (4) Theagentlearns by choosinga sequenceof learning actions and each action must (probabilisticaUy) improveexpected utility. Asa consequence, the agentcannotchooseto makeits behaviorworsein the hopethat it mayproducelarger subsequent improvements. (5) Learningactions haveaccess to unbounded numberof training examples.For any given run of the system,this number will be finite becausethe agent mustact underlimitedresources.Thisis reasonablefor speedup learning techniquesare unsupervisedlearning techniques so the examples do not needto be classified. Together,these assumptions define a sub-classof rational learningagentswe call one.stagespeed-uplearningagents. Giventhis class of learningagents,their behavioris determinedbythe choiceof a TypeII utility function.Wehavedevelopedalgorithmsfor two alternative characterizationsof Typen utility that result fromdifferentviewsonhowto relate benefit andcost. Thesecharacterizationsessentially amount to TypeII utility schemas- theycombine somearbitrary Type I utility functionandan arbitrarycharacterization of learning cost into a particular TypeII function. Ourfirst approach, RASL, implementsa characterization whichleads the system to maximizethe numberof tasks it can solve with a fixed amountof resources. A second approach,RIBS,encodesa different characterizationthat leads the systemto achievea fixed level of benefit with a minimum of resources.Bothapproaches are conceptually similar in their use of adaptivebias, so wedescribe RASL in somedetail to give a flavor of the techniques. the COMPOSER system, RASLincorporates decision making procedures that provideflexible controlof learningin responseto varyingresourceconstraints. RASL acts as a learnhag agentthat, givena problemsolver anda fixed resource constraint, balanceslearning and problemsolving with the goalof maximizing expectedTypeII utility. In this contextof speed-uplearningresourcesare definedin termsof computational time, however the statistical modelappliesto anynumericmeasureof resourceuse (e.g., space, monetary cost). 1.4 Overview RASL is designedin the serviceof the followingTypeII goal: Givena problemsolver and a known,fixed amountof resources, solve as manyproblemsas possible within the resources. RASL can start solving problemsimmediately.Alternatively, the agent can devote someresources towards learningin the hopethat this expenseis morethanmadeup by improvements to the problemsolver. TypeII utility correspondsto the number of problemsthat can be solvedwith the availableresources.Theagentshouldonly invokelearningif it increases the expectednumberofproblems(ENP)that can be solved. RASL embodiesdecision makingcapabilities to performthis evaluation. Theexpectednumberof problemsis governedby the following relationship: availableresourcesafter learning ENP = avg. resourcesto solvea problemafter learning Thetop term reflects the resourcesavailable for problems solving.If nolearningis perforlned,this is simplythe total availableresources.Thebottomtermreflects the averagecost to solve a problemwith the learnedproblemsolver. RASL is reflectivein that it guidesits behaviorbyexplicitlyestimating ENP.To evaluate this expression, RASL mustdevelopestimatersfor the resourcesthat will be spentonlearningandthe performance of the learned problemsolver. Like COMPOSER, RASLadopts a kill-climbing approachto maximizing utility. Thesystemincrementallylearns transformationsto an initial problem solver. Ateachstep inthesearch, a transformation generatorproposesa set of possiblechanges to the current problemsolver.2 Thesystemthen evaluates these changeswith training examplesandchoosesonetransRASL (for RAtionalSpeed-upLearner)builds uponour preexpectedutility. At eachstep in the vious work on the COMPOSER learning system. COMPOS- formationthat improves search,RASL has twooptions:(1) stop learningandstart solvERis a general statistical framework for probabilistically ing problems,or (2) investigate newtransformations,choosidentifyingtransformationsthat improveTypeI utility of a ing onewith the largest expectedincreaseinENP.Inthis way, problem solver [Gratch91, Gratch92]. COMPOSER has demonstrated its effectivenessin artificial planningdomains 2. Theapproachprovidesa frameworkthat can incorpo[Grateh92] and in a real-world scheduling domain rate varying mechanisms to proposingchanges. Wehave [Gratch93a].Thesystemwasnot, however,developedto supdevelopedinstantiations of the frameworkusing PRODIportrationallearningagents.It hasa fixedtradex~-ffor balancGY/EBL’s proposedcontrol rules [Minton88],STATIC ing learnedimprovements withthe cost to attain themandproproposedcontrol rules [Etziom’91] anda Hbraryof expert proposedschedulingheuristics [Gratch93a]. videsnofeedback.U,sing the low-levellearningopecationsof 39 the behaviorof RASL is driven by its ownestimate of how learninginfluencesthe expectednumber of problemsthat may be solved. Thesystemonly continuesthe hill-climbingsearch as longas the nextstep is expected to improve the TypeII utility. RASL proceedsthrougha series of decisionswhereeach action is chosento producethe greatest increase in ENP. However, as it is a hill-climbingsystem,the ENPexhibitedby RASL will be locally, andnot necessarilyglobally, maximal. 1.5 Algorithm Assumptions n observationsof X.~’n is a goodestimatorfor EX.Variance is a measure of the dispersionor spreadof a distribution. We use the samplevariance,S2" ffi 1n Zn(Xi - Z~)2asan estima/-1 tor for the varianceof a distribution. x Thefunction~)(x) ffi 1(1/f-~-)exp{-0.5y2}dy is the cumula-aa five distributionfunctionof the standardnormal(also called standardgaussian)distribution,tlKx)is the probabilitythat RASL relies on several assumptions.Themostimportantaspoint drawnrandomlyfroma standard normaldistribution sumption relates to its reflectiveabilities. Thecostof learning will be less thanor equaltox. Thisfunctionplaysa important actionsconsistsof three components: the cost of deliberating role in statistical estimationandinference.TheCentralLimit on the effectivenessof learning,the cost of proposing transTheorem showsthat the difference betweenthe samplemean formations,andthe cost of evaluatingproposedtransformaandtrue meanof an arbitrary distribution can be accurately tions. When RASL decideson the utility of a learningaction, representedby a standardnormaldistribution, given suffiit shouldaccountfor all of these potential costs. Ourimpleciently large sample.In practicewecan performaccuratestamentation,however,relies on an assumption that the cost of tistical inferenceusinga "normalapproximation." evaluationdominatesother costs. This assumption is reasonable in the domainsCOMPOSER has been applied. Thepre1.6.1 COMPOSER’s Statistical Evaluation lxmderant learningcost lies in the statistical evaluationof transformations, throughthe large cost of processingeachexLet T be the set of hypodaesized transformations.Let PEdeampleproblemm this involves invokinga problemsolver to note the current problem solver. WhenRASL processes a solve the exampleproblem.Thedecision makingprocedures training example it determines, for each transformation, the that implementthe Type1/deliberations involveevaluating change in resource cost that the transformation would have relatively simple mathematicalfunctions. This assumption providedif it wereincorporated intoPE.Thisis the difference has the consequencethat RASL does not account for some in cost between solvingthe problemwith andwithoutincorpooverhead in the cost of learning,andundertight resourceconrating the transformation intoPE.This is denotedAr,(tlPE) straints the systemmayperformpoorly. RASL also embodies the same statistical assumptions adopted by COMPOSER. for the/th training example.Trainm"g examplesareprocessed incrementally.After eachexample the systemevaluatesa staThestatistical routinesmustestimatethe meanof variousdististic called a stoppingrule. Theparticular rule weuse was tribution. Whileno assumption is madeaboutthe formof the proposedby Arthur Nhdasand is called the N~idasstopping distributions, wemakea standardassumption that the sample rule [Nadas69].This rule determineswhenenoughexamples meanof these distributionsis normallydistributed. havebeenprocessedto state with confidencethat a transformarionwill help or hurt the averageproblemsolve speedof 1.6 Estimating ENP PE. Thecoreof RASL’s rational analysisis a statistical estimator After processinga training example,RASL evaluatesthe folfor the ENPthat results frompossiblelearningactions. This lowing rule for each transformation: sectiondescribesthe derivationof this estimator.Ata given decision point, RASLmust choose between terminating learningor investigatinga set of transformations tothecurrent n t~ (1) n > no AND [j~_(tlPE)]2 < ~-, s.t. ~(a) ffi 2rll problemsolver. RASL uses COMPOSER’s statistical procedureto identify beneficialtransformations. Thus,the cost of learningis basedonthe cost of this procedure.Afterreviewwherenois a smallfinite integerindicatingan initial sample ing somestatistical notationwedescribethis statistical procesize, nis the number of examples takenso far, is the transfordure. Wethen developfromthis statistic a estimatorfor ENP. mation’saverageimprovement, is the observedvariance in the transformation’simprovement, and5 is a confidencepaLet Xbe a random variable. Anobservationof a random varirameter. If the expression holds, the transformationwill ablecanyield oneofasetof possible numericoutcomeswhere speed-up (slow down)PEif is positive (negative) with confithe likelihoodof eachoutcome is determinedby an associated probabilitydistribution. Xlis the RhobservationofX.EXdedence1--~. Thenumberof examplestaken at the point when Equation1 holdsis called the stoppingtime associatedwith notesthe expeoted valueOf X,also called the meanof the distributioa. ~nis the s~aplemeanandrefers to the averageof .t~ tr~msfom~ation andis denotedST(tlPE). 4O 1.6.2 Estimating Stopping Times Giventhat weare usingthe stoppingrule in Equation1, the cost of learninga transformation is the stoppingtimefor that transformation, timesthe averagecost to processan example. Thus,oneelementof an estimatefor learningcost is an estimatefor stoppingtimes. In Equation1, the stoppingtimeassociatedwith atransformation is a functionof the variance,the squareof the mean,and the samplesize n. Wecan estimate thesefirst twoparameters usinga smallinitial sample of noexamples: and ~r,(tlPE) ,~ ~:,0(tlPE). Wecan estimate stoppingtimeby usingtheseestimates,treating the inequality in Equation1 as an equality, andsolvingfor n. Thestopping timecannot be less thanthe noinitial examples so the resulting estimatoris: " {.r "-<,,o1) ST,0(tlPE) " max , a: [~:o(tlPE)} , whereis the averageimprovement in TypeI utility that resuits if transformation t is adopted,is the variancein this randornvariable,andO(a)ffi 8/(21TI). 1.6.3 vided by the transformation:r~o(PE ) - ~:~o(tlPE).Combining these estimatorsyields the followingestimatorfor the expectednumber of problemsthat can be solvedafter learning transformation t:. ^ ^ R-A~o(t,T, PE)× ST~o(tlPE ) ENP~(R, tlPe) ffi r,o(PE) _ 2~:o(ttpg ) In the generalcase, the cost of processingthejth problem dependson several factors. It can dependon the particulars of the problem.In can also dependon the currentlytransformed performanceelement, PEt. For example,manylearning approachesderiveutility statistics byexecuting(or simulating the execution)of the performance elementon eachproblem. Finally, as potential transformations mustbe reasonedabout, learningcostcandepend onthe size of the currentset of transformations,T. Let kj(T, PE)denotethe learningcost associatedwiththe jth problemunderthe transformationset T andthe performance elementPE.Thetotal learningcost associatedwitha transformation,t, is the sumof the per problem learningcostsoverthe numberof examplesneededto apply the transformation: ENP Estimator Wecan nowdescribe howto computethis estimate for individualtransformations basedca aainitial sampleof n~ exampies. Let R be the available resources. Theresources ex9endedtv~’~ting a transformation caa :be estimated by 41 (3) Spaceprecludesa descriptionof howthis estimatoris usedbut the RASL algorithm is listed in the Appendix.Giventhe availableresources,RASL estimates,basedon a smallinitial sample,ff a transformation shouldbe learned.If not, is uses the remainingresourcesto solve problems.If learningis expectedto help, RASL statistically evaluatescompeting transformations,choosingthe onewith the greatest increase in ENP.It then determines,basedon a smallinitial sample,if a newtransformationshouldbe learned, and so on. 4. Learning Cost Eachtransformation can alter the expectedutility of a perrefinanceelement.Toaccuratelyevaluate potential changes wemustallocate sonaeof the available resources towards learning. Underour statistical formalizationof the problem, learningcost is a functionof the stoppingtimefor a transformation,andthe cost of processingeachexampleproblem. 1.6.4 multiplyingthe averagecost to process an exampleby the stoppingtime associatedwith that transformation:. Theaverageresourceuse of the learnedperformance elementcan be estimatedby combining estimatesof the resourceuse of the current performance elementand the changein this use pro- EMPIRICAL EVALUATION RASL’s mathematicalframework predicts that the systemcan appropriatelycontrollearningundera variety of timepressures. Givenan initial problem solver anda fixed amountof resources,the systemshouldallocate resourcesbetweenlearning and problem solving to solve as manyproblems as possible. Thetheorythat underliesRASL makesseveral predictions that wecanevaluateempirically.(1) Givenan arbiWarylevel of initial resources,RASL shouldsolve a greater than or equalto number of problems thanff weusedall available resourcesonproblemsolving3. Asweincreasethe available resources,RASL should(2) spendmoreresourceslearning, (3) acquireproblemsolvers with loweraveragesolution cost, (4) exhibit a greater increasein the number of problems that can be solved. Weempiricallytest these claimsin the real-worlddomainof spacecraftcommunication scheduling.Thetask is to allocate communication requests betweenearth-orbitingsatellites and the three 26-meterantennasat Goldstone,Canberra,and Madrid. These antennas makeup part of NASA’s DeepSpace Network.Schedulingis currently performedby a humanexpert, but the Jet PropulsionLaboratory(JPL)is developing heuristicscheduling techniquefor this task. In a previousarticle, wedescribe an application of COMPOSER to improving this schedulerovera distributionof scheduling problems. 3. RASL pays an overheadof taking no initial examples to makeits rational deliberations.If learningprovidesno benefit, this can result in RASL solvingless examples than a non-learningagent. COMPOSER acquired search control heuristics that improvedthe averagetime to producea schedule[Gratch93a]. Weapply RASL to this domainto evaluate our claims. COMPOSER exploreda large spaceof heuristic search strategies for the schedulerusingits statistical hill-climbingtechnique.In the courseof the experiments weconstructeda large databaseof statistics onhowthe schedulerperformed with different search strategies over manyschedulingproblems.We use this databaseto reducethe computational cost of the current set of experiments.Normally,whenRASL processes a training example it mustevokethe schedulerto performan expensivecomputation to extract the necessarystatistics. For these experimentswedrawproblemsrandomlyfromthe databaseandextract the statistics alreadyacquiredfromthe previous experiments.RASL is "told" that the cost to acquire these statistics is the sameas the cost to acquirethemin the original COMPOSER experiments. Alearning trial consists of fixing an amountof resources, evokingRASL,and then solving problemswith the learned problemsolver until resources are exhausted.This is comparedwitha non-learning approachthat simplyusesall available resourcesto solve problems.Learningtrials are repeated multipletimesto achievestatistical significance.Thebehavior of RASL underdifferent timepressuresis evaluatedbyvarying the amountof availableresourees.Resultsare averaged overfifty learningtrials to showstatistical significance.We measureseveral aspects of RASL’s behaviorto evaluate our claims. Theprinciple component of behavioris the number of problems that are solved. In additionwetrack the timespent learning,the TypeI utility of the learnedscheduler,andthe numberof hill-climbing steps adoptedby the algorithm. RASL has two parameters.Thestatistical error of the stopping rule is controlled by ~, whichweset at 5%.RASL requiresan initial sampleof size no to estimateENP.Forthese experimentswechose a value of thirty. RASL is compared againsta non-learning agentthat usesthe originalexpertcontrol strategy to solve problems.Thenon-learningagent devotesall availableresourcesto problemsolving. Theresults are summarized in Figure2. RASL yields an increasein the number of problems that are solvedcompared to a n0n-lenmingagent, and the improvement accelerates as the resources are increased. As the resources are increased, RASL spendsmoretime in the learning phase, taking more steps in the hill-climbingsearch, andacquiringschedulers with better averageproblemsolving performance. experiments [Gratch93a]. In this domain, COMPOSER learnedstrategies that take aboutthirty secondsto solve a schedulingproblem.This suggeststhat wemustconsiderably increasethe availableresourcesbeforethe cost to attain this better strategy becomesworthwhile.Weare somewhat surprisedbythe strict linear relationshipbetween the available resourcesandthe learningtime. At somepoint learningtime shouldlevel off as learning producesdiminishingreturns. However,as suggestedby the difference in the schedulers learned by COMPOSER and RASL,wecan greatly increase the availableresourcesbeforeRASL reachesthis limit. It is an openissuehowbest to set the initial sampleparameter, no. Thesize of the initial sampleinfluencesthe accuracyof the estimateOf ENP,whichin turn influencesthe behaviorof the system. Poorestimates can degradeRASL’s decision making capabilities. Making no verylarge will increasethe accuracy of the estimates,but increasesthe overheadof the technique. It appearsthat the best settingfor nodepends oncharacteristics of the data, suchas the varianceof the distribution.Furtherexperienceonreal domains is neededto assessthe impactof this sensitivity. Thereare waysto addressthe issue in a principled way(e.g. usingcross-validationto evaluatedifferentsetting) but these wouldincreasethe overheadof the technique. 5. RELATED WORK RASL illustrates just one of manypossible waysto reason aboutthe relationshipbetween the cost andbenefit of learning actions. Alternativesaeise whenweconsiderdifferent definitions for TypeII utility or whenthere exists prior knowledge aboutthe learningactions. In [Chien94]wedescribethe RIBSalgorithm(RationalInterval-BasedSelection)for a different relationshipbetween cost andbenefit. In this situationthe TypeII utility is definedin termsof the ratio of TypeI utility andcost. Weintroducea generalstatistical approach for choosing,withsomefixed level of confidence,the best of a set of k hypotheses,wherethe hypothesesare characterizedin termsof someTypeI utility function.Theapproachattemptsto identify the best hypothesis by investigatingalternativesovervariablypricedtraining examples,wherethere is differing cost in evaluatingeachhypothesis.This problem arises, forexample, in learningheuristics to improvethe quality of schedulesproduced by an automatedschedulingsystem. Thisissue of rational learninghas frequentlybeencast as a problem of bias selection [desJardins92,Provost92].Alearning algorithm is providedwitha set of biasesandsomemetaTheimprovement in ENPis statistically significant. The graphof problemssolvedshows95%confidenceintervals on proceduremustchooseonebias that appropriatelyaddresses the performance of RASL over the learning trials. Intervals the pragmaticfactors. Provost[Provost92]showsthat under are not showfor the non-learningsystemas the variancewas very weakprior informationabout bias space, a technique negligibleoverthe fifty learningtrials. Looking at TypeI utilcalled iterative weakening canachievea fixed level of benefit ity, the searchstrategies learnedby RASL wereconsiderably with near minimum of cost. Theapproachrequires ~at biases worsethan the average strategy learned in our COMI~SER :be orderedin terms of ~creasmg cost andexi~essiv~ess. 42 10000. 8000- Agent with RASL .,+~ o Agent with RASL .~40- 000- 30- 4000- ~20- ~’~- Non-learning 20000 0 20 40 60 80 20’ 100 Available Resources(hours) ~) 60 ’ ’ 80 100 Available Resources(hours) . . 1- I° 0. 0 0 20 40 60 80 100 0 Available Resources (hours) 20.... 40 60 80 100 Available Resources (hours) Figure 2: RSLresults. Results are averagedover thirty trials. desJardins [desJardins92] introduces a class of prior knowledge that can be exploited for bias selection. She introduces the statistical knowledgecalled uniformities from whichone can deducethe predictive accuracythat will result fromlearning a concept with a given inductive bias. By combiningthis with a simple model of inductive cost, here PAGODA algorithm can choosea bias appropriateto the givenlearning situation. es. Thechallengeis to developwaysof estimating the cost and benefit of this type of computation.In classification learning, for another example,a rational classification algorithmmust developestimates on the cost and the resulting benefit of improvedaccuracy that result from growinga decision tree. This work focused on a one learning tradeoff but there are moreissues we can consider. Wefocused on the computational cost to process training examples. Weassume an unRASL focuseson one type of learning cost that wasfairly easy boundedsupply of training examplesto draw upon, however to quantify: the cost of statistical inference. It remainstoapunderrealistic situations weoften havelimited data and it can be expensiveto obtain more. Our statistical evaluation techply these conceptsto other learning approaches.For example, STATICis a speed-up learning approach that generates niquestry to limit beth the probability of adoptinga transforknowledgethrough an extensive analysis of a problemsolvmationwith negativeutility and the probability of rejecting a transformation with positive utility (the difference between er’s domaintheory [Etzioni91 ]. Wemight imaginea similar rational control forSTATIC where thesystem performs differ- type I and type II error). Undersomecircumstance,one factor entlevels ofanalysis whenfaced withdifferent timepressatr- maybe more important than the other. These issues must be 43 faced by any rational agent that includes learning as one of its courses of action. Unfortunately, learning techniques embody arbitrary and often unarticulated commitmentsto balancing these characteristics. This research takes just one of many needed steps towards makinglearning techniques amenableto rational learning agents. 6. CONCLUSIONS Traditionally, machinelearning techniques have focused on improvingthe utility of a performanceelement without much regard to the cost of learning, except, perhaps, requiring that algorithms be polynomial.While polynomialrun time is necessary, it says nothing about howto chooseamongsta variety of polynomialtechniques whenunder a variety of time pressures. Yet different polynomialtechniquescan result in vast practical differences. Mostlearning techniquesare inflexible in the face of different learning situations as they embody a fixed tradeoff between the expressiveness of the technique and its computational efficiency. Even whentechniques are flexible, a rational agent is faced with a complexand often unknownrelationship betweenthe configumbleparameters of a techniqueand its learning behavior. Wehavepresented an initial foray into rational learning agents. Wepresented the RASLalgorithm, and extension of the COMPOSER system, that dynamicallybalances the costs and benefits of learning. Acknowledgements Portions of this work were performedat the BeckmanInstitute, Universityof Illinois underNational ScienceFoundation Grant 1RI-92--09394,and at the Jet Propulsion Laboratory, California Institute of Technology undercontract with the National Aeronautics and Space Administration. References [AI&STAT93]AI&STAT,Proceedings of the Fourth International Workshopon Artificial Intelligence andStatistics, Ft. Lauderdale,January1993. [Berger80] J.O. Berger, Statistical Decision Theory and Bayesian Analysis, Springer Verlag, 1980. [Brand91] M. Brand, "Decision-Theoretic Learning in an Action System,"Proceedingsof the Eighth International Workshopon MachineLearning, Evanston, IL, June 1991, pp. 283-287. [Chien94] S.A. Chien, J. M. Gratch and M. C. Burl, "Onthe Efficient Allocation of Resourcesfor Hypothesis Evaluation in Machine Learning: A Statistical Approach,"submittedto Institute ofElectrical andElectronics Engineers Transactions on Pattern Analysis and Machine Intelligence, 1994. [desJardins92] M.E. desJardins, "PAGODA: A Model for Autonomous Learning in Probabilistic Domains,"Ph.D. The~is, ComputerScience Division, University of California, Berkeley, CA,April 1992. (Also appears as UCB/ ComputerScience Department 92/678) 44 [Doyle90] J. Doyle, "Rationality and its Roles in Reasoning (extended version)," Proceedingsof the National Conferenceon Artificial lntelligence,Boston, MA,1990, pp. 1093-1100. [Etzioni91] O. Etzioni, "STATICa Problem--Space Compiler for PRODIGY," Proceedings of the National Conferenceon Arttficial Intelligence, Anaheim,CA,July 1991, pp. 533-540. [Good71] I.J. Good, "The Probabilistic Explication of Information, Evidence, Surprise, Causality, Explanation, and Utility,"in Foundations of Statistical Inference, V. P. Godambe,D. A. Sprott (ed.), Hold, Rienhart and Winston, Toronto, 1971, pp. 108-127. [Gratch91] J. Gratch and G. DeJong, "A Hybrid Approach to GuaranteedEffective Control Strategies," Proceedings of the Eighth International Workshopon Machine Learning, Evanston, IL, June 1991. [Gratch92] J. Gratch and G. DeJong, "COMPOSER: A Probabilistic Solution to the Utility Problemin Speed-up Learning," Proceedings of the National Conference on Artificial Intelligence, San Jose, CA,July 1992, pp. 235-240. [Gratch93a] J. Gratch, S. Chien and G. DeJong, "Learning Search Control Knowledgefor Deep Space Network Scheduling," Proceedings of the Ninth International Conference on Machine Learning, Amherst, MA,June 1993. [Gratch93b] J. Gratch and G. DeJong, Rational Learning: Finding a Balance BetweenUtility and Efficiency,, January 1993. [Greiner92] R. Greiner and I. Jurisica, "A Statistical Approachto Solving the EBLUtility Problem," Proceedings ofthe National ConferenceonArtificial lntelligence, San Jose, CA, July 1992, pp. 241-248. [Horvitz89] E.J. Horvitz, G. F. Cooperand D. E. Heckerman, "Reflection and Action UnderScarce Resources: Theoretical Principles and Empirical Study," Proceedings of the Eleventh International Joint Conferenceon Artificial Intelligence, Detroit, MI, August1989, pp. 1121-1127. [Howard70] R.A. Howard, "Decision Analysis: Perspectives on Inference, Decision, and Experimentation," Proceedingsof the Institute of Electrical andElectronics Engineers 58, 5 (1970), pp. 823-834. [Minton88] S. Minton,in Learning Search Control Knowledge: An Explanation-Based Approach, Kluwer AcademicPublishers, Norwell, MA,1988. [Nadas69] A. Nadas, "An extension of a theorem of Chowand Robbinson sequential confidence intervals for the mean," The Annals of MathematicalStatistics 40, 2 (1969), pp. 667-671. [provost92] F.J. Provost, "Policies for the Selection of Bias in Inductive Machine Learning," Ph.D. Thesis, ComputerScience Department,University of Pittsburgh, Pittsburgh, 1992. (Also appears as Report 92-34) [Russell89] S. Russell and E. Wcfald, ,Principles of Metareasoning,"Proceedingsof the First Internationa~ Conference on Principles of KnowledgeRepresentation andReasoning,Toronto, Ontario, Canada, May1989, pp. 400-411. [Simon76] H.A. Simon, "From Substantive to Procedural Rationality,"in Methodand Appraisal in Economics, S. J. Latsis (ed.), Cambridge UniversityPress, 1976, pp. 129-148. [Subramanian92]D. Subramanianand S. Hunter, "Measuring Utility and the Design of Provably Good EBLAlgorithms," Proceedingsof the Ninth International Conference on Machine Learning, Aberdeen, Scotland, July 1992, pp. 426-435. [Tesauro92] G. Tesauro, "Temporal Difference Learning of Backgammon Strategy," Proceedingsof the Ninth International Conference on MachineLearning, Aberdeen, Scotland, July 1992, pp. 451-457. [Wefald89] E.H. Wefald and S. J. Russell, "Adaptive Learning of Decision-Theoretic Search Control Knowledge," Proceedingsof the Sixth International Workshop on Machine Learning, Ithaca, NY, June 1989, pp. 408-411. Appendix Function RASL(PEo, R0, 8) [1] i :- 0; Learn:- True; T :-- Transforms(PEi);8" := 8/(21TI); [2] WHILE T ~ O ANDLearning-worthwhile (T, PEi) [3] ( continue :- True; j := no; good-rules :--- ~; [4] WHILET ~ O ANDcontinue DO [5] ( j :-j + 1; Get Xj(T, PEi); Vt~T: GetA~)(tIPE); significant :- N6das-stopping-rule(T, PEi, 8", j); [6] T :-- T - significant; [7] [8] [9] WHEN 3t e significant : ~(tlPEi) > 0 /* should learning continue? */ /* find a goodtransformation */ /* beneficial transformation found */ ( good-rules:.~ good-rulesI..J [t ~ significant2~TA/tPE,) > 0] [10] WHEN maxIegP,<R,,t,Pe,)] IE~IPj(R,,tIPEI)} /*nobettertransformation*/ tETI. J maxt~sisn/~ca, ut [11] best := t ~ significant : Vw~ significant, ~7~(tteE3 > ~(wlPE3 [12] PEi+I:-- Apply(best, PEi); T "= Transforms(PEi+l); 8* "= ~/(21TI); /*adopt it [13] Ri+l := Ri -j × ~j(T, PEi); i := i + 1; continue "= False; } [14] Solve-problems-with-remaining-resources (PEi Function Learning-worthwhile (T, PE) [1] FORj-1 TOno DO [2] Get *j(T, PE); VteT: Get rj(PE), Arj(t[PE); [3] ReturnmaxlEl~e~o(R.tlPE,) } <R-----2----t t~rt t r~o(PEj) FunctionNddas-stopping-rule(T, PE, 8", j) [11 Return t E T: (~7/tlpE)) 2 a2 -® Figure 3: Rational COMPOSER Algorithm 45