Lectures TDDD10AIProgramming AutomatedPlanning CyrilleBerger Planningslidesborrowed fromDanaNau: http://www.cs.umd.edu/~nau/planning/slides/ 1AIProgramming:Introduction 2Introductionto 3AgentsandAgents 4Multi-Agentand 5Multi-AgentDecision 6CooperationAndCoordination 7CooperationAndCoordination 8MachineLearning 9AutomatedPlanning 10PuttingItAll 2/58 Lecturecontent AutomatedPlanning Whatisplanning? Typeofplanners Domain-dependentplanners Configurableplanners ReenforcementLearning IndividualandGroupAssignment 3/58 AutomatedPlanning DictionaryDefinitionsof“Plan” 1Ascheme,program,ormethodworkedout beforehandfortheaccomplishmentofanobjective:a planofattack. 2Aproposedortentativeprojectorcourseofaction: hadnoplansfortheevening. 3Asystematicarrangementofelementsorimportant parts;aconfigurationoroutline:aseatingplan;the planofastory. 4Adrawingordiagrammadetoscaleshowingthe structureorarrangementofsomething. 5Aprogramorpolicystipulatingaserviceorbenefit:a pensionplan. Whatisplanning? 6 Statetransitionsystem AIDefinitionofPlan Realworldisabsurdlycomplex, needtoapproximate [arepresentation]offuture behavior(...)usuallyasetof actions,withtemporalandother constraintsonthem,forexecution bysomeagentoragents. –AustinTate,MITEncyclopediaof theCognitiveSciences,1999 Onlyrepresentwhattheplannerneedstoreason about StatetransitionsystemΣ= (S,A,E,ɣ) S={abstractstates} e.g.,statesmightincludearobot’slocation,but notitspositionandorientation A={abstractactions} e.g.,“moverobotfromloc2toloc1”mayneed complexlower-levelimplementation E={abstractexogenousevents} Notundertheagent’scontrol ɣ=statetransitionfunction Givesthenextstate,orpossiblenextstates,after anactionorevent ɣ:S×(A∪E)→Sorɣ:S×(A∪E)→{S₁,...Sₙ } 7 8 Fromplantoexecution Example statesS={s₀,…, s₅} A={move1,move2, put,take,load, unload} E=∅ ɣ:S×A→S:defined ontherightside 9 10 Planningproblem Plan Classicalplan:a sequenceof 〈take,move1,load,move2〉 actions: Policy:partialfunction fromSintoA DescriptionofΣ =(S,A,E,ɣ) Initialstateorset ofstates Objective {(s₀,take),(s₁,move1), (s₃,load),(s₄,move2)} {(s₀,move1),(s₂,take), (s₃,load),(s₄,move2)} Goalstate,setofgoalstates, setoftasks,“trajectory”of states,objectivefunction,… Example: Both,ifexecuted startingats, produces₅ Initialstate=s₀ Goalstate=s₅ 11 12 PlanningandScheduling Applications Scheduling Robotics Sequenceofactions Pathandmotionplanning Decidewhenandhowto performagivensetofactions Industrialapplications TimeandResourceconstraints andpriorities i.e.theschedulerofyourkernel Printer NP-Complete Planning Decidewhatactionstouseto achievesomesetof objectives CanbemuchworsethanNPcomplete;worstcaseis undecidable Productionmachines:sheet-metalbending Forestfirefighting Packagesdelivery Games:playingbridge,chess... ... 13 14 Typeofplanners DomainMadeortunedforaspecificplanningdomain Won’tworkwell(ifatall)inotherplanningdomains Typeofplanners DomainInprinciple,worksinanyplanningdomain Inpractice,needrestrictionsonwhatkindof planningdomain Configurable Domain-independentplanningengine Inputincludesinfoabouthowtosolveproblemsin somedomain 16 Domain‐SpecificPlanners Domain-SpecificPlanner-Example Mostsuccessfulreal-worldplanningsystems workthisway: Bridge Aftercarddealing, playersmakebidsand prepareaplanthatthey willfollowduringthe game Domainspecific: Marsexploration,sheet-metalbending,playingbridge, Removestatedependingonthe cardyouareholding Forinstance,Northwillnotchoose "heart"astrumpcolor Oftenuseproblem-specifictechniquesthatare difficulttogeneralizetootherplanningdomains 17 18 RestrictiveAssumptions Domain-IndependentPlanners Inprinciple,worksinany planningdomain A0:Finitesystem:finitelymanystates,actions,events A1:Fullyobservable:thecontrolleralwaysΣ’scurrent state A2:Deterministic:eachactionhasonlyoneoutcome A3:Static(noexogenousevents):nochangesbutthe controller’sactions A4:Attainmentgoals:asetofgoalstatesSg A5:Sequentialplans:aplanisalinearlyordered sequenceofactions(a1,a2,...an) A6:Implicittime:notimedurations;linearsequenceof instantaneousstates A7:Off-lineplanning:plannerdoesn’tknowthe executionstatus Nodomain-specificknowledgeexceptthedescriptionof thesystemΣ In Notfeasibletomakedomain-independentplannerswork wellinallpossibleplanningdomains Makesimplifyingassumptionsto restrictthesetofdomains Classicalplanning Historicalfocusofmostresearchonautomatedplanning 19 20 ClassicalPlanning Classicalplanningrequiresalleightrestrictive assumptions Offlinegenerationofactionsequencesforadeterministic, static,finitesystem,withcompleteknowledge,attainment goals,andimplicittime Domain-dependentplanners Reducestothefollowingproblem: GivenaplanningproblemP=(Σ,s0, Findasequenceofactions(a1,a2,...an)thatproducesa sequenceofstatetransitions(s1,s2,...,sn)suchthatsnisin Sg. Thisisjustpath-searchinginagraph Nodes= Edges= Isthistrivial? 22 ClassingPlanning Plan‐SpacePlanning Decomposesetsofgoals intotheindividualgoals Planforthem Generalizetheearlierexample: 5locations, 3robotvehicles, 100containers, 3palletstostackcontainerson Thisisprobablyjustasingleboat... Bookkeepinginfotodetectandresolve interactions Produceapartially orderedplanthatretains asmuchflexibilityas possible TheMarsroversuseda temporal-planning extensionofthis Thenthereare10²⁷⁷states Numberofparticlesintheuniverseisonlyabout10⁸⁷ Theexampleismorethan10¹⁹⁰timesaslarge Automated-planningresearchhasbeen heavilydominatedbyclassicalplanning Dozens(hundreds?)ofdifferentalgorithms 23 24 PlanningGraphs HeuristicSearch HeuristicfunctionlikethoseinA* Createdusingtechniquessimilartoplanninggraphs Problem:A*quicklyrunsoutofmemory Sodoagreedysearchinstead Greedysearchcangettrappedinlocal minima Roughidea: Greedysearchpluslocalsearchatlocalminima First,solvearelaxed Each“level”containsalleffectsofallapplicableactions Eventhoughtheeffectsmaycontradicteachother HSP[Bonet&Geffner] FastForward[Hoffmann] Next,doastate-spacesearchwithintheplanning Graphplan,IPP,CGP,DGP,LGP,PGP,SGP,TGP, 25 26 Configurableplanners Inanyfixedplanningdomain,adomain-independent plannerusuallywillnotworkaswellasadomain-specific plannermadespecificallyforthatdomain Configurableplanners Adomain-specificplannermaybeabletogodirectlytowardasolutionin situationswhereadomain-independentplannerwouldexploremay alternativepaths Butwedon’twanttowriteawholenewplannerforevery domain Configurableplanners Domain-independentplanningengine Inputincludesinfoabouthowtosolveproblemsinthedomain Generallythismeansonecanwriteaplanningenginewith fewerrestrictionsthandomain-independentplanners HierarchicalTaskNetwork(HTN)planning Planningwithcontrolformulas 28 PlanningwithControlFormulas HTNPlanning(1/2) Problemreduction Tasks(activities)ratherthangoals Methodstodecomposetasksinto Ateachstates,wehaveacontrolformulawrittenin temporallogic Enforceconstraints,backtrackifnecessary e.g.,“neverpickupxunlessxneedstogoontopofsomethingelse” E.g.,taxinotgoodforlong Foreachsuccessorofs,deriveacontrolformulausing logicalprogression Pruneanysuccessorstateinwhichtheprogressed formulaisfalse Real-worldapplications Noah,Nonlin,O-Plan,SIPE,SIPE-2, SHOP,SHOP2 TLPlan,TALplanner, 29 30 ForwardandBackwardSearch HTNPlanning(2/2) Instate-spaceplanning,mustchoose whethertosearchforwardorbackward InHTNplanning,therearetwochoicesto makeaboutdirection: forwardorbackward upordown 31 32 Planninginanuncertainworld LimitationofOrdered-TaskPlanning Problemoftotalorder Untilnow,wehaveassumedthat eachactionhasonlyonepossible outcome Butoftenthat’sunrealistic Inmanysituations,actionsmay havemorethanonepossible outcome Thiscouldbenicer Actionfailures e.g.,gripperdropsitsload Exogenousevents e.g.,roadclosed Wouldliketobeabletoplanin suchsituations Oneapproach:MarkovDecision Processes Solvedwithpartialordermethod 33 34 AutomatedPlanning-Summary ReenforcementLearning Domain-specific Writeanentirecomputerprogram-lotsof Lotsofdomain-specific performanceimprovements Domain-independent Justgiveitthebasicactions-notmuch Notvery 35 ReinforcementLearningDefinition NaïveApproach Usesupervised learningtolearn: Agentsaregivensensoryinputs: States∊S RewardR∊ f(s,a)= Ateachsteps,agentsselect anoutput: Foranyinput state,pick thebest a=argmax action: Actiona∊A a∊A Willthat 37 MarkovDecisionProcess(1/2) Theagentneedtothinkahead Itneedsagoodsequence ofactions. FormalizedintheMarkov DecisionProcessframework! 38 MarkovDecisionProcess(2/2) FinitesetofstatesS,finiteset ofactionsA. Ateachdiscretetimestep agentobservesstatesₜ∊S choosesactionaₜ∊A andreceivesanimmediaterewardrₜ. Thestatechangestosₜ+1∊ Markovassumptionissₜ+1= δ(sₜ,aₜ)andrₜ=r(sₜ,aₜ). 39 40 PolicyFunction Rewards Thepolicyfunctiondecideswhichaction totakeineachstate: Tothinkahead: anagentslooksatfuturerewards:r(sₜ₊₁,aₜ₊₁), r(sₜ₊₂,aₜ₊₂)... formalizedasthesumofrewards(alsocalledutilityor value): V=∑ɣᵗr(sₜ,aₜ) aₜ=π(sₜ) Thepolicyfunctioniswhatwewantto ɣisthediscountfactormakingrewards faroffintothefuturelessvaluable. Ifwefollowaspecificpolicyπ,the valueofstatesₜis: Vπ(sₜ)=r(sₜ,π(sₜ))+ 41 ValueFunction 42 OptimalPolicy Valuefunctionforrandommovement: Ifwefollowaspecificpolicyπ: Vπ(sₜ)=r(sₜ,π(sₜ))+ɣVπ(sₜ₊₁) IfweknowVπ(st)thenthepolicy πisgivenby: Optimalvaluefunctionforoptimal policy: π(s)=argmaxa(r(s,a)+ɣVπ(δ(s,a))) Findingtheoptimalpolicyisabout findingπ(s)orVπ(sₜ)orboth. 43 44 Q-FunctionLearninginunknownenvironment ValueIteration InitializethefunctionV(s) withrandomvaluesV₀(s) Foreachstatesₜand eachiterationkdo: Optimalpolicy π*(s)=argmaxₐ(r(s,a)+V*(δ(s,a))) Whatifwedonotknowδ(s,a)?orr(s,a)? Q-Function Q(s,a)=r(s,a)+V*(δ(s,a))) π*(s)=argmaxa(Q(s,a)) computeVₖ₊₁(sₜ)=maxₐ(r(sₜ,a)+ɣVₖ(sₜ₊₁)) 45 UpdatetheQ-Function 46 Q-LearningforDeterministicWorlds Foreachs,ainitializetable entryQ^(s,a)⟵0. Observecurrentstates. Doforever: QandV*areclosely V*(s)=maxₐQ(s,a) Qcanbewritten Q(sₜ,aₜ)=r(sₜ,aₜ)+V*(δ(sₜ,aₜ)) =r(sₜ,aₜ)+ɣmaxₐ'Q(sₜ₊₁,a') Selectanactionaandexecute Receiveimmediatereward Observethenewstate IfQ^denotethecurrent approximationofQthenitcanbe updatedby: Updatethetableentryfor Q^(s,a):=r+ɣmaxₐ' s⟵ Q^(s,a):=r+ɣmaxₐ'Q^(s',a') 47 48 Q-LearningforNonDeterministicWorlds Q-LearningExample Whatiftheworldisnondeterministic? VandQarethenexpected V=E[∑ɣᵗr(sₜ,aₜ)] Q(s,a)=E[r(s,a)+V*(δ(s,a)))] 49 Q-LearningforNonDeterministicWorlds 50 Summary LearningQbecomes: Advantages Q^ₙ(s,a):=(1-αₙ)Q^ₙ₋₁(s,a)+αₙ(r+ɣmaxₐ' Q^ₙ₋₁(s',a')) Allowsanagenttoadapttomaximiserewards inapotentiallyunknownenvironment. Disadvantages Requirescomputationpolynomialinthe numberofstates! Thenumberofstatesgrowsexponentially withinputdimensions! ReinforcementLearningassumesdiscrete statesandactionspaces. 51 52 Project IndividualandGroupAssignment Agroupof4to6students ImplementaRoboRescue Workindividuallyonasubpartof theproblem 54 Tasks Reports Indiviual Foundationtasks Findaround4relatedarticles Writeaonepagedescription Deadline:October,28th Navigation Communication Agents:police,ambulance,fire brigad Individual Implementandevaluatethetechnique Writeareportdescribingthetechnique,resultsanda discussion Deadline,draft:December,14th,final:January6th Exploration Prediction Tasksallocation Group Oneperteam! Adescriptionofthealgorithmsandstrategiesused 55 56 WhatisagoodReport Summary Gradebasedonthereportquality,suchasreadability, language,pictures,structureandlength,and theleveloftechnicaldetailweightedwiththedifficultyofthe chosenapproaches Thereportsshouldbe5-6pages,butitismoreimportanttomake itpossibleforthereadertounderstandyourworkthantogetthe exactrightnumberofpages. Youcangetfeedback,submityourdraftbeforeDecember14th, andwegetaseminaronDecember16thfordiscussingthedrafts Automatedplanning Classicalplanningproblem HTN Reenforcementlearning Markovdecisionprocess Q-Learning 57 58/58