Lectures TDDD10 AI Programming Automated Planning

advertisement
Lectures
TDDD10AIProgramming
AutomatedPlanning
CyrilleBerger
Planningslidesborrowed
fromDanaNau:
http://www.cs.umd.edu/~nau/planning/slides/
1AIProgramming:Introduction
2Introductionto
3AgentsandAgents
4Multi-Agentand
5Multi-AgentDecision
6CooperationAndCoordination
7CooperationAndCoordination
8MachineLearning
9AutomatedPlanning
10PuttingItAll
2/58
Lecturecontent
AutomatedPlanning
Whatisplanning?
Typeofplanners
Domain-dependentplanners
Configurableplanners
ReenforcementLearning
IndividualandGroupAssignment
3/58
AutomatedPlanning
DictionaryDefinitionsof“Plan”
1Ascheme,program,ormethodworkedout
beforehandfortheaccomplishmentofanobjective:a
planofattack.
2Aproposedortentativeprojectorcourseofaction:
hadnoplansfortheevening.
3Asystematicarrangementofelementsorimportant
parts;aconfigurationoroutline:aseatingplan;the
planofastory.
4Adrawingordiagrammadetoscaleshowingthe
structureorarrangementofsomething.
5Aprogramorpolicystipulatingaserviceorbenefit:a
pensionplan.
Whatisplanning?
6
Statetransitionsystem
AIDefinitionofPlan
Realworldisabsurdlycomplex,
needtoapproximate
[arepresentation]offuture
behavior(...)usuallyasetof
actions,withtemporalandother
constraintsonthem,forexecution
bysomeagentoragents.
–AustinTate,MITEncyclopediaof
theCognitiveSciences,1999
Onlyrepresentwhattheplannerneedstoreason
about
StatetransitionsystemΣ=
(S,A,E,ɣ)
S={abstractstates}
e.g.,statesmightincludearobot’slocation,but
notitspositionandorientation
A={abstractactions}
e.g.,“moverobotfromloc2toloc1”mayneed
complexlower-levelimplementation
E={abstractexogenousevents}
Notundertheagent’scontrol
ɣ=statetransitionfunction
Givesthenextstate,orpossiblenextstates,after
anactionorevent
ɣ:S×(A∪E)→Sorɣ:S×(A∪E)→{S₁,...Sₙ }
7
8
Fromplantoexecution
Example
statesS={s₀,…,
s₅}
A={move1,move2,
put,take,load,
unload}
E=∅
ɣ:S×A→S:defined
ontherightside
9
10
Planningproblem
Plan
Classicalplan:a
sequenceof
⟨take,move1,load,move2⟩
actions:
Policy:partialfunction
fromSintoA
DescriptionofΣ
=(S,A,E,ɣ)
Initialstateorset
ofstates
Objective
{(s₀,take),(s₁,move1),
(s₃,load),(s₄,move2)}
{(s₀,move1),(s₂,take),
(s₃,load),(s₄,move2)}
Goalstate,setofgoalstates,
setoftasks,“trajectory”of
states,objectivefunction,…
Example:
Both,ifexecuted
startingats,
produces₅
Initialstate=s₀
Goalstate=s₅
11
12
PlanningandScheduling
Applications
Scheduling
Robotics
Sequenceofactions
Pathandmotionplanning
Decidewhenandhowto
performagivensetofactions
Industrialapplications
TimeandResourceconstraints
andpriorities
i.e.theschedulerofyourkernel
Printer
NP-Complete
Planning
Decidewhatactionstouseto
achievesomesetof
objectives
CanbemuchworsethanNPcomplete;worstcaseis
undecidable
Productionmachines:sheet-metalbending
Forestfirefighting
Packagesdelivery
Games:playingbridge,chess...
...
13
14
Typeofplanners
DomainMadeortunedforaspecificplanningdomain
Won’tworkwell(ifatall)inotherplanningdomains
Typeofplanners
DomainInprinciple,worksinanyplanningdomain
Inpractice,needrestrictionsonwhatkindof
planningdomain
Configurable
Domain-independentplanningengine
Inputincludesinfoabouthowtosolveproblemsin
somedomain
16
Domain‐SpecificPlanners
Domain-SpecificPlanner-Example
Mostsuccessfulreal-worldplanningsystems
workthisway:
Bridge
Aftercarddealing,
playersmakebidsand
prepareaplanthatthey
willfollowduringthe
game
Domainspecific:
Marsexploration,sheet-metalbending,playingbridge,
Removestatedependingonthe
cardyouareholding
Forinstance,Northwillnotchoose
"heart"astrumpcolor
Oftenuseproblem-specifictechniquesthatare
difficulttogeneralizetootherplanningdomains
17
18
RestrictiveAssumptions
Domain-IndependentPlanners
Inprinciple,worksinany
planningdomain
A0:Finitesystem:finitelymanystates,actions,events
A1:Fullyobservable:thecontrolleralwaysΣ’scurrent
state
A2:Deterministic:eachactionhasonlyoneoutcome
A3:Static(noexogenousevents):nochangesbutthe
controller’sactions
A4:Attainmentgoals:asetofgoalstatesSg
A5:Sequentialplans:aplanisalinearlyordered
sequenceofactions(a1,a2,...an)
A6:Implicittime:notimedurations;linearsequenceof
instantaneousstates
A7:Off-lineplanning:plannerdoesn’tknowthe
executionstatus
Nodomain-specificknowledgeexceptthedescriptionof
thesystemΣ
In
Notfeasibletomakedomain-independentplannerswork
wellinallpossibleplanningdomains
Makesimplifyingassumptionsto
restrictthesetofdomains
Classicalplanning
Historicalfocusofmostresearchonautomatedplanning
19
20
ClassicalPlanning
Classicalplanningrequiresalleightrestrictive
assumptions
Offlinegenerationofactionsequencesforadeterministic,
static,finitesystem,withcompleteknowledge,attainment
goals,andimplicittime
Domain-dependentplanners
Reducestothefollowingproblem:
GivenaplanningproblemP=(Σ,s0,
Findasequenceofactions(a1,a2,...an)thatproducesa
sequenceofstatetransitions(s1,s2,...,sn)suchthatsnisin
Sg.
Thisisjustpath-searchinginagraph
Nodes=
Edges=
Isthistrivial?
22
ClassingPlanning
Plan‐SpacePlanning
Decomposesetsofgoals
intotheindividualgoals
Planforthem
Generalizetheearlierexample:
5locations,
3robotvehicles,
100containers,
3palletstostackcontainerson
Thisisprobablyjustasingleboat...
Bookkeepinginfotodetectandresolve
interactions
Produceapartially
orderedplanthatretains
asmuchflexibilityas
possible
TheMarsroversuseda
temporal-planning
extensionofthis
Thenthereare10²⁷⁷states
Numberofparticlesintheuniverseisonlyabout10⁸⁷
Theexampleismorethan10¹⁹⁰timesaslarge
Automated-planningresearchhasbeen
heavilydominatedbyclassicalplanning
Dozens(hundreds?)ofdifferentalgorithms
23
24
PlanningGraphs
HeuristicSearch
HeuristicfunctionlikethoseinA*
Createdusingtechniquessimilartoplanninggraphs
Problem:A*quicklyrunsoutofmemory
Sodoagreedysearchinstead
Greedysearchcangettrappedinlocal
minima
Roughidea:
Greedysearchpluslocalsearchatlocalminima
First,solvearelaxed
Each“level”containsalleffectsofallapplicableactions
Eventhoughtheeffectsmaycontradicteachother
HSP[Bonet&Geffner]
FastForward[Hoffmann]
Next,doastate-spacesearchwithintheplanning
Graphplan,IPP,CGP,DGP,LGP,PGP,SGP,TGP,
25
26
Configurableplanners
Inanyfixedplanningdomain,adomain-independent
plannerusuallywillnotworkaswellasadomain-specific
plannermadespecificallyforthatdomain
Configurableplanners
Adomain-specificplannermaybeabletogodirectlytowardasolutionin
situationswhereadomain-independentplannerwouldexploremay
alternativepaths
Butwedon’twanttowriteawholenewplannerforevery
domain
Configurableplanners
Domain-independentplanningengine
Inputincludesinfoabouthowtosolveproblemsinthedomain
Generallythismeansonecanwriteaplanningenginewith
fewerrestrictionsthandomain-independentplanners
HierarchicalTaskNetwork(HTN)planning
Planningwithcontrolformulas
28
PlanningwithControlFormulas
HTNPlanning(1/2)
Problemreduction
Tasks(activities)ratherthangoals
Methodstodecomposetasksinto
Ateachstates,wehaveacontrolformulawrittenin
temporallogic
Enforceconstraints,backtrackifnecessary
e.g.,“neverpickupxunlessxneedstogoontopofsomethingelse”
E.g.,taxinotgoodforlong
Foreachsuccessorofs,deriveacontrolformulausing
logicalprogression
Pruneanysuccessorstateinwhichtheprogressed
formulaisfalse
Real-worldapplications
Noah,Nonlin,O-Plan,SIPE,SIPE-2,
SHOP,SHOP2
TLPlan,TALplanner,
29
30
ForwardandBackwardSearch
HTNPlanning(2/2)
Instate-spaceplanning,mustchoose
whethertosearchforwardorbackward
InHTNplanning,therearetwochoicesto
makeaboutdirection:
forwardorbackward
upordown
31
32
Planninginanuncertainworld
LimitationofOrdered-TaskPlanning
Problemoftotalorder
Untilnow,wehaveassumedthat
eachactionhasonlyonepossible
outcome
Butoftenthat’sunrealistic
Inmanysituations,actionsmay
havemorethanonepossible
outcome
Thiscouldbenicer
Actionfailures
e.g.,gripperdropsitsload
Exogenousevents
e.g.,roadclosed
Wouldliketobeabletoplanin
suchsituations
Oneapproach:MarkovDecision
Processes
Solvedwithpartialordermethod
33
34
AutomatedPlanning-Summary
ReenforcementLearning
Domain-specific
Writeanentirecomputerprogram-lotsof
Lotsofdomain-specific
performanceimprovements
Domain-independent
Justgiveitthebasicactions-notmuch
Notvery
35
ReinforcementLearningDefinition
NaïveApproach
Usesupervised
learningtolearn:
Agentsaregivensensoryinputs:
States∊S
RewardR∊
f(s,a)=
Ateachsteps,agentsselect
anoutput:
Foranyinput
state,pick
thebest
a=argmax
action:
Actiona∊A
a∊A
Willthat
37
MarkovDecisionProcess(1/2)
Theagentneedtothinkahead
Itneedsagoodsequence
ofactions.
FormalizedintheMarkov
DecisionProcessframework!
38
MarkovDecisionProcess(2/2)
FinitesetofstatesS,finiteset
ofactionsA.
Ateachdiscretetimestep
agentobservesstatesₜ∊S
choosesactionaₜ∊A
andreceivesanimmediaterewardrₜ.
Thestatechangestosₜ+1∊
Markovassumptionissₜ+1=
δ(sₜ,aₜ)andrₜ=r(sₜ,aₜ).
39
40
PolicyFunction
Rewards
Thepolicyfunctiondecideswhichaction
totakeineachstate:
Tothinkahead:
anagentslooksatfuturerewards:r(sₜ₊₁,aₜ₊₁),
r(sₜ₊₂,aₜ₊₂)...
formalizedasthesumofrewards(alsocalledutilityor
value):
V=∑ɣᵗr(sₜ,aₜ)
aₜ=π(sₜ)
Thepolicyfunctioniswhatwewantto
ɣisthediscountfactormakingrewards
faroffintothefuturelessvaluable.
Ifwefollowaspecificpolicyπ,the
valueofstatesₜis:
Vπ(sₜ)=r(sₜ,π(sₜ))+
41
ValueFunction
42
OptimalPolicy
Valuefunctionforrandommovement:
Ifwefollowaspecificpolicyπ:
Vπ(sₜ)=r(sₜ,π(sₜ))+ɣVπ(sₜ₊₁)
IfweknowVπ(st)thenthepolicy
πisgivenby:
Optimalvaluefunctionforoptimal
policy:
π(s)=argmaxa(r(s,a)+ɣVπ(δ(s,a)))
Findingtheoptimalpolicyisabout
findingπ(s)orVπ(sₜ)orboth.
43
44
Q-FunctionLearninginunknownenvironment
ValueIteration
InitializethefunctionV(s)
withrandomvaluesV₀(s)
Foreachstatesₜand
eachiterationkdo:
Optimalpolicy
π*(s)=argmaxₐ(r(s,a)+V*(δ(s,a)))
Whatifwedonotknowδ(s,a)?orr(s,a)?
Q-Function
Q(s,a)=r(s,a)+V*(δ(s,a)))
π*(s)=argmaxa(Q(s,a))
computeVₖ₊₁(sₜ)=maxₐ(r(sₜ,a)+ɣVₖ(sₜ₊₁))
45
UpdatetheQ-Function
46
Q-LearningforDeterministicWorlds
Foreachs,ainitializetable
entryQ^(s,a)⟵0.
Observecurrentstates.
Doforever:
QandV*areclosely
V*(s)=maxₐQ(s,a)
Qcanbewritten
Q(sₜ,aₜ)=r(sₜ,aₜ)+V*(δ(sₜ,aₜ))
=r(sₜ,aₜ)+ɣmaxₐ'Q(sₜ₊₁,a')
Selectanactionaandexecute
Receiveimmediatereward
Observethenewstate
IfQ^denotethecurrent
approximationofQthenitcanbe
updatedby:
Updatethetableentryfor
Q^(s,a):=r+ɣmaxₐ'
s⟵
Q^(s,a):=r+ɣmaxₐ'Q^(s',a')
47
48
Q-LearningforNonDeterministicWorlds
Q-LearningExample
Whatiftheworldisnondeterministic?
VandQarethenexpected
V=E[∑ɣᵗr(sₜ,aₜ)]
Q(s,a)=E[r(s,a)+V*(δ(s,a)))]
49
Q-LearningforNonDeterministicWorlds
50
Summary
LearningQbecomes:
Advantages
Q^ₙ(s,a):=(1-αₙ)Q^ₙ₋₁(s,a)+αₙ(r+ɣmaxₐ'
Q^ₙ₋₁(s',a'))
Allowsanagenttoadapttomaximiserewards
inapotentiallyunknownenvironment.
Disadvantages
Requirescomputationpolynomialinthe
numberofstates!
Thenumberofstatesgrowsexponentially
withinputdimensions!
ReinforcementLearningassumesdiscrete
statesandactionspaces.
51
52
Project
IndividualandGroupAssignment
Agroupof4to6students
ImplementaRoboRescue
Workindividuallyonasubpartof
theproblem
54
Tasks
Reports
Indiviual
Foundationtasks
Findaround4relatedarticles
Writeaonepagedescription
Deadline:October,28th
Navigation
Communication
Agents:police,ambulance,fire
brigad
Individual
Implementandevaluatethetechnique
Writeareportdescribingthetechnique,resultsanda
discussion
Deadline,draft:December,14th,final:January6th
Exploration
Prediction
Tasksallocation
Group
Oneperteam!
Adescriptionofthealgorithmsandstrategiesused
55
56
WhatisagoodReport
Summary
Gradebasedonthereportquality,suchasreadability,
language,pictures,structureandlength,and
theleveloftechnicaldetailweightedwiththedifficultyofthe
chosenapproaches
Thereportsshouldbe5-6pages,butitismoreimportanttomake
itpossibleforthereadertounderstandyourworkthantogetthe
exactrightnumberofpages.
Youcangetfeedback,submityourdraftbeforeDecember14th,
andwegetaseminaronDecember16thfordiscussingthedrafts
Automatedplanning
Classicalplanningproblem
HTN
Reenforcementlearning
Markovdecisionprocess
Q-Learning
57
58/58
Download