Q-learning

advertisement
Learning in Multi-agent System
Zhai Yuqing
yqzhai@seu.edu.cn
Outline
• Agent Learning
• Multi-agent learning
• Reinforcement Learning & Multi-agent
Reinforcement Learning
Agent Learning
Why Learning?
• Learning is essential for unknown environments,
– i.e., when designer lacks omniscience
• Learning is useful as a system construction
method,
– i.e., expose the agent to reality rather than trying to
write it down
• Learning modifies the agent's decision
mechanisms to improve performance
Why Learning?
• It is difficult to hand-design behaviours that act optimally (or
even close to it)
• Agents can optimize themselves using reinforcement learning
– Not learning new concepts (or behaviours), but given a set
of states and actions it can find best policy
– Is this just adaptive control?
• Learning can be done on-line and continuously throughout
lifetime of agent, adapting to (slowly) changing situations
Learning
Rewards
Observations,
Sensations
Learning
Algorithm
World,
State
Policy
Actions
Learning to act in the world
Rewards
Other agents
(possibly learning)
Observations,
Sensations
?
Learning
Algorithm
Environment
Policy
Actions
World
Learning Agent Architecture
• A learning agent can be thought of as
containing a performance element that
decides what actions to take and a learning
element that modifies the performance
element so that it makes better decisions
Learning Agent Architecture
Multi-agent Learning
Learning in Multiagent Systems
• Intersection of DAI and ML
• Why bring them together?
– There is a strong need to equip Multiagent
systems with learning abilities
– The extended view of ML as Multiagent
learning is qualitatively different from
traditional ML and can lead to novel ML
techniques and algorithms
Multi-Agent Learning Problem:
• Agent tries to solve its learning
problem, while other agents in the
environment also are trying to solve
their own learning problems. 
challenging non-stationarity.
• Main scenarios: (1) cooperative; (2)
self-interest (many deep issues swept
under the rug)
• Agent may know very little about
other agents:
– payoffs may be unknown
– learning algorithms unknown
• Traditional method of solution: game
theory (uses several questionable
assumptions)
Multi-agent Learning Problem
• Agent tries to solve its own learning problem, while other
agents in the environment try to solve their own learning
problems
• Larger state space
– Might have to include state of other robots in own state
• Problems of Multi-Agent RL
– All of problems from single agent
– Other agents unpredictable or non-stationary
– Should reinforcement be local or global?
– Was robot trying to achieve goal or reacting to other
robots when doing a good action?
Learning in Multi-Agent Systems
• No Doubt learning is of great importance for MAS !
• Challenge:
– Multi-Agent learning problem. The optimal policy changes.
Other agents are learning too.
– Can we have a unifying framework in which this learning
can be understood.
• Challenging MAS-domains:
–
–
–
–
–
Robotic soccer
Traffic
Robotic rescue
Trading agents, e-commerce
Automated Driving
General Characterization
• Principal categories of learning
• The features in which learning approaches
may differ
• The fundamental learning problem known
as the credit-assignment problem
Principal Categories
• Centralized Learning (isolated learning)
– Learning executed by a single agent, no
interaction with other agents
– Several centralized learners may try to obtain
different or identical goals at the same time
Principal Categories
• Decentralized Learning (interactive learning)
– Several agents are engaged in the same learning process
– Several groups of agents may try to obtain different or
identical learning goals at the same time
• Single agent may be involved in several
centralized/decentralized learning processes at the
same time
Learning and Activity
Coordination
• Previous research on coordination focused on offline design of behavioral rules, negotiation
protocols, etc…
• Agents operating in open, dynamic environments
must be able to adapt to changing demands and
opportunities
• How can agents learn to appropriately coordinate
their activities?
Learning about and from Other
Agents
• Agents learn to improve their individual
performance
• Better capitalize on available opportunities
by prediction the behavior of other agents
(preferences, strategies, intentions, etc…)
Learning Organizational Roles
• Assume agents have the capability of
playing one of several roles in a situation
• Agents need to learn role assignments to
effectively complement each other
Learning Organizational Roles
• The framework includes Utility, Probability and
Cost (UPC) estimates of a role adopted at a
particular situation
– Utility – desired final state’s worth if the agent adopted
the given role in the current situation
– Probability – likelihood of reaching a successful final
state (given role/situation)
– Cost – associated computational cost incurred
– Potential – usefulness of a role in discovering pertinent
global information
Learning Organizational Roles:
Theoretical Framework
• Sk , Rksets of situations and roles for agent k
• An agent maintains Sk  Rk vectors of UPC
• During the learning phase:
f (U rs , Prs , Crs , Potentialrs )
Pr(r ) 
 jR f (U js , Pjs , C js , Potentialjs )
k
• f rates a role by combining the component
measures
Learning Organizational Roles:
Theoretical Framework
• After the learning phase is over, the role to be
played in situation s is:
r  arg max f (U js , Pjs , C js , Potential js )
jRk
• UPC values are learned using reinforcement
learning
• UPC estimates after n updates:
Uˆ rsn , Pˆrsn , Pˆ otentialrsn
Learning Organizational Roles:
Updating the Utility
• S – the situations encountered between the
time of adopting role r in situation s and
reaching a final state F with utility U F
• The utility values for all roles chosen in
each of the situation in S are updated:
Uˆ rsn1  (1 ) Uˆ rsn   U F
Learning Organizational Roles:
Updating the Probability
• O : S  [0,1]- returns 1 if the given state is
successful
• The update rule for probability:
Pˆrsn1  (1 )  Pˆrsn    O(F )
Learning Organizational Roles:
Updating the Potential
• Conf (S )- returns 1 if in the path to the final
state, conflicts are detected and resolved by
information exchange
• The update rule for potential:
Pˆ otentialrsn1  (1  )  Pˆ otentialrsn    Conf (S )
Learning to Exploit an Opponent:
Model-Based Approach
• The most prominent approach in AI for
developing playing strategies is the
minimax algorithm
– Assumes that the opponent will choose the
worst move
• An accurate model of the opponent can be
used to develop better strategies
Learning to Exploit an Opponent
Model-Based Approach
• The main problem of RL is its slow convergence
• Model based approach tries to reduce the number
of interaction examples needed for learning
• Perform deeper analysis of past interaction
experience
Model Based Approach
• The learning process is split into two
separate stages:
– Infer a model of the other agent based on past
experience
– Utilize the learned model for designing
effective interaction strategy for the future
Reducing Communication by
Learning
• Learning is a method for reducing the load of
communication among agents
• Consider the contract-net approach:
– Broadcasting of task announcement is assumed
– Scalability problems when the number of
managers/tasks increases
Reducing Communication in
Contract-Net
• A flexible learning-based mechanism called
addressee learning
• Enable agents to acquire knowledge about
the other agents’ task solving abilities
• Tasks may be assigned more directly
Reducing Communication in
Contract-Net
• Case-based reasoning is used for knowledge
acquisition and refinement
• Humans often solve problems using
solutions that worked well for similar
problems
• Construct cases – problem-solution pairs
Case-Based Reasoning in
Contract Net
• Each agent maintains it own case base
• A case consists of:


– Task specification
Ti  Ai1Vi1,..., AimiVimi
– Info about which agent already solved the task
and the quality of the solution
• Need a similarity measure for tasks
Case-Based Reasoning in
Contract Net
Dist( Air , Ajsis
)
• Distance between two attributes
domain-specific
• Similarity between two tasks Ti and T: j
Similar(Ti , T j )   Dist( Air , Ajs )
r
Ti
s
• For task a set of similar tasks is:
S (Ti )  Tj : Similar(Ti , Tj )  0.85
Case-Based Reasoning in
Contract Net
• An agent has to assign task to Tanother
i
agent
• Select the most appropriate agents by
computing their suitability:
1
Suit( A, Ti ) 
  Perform( A, T j )
S (Ti ) T j S (Ti )
Improving Learning by
Communication
• Two forms of improving learning by
communication are distinguished:
– Learning based on low-level communication
(e.g. exchanging missing information)
– Learning based on high-level communication
(e.g. mutual explanation)
Improving Learning by
Communication
• Example: Predator-Prey domain
– Predators are Q-learners
– Each predator has a limited visual perception
– Exchange sensor data – low-level
communication
– Experiments show that it clearly leads to
improved learning results
Knowledge exchange in MAS
• More sophisticated implementations
provide knowledge exchange capabilities
– Exchange the strongest rules they have learned
– Multi-agent Mutual Learning(MAML)
Some Open Questions…
• What are the unique requirements and conditions
for Multiagent learning?
• Do centralized and decentralized learning
qualitatively differ from each other?
• Development of theoretical foundations of
decentralized learning
• Applications of Multiagent learning in complex
real-world environments
Reinforcement Learning
& Multi-agent Reinforcement
Learning
6
12
Reinforcement Learning Approach
Feature:
Input
Usually, it will be given only after
achieving the goal.

State
Recognizer
This delayed reward is the only clue to
agent’s learning.
LookUp
Table
W (S, a )
Action
Selector
Action

Learner
Environment
Environment
Agent
The Reward won’t be given immediately
after agent’s action.

Reward
Overview:
TD [Sutton 88], Q-learning [Watkins 92]
•
Agent can estimate a model of state transition probabilities of
E(Environment), if E has a fixed state transition probability (; E is a
MDPs) .
Profit sharing [Grefensttette 88]
•
Agent can estimate a model of state transition probabilities of E, even
though E does not have a fixed state transition probability.
c.f. Dynamic programming
•
Agent needs to have a perfect model of state transition probabilities of E.
E
Reinforcement Learning Scenario
Agent
reward rt
state st
action at
rt+1
Environment
st+1
Example
Q(s, ared) = 0 +  × 81 = 72
Q(s, agreen) = 0 +  × 100 = 90
Q(s, ablue) = 0 +  × 100 = 90
Multi-agent RL
• Basic idea
– Combine the learning process in an unknown
environment with the interactive decision process of
multiple agents
– There is no single utility function to optimize
– Each agent has a different objective and its
payoff is determined by the joint action of
multiple agents
Challenges in Multi-agent RL
• Curse of dimensionality
– The number of parameters to be learned increases dramatically
with the number of agents
• Partial observability
– states and actions of the other agents which are required for an
agent to make decision are not fully observable
– inter-agent communication is usually costly
– Notes: Partially observable Markov decision processes
(POMDPs) have been used to model partial observability in
probabilistic AI
The Multi-Agent Reinforcement
Learning (MARL) Model
• Multiple selfish agents in a stationary dynamic
environment.
• Environment modeled as a Stochastic (a.k.a
Markov) Game (SG or MG).
• Transition and Payoff are functions of all agents’
actions.
The MARL Model (cont.)
• Transition probabilities and payoffs are initially
unknown to agent.
• Agent’s goal – maximize return.
Typical Multi-agent RL methods
• Value Iteration Learning [Sutton and Barto ]
– Based on different concepts of equilibrium in game
theory
• Min-max solution-based learning algorithm in zerosum stochastic games [Littman]
• Nash Equilibrium-based learning
algorithm[Wellman]
– Extending Littman’s algorithm to the general-sum games
• Correlated Equilibrium based learning
algorithm[Hall ]
Typical Multi-agent RL methods
•
Multiple-person decision theory-based
– Assume that each agent plays a best-response
against stationary opponents
– Require the joint action of agents to converge to
Nash Equilibrium in self-play
– Learn quickly while losing and slowly while
winning
– Learn best response when opponents are
stationary, otherwise move to equilibrium
Typical Multi-agent RL methods
• Integrating RL with Coordination Learning
– Joint action learner
• Independent Learner
– Ignore the existence of other agents
– Just apply RL in the classic sense
Typical Multi-agent RL methods
• Hierarchical Multi-agent RL
– Each agent is given an initial hierarchical
decomposition of the overall task
– Define cooperative subtasks to be those subtasks in
which coordination among agents has significant
effect on the performance of the overall task
– Cooperative subtasks are usually defined at highest
level(s) of the hierarchy
MAL Foundation
• The game theoretic concepts of stochastic
games and Nash equilibria
• Learning algorithms use stochastic games as a
natural extension of Markov decision processes
(MDPs) to multiple agents
– Equilibria learners
• Nash-Q , Minimax-Q
• Friend-or-Foe-Q
• gradient ascent learner
– best-response learner
Multiagent Q-learning desiderata
“performs well” vs. arbitrarily adapting other agents



best-response probably impossible
Doesn’t need correct model of other agents’ learning
algorithms

But modeling is fair game

Doesn’t need to know other agents’ payoffs

Estimate other agents’ strategies from observation

do not assume game-theoretic play

No assumption of stationary outcome: population may never
reach eqm, agents may never stop adapting

Self-play: convergence to repeated Nash would be nice but
not necessary. (unreasonable to seek convergence to a oneshot Nash)
Finding Nash equilibrium
•
Game theoretic approach which supposes
the complete knowledge of the reward
structure of the underlying game by all the
agents
– Each agent calculates an equilibrium, by using
mathematical programming
– Suppose that the other agents are rational
Potential applications of MARL
• E-commerce – agents buying and selling over
the internet.
• Autonomous computing, e.g., automatic fault
recovery.
• Exploration of environments that are
inaccessible to humans: bottom of oceans,
space, etc…
The End
Download