Learning in Multiagent Systems Advanced AI Seminar Michael Weinberg

advertisement
Learning in Multiagent Systems
Advanced AI Seminar
Michael Weinberg
The Hebrew University in Jerusalem, Israel
March 2003
Agenda






What is learning in MAS?
General Characterization
Learning and Activity Coordination
Learning about and from Other Agents
Learning and Communication
Conclusions
7/12/2016
Advanced AI Seminar, March 2003
2
What is Learning

Learning can be informally defined as:

7/12/2016
The acquisition of new knowledge and motor
or cognitive skills and the incorporation of the
acquired knowledge and skills in future system
activities, provided that this acquisition and
incorporation is conducted by the system itself
and leads to an improvement in its
performance
Advanced AI Seminar, March 2003
3
Learning in Multiagent Systems

Intersection of DAI and ML

Why bring them together?


7/12/2016
There is a strong need to equip Multiagent
systems with learning abilities
The extended view of ML as Multiagent
learning is qualitatively different from
traditional ML and can lead to novel ML
techniques and algorithms
Advanced AI Seminar, March 2003
4
Agenda






What is learning in MAS?
General Characterization
Learning and Activity Coordination
Learning about and from Other Agents
Learning and Communication
Conclusions
7/12/2016
Advanced AI Seminar, March 2003
5
General Characterization



Principal categories of learning
The features in which learning approaches
may differ
The fundamental learning problem known
as the credit-assignment problem
7/12/2016
Advanced AI Seminar, March 2003
6
Principal Categories

Centralized Learning (isolated learning)


Learning executed by a single agent, no
interaction with other agents
Several centralized learners may try to obtain
different or identical goals at the same time
7/12/2016
Advanced AI Seminar, March 2003
7
Principal Categories

Decentralized Learning (interactive learning)



Several agents are engaged in the same learning
process
Several groups of agents may try to obtain
different or identical learning goals at the same
time
Single agent may be involved in several
centralized/decentralized learning processes
at the same time
7/12/2016
Advanced AI Seminar, March 2003
8
Differencing Features:
The degree of decentralization

The degree of decentralization


7/12/2016
Distributedness
Parallelism
Advanced AI Seminar, March 2003
9
Differencing Features:
Interaction-specific features

Classification of the interactions required
for realizing a decentralized learning
process:




7/12/2016
The
The
The
The
level of interaction
persistence of interaction
frequency of interaction
variability of interaction
Advanced AI Seminar, March 2003
10
Differencing Features:
Involvement-specific features

Features that characterize the involvement
of an agent into a learning process:


7/12/2016
The relevance of involvement
The role played during involvement
Advanced AI Seminar, March 2003
11
Differencing Features:
Goal-specific features

Features that characterize the learning
goal:


7/12/2016
Type of improvement that is tried to be
achieved by learning
Compatibility of the learning goals pursued by
the agents
Advanced AI Seminar, March 2003
12
Differencing Features:
The learning method

The following learning methods are
distinguished:






Rote learning
Learning from instruction and by advice taking
Learning from examples and by practice
Learning by analogy
Learning by discovery
The main difference is in the required
amount of learning efforts
7/12/2016
Advanced AI Seminar, March 2003
13
Differencing Features:
The learning feedback


The learning feedback indicates the
performance level achieved so far
The following learning feedbacks are
distinguished:



7/12/2016
Supervised learning (teacher)
Reinforcement learning (critic)
Unsupervised learning (observer)
Advanced AI Seminar, March 2003
14
The Credit-Assignment Problem


The problem of properly assigning
feedback for an overall performance
change to each of the system activities
that contributed to that change
Can be usefully decomposed into two subproblems:


7/12/2016
The inter-agent CAP
The intra-agent CAP
Advanced AI Seminar, March 2003
15
The inter-agent CAP

Assignment of credit or blame for an
overall performance change to the
external actions of the agents
7/12/2016
Advanced AI Seminar, March 2003
16
The intra-agent CAP

Assignment of credit or blame for a
particular external action of an agent to its
underlying internal inferences and
decisions
7/12/2016
Advanced AI Seminar, March 2003
17
Agenda






What is learning in MAS?
General Characterization
Learning and Activity Coordination
Learning about and from Other Agents
Learning and Communication
Conclusions
7/12/2016
Advanced AI Seminar, March 2003
18
Learning and
Activity Coordination



Previous research on coordination focused
on off-line design of behavioral rules,
negotiation protocols, etc…
Agents operating in open, dynamic
environments must be able to adapt to
changing demands and opportunities
How can agents learn to appropriately
coordinate their activities?
7/12/2016
Advanced AI Seminar, March 2003
19
Reinforcement Learning


Agents choose the next action so as to
maximize a scalar reinforcement or
feedback received after each action
The learner’s environment can be modeled
by a discrete time, finite state, Markov
Decision Process (MDP)
7/12/2016
Advanced AI Seminar, March 2003
20
Markov Decision Process (MDP)


MDP - Reinforcement Learning task that
satisfies the Markov state property
Markov State satisfies
Prst 1  s, rt 1  r | st , at 
7/12/2016
Advanced AI Seminar, March 2003
21
Reinforcement Learning (cont)

The environment in MDP represented by a
4-tuple <S,A,P,r >





S is a set of states
A is a set of actions
P : S  S  A  [0,1]
r :S A
Each agent maintains a policy  that
maps states into desirable actions
7/12/2016
Advanced AI Seminar, March 2003
22
Q-Learning Algorithm

Reinforcement Learning algorithm

Maintains a table of Q-values


Q(x,a) – “how good action a is in state x”?
Converges to the optimum Q-values with
probability 1
7/12/2016
Advanced AI Seminar, March 2003
23
Q-Learning Algorithm (cont)

At step n the agent performs the
following steps:





7/12/2016
Observe its current state x n
Select and perform action a n
Observe the subsequent state
Receive immediate payoff rn
Adjust Qn1 values
Advanced AI Seminar, March 2003
yn
24
Discounted Sum of Future
Rewards


Q-Learning finds an optimal policy that
maximizes total discounted expected
reward
Discounted reward – reward received s
steps hence are worth less than reward
s

received now by a factor of
7/12/2016
Advanced AI Seminar, March 2003
25
Evaluating the Policy

Under policy  the value of state x is:
V ( x)  Rx ( ( x))    Pxy [ ( x)]  V ( y )


y

The optimal policy 
* satisfies:


*
V ( x)  max Rx (a)    Pxy [a] V ( y)
a
y


*
7/12/2016
Advanced AI Seminar, March 2003
26
Q-Values

Under policy  define Q-values as:
Q ( x, a )  Rx (a )    Pxy [ ( x)]  V ( y )


y

Executing action
thereafter
7/12/2016
a
and following policy
Advanced AI Seminar, March 2003

27
Adjusting Q-Values

Update Q-values as following:

If
x  xn and a  an
Qn ( x, a)  (1   )  Qn1 ( x, a)    [rn   Vn1 ( yn )]

Otherwise
Qn ( x, a)  Qn1 ( x, a)

Where Vn 1 ( y )  max Qn 1 ( y, b)
b
7/12/2016
Advanced AI Seminar, March 2003
28
Isolated, Concurrent
Reinforcement Learners


Reinforcement learners develop action selection
policies that optimize environmental feedback
Can be used in domains



With no pre-existing domain expertise
With no information about other agents
RL can be used as new coordination techniques
where currently available coordination schemes
are ineffective
7/12/2016
Advanced AI Seminar, March 2003
29
Isolated, Concurrent
Reinforcement Learners



Each agent learns to optimize its
reinforcement from the environment
Other agents are not explicitly modeled
An interesting research question is
whether it is feasible for such an agent to
use the same learning mechanism in both
cooperative and non-cooperative
environments
7/12/2016
Advanced AI Seminar, March 2003
30
Isolated, Concurrent
Reinforcement Learners



An assumption of most RL techniques is
that the dynamics of the environment is
not affected by other agencies
This assumption is invalid in domains with
multiple,concurrent learners
Standard RL is probably not adequate for
concurrent, isolated learning of
coordination
7/12/2016
Advanced AI Seminar, March 2003
31
Isolated, Concurrent
Reinforcement Learners

The following dimensions were identified to
characterized domains amenable to CIRL:




7/12/2016
Agent coupling (tightly/loosely)
Agent relationships (cooperative/adversarial)
Feedback timing (immediate/delayed)
Optimal behavior combinations
Advanced AI Seminar, March 2003
32
Experiments with CIRL

Conclusions:




Through CIRL both friends and foes can
concurrently acquire useful coordination info
No prior knowledge of the domain needed
No explicit model of the capabilities of other
agents is required
Limitations:

7/12/2016
Inability to develop effective coordination
when agents are strongly coupled, feedback is
delayed and there are only few optimal
behavior combinations
Advanced AI Seminar, March 2003
33
Experiments with CIRL

A possible fix to the last limitation is “lockstep learning”:

7/12/2016
Two agents synchronize their behavior so that
one is learning while the other is following a
fixed policy and vice versa
Advanced AI Seminar, March 2003
34
Interactive Reinforcement
Learning of Coordination


Agents can explicitly communicate to
decide on individual and group actions
Few algorithms for Interactive RL:


7/12/2016
Action Estimation Algorithm
Action Group Estimation Algorithm
Advanced AI Seminar, March 2003
35
Agenda






What is learning in MAS?
General Characterization
Learning and Activity Coordination
Learning about and from Other Agents
Learning and Communication
Conclusions
7/12/2016
Advanced AI Seminar, March 2003
36
Learning about and from
Other Agents


Agents learn to improve their individual
performance
Better capitalize on available opportunities
by prediction the behavior of other agents
(preferences, strategies, intentions, etc…)
7/12/2016
Advanced AI Seminar, March 2003
37
Learning Organizational Roles


Assume agents have the capability of
playing one of several roles in a situation
Agents need to learn role assignments to
effectively complement each other
7/12/2016
Advanced AI Seminar, March 2003
38
Learning Organizational Roles

The framework includes Utility, Probability
and Cost (UPC) estimates of a role
adopted at a particular situation




7/12/2016
Utility – desired final state’s worth if the agent
adopted the given role in the current situation
Probability – likelihood of reaching a successful
final state (given role/situation)
Cost – associated computational cost incurred
Potential – usefulness of a role in discovering
pertinent global information
Advanced AI Seminar, March 2003
39
Learning Organizational Roles:
Theoretical Framework



S k , Rk sets of situations and roles for agent k
An agent maintains S k  Rk vectors of UPC
During the learning phase:
f (U rs , Prs , Crs , Potentialrs )
Pr( r ) 
 jR f (U js , Pjs , C js , Potential js )
k

f rates a role by combining the component
measures
7/12/2016
Advanced AI Seminar, March 2003
40
Learning Organizational Roles:
Theoretical Framework

After the learning phase is over, the role to
be played in situation s is:
r  arg max f (U js , Pjs , C js , Potential js )
jRk


UPC values are learned using reinforcement
learning
UPC estimates after n updates:Uˆ rsn , Pˆrsn , Pˆ otentialrsn
7/12/2016
Advanced AI Seminar, March 2003
41
Learning Organizational Roles:
Updating the Utility


S – the situations encountered between
the time of adopting role r in situation s
and reaching a final state F with utility U F
The utility values for all roles chosen in
each of the situation in S are updated:
Uˆ rsn1  (1   ) Uˆ rsn   U F
7/12/2016
Advanced AI Seminar, March 2003
42
Learning Organizational Roles:
Updating the Probability

O : S  [0,1] - returns 1 if the given state is
successful

The update rule for probability:
Pˆrsn1  (1   )  Pˆrsn    O( F )
7/12/2016
Advanced AI Seminar, March 2003
43
Learning Organizational Roles:
Updating the Potential

Conf (S ) - returns 1 if in the path to the
final state, conflicts are detected and
resolved by information exchange

The update rule for potential:
Pˆ otentialrsn1  (1   )  Pˆ otentialrsn    Conf (S )
7/12/2016
Advanced AI Seminar, March 2003
44
Learning Organizational Roles:
Robotic Soccer Game


Most implementations of robotic soccer
teams use the approach of learning
organizational roles
Use layered learning methodology:


Low level skills (e.g. shoot the ball)
High level decision making (e.g. who to pass to)
7/12/2016
Advanced AI Seminar, March 2003
45
Learning in Market
Environments


Buyers and sellers trade in electronic
marketplaces
Three types of agents:



7/12/2016
0-level agents: don’t model the behaviour of
others
1-level agents: model others as 0-level agents
2-level agents: model others as 1-level agents
Advanced AI Seminar, March 2003
46
Learning to Exploit an Opponent:
Model-Based Approach

The most prominent approach in AI for
developing playing strategies is the
minimax algorithm


Assumes that the opponent will choose the
worst move
An accurate model of the opponent can be
used to develop better strategies
7/12/2016
Advanced AI Seminar, March 2003
47
Learning to Exploit an Opponent
Model-Based Approach



The main problem of RL is its slow
convergence
Model based approach tries to reduce the
number of interaction examples needed
for learning
Perform deeper analysis of past interaction
experience
7/12/2016
Advanced AI Seminar, March 2003
48
Model Based Approach

The learning process is split into two
separate stages:


7/12/2016
Infer a model of the other agent based on
past experience
Utilize the learned model for designing
effective interaction strategy for the future
Advanced AI Seminar, March 2003
49
Inferring a Best-Response
Strategy



Represent the opponent’s model as a DFA
Example: The TFT strategy for the IPD
game
Theorem: Given a DFA opponent model M̂
opt
M
there exists a best response DFA
that
can be computed in time polynomial in M̂
7/12/2016
Advanced AI Seminar, March 2003
50
Learning Models of Opponents


The US-L* algorithm infers a DFA that is
consistent with the sample of the
opponent’s behavior
The US-L* algorithm extends the model
according to the three guiding principles:



7/12/2016
Consistency: The new model must be
consistent with the give sample
Compactness: A smaller model is better
Stability: Should be similar to the previous
model as much as possible
Advanced AI Seminar, March 2003
51
Exploring the Opponent’s
Strategy



One of the weaknesses of the MB approach is
that it can converge to sub-optimal behavior
Best-response ignores the possibility that the
current model is not identical to the opponent’s
strategy
This is known as the Exploration vs Exploitation
dilemma
7/12/2016
Advanced AI Seminar, March 2003
52
Exploration vs Exploitation



The learning player has to choose between
the wish to exploit the current model and
the desire to explore other alternatives
For stationary opponents it is rational to
explore more frequently at early stages
Use of Boltzmann distribution
7/12/2016
Advanced AI Seminar, March 2003
53
Agenda






What is learning in MAS?
General Characterization
Learning and Activity Coordination
Learning about and from Other Agents
Learning and Communication
Conclusions
7/12/2016
Advanced AI Seminar, March 2003
54
Reducing Communication by
Learning


Learning is a method for reducing the load
of communication among agents
Consider the contract-net approach:


7/12/2016
Broadcasting of task announcement is assumed
Scalability problems when the number of
managers/tasks increases
Advanced AI Seminar, March 2003
55
Reducing Communication in
Contract-Net



A flexible learning-based mechanism called
addressee learning
Enable agents to acquire knowledge about
the other agents’ task solving abilities
Tasks may be assigned more directly
7/12/2016
Advanced AI Seminar, March 2003
56
Reducing Communication in
Contract-Net



Case-based reasoning is used for
knowledge acquisition and refinement
Humans often solve problems using
solutions that worked well for similar
problems
Construct cases – problem-solution pairs
7/12/2016
Advanced AI Seminar, March 2003
57
Case-Based Reasoning in
Contract Net

Each agent maintains it own case base

A case consists of:





Task specification Ti  Ai1Vi1 ,..., Aimi Vimi
Info about which agent already solved the
task and the quality of the solution
Need a similarity measure for tasks
7/12/2016
Advanced AI Seminar, March 2003
58
Case-Based Reasoning in
Contract Net


Distance between two attributes Dist ( Air , A js )
is domain-specific
Similarity between two tasks Ti and T j :
Similar (Ti , T j )   Dist ( Air , Ajs )
r

s
For task Ti a set of similar tasks is:
S (Ti )  T j : Similar (Ti , T j )  0.85
7/12/2016
Advanced AI Seminar, March 2003
59
Case-Based Reasoning in
Contract Net


An agent has to assign task Ti to another
agent
Select the most appropriate agents by
computing their suitability:
1
Suit ( A, Ti ) 
  Perform( A, T j )
S (Ti ) T j S (Ti )
7/12/2016
Advanced AI Seminar, March 2003
60
Improving Learning by
Communication

Two forms of improving learning by
communication are distinguished:


7/12/2016
Learning based on low-level communication
(e.g. exchanging missing information)
Learning based on high-level communication
(e.g. mutual explanation)
Advanced AI Seminar, March 2003
61
Improving Learning by
Communication

Example: Predator-Prey domain




7/12/2016
Predators are Q-learners
Each predator has a limited visual perception
Exchange sensor data – low-level
communication
Experiments show that it clearly leads to
improved learning results
Advanced AI Seminar, March 2003
62
Some Open Questions…




What are the unique requirements and conditions
for Multiagent learning?
Do centralized and decentralized learning
qualitatively differ from each other?
Development of theoretical foundations of
decentralized learning
Applications of Multiagent learning in complex
real-world environments
7/12/2016
Advanced AI Seminar, March 2003
63
Conclusions



This area is of particular interest to DAI as
well as to ML
We spoke about characterization of learning
in MAS
Showed several concrete learning approaches
and the main streams in this area:



7/12/2016
Learning and activity coordination
Learning about and from other agents
Learning and communication
Advanced AI Seminar, March 2003
64
Bibliography




“Multiagent Systems”, Chapter 6
“Q-Learning”, Watkins, Machine Learning vol. 8
(1992)
“Model-based Learning of Interaction Strategies in
MAS”, Carmel & Markovitch
“Multiagent Systems: A Survey from a Machine
Learning Perspective”, Stone and Veloso
7/12/2016
Advanced AI Seminar, March 2003
65
Thank You
7/12/2016
Advanced AI Seminar, March 2003
66
Download