Learning to Negotiate Optimally in Non-stationary environments Vidya Narayanan Nick Jennings

advertisement
Learning to Negotiate Optimally in
Non-stationary environments
Vidya Narayanan
Nick Jennings
University of Southampton
CIA Workshop Edinburgh September 11th-13th 2006.
Overview
•
Agent negotiation
•
Machine learning techniques
•
Model description
•
Algorithm description
•
Results
•
Conclusion
•
Future Work
Agent Negotiation
• Large multi-agent systems
• Different goals
• Resolving conflicts
Agent Negotiation
• Operating in Non-Stationary
Environments
– Real world applications everything is dynamic
• Operational objectives
• Parameters
• Constraints
Agent Negotiation
• Effective Performance
– Learning
– Adaptation
Machine Learning in Literature
• Reinforcement Learning
– Advantages
• Modelling interactions Multi agent systems
• Q-learning model free learning
– Drawbacks
• Difficult to model specific process like negotiation
• Computationally intensive
Machine Learning in Literature
• Bayesian Learning.
• Belief Update Process
– Data or information from environment
– Prior Distribution
– Update prior using data
• Negotiation as a Bayesian learning process.
Description of the Environment
• Memoryless Property
• Justification for assumption
• How it helps?
Non-Stationary Environment
• Determined the current strategy profile
– Given initial strategy profile
– Markov Property
– Chapman-Kolmogrov equation
• Computing current probability distribution in nonstationary environments
Description of Model
• Information about agent
– Payoff function
• Information known about opponent
– Initial strategy profile
– Offer at each stage of negotiation
• Need to compute
– Strategy yielding maximum payoff
• What varies?
– Opponent’s strategy profile
– Agent’s own payoff function
• Objective
– Determine opponent’s strategy profile and maximise own payoff
• Solution outline
–
–
–
–
–
Data --- Offer price
Strategy Profile --- Transition Probability matrices
Prior --- Arbitrary distribution over strategy profiles
Bayesian learning --- True distribution
Compute own strategy profile to maximise payoff
Algorithm
Inputs to the Algorithm
• Offer Price of opponent
Offer Price (100 --Range
500)
Initial Offer 110
• Initial strategy profile
0.5 0.25 0.25
H 01  0.75
0
0.25
0.2 0.1 0.7
0.5 0.2 0.3
H 02 (t )  0.1 0 0.9
0.6 0.2 0.2
0.1 0.8 0.1
H 30 (t )  0.7 0.15 0.15
0.1 0.7 0.2
• Distribution over strategy
profile
• Assumptions
– Relationship --- Strategy
and Offer price of opponent
– Set of arbitrary strategy
profiles
– Initial distribution over
strategies eg., equally likely
– Markov property
• Compute updated
probabilities over strategy
profiles.
(0.4,0.2,0.4)
Pr{H 01 (0)}  0.4
Pr{H 02 (0)}  0.2
Pr{H 03 (0)}  0.4
Pr{O0p (t ) | H 01 (t )}  0.4
Pr{O0p (t ) | H 02 (t )}  0.1
Pr{O0p (t ) | H 03 (t )}  0.5
Steps of the Algorithm
k
• Compute new Hypothesis
• Compute current opponent
strategy profile using result.
• Compute own strategy profile
from payoff function for maximum
payoff by linear programming.
H
new
0
(t )   Pr{H 0i (t ) | O0p (t )}  H 0i (t )
i 0
P 00 (t )  H 0new (t )
1
0 2
u0x  2.5  1 2
 1.5 1.5 1
Learning in the algorithm
• Repeat steps for each stage of negotiation.
• Learning over repeated negotiations.
• Convergence --- True probability distribution
over repeated negotiations.
Learning in the Algorithm
Learning
1st
2nd
…. . Convergence
Negotiation Negotiation
1st Step
Random
Dist.
Updated
Dist.
. . . Correct
Distribution
2nd Step
Random
Dist.
Updated
Dist.
. . .
Random
Dist.
Updated
Dist.
. . . Correct
Distribution
Convergence
Results
Conclusions
• New framework – Non-stationary negotiation.
• Theorem --- Computing current distribution in
non-stationary environments.
• Algorithm for negotiation.
• Theorem --- Convergence of algorithm to
optimal.
Conclusions
• Evaluated by varying
– Number of Hypotheses
– Strategy Profiles
– Payoff Functions
• Shown empirically convergence is rapid.
Future Work
• Analytical relationship offer price and strategy
profile.
• Comparison RL and BL.
Questions??
Pr{ X n1  x | X n  xn , X n1  xn1 ,...}  Pr{ X n1  x | X n  xn }
back
m
P  P P
n
ij
k 0
r
ik
s
kj
back
Pr{H (t ) | O (t )} 
i
n
p
n
Pr{H (t )}  Pr{O (t ) | H (t )}
p
n
 Pr{H (t )} Pr{O
i 0
k 0
Back to Slide 12
i
n
i
n
p
n
i
n
i
n

(t ) | H (t )}
Back to Slide 11
m


p (t )   p (0)  Q (t )
n
k
j 0
0
j
n
jk

Q (t )  [ P (0)]  [ P (1)]  ...  [ P
n
jk
0
1
n 1
(t  1)]
back
Compute :
n
smax
(t )  max[( s1 (t ), s2 (t ),..., sm (t )]n  unx (t )  [( s10 (t ), s20 (t ),..., sm0 (t )]T
s.t
m
s 1
i 0
i
si  0, i
back
Download