wsu06 - Computer Science

advertisement
Adaptive Sequential Decision Making
with Self-Interested Agents
David C. Parkes
Division of Engineering and Applied Sciences
Harvard University
http://www.eecs.harvard.edu/econcs
Wayne State University
October 17, 2006
Context
• Multiple agents
• Self-interest
• Private information about preferences,
capabilities
• Coordinated decision problem
– social planner
– auctioneer
Social Planner:
LaGuardia Airport
Social Planner:
WiFi @ Starbucks
Self-interested Auctioneer:
Sponsored Search
This talk: Sequential Decision Making
•
•
•
•
Multiple time periods
Agent arrival and departure
Values for sequences of decisions
Learning by agents and the “center”
• Example scenarios:
–
–
–
–
–
allocating computational/network resources
sponsored search
last-minute ticket auctions
bidding for shared cars, air-taxis,…
…
Markov Decision Process
Pr(st+1|at,st)
st
at
r(at,st)
+ Self-interest
st+1
st+2
Online Mechanisms
agent reports
M=(,p)
t: S! A
pt: S! Rn
actions
payments
• Each period:
– agents report state/rewards
– center picks action, payments
• Main question:
– what policies can be implemented in a gametheoretic equilibrium?
Outline
• Multi-armed Bandits Problem [agent learning]
– canonical, stylized learning problem from AI
– introduce a multi-agent variation
– provide a mechanism to bring optimal coordinated
learning into an equilibrium
• Dynamic auction problem
[center learning]
– resource allocation (e.g. WiFi)
– dynamic arrival & departure of agents
– provide a truthful, adaptive mechanism
Multi-Armed Bandit Problem
• Multi-armed bandit (MAB) problem
• n arms
• Each arm has stationary uncertain reward process
• Goal: implement a (Bayesian) optimal learning policy
Learning as Planning
Optimal Learning as Planning
Tractability: Gittins’ result
• Theorem [Gittins & Jones 1974]: The
complexity of computing an optimal joint
policy for a collection of n Markov Chains
is linear in n.
– There exist independent index functions such
that the MC with highest “Gittins index” at any
given time should be activated.
– Can compute as optimal MDP value to “restartin-i” MDP, solve using LP (Katehakis &
Veinott’87)
Self-Interest + MABP
• Multi-armed bandit (MAB) problem
• n arms
• Each arm has stationary uncertain reward process
• Goal: implement a (Bayesian) optimal learning policy
Self-Interest + MABP
• Multi-armed bandit (MAB) problem
• n arms (arm == agent)
• Each arm has stationary uncertain reward process,
(privately observed)
• Goal: implement a (Bayesian) optimal learning policy
Mechanism
reward
A1
A2
A3
reward
Review: The Vickrey Auction
• Rules: “sell to highest bidder at secondhighest price”
Alice: $10
Bob: $8
Carol: $6
mr.robot
• How should you bid? Truthfully!
• Alice wins for $8
Review: The Vickrey Auction
• Rules: “sell to highest bidder at secondhighest price”
Alice: $10
Bob: $8
Carol: $6
mr.robot
• How should you bid? Truthfully!
(dominant-strategy equilibrium)
• Alice wins for $8
•First Idea
Vickrey auction
• Conjecture: Agents will bid Gittins index
for arm in each round.
• Intuition?
Not truthful!
•Agent 1 may have knowledge that the
mean reward for arm 2 is smaller than
agent 2’s current Gittins index.
•Learning by 2 would decrease the price
paid by 1 in the future ) 1 should underbid
•Second Idea
• At every time-step:
– Each agent reports claim about Gittins
index
– Suppose b1¸ b2 ¸ … ¸ bn
– Mechanism activates agent 1
– Agent 1 reports reward, r1
– Mechanism pays r1 to each agent  1
– Theorem: Truthful reporting is a MarkovPerfect equilibrium, and mechanism
implements optimal Bayesian learning.
Learning-Gittins VCG
• At every time-step:
– Activate Agent with highest bid.
– Pay the reward received by activated agent
to all others
– Collect from every agent i, expected value
agents  i would receive without i in system
• Sample hypothetical execution path(s),
using no reported state information.
• Theorem: Mechanism is truthful,
system-optimal, ex ante IR, and ex ante
strong budget-balanced in MPE.
Learning-Gittins VCG
(CPS’06)
• At every time-step:
– Activate Agent with highest bid.
– Pay the reward received by activated agent
to all others
– Collect from every agent i, expected value
agents  i would receive without i in system
• Sample hypothetical execution path(s),
using no reported state information.
• Theorem: Mechanism is truthful,
system-optimal, ex ante IR, and ex ante
strong budget-balanced in MPE.
agent
immediate
reward if
activated
Gittins
index
1
7
10
-X
2
8
9
7 - X
3
3
5
7 - X
payment
-1
-2
-3
• where X-i is the total expected value
agents other than i would have received
in this period if i weren’t there.
Outline
• Multi-armed Bandits Problem [agent learning]
– canonical, stylized learning problem from AI
– introduce a multi-agent variation
– provide a mechanism to bring optimal coordinated
learning into an equilibrium
• Dynamic auction problem
[center learning]
– resource allocation (e.g. WiFi)
– dynamic arrival & departure of agents
– provide a truthful, adaptive mechanism, that
converges towards an optimal decision policy
A3
A1,A2
st
st+1
st+2
st+3
A4
First question: what policies can be
truthfully implemented in this environment,
where agents can misreport private
information?
Illustrative Example
Selling a single right to access WiFi in each period
Agent: (ai,di,wi) ) value wi for allocation in
t2[ai,di]
Scenario:
9am A1 (9,11,$3), A2 (9,11,$2)
10am A3 (10,11,$1)
Second-price: Sell to A1 for $2, then A2 for $1
Manipulation?
Illustrative Example
Selling a single right to access WiFi in each period
Agent: (ai,di,wi) ) value wi for allocation in
t2[ai,di]
Scenario:
9am A1 (9,11,$3), A2 (9,11,$2)
10am A3 (10,11,$1)
Second-price: Sell to A1 for $2, then A2 for $1
Illustrative Example
Selling a single right to access WiFi in each period
Agent: (ai,di,wi) ) value wi for allocation in
t2[ai,di]
Scenario:
9am A1 (9,11,$3), A2 (9,11,$2)
10am A3 (10,11,$1)
Second-price: Sell to A1 for $2, then A2 for $1
Naïve Vickrey approach fails!
(NPS’02)
9am A1 (9,11,$3), A2 (9,11,$2)
10am A3 (10,11,$1)
Mechanism Rule: Greedy policy, collect “critical
value payment”, i.e. the smallest value can bid and
still be allocated.
) Sell to A1, collect $1. Sell to A2, collect $1.
Theorem. Truthful, and implements a 2approximation allocation, when no-early
arrivals and no-late departures.
Key Intuition: Monotonicity
(HKMP’05)
Monotonic: i(vi,v-i) = 1 ) i(v’i,v-i)=1 for higher bid
w’i¸wi, more relaxed [a’i,d’i]¶[ai,di]
win
p’
p
lose
p
time
a
a’
d’ d
Single-Valued Domains
• Type i=(ai,di,[ri,Li])
• Value ri for decision kt2Li, or kt2LjÂLi
• Examples:
– “single-minded” online combinatorial auctions
– WiFi allocation with fixed lengths of service
• Monotonic: higher r, smaller L, earlier a, later d
• Theorem: monotonicity is necessary and
sufficient for truthfulness in SV domains.
A3
A1,A2
st
st+1
st+2
st+3
A4
Second question: how to compute monotonic
policies in stochastic, SV domains? How to
allow learning (by center)?
Basic Idea
T0
0
T1
1
T2
2
T3
3
…
• Model-Based Reinforcement Learning
- Update model in each epoch
• Planning: compute new policy 0, 1, …
- Collect critical value payments
- Key Components:
1. Ensure policies are monotonic
2. Method to compute critical-value payments
3. Careful updates to model.
1. Planning: Sparse-Sampling
h0
w
Sparse-sampling()
L
depth-L sampled tree, each node is state, each
node’s children obtained by sampling each action
w times, back-up estimates to root.
Monotonic? Not Quite.
Achieving Monotonicity: Ironing
• Assume a maximal patience, 
• Ironing: if ss allocates to (ai,di,ri,Li) in
period t then check ss would allocate to
(ai,di+,ri,Li)
– NO: block (ai,di,ri,Li) allocation
– YES: allow allocation
• Also use “cross-state sampling” to be
aware of ironing when planning.
2. Computing payments: Virtual Worlds
’1: value ! vc(t0)-
VW1
t0 A wins t1 A wins t2
1
2
t3
…
VW2
’2: value ! vc(t1)-
+ method to compute vc(t) in any state st
3. Delayed Updates
T0
0
T1
1
T2
2
T3
3
…
• Consider critical payment for an agent ai<T1<di
• Delayed-updates: only include departed agents
in revised 1
• Ensures policy is agent-independent
Complete procedure
• In each period:
– maintain main world
– maintain virtual world without each agent active +
allocated
• For planning:
– use ironing to cancel an action
– cross-state sparse-sampling to improve policy
• For pricing:
– charge minimal critical value across virtual worlds
• Periodically: move to a new model (and policy)
– only use departed types
• Theorem: truthful (DSE), adaptive policy for
single-valued domains.
Future: Online CAs
• Combinatorial auctions (CAs) well studied
and used in practice (e.g. procurement)
• Challenge problem: Online CAs
• Two pronged approach:
– computational (e.g. leveraging recent work in
stochastic online combinatorial optimization
by Pascal Van Hentenryck, Brown)
– incentive considerations (e.g. finding
appropriate relaxations of dominant
strategy truthfulness to the online domain)
Summary
• Online mechanisms extend traditional
mechanism design to consider dynamics (both
exogeneous, e.g. supply and endogeneous)
• Opportunity for learning:
– by agents. Multi-agent MABP
– demonstrated use of payments to bring optimal
learning into an equilibrium
– by center. Adaptive online auctions
– demonstrated use of payments to bring expectedvalue maximizing policies into an equilibrium
• Exciting area. Lots of work still to do!
Thanks
• Satinder Singh, Jonathan Bredin, Quang
Duong, Mohammad Hagiaghayi, Adam Juda,
Robert Kleinberg, Mohammad Mahdian, Chaki
Ng, Dimah Yanovsky.
• More information
www.eecs.harvard.edu/econcs
Download