Adaptive Sequential Decision Making with Self-Interested Agents David C. Parkes Division of Engineering and Applied Sciences Harvard University http://www.eecs.harvard.edu/econcs Wayne State University October 17, 2006 Context • Multiple agents • Self-interest • Private information about preferences, capabilities • Coordinated decision problem – social planner – auctioneer Social Planner: LaGuardia Airport Social Planner: WiFi @ Starbucks Self-interested Auctioneer: Sponsored Search This talk: Sequential Decision Making • • • • Multiple time periods Agent arrival and departure Values for sequences of decisions Learning by agents and the “center” • Example scenarios: – – – – – allocating computational/network resources sponsored search last-minute ticket auctions bidding for shared cars, air-taxis,… … Markov Decision Process Pr(st+1|at,st) st at r(at,st) + Self-interest st+1 st+2 Online Mechanisms agent reports M=(,p) t: S! A pt: S! Rn actions payments • Each period: – agents report state/rewards – center picks action, payments • Main question: – what policies can be implemented in a gametheoretic equilibrium? Outline • Multi-armed Bandits Problem [agent learning] – canonical, stylized learning problem from AI – introduce a multi-agent variation – provide a mechanism to bring optimal coordinated learning into an equilibrium • Dynamic auction problem [center learning] – resource allocation (e.g. WiFi) – dynamic arrival & departure of agents – provide a truthful, adaptive mechanism Multi-Armed Bandit Problem • Multi-armed bandit (MAB) problem • n arms • Each arm has stationary uncertain reward process • Goal: implement a (Bayesian) optimal learning policy Learning as Planning Optimal Learning as Planning Tractability: Gittins’ result • Theorem [Gittins & Jones 1974]: The complexity of computing an optimal joint policy for a collection of n Markov Chains is linear in n. – There exist independent index functions such that the MC with highest “Gittins index” at any given time should be activated. – Can compute as optimal MDP value to “restartin-i” MDP, solve using LP (Katehakis & Veinott’87) Self-Interest + MABP • Multi-armed bandit (MAB) problem • n arms • Each arm has stationary uncertain reward process • Goal: implement a (Bayesian) optimal learning policy Self-Interest + MABP • Multi-armed bandit (MAB) problem • n arms (arm == agent) • Each arm has stationary uncertain reward process, (privately observed) • Goal: implement a (Bayesian) optimal learning policy Mechanism reward A1 A2 A3 reward Review: The Vickrey Auction • Rules: “sell to highest bidder at secondhighest price” Alice: $10 Bob: $8 Carol: $6 mr.robot • How should you bid? Truthfully! • Alice wins for $8 Review: The Vickrey Auction • Rules: “sell to highest bidder at secondhighest price” Alice: $10 Bob: $8 Carol: $6 mr.robot • How should you bid? Truthfully! (dominant-strategy equilibrium) • Alice wins for $8 •First Idea Vickrey auction • Conjecture: Agents will bid Gittins index for arm in each round. • Intuition? Not truthful! •Agent 1 may have knowledge that the mean reward for arm 2 is smaller than agent 2’s current Gittins index. •Learning by 2 would decrease the price paid by 1 in the future ) 1 should underbid •Second Idea • At every time-step: – Each agent reports claim about Gittins index – Suppose b1¸ b2 ¸ … ¸ bn – Mechanism activates agent 1 – Agent 1 reports reward, r1 – Mechanism pays r1 to each agent 1 – Theorem: Truthful reporting is a MarkovPerfect equilibrium, and mechanism implements optimal Bayesian learning. Learning-Gittins VCG • At every time-step: – Activate Agent with highest bid. – Pay the reward received by activated agent to all others – Collect from every agent i, expected value agents i would receive without i in system • Sample hypothetical execution path(s), using no reported state information. • Theorem: Mechanism is truthful, system-optimal, ex ante IR, and ex ante strong budget-balanced in MPE. Learning-Gittins VCG (CPS’06) • At every time-step: – Activate Agent with highest bid. – Pay the reward received by activated agent to all others – Collect from every agent i, expected value agents i would receive without i in system • Sample hypothetical execution path(s), using no reported state information. • Theorem: Mechanism is truthful, system-optimal, ex ante IR, and ex ante strong budget-balanced in MPE. agent immediate reward if activated Gittins index 1 7 10 -X 2 8 9 7 - X 3 3 5 7 - X payment -1 -2 -3 • where X-i is the total expected value agents other than i would have received in this period if i weren’t there. Outline • Multi-armed Bandits Problem [agent learning] – canonical, stylized learning problem from AI – introduce a multi-agent variation – provide a mechanism to bring optimal coordinated learning into an equilibrium • Dynamic auction problem [center learning] – resource allocation (e.g. WiFi) – dynamic arrival & departure of agents – provide a truthful, adaptive mechanism, that converges towards an optimal decision policy A3 A1,A2 st st+1 st+2 st+3 A4 First question: what policies can be truthfully implemented in this environment, where agents can misreport private information? Illustrative Example Selling a single right to access WiFi in each period Agent: (ai,di,wi) ) value wi for allocation in t2[ai,di] Scenario: 9am A1 (9,11,$3), A2 (9,11,$2) 10am A3 (10,11,$1) Second-price: Sell to A1 for $2, then A2 for $1 Manipulation? Illustrative Example Selling a single right to access WiFi in each period Agent: (ai,di,wi) ) value wi for allocation in t2[ai,di] Scenario: 9am A1 (9,11,$3), A2 (9,11,$2) 10am A3 (10,11,$1) Second-price: Sell to A1 for $2, then A2 for $1 Illustrative Example Selling a single right to access WiFi in each period Agent: (ai,di,wi) ) value wi for allocation in t2[ai,di] Scenario: 9am A1 (9,11,$3), A2 (9,11,$2) 10am A3 (10,11,$1) Second-price: Sell to A1 for $2, then A2 for $1 Naïve Vickrey approach fails! (NPS’02) 9am A1 (9,11,$3), A2 (9,11,$2) 10am A3 (10,11,$1) Mechanism Rule: Greedy policy, collect “critical value payment”, i.e. the smallest value can bid and still be allocated. ) Sell to A1, collect $1. Sell to A2, collect $1. Theorem. Truthful, and implements a 2approximation allocation, when no-early arrivals and no-late departures. Key Intuition: Monotonicity (HKMP’05) Monotonic: i(vi,v-i) = 1 ) i(v’i,v-i)=1 for higher bid w’i¸wi, more relaxed [a’i,d’i]¶[ai,di] win p’ p lose p time a a’ d’ d Single-Valued Domains • Type i=(ai,di,[ri,Li]) • Value ri for decision kt2Li, or kt2LjÂLi • Examples: – “single-minded” online combinatorial auctions – WiFi allocation with fixed lengths of service • Monotonic: higher r, smaller L, earlier a, later d • Theorem: monotonicity is necessary and sufficient for truthfulness in SV domains. A3 A1,A2 st st+1 st+2 st+3 A4 Second question: how to compute monotonic policies in stochastic, SV domains? How to allow learning (by center)? Basic Idea T0 0 T1 1 T2 2 T3 3 … • Model-Based Reinforcement Learning - Update model in each epoch • Planning: compute new policy 0, 1, … - Collect critical value payments - Key Components: 1. Ensure policies are monotonic 2. Method to compute critical-value payments 3. Careful updates to model. 1. Planning: Sparse-Sampling h0 w Sparse-sampling() L depth-L sampled tree, each node is state, each node’s children obtained by sampling each action w times, back-up estimates to root. Monotonic? Not Quite. Achieving Monotonicity: Ironing • Assume a maximal patience, • Ironing: if ss allocates to (ai,di,ri,Li) in period t then check ss would allocate to (ai,di+,ri,Li) – NO: block (ai,di,ri,Li) allocation – YES: allow allocation • Also use “cross-state sampling” to be aware of ironing when planning. 2. Computing payments: Virtual Worlds ’1: value ! vc(t0)- VW1 t0 A wins t1 A wins t2 1 2 t3 … VW2 ’2: value ! vc(t1)- + method to compute vc(t) in any state st 3. Delayed Updates T0 0 T1 1 T2 2 T3 3 … • Consider critical payment for an agent ai<T1<di • Delayed-updates: only include departed agents in revised 1 • Ensures policy is agent-independent Complete procedure • In each period: – maintain main world – maintain virtual world without each agent active + allocated • For planning: – use ironing to cancel an action – cross-state sparse-sampling to improve policy • For pricing: – charge minimal critical value across virtual worlds • Periodically: move to a new model (and policy) – only use departed types • Theorem: truthful (DSE), adaptive policy for single-valued domains. Future: Online CAs • Combinatorial auctions (CAs) well studied and used in practice (e.g. procurement) • Challenge problem: Online CAs • Two pronged approach: – computational (e.g. leveraging recent work in stochastic online combinatorial optimization by Pascal Van Hentenryck, Brown) – incentive considerations (e.g. finding appropriate relaxations of dominant strategy truthfulness to the online domain) Summary • Online mechanisms extend traditional mechanism design to consider dynamics (both exogeneous, e.g. supply and endogeneous) • Opportunity for learning: – by agents. Multi-agent MABP – demonstrated use of payments to bring optimal learning into an equilibrium – by center. Adaptive online auctions – demonstrated use of payments to bring expectedvalue maximizing policies into an equilibrium • Exciting area. Lots of work still to do! Thanks • Satinder Singh, Jonathan Bredin, Quang Duong, Mohammad Hagiaghayi, Adam Juda, Robert Kleinberg, Mohammad Mahdian, Chaki Ng, Dimah Yanovsky. • More information www.eecs.harvard.edu/econcs