Max Weight Learning Algorithms with Application to Scheduling in Unknown Environments Pr(success1, …, successn) = ?? Michael J. Neely University of Southern California http://www-rcf.usc.edu/~mjneely Information Theory and Applications Workshop (ITA), UCSD Feb. 2009 *Sponsored in part by the DARPA IT-MANET Program, NSF OCE-0520324, NSF Career CCF-0747525 0 1 2 3 4 5 6 •Slotted System, slots t in {0, 1, 2, …} •Network Queues: Q(t) = (Q1(t), …, QL(t)) •2-Stage Control Decision Every slot t: 1) Stage 1 Decision: k(t) in {1, 2, …, K}. Reveals random vector w(t) (iid given k(t)) w(t) has unknown distribution Fk(w). 2) Stage 2 Decision: I(t) in I (a possibly infinite set). Affects queue rates: A(k(t), w(t), I(t)) , m(k(t), w(t),I(t)) Incurs a “Penalty Vector” x(t): x(t) = x(k(t), w(t), I(t)) Stage 1: k(t) in {1, …, K}. Reveals random w(t). Stage 2: I(t) in I. Incurs Penalties x(k(t), w(t), I(t)). Also affects queue dynamics A(k(t), w(t), I(t)) , m(k(t), w(t),I(t)). Goal: Choose stage 1 and stage 2 decisions over time so that the time average penalties x solve: f(x), hn(x) general convex functions of multi-variables Motivating Example 1: Min Power Scheduling with Channel Measurement Costs A1(t) S1(t) A2(t) S2(t) AL(t) SL(t) Minimize Avg. Power Subject to Stability If channel states are known every slot: Can Schedule without knowing channel statistics or arrival rates! (EECA --- Neely 2005, 2006) (Georgiadis, Neely, Tassiulas F&T 2006) Motivating Example 1: Min Power Scheduling with Channel Measurement Costs A1(t) S1(t) A2(t) S2(t) AL(t) SL(t) Minimize Avg. Power Subject to Stability If “cost” to measuring, we make a 2-stage decision: Stage 1: Measure or Not? (reveals channels w(t) ) Stage 2: Transmit over a known channel? a blind channel? -Li and Neely (07) -Gopalan, Caramanis, Shakkottai (07) Existing Solutions require a-priori knowledge of the full joint-channel state distribution! (2L , 1024L ? ) Motivating Example 2: Diversity Backpressure Routing (DIVBAR) 2 1 3 error broadcasting [Neely, Urgaonkar 2006, 2008] Networking with Lossy channels & Multi-Receiver Diversity: DIVBAR Stage 1: Choose Commodity and Transmit DIVBAR Stage 2: Get Success Feedback, Choose Next hop If there is a single commodity (no stage 1 decision), we do not need success probabilities! If two or more commodities, we need full joint success probability distribution over all neighbors! Stage 1: k(t) in {1, …, K}. Reveals random w(t). Stage 2: I(t) in I. Incurs Penalties x(k(t), w(t), I(t)). Also affects queue dynamics A(k(t), w(t), I(t)) , m(k(t), w(t),I(t)). Goal: Equivalent to: Where g(t) is an auxiliary vector that is a proxy for x(t). Stage 1: k(t) in {1, …, K}. Reveals random w(t). Stage 2: I(t) in I. Incurs Penalties x(k(t), w(t), I(t)). Also affects queue dynamics A(k(t), w(t), I(t)) , m(k(t), w(t),I(t)). Equivalent Goal: Technique: Form virtual queues for each constraint. h(g(t)) U(t) b Un(t+1) = max[Un(t) + hn(g(t)) – bn,0] x(t) Z(t) g(t) Zm(t+1) = Zm(t) – gm(t) + xm(t) Possibly negative Use Stochastic Lyapunov Optimization Technique: [Neely 2003], [Georgiadis, Neely, Tassiulas F&T 2006] Define: Q(t) = All Queues States = [Q(t), Z(t), U(t)] Define: L(Q(t)) = (1/2)[sum of squared queue sizes] Define: D(Q(t)) = E{L(Q(t+1)) – L(Q(t))|Q(t)} Schedule using the modified “Max-Weight” Rule: Every slot t, observe queue states and make a 2-stage decision to minimize the “drift plus penalty”: Minimize: D(Q(t)) + Vf(g(t)) Where V is a constant control parameter that affects Proximity to optimality (and a delay tradeoff). How to (try to) minimize: Minimize: D(Q(t)) + Vf(g(t)) The proxy variables g(t) appear separably, and their terms can be minimized without knowing system stochastics! Minimize: Subject to: [Zm(t) and Un(t) are known queue backlogs for slot t] Minimizing the Remaining Terms: Minimize: D(Q(t)) + Vf(g(t)) Solution: Define g(mw)(t), I(mw)(t) , k(mw)(t) as the ideal max-weight decisions (minimizing the drift expression). Define ek(t): Then: ? k(mw)(t) = argmin{k in {1,.., K}} ek(t) (Stage 1) I(mw)(t) = argmin{I in I} Yk(t)(w(t), I, Q) (Stage 2) g(mw)(t) = solution to the proxy problem Approximation Theorem: (related to Neely 2003, G-N-T F&T 2006) If actual decisions satisfy: With: (related to slackness of constraints) Then: -All Constraints Satisfied. -Average Queue Sizes < [B + C + c0V] min[emax – eQ , s – eZ] -Penalty Satisfies: f( x ) < f*optimal + O(max[eQ,eZ]) + (B+C)/V It all hinges on our approximation of ek(t): Declare a “type k exploration event” independently with probability q>0 (small). We must use k(t) = k here. Approach 1: {w1(k)(t), …, wW(k)(t)} = samples over past W type k explor. events It all hinges on our approximation of ek(t): Declare a “type k exploration event” independently with probability q>0 (small). We must use k(t) = k here. Approach 2: {w1(k)(t), …, wW(k)(t)} = samples over past W type k explor. Events {Q1(k)(t), …, QW(k)(t)} = queue backlogs at these sample times. Analysis (Approach 2): Subtleties: 1) “Inspection Paradox” issue requires use of samples at exploration events, so {w1(k)(t), …, wW(k)(t)} iid. 2) Even so, {w1(k)(t), …, wW(k)(t)} are correlated with queue backlogs at time t, and so we cannot directly apply the Law of Large Numbers! Analysis (Approach 2): w1(t) w2(t) w3(t) wW(t) tstart t Use a “Delayed Queue” Analysis: constant Can Apply LLN constant Max-Weight Learning Algorithm (Approach 2): (No knowledge of probability distributions is required!) -Have Random Exploration Events (prob. q). -Choose Stage-1 decision k(t) = argmin{k in {1,.., K}}[ ek(t) ] -Use I(mw)(t) for Stage-2 decision: I(mw)(t) = argmin{I in I} Yk(t)(w(t), I, Q(t)) -Use g(mw)(t) for proxy variables. -Update the virtual queues and the moving averages. Theorem (Fixed W, V): With window size W we have: -All Constraints Satisfied. -Average Queue Sizes < [B + C + c0V] min[emax – eQ , s – eZ] -Penalty Satisfies: f( x ) < f*q + O(1/sqrt{W}) + (B+C)/V Concluding Theorem (Variable W, V): Let 0 < b1 < b2 < 1. Define V(t) = (t + 1) b1 , W(t) = (t+1)b2 Then under the Max-Weight Learning Algorithm: -All Constraints are Satisfied. -All Queues are mean rate stable*: -Average Penalty gets exact optimality (subject to random exploration events): f( x ) = f*q *Mean rate stability does not imply finite average congestion and delay. In fact, Average congestion and delay are necessarily infinite when exact optimality is reached.