DcWEy HD28 .M414 ^3 ALFRED P. WORKING PAPER SLOAN SCHOOL OF MANAGEMENT Conservation Laws, Extended Polymatroids and Multi-Armed Bandit Problems; A Unified Approach to Indexable Systems Dimitris Bertsimas Jose WP# - 3538-93 and Nino-Mora MSA February, 1993 MASSACHUSETTS INSTITUTE OF TECHNOLOGY 50 MEMORIAL DRIVE CAMBRIDGE, MASSACHUSETTS 02139 Conservation Laws, Extended Polymatroids and Multi-Armed Bandit Problems; A Unified Approach to Indexable Systems Dimitris Bertsimas Jose WP# - 3538-93 and Nino-Mora MSA February, 1993 MAR 2 5 1993. Conservation laws, extended polymatroids and multi-armed bandit problems; a unified approach to indexable s^•,stems Dimitris Bertsimas ' Jose Nino-Mora - Februarv 1993 Management and Operations Research Center. MIT. CamThe research of tiie autlior was partially supported by a Presidential ^'ouIlc Investigator Award DDM-9158118 with matching funds from Draper Laboratory -Jose Nifio-Mora. Operations Research Center. MIT. Cambridge. .MA 02139 The restsTicli of ihc 'Dimitris Bertsimas. Sloan Scfiooi of bridge, MA 02139. author was partially supported by a MEC/Fulbright Fellowship. Abstract We isfy show that if performance measures in stochastic and dynamic scheduling problems sat- generalized conservation laws, then the feasible space of achievable performance polyhedron called an extended polymatroid that generalizes the usual polymatroid^s duced by Edmonds. is intro- Optimization of a linear objective over an extended polymatroid solved by an adaptive greedy algorithm, which leads to an optimal solution having an dexability property {indexable systems). Under a a is in- certain condition, then the indices have a stronger decomposition property {decomposable systems). The following classical prob- lems can be analyzed using our theory: multi-armed bandit problems, branching bandits. multiclass queues, multiclass queues with feedback, deterministic scheduling pnDlilpnis teresting consequences of our results include: ( 1) a characterization of indexable systems systems that satisfy generalized conservation laws, new programming proof (3) a property of Gittins indices multi-armed bandit problems, in proach to sensitivity analysis of indexable systems. of indexable systems as sums terms of retirement options ysis of the indexability of in of dual variables (-5) a of the decomposabilit} (4) a unified and practical ap- new characterization of the indu'^s and a new interpretation of the indices the context of branching bandits, (6) the undiscounted branching bandits, (7) a first is in rigorous anal- new algorithm the indices of indexable systems (in particular Gittins indices), which known a-s (2) a sufficient condition for indexable systems to be decomposable, linear In- to compute as fast as the fastest algorithm, (8) a unification of the algorithm of Klimov for multiclass queues and the algorithm of Gittins for multi-armed bandits as special cases of the same algorithm. (9) closed form formulae for the performance of the optimal policy, and (10) an understanding of the nondependence of the indices on some of the parameters of the stochastic scheduling problem. Most importantly, our approach provides a unified treatment of several problems in stochastic variations such as: and dynamic scheduling and is classical able to address in a unified wax their discounted versus undiscounted cost criterion, rewards versus taxes. preemption versus nonpreemption, discrete versus continuous time, work conserving versus idling policies, linear versus nonlinear objective functions. 1 Introduction 1 In the tion mathematical programming tradition researchers and practitioners solve optinuza- problems by defining decision variables and formulating constraints, thus describing the feasible space of decisions, and applying algorithms for the solution of tiie underlying opti- mization problem. For the most part, the tradition for stochastic and dynamic scheduling problems has been, however, quite different, cis it relies primarily on dynamic prograninimg formulations. Using ingenious but often ad hoc methods, which exploit the structure of the particular problem, researchers and practitioners can sometimes derive insightful structural results that lead to efficient algorithms. scheduling problems Lawler et. al. [23] comprehensive survey of deterniiuistic In their The end their paper with the following remarks: results in stochastic scheduling are scattered and they have been obtained through con- a siderable and sometimes dishearting effort. In the words of Coffman, Hofri and Weiss there great need for is new mathematical techniques [8]. useful for simplifying the dernation of the results". Perhaps one of the most important successes twenty years last is the area of stochastic scheduling in the solution of the celebrated mulit-armed bandit problem, the in a g>^"iiPii<- version of which in discrete time can be described as follows: The multi-armed bandit problem: . K. Project k can be , . . time f = ik(t) at 0, I, time . . <, . in one of a finite There are number of states one can work on only a single project. < i? < by a Markov transition rule (which may depend on projects one has not engaged remain unchanged, is how If 1. one works on project k, i.e.. The R,^{t) k- = t), I stauar.^ + if,{t while the states of ii{t) for to allocate one's resources sequentially in time in order to in Rewards state ik{t) changes to but not on ii{t+ 1) = k At each instant of discrete ifc. then one receives an immediate expected reward of additive and discounted in time by a factor indexed A' parallel projects, / ^ k. The 1 ilie prnlij.-in maximize expected imal discounted reward over an infinite horizon. The problem has numerous and the references therein). It applications and a rather vast literature (see Gittiiis was originally solved by Gittins and Jones that to each project k one could attach an index 7'^(ijt(0)< which is They also proved the who proved a function of the project k and the current state ik{t) alone, such that the optimal action at time project of largest current index. [14], [16] t is to engage the important result that thesp iiidf^x functions satisfy a stronger index decomposHton property: the function on characteristics of project k now known More Walrand and Buyukkoc Weber on interchange arguments. The multi-armed bandit problem in on an interchange argument, other proofs S- In this is [34] [33] and Weiss provided [35] outlined an intuiti\e proof. on a simple inductive argument recently, Tsitsiklis [31] has provided a proof based scheduling system Gitrms con- provided a proof based on dynamic programming, subsequently [36] simplified by Tsitsiklis [30]. Varaiya, different proofs based only depends ) as Gittins indices, in recognition of tribution. Since the original solution, which relied were proposed: Whittle "( rewards and transition probabihties), and not on an\ (states, other project. These indices are -; a special case of a dynamic and stochastic job context, there is a set E of job types and we are interesti^d optimizing a function of a performance measure (rewards or taxes) under a riass of admissible scheduling policies. 1 system S indexable index, is if In general the indices 7, could of the set E say that the following policy At each decision epoch 7^. We (Indexable Systems) Definition to subsets Ek, k 1, . . . /\ , entire set entire set i E of job types. we attach an of job types. Consider a partition Such a property is lie In certain situations, the index of £ Ek depends only on the characteristics of the job types E / which contain collections of job types and can interpreted as projects consisting of several job types. job type To each job type optimal: is stochastic jo6 scheduling select a job with the largest index. depend on the = adynamic and in Ek and not on the particularly useful computationally sinci^ it enables the system to be decomposed to smaller parts and the computation of the indices can be done independently. As we have seen the multi-armed bandit problem has this decomposition property, which motivates the following definition: Definition 2 (Decomposable Systems) An indexable system for all job types i £ Ek, the index 7, of job type i is called decomposable if depends only on the characteristics of the set of job types Ek- In addition to the multi-armed bandit problem, a variety of dynamic and stochastic scheduling problems has been solved 1. in the last decades by indexing rules: Extensions of the usual multi-armed bandit problem such as arm-acquiring bandits (Whittle [37], [38]) and more generally branching bandits (Weiss several important problems as special cases. 2 [35]). that include 2. The multiclass queueing scheduling problem with Bernoulli feedback (Kliiiun Tcha and 3. The [19], Kleinrock [21], interesting distinction, which and (2) Example (3), not emphasized system, while example (4) above problems have very what in is [26]). is that examples general decomposable systems. indexable, but not decomposable. decomposable under the average cost it is results, the literature, in It is the multi-armed bandit problem Faced with these Shantikumar and Yao [9]. [27]). however, has a more refined structure. As already observed, particular, is [13], above are indexable systems, but they are not under discounting, while trivial Gelenbe and Mitrani Deterministic scheduling problems (Smith An (1) Pliska [29]). multiclass queueing scheduling problem without feedback (Cox and Smith Harrison 4. [22]. is criterion (the an example of a rule). r// decomposable also decomposable. one asks what is efficient solutions the underlying deep reason that these non- both theoretically as well as practically In the class of stochastic and dynamic scheduling problems that ar^ uidex- is able? Under what conditions, indexable systems are decomposable'' But most im]3oi-taiul_\ is there a unified way to address stochastic and dynamic scheduling problems that to a deeper understanding of their strong structural properties'' This is will lead the set of questions that motivates this work. In the last decade the following approach has been proposed to address special cases of these questions. stochastic and In broad terms, researchers try to describe the feasible space of dynamic scheduling problem dynamic scheduling problem is as a polyhedron. a Then, the stochastic and translated to an optimization problem over the corresponding polyhedron, which can then be attacked by traditional mathematical programming methods. Coffman and Mitrani [7] and Gelenbe and Mitrani [13] first showed using conservation laws that the performance space of a multiclass queue under the average cost criterion can be described as a polyhedron. Federgruen and Groenevelt by observing that special structure rule). in certain special cases of multiclciss (it is [12] advanced the theory further queues, the polyhedron has [26] generalized the theory further by observing that strong conservation laws, then the underlying performance space polymatroid. They a very a polymatroid) that gives rise to very simple optimal policies (the c^ Shantikumar and Yao satisfies [1 1], also proved that, when the cost is linear is if a system necessarily a on the performance, the optimal policy a fixed priority rule (also called head of the line priority rule; see Cobliam is Cox and Smith networks, in [6]. and Their results partially extend to some rather restricted queueing [9]). which they assume that all the different classes of customers liave the same routing probabilities, and the same service requirements at each station of the network (see Tsoucas also [25]). scheduling a multiclass nonpreemptive by Klimov ([22]). performance ([32]) derived the region of achievable M/G/1 queue Finally, Bertsimjis et al. [2] the problem of with Bernoulli feedback, intioduced generalize the ideas of conservation laws to general multiclass queueing networks using potential function ideas and nonlinear inequalities that the in feasible region satisfies. They finil Optimization over linear this set of constraints gives bounds on achievable performance. Our goal in this paper is to propose a unified theory of conservation laws that the very strong structural properties in and to establish the optimization of a class of stochastic and dynamic systems that include the multi-armed bandit problem and its extensions follow from the corresponding strong structural properties of the underlying poiyhedra that characterize the regions of achievable performance. By generalizing the work of Shantikumar and measures in stochastic and polymatroid (see Bhattacharya is [26] dynamic scheduling problems laws, then the feasible space of achievable tended polymatroid Yao ei at. [4]). performance is we show that if performance satisfy generalized consa-i aiion a polyhedron called an Optimization of t a linear objective over ri^itdtd an ex- solved by an adaptive greedy algorithm, which leads to an optimal solution having an indexability property. Special cases of our theory include lems we have mentioned, i.e., all tl>^ proli- multi-armed bandit problems, discounted and undiscountpd branching bandits, multiclass queues, multiclass queues with feedback and deterministic scheduhng problems. Interesting consequences of our 1. A results include; characterization of indexable systems as systems that satisfy generalized conserva- tion laws. 2. Sufficient conditions for indexable systems to be 3. A decomposable. genuinely new, algebraic proof (based on the strong duality theory of linear pro- gramming as opposed to dynamic programming formulations) of the decomposability property of Gittins indices in multi-armed bandit problems. 4. A and practical approach unified to sensitivity analysis of indexable systems., based on the well understood sensitivity analysis of linear programming 5. A new characterization of the indices of indexable systems as sums of dual variables corresponding to the extended polymatroid that characterizes the feasible space. 6. A new interpretation of indices in the context of branching bandits as retiiement options, thus generalizing the interpretation of Whittle [36] and Weber [34] for the indices of the classical multi-armed bandit problem. 7. The first complete and rigorous analysis of the indexability of undiscounted branching bandits. 8. A new algorithm to compute the indices of indexable systems tins indices), Buyukkoc 9. The which is as fast as the fastest particular G:t- known algorithm (Varaiya. \\alran(l and [33]). realization that the algorithm of Klimov for multicla.ss of Gittins for multi-armed bandits are examples of the 10. (in queues and the algorithm same algorithm. Closed form formulae for the performance of the optimal policy. This also leads to an understanding of the nondependence of the indices on some of the parameters of the stochastic scheduling problem. Most importantly, our approach provides lems in stochastic variations such bls: a unified and dynamic scheduling and is treatment of several able to address in classical prob- a unified way their discounted versus undiscounted cost criterion, rewards versus taxes. preemption versus nonpreemption, discrete versus continuous time, work conserving \ersus idling policies, linear versus nonlinear objective functions. The paper is structured as follows: servation laws and problem vector is show that satisfies generalized if Section 2 we define the notion of generalized con- a performance vector of a stochastic and dynamic scheduling conservation laws, then the feasible space of this performance an extended polymatroid. show that In linear optimization Using the duality theory of linear programming we problems over extended polymatroids can be solved by an adaptive greedy algorithm. Most importantly, we show that this optimization problem has an indexability property. In this way, we give a characterization of indexable systems as ) systems that satisfy generalized conservation laws. We also find a sufficient condition for an indexable system to be decomposable and prove a powerful result on sensitivity analysis. we study In Section 3 a natural generalization of the classical multi-armed bandit problem: the branching bandit problem. We propose two different performance measures and prove that they satisfy generalized conservation laws, and thus from the results of the previous section their feaisible space is an extended polymatroid. We then consider different cost and reward structures on branching bandits, corresponding to the discounted and undiscouiitpd and some transform case, results. Section 4 contains applications of the previous sections to various classical problems: multi-armed bandits, multi-class queueing scheduling prol)lenis with or without feedback and deterministic scheduling problems. some thoughts on the field The final section contains of optimization of stochastic systems. Extended Polymatroids and Generalized Conservation Laws 2 Extended Polymatroids 2.1 Tsoucas [32] characterized the performance space of Klimov's problem (see Klimox polyhedron with a special structure, not previously identified et at. called this polyhedron an extended polymatroid [4] erties of it. in ['22]) .i^ a Bhattachar\ the literature and proved some interesting Extended polymatroids are a central structure for the results we present n pro|)in this paper. Let us first establish the notation we denote a real n-vector, with components \S\ E= Let i,, for £ E. For S C N, / denote the cardinality of 5. Let 2^ denote the class of be a set function, that /if Let TT = > 0, (jri, for ...,7r„) satisfies 6(0) i e S = and 0. {!,... ,n} be a finite set will use. Let A = A^ = be a permutation of E. 0, all subsets of i G S" , . . . , i,r„)-^- For clarity of presentation, = (x\ Let us write t. = . Let for all . to introduce the following additional notation. For an n-vector x (i„j E scE be a matrix (.4f ),g£, for = f let 5*^ (6({^i}).fc({Ti,7r2}),...,6({7ri,...,7r„})f. it \ 5. 6: Lei j- ami ki 2^ — h\ that satistip^ 5C is x„) £" ( 1 con\pnif^iit let j- = Let An denote the following lower triangular submatrix of AiV^ / .4: ... A.= T") {t1 \Ai: Let i'(7r) j{fl ^r,} Ai'^i ^n} 'T2 be the unique solution of the linear system "^^x,. =fc({;r:,...,:r,}), tiAiV j=l,...,n 1=1 or, in matrix notation: A^x^ = b^. Let us define the polyhedron V{A, 6) = { J G 3?" ; ^ Afx, > b(S), for 5 C r (4) } and the polytope B(/l,6)= {r G3f" :^-4fr, > Note that is if r G V(A,b), then due to Bhattacharya et al. it this Ceise > follows that r ^.4fx, =6(f)} and componentwise. The following definition We with base set E, if say that the polyhedron V(A.b) for every we say that the polytope B(A,b) is permutation n of E. v{tt) is an G V{A.h). f r- In the base of the extended polymatroid 'P{A.Ij). Optimization over Extended Polymatroids 2.2 Extended polymatroids are polyhedra defined by an exponential number Yet, Tsoucas [32] and Bhattacharya et on Klimov's algorithm (see Klimov an extended polymatroid. linear program certain al. [4] In this subsection We we provide a programming problem over new duality proof that this then show that we can associate with this indices, related to the dual an indexability property. of inequalities. presented a polynomial algorithm, based [22]) for solving a linear algorithm solves the problem optimally. hcis (J) [4]. Definition 3 (Extended Polymatroid) tended polymatrotd S C ^ for 6(S), program, in such a way that the problem Under certain conditions, we prove that a stronger index We decomposition property holds. also present an optimality condition specially suited for performing sensitivity analysis. In what follows we assume that V{A,b) row vector. Let us consider the following Therefore we may ^ mm{ (D) Let R £ K"^^ be a programming problem; linear (6) a polytope, this linear program has a finite optimal solution. is consider have a dual variable y an extended polymatroid max{^i2,a;, :iGe(A,6)}. (P) Note that since B{A,b) is for every b(S)y^ : SCE and dual, its this will S C E. The ^ Afy^ = have the same optimum value. dual problem for R,. shall is: and G E, / We y- < for 0, S C E}. SBi (7) In order to solve (P), Bhattacharya ef al. algorithm, based on Klimov's algorithm [4] presented the following adapftie giffdy [22]: Algorithm ^i Input: {R,A). Output: (n,y,i^,S), where Step 0. Set Sn pick 5<ep i. S= and (t/i,... ,f„), 7r„ For t = i/„ Set S„_fc }. S„_fc+i \ {TTn-k+x }; 1, = = max{ -^ £ G argmax{ -^ = 7r„) (ttj 5„}, with {Si, E. Set = tt ..., n- : f G 5jt . i = a is {tti, permutation of E, y . . ,7r/;}, . for k 5<ep n. For 7r„_t 5 C (y'^)<;cE- '' £ E. £ E); 1: set i/„_A: = max{ '^°5^'_^ 1, pick = G argmax{ ^'=1 ,-„_* ' —^^^ : ? G Sn-k } £" set {fj 0, , if 5 = 6j for otherwise. some j G £'; — ^ ' € Sn-k ) — n It is easy to see that the complexity of Ai, given {R, A), reward vectors, 7r may occur ties algorithm ^i in that vectors f and y are uniquely determined by A\. will be clear permutation Howmer. we rules. show will In order to prove this point, and to understand better ^i, later, that, for certain In the presence of ties, the . generated depends clearly on the choice of tie-breaking importance Note 0{n'^). is whose us introduce the following let related algorithm: Algorithm A2 Input: {R,A). Output: {r,y,'H.J), where partition of E, and J^ Step 1. Set k :— set 6\ Step 2. = max{ -jf f H^ < r n U[_jt///, for = set Jj 1; While Jk = < I = 1, {y ^ )scE' = {^1 //r} is a h ] »-. . . . , E\ ^ i / = an integer, y is E) H\ = argmax{ -^ and : ' £ G } do: begin Set t := set ek fc + 1; = set Jk = max{ ^'"^^* Jk-\ \ Hk-\\ "''"' / : g and Jfe } //, A, = argmax{ ^'"^?* '"''''' : ; g .4, end {while} Step 3. Set r for = SC E t; set ^it =S , if S = Jfc for some k = 1 , . . . , r; y 0, In what follows let (7r,y, i/,5) otherwise. be an output of A2- Note that the output of algorithm ^2 The idea that algorithm A2 is just an is v4i and (r,y,Ti,J) he the output of let uniquely determined by unambiguous version of Ai its is input. formalized in the following result: Proposition (a) for 1 The following relations hold between the outputs of algorithms A\ and A2: 1=1,..., {6k, i/' = \Jk\ for some k = 1, . . . , r; (S) 0, otherwise: (b) y (c) n = V; satisfies = Jk ffk = {^i ^=1 ^\J,\}. r. t {t^\j,\-\h,\+i^--,^\j,\]- = (9) (10) r- l Outline of the proof Parts (a) and (c) follow by induction arguments. Part (b) follows by (a) and the definitions of y and y. Remark: Proposition shows that y and v are uniquely determined (and thus invariant 1 under different tie-breaking the permutations Tsoucas solves linear by algorithm Ai rules) also reveals in (c) the structure of that can be generated by ^i. tt and Bhattacharya [32] It et al. [4] proved from program (P) optimally. Next we provide a first new principles that algorithm proof, using linear A\ programming duality theory. Proposition 2 Let vector y and permutation and y are an optimal primal-dual pair for tlte be rr linear generated by algorithm A\ Th>.n i(t) programs (P) and(D). Proof We first show that y is dual feasible. C 5„ it follows that !/„_! and since 5n-i Similarly, for it = 1,. . . — n , By definition of u^ in , it follows that < by definition of 2, ^i i^n-k it follows that k j=0 and since Sn-k-\ and by C Sn-k, definition of y, Moreover, for Hence y is ib = follows that fn-t-i it we have y^ < 0, 1, . . . , n — 1 0, for we 5 C < 0. Hence Uj t" have, by construction. dual feasible. 10 < 0, for j = 1, . . c - 1. d . Let I = Since V[A,b) t'(7r). show that 1 and y it { ""i is complementary slackness. Assume y^ ^ satisfy must he S = Sk = an extended polymatroid, x is > , ""A.- } , for some And k. ^Afx, = J2^i? i'(7r) 0. Let us Then, by const riirt ion since x satisfies (3), follows that it ^'>z.^=6(5). ;=1 leS Hence, by strong duality primal feasible. and y are an optimal primal-dual pair, and comph^es this the proof. Remark: Edmonds [10] introduced a special class of polyhedra called polymatroids. and proved the classical result that the greedy algorithm solves the linear optimization |3roblem over a polyhedron for every linear objective function Now, polymatroid. that ^1 result is in the case that Af = the greedy algorithm that sorts the and Proposition 2 it for 1, / /?, 's in if and only if £ S. and S C the polyhedron E . ii nonincreasing order follows that in this case B{A.b) is is ea.s_\ is a to see By Edmonds a polymatroid 1 h<^refore. extended polymatroids are the natural generalizations of polymatroids. and algorithm A\ is the natural extension of the greedy algorithm. The well fact that known i'(7r) and y are optimal solutions has some important consequences that every extreme point of a polyhedron objective function. Therefore, the i'(:r)'s is the unique maLXimizer of som^ are the only extreme points of PiA.b). It :<« liiu- ir Uowrr it follows: Theorem 1 (Characterization of Extreme Points) The set of eitremt points nfPiA.h) is {v(n) The : n ts a permutation of optimality of the adaptive greedy algorithm Ai E] leads naturally to the definiin^n of certain indices, which for historical reasons, that will be clear later, we call generahz' Gittins indices. Definition 4 (Generalized Gittins Indices) Let y be the optimal dual solution y,e lu r- ated by algorithm ^i. Let 7, ^ = S: We say that 7i , . , . , /, ,-€f. E0S31 7n are the generalized Gittins indices of linear program (P). 11 (11) Remark: Notice that by Proposition 1(a) and the definition of tation n an output of algorithm A\ then the generalized Gittins indices can be coni|iutod is y, it follows that if permu- as follows; 12) 13) Let 9?_ TT = {ar 6 3? : I < 0}. Let 71, be a permutation of E. Let T . . . , 7„ be the generalized Gittins indices of (P). Let be the following n x n lower triangular matrix: /I In the next proposition 0\ ... 1 1 Vi 1 ... 1/ and the next theorem we reveal the equivalence between some optimality conditions for linear program (P). Proposition 3 The following statements (a) satisfies (9) TT (b) T is are equivalent: and (10): an output of algorithm A\: (c) R^A~^ £ (d) T^r, < 3J!1~ 7t2 X 9?, < •• < and then the generalized Gittins indices are given by 'r — R-AZ^T: 7>r„. Outline of the proof (a) (b) ^ ^ (b): Proved (c): It is clear, in Proposition 1(a). by construction in i^ Now, in the ^1, that = R.AZ\ proof of Proposition 2 we showed that 7^ and by (14) it i^ = ^^ follows that 7, = R,AZ'T 12 14) G 3?" x 3?. Moreover, by ( 12) we get By (c) => (d): (c) we have = 7,r-i ^r„a;' £^1-' x«, whence the result follows. (d) => (a): By construction 5 of the generalized Gittins indices, 7, Also, it is = ^1 + easy to see that satisfy (10), and hence Combining the (9). it ... 9j in algorithm A2, the fact that y y and the definition of follows that fovi£Hh, 4-^^, < = 0, for j > and t = These two facts 2. lo) 1, must clearly imply that t which completes the proof of the proposition result that algorithm equivalent conditions in Proposition 3, A\ solves linear we obtain program (P) optimal!.\ with the several optimality conditions, as sliov^n next. Theorem 2 (Sufficient Optimality Conditions of the conditions (a)-(d) of Proposition 3 holds. and Indexability) Aisuwe Then i'(t) solves linear fhaf program [P) any opti- mally. It is easy to see that conditions (a)-(d) of Proposition 3 are not, optimality conditions. They are neccessary if the polytope B(A,b) is in general, necessary nondegenerate Some consequences of Theorem 2 are the following: Remarks: 1. Sensitivity analysis: Optimality condition suited for performing sensitivity analysis permutation n of E, solves for what vectors R (c) is specially vvi='ll Consider the following question: given a and matrices problem (P) optimally? The answer is: for RkAZ^ € 3?r^ 13 of Proposition 3 X R 9?. .4 and can we guarantee that l(t) A that satisfy the condition . We may By also zisk: which permutations for Proposition 3(d), the answer thus providing an problem of 0(n now is optimal permutations n that satisfy for is: can we guarantee that uiw) tt log n) optimality test for Glazebrook tt. addressed the [17] sensitivity analysis in stochastic scheduling problems. His results are in the form of suboptimality bounds. 2. Explicit formulae for Gittins indices: Proposition 3(c) provides an mula The formula for the vector of generalized Gittins indices. exi:)licit for- reveals that the indices are piecewise linear functions of the reward vector. 3. Indexability: Optimality condition (d) of Proposition 3 shows that any permutation that sorts the generalized Gittins indices nonincreasing order provides an optmial in Condition (d) thus shows that this class of optimization solution for problem (P). problems has an indexability property. In the case that matrix A of (P) can be simplified. let 5(^4^,6*^) has a certain special structure, the computation of th^ indices E Let be partitioned as E = be the base of an extended polymatroid; U;.\_j let x'' E'^. = . For (x|''),g£^; ^- = h' 1. let (P;. ) bo the following linear program; max{ Y, ^^-f {Pk) let {7,- • ^^ e 5(.4^6'^)}; }i6E^ be the generalized Gittins indices of problem (Pk)- Assume (1(3) that the follou ing independence condition holds: Af = Af^^" = Under condition (Pk), as shown Theorem (17) there in the is iA'')f''^' f € 5 n Ek and SCE. (17) an easy relation between the indices of problems (P) and next result. 3 (Index Decomposition) dices of linear for , programs (P) and (Pk) 7, = 7*, Under condition (11), the generalized Giflins in- satisfy forieEk 14 and k = I K. (18) Proof Let hi = 7*, for E Let us renumber the elements of of tt = h (IH) A' \ so that /ij Let T=:(l,...,n). Permutation and G Ek / < £ • < < /l2 (20) /in induces permutations Tr*^ of E^. for fc = 1. /v. that satisfy (21) Hence, by Proposition 3 follows that it = 7:. = fort /?;.( -4^*)-' T,. A- l or, equivalently, \ r-'.4? WjT] > 7^2 1 • • ' 77r* = / On an jE^I x \Ek\ matrix with the the other hand, (22) sani»^ structure as matrix T. for k = 1. we have /I 0\ .. •1 T-'Ar ./?:,) u- -I is i?^... r-'.4:./ \ where Tk (/;;,, ... 1 = -1 ... 1 .4^ -1 \ 1/ / \ {1,2} .4 Now, notice that if i '-'-A £ Ek, ^{1 j) j _ "-'} j{l "} \A\{1 £ E\ ^{1 .{1 .4 Ek and }}<^E, _ i < --4a{1 j then, ^{\ 15 "} n-1} .4i^ "^ I by (17): j-i}n£t _ ^{1 j-i} (23) Hence, by (19) and (23) it follows that system (22) can be written equnalentiy as KT-^A,^R^. Now, (20) (24) and (24) imply that hi - /13 R^A-' = KT-^ = G D?r^ X K, hn-l \ and by Proposition 3 it - hn / /in follows that the generalized Gittins indices of 7. 12o) problem (P) satisfy = R.a;'t Hence, by (24), h, and this = for 7i, E completes the proof of the theorem. Theorem 3 implies that the fundamental reason easy and useful consequence of Theorems 2 and 3 Corollary 1 Under assumptions of Theorem the can be computed by solving the K is 17) An an optimal solution of problr-m {/') for decomposition to hold 3, = 1,. We On ( . . K by algorithm A\ (H'd indices. will see later that the classical has the index decomposition property. the other hand, we is much stronger th u multi-armed bandit prolikm will see that (see [22]) has the indexability property, but in the general case 2.3 . important to emphasize that the index decomposition property the indexability property. is the following: subproblems (Pk). for k computing their respective generalized Gittins It is € ;• it is Klimov's proM<>in not decomposalik Generalized Conservation Laws Shantikumar and Yao mance meaisures [26] formalized a definition of strong conservation taw? for perfor- in genera! multiclass the performance space. We m queues, that implies a polymatroidal structure next present a more general definition of generalized foi/sfr- vation laws in a broader context that implies an extended polymatroidal structure in tlie performance space, which has several interesting and important implications. Consider 16 a . There are n job types, whirh we general dynamic and stochastic job scheduling process. label i E — E: we denote U We {l,...,n}. consider the class of admissible scheduling pohcit^. which to be the class of , nonidling, nonpreemtive and nonantiripative scheduling all policies. Let i" be a performance measure of type We assume that r" jobs under admissible policy u. for £ / E Let r" be the corresponding performance \ector. an expectation. is i Let x^ denote the performance vector under a fixed priority rule that assigns priorities to the job types according to the permutation n highest priority, . type , . . ttx = (tti, . . . , 7r„) of E, where type n„ has the has the lowest priority. Definition 5 (Generalized Conservation Laws) The performance vector x satisfy generalized conservation laws and a matrix — .4 if there exist a function (/if )teE,SC£ satisfying (1) — 6:2 ')?+ is said to = such that 6(0) such that: (a) = ^-4fx^ 6(S) for all TT = 5 and ^^ Af x"; = b{E), {ti : 7r|5|} S C E: (2(5) (b) ^Afx"^ >b(S), SCf" for all and i€S In words, a exist weights performance vector Af any subset S types (in S'^) is G </ said to satisfy generalized conservation laws such that the total weighted performance over under any admissible in for all // (27) if: there .e£ C E job types is achieved by any fixed priority rule that gives priority to over types in 5. The strong correspond to the special case that The connection between una riant and the minimum weighted performance over the job types policy, is all all all other conservation laws of Shantikumar and ^'ao weights are Af = [26] 1. generalized conservation laws and extended polymatroids is the following theorem: Theorem and (26) (a) x" (b) The = 4 Assume that (27). performance vector x satisfies generalized conservation laws Then vertices of v{ir), the B{A,b) are the performance vectors of the fixed priority for every permutation t of E. The extended polymatroid base 3(A,b) is the 17 performance space. rules, and Proof By (a) (26) (b) Let it follows that X = {x":uGi/}be pointsof 5(yl,6). By (27) X is = x'" it v(ir). And by Theorem the performance space. follows that X C ^(/l, the result follows. 1 Let Bi.[A,b) be the By 6). (a), -B„(,4, 6) C .set .V. of extreme Hence, since a convex set {U contains randomized policies) we have B{A,b) Hence X= B{A,b), and this completes the proof of the theorem. As a consequence of Theorem mance of at = com{B{A,b)) CX. 4, it theorem that the perfor- follows by Caratheodory vector i" corresponding to an admissible policy u can be achieved by a randomization most n + 1 fixed priority rules. Optimization over systems satisfying generalized conservation laws 2.4 Let x" be a performance vector for a dynamic and stochastic job scheduling process that satisfies generalized conservation laws (associated to find an admissible policy u that maximizes a with linear .4, 6()). Suppose that we want reward function J2i^£ Ri-i'" This optimal scheduling control problem can be expressed as max{ (ft/) By Theorem 4 this control ^ /?,x," : uGi/}. (28) problem can be transformed into the following linear program- ming problem; (P) The strong max{^/?,z, :2:Ge(/l,6)}. (29) structural properties of extended polymatroids lead to strong structural prop- erties in the control problem. Suppose that to each job type / we attach an index. -, . A policy that selects at each decision epoch a job of currently largest index will be referred to as an index policy. Let 7i, ..., 7„ be the generalized Gittins indices of linear program (P). As a direct consequence of the results of Section 2.2 we show next that the control problem (Pu) solved by an index policy, with indices given by 71, Theorem Then 5 (Indexability) (a) Let i'(7r) be . . . is ,7n an optimal solution of linear program {P). the fixed priority rule that assigns priorities to the job types according to permulation 18 n IS optimal for the control problem (Pn); A (b) index The policy that selects at each decision epoch a job of currently largest generalized C'ttins optimal for the control problem. is previous theorem implies that systems satisfying generalized conservation laws are indexable systems. Let us consider are K now a dynamic and stochastic project types, labeled A selected. it = 1,...,A'. project of type k can be At each decision epoch one of a in project selection process, in which there finite number project must he a of states £ Ek u- These states correspond to stages in the development of the project. Clearly this process can be interpreted as a job scheduling process, as follows: simply interpret the action of selecting a project k in state i^ G Ek as selecting a job of type satisfies generalized conservation laws associated will see 5, = We may £ \Jki-iEk' i^ interpret Let us assume that this job scheduling process that each project consists of several jobs. Theorem / with matrix the corresponding optimal control problem is and .4 function b(). solved by an index policy among next that when a certain independence condition set the projects is By We satisfied, a strong index decomposition property holds. We thus assume that E is partitioned as performance vector over job types E = [Jk=\Ek- E^ corresponding in obtained when projects of types other than k are ignored us assume that the performance vector x with matrix A'' and set function b''{), Let x = (x,),g£j. to the project selection bo the inoMeni they are never engaged). Let (i.e., satisfies generalized conservation laws associated and that the independence condition ( 17) is satisfied. Let Uk be the corresponding set of admissible policies. Under these aissumptions, Theorem 3 applies, and together with Theorem 5(b) we get the following result; Theorem 6 (Index Decomposition) dices of job types in The E^ only depend on Under condition (17), the generalized Gitltn.'^ in- characteristics of project type k. previous theorem identifies a sufficient condition for the indices of an indexable system to have a strong decomposition property. Therefore, systems that satisfy generalized conservation laws which further satisfy (17) are decomposable systems. For such systems the solution of problem (i^) can be obtained by solving This theorem justifies the term generalized Gittins 19 K smaller independent subproblems. indices. We will see in Section 4 that when applied to the multi-armed bandit problem, these indices reduce to the usual Gittins indices. Let us consider briefly the problem of optimizing a nonlinear cost function on the perfor- mance Bhattacharya vector. et al. [4] addressed the problems of separable convex, min-max. lexicographic and semi-separable convex optimization over an extended polymatroid. and provided iterative algorithms for their solution. reward case, the control problem Analogously as what we did reward function can be reduced in the Ccise of a nonlinear programming problem over the base of an extended polymatroid. to solving a nonlinear Branching Bandit Processes 3 Consider the following branching bandit process introduced by Weiss [35]. wiio observed that can model a large number of dynamic and stochastic scheduling processes. it finite number project. = ik, It is We convenient Let E = type Ic U^^^Ek- = development of to stages in the {l,...,n} be the i random time of a project a projects N,j, in states j £ Vi, is E We assume for a duration (, a single label = If m, is clear that this process The model we denote and the instants project present. the is //. — (the duration of that given /', the durations and the i. Projects are to be selected at which a project stage number in of projects in state project engaged to a new ;' is state, and .V, shall refer The decision is some present at a given time, then [37], [38]) is (2) external arrivals of new m — (mj, . , lUr, it )• a special case of branching consist of two parts: 2U We completed and there a semi-Markov decision process with states which the descendants u. as the class of admissible policus of arm-acquiring bandits (see Whittle bandit process, \, N, are random variables with an arbitrary joint distribution, independent to this class of policies, which t arrivals replaced by a nonnegative integer under a nonidling, nonpreemptive and nonantinpative scheduling policy is tiie finite set of possible and random c, other projects, and identically distributed for the same epochs are a is project can he in one of combine these two indicators into and upon completion of the stage the project descendants all follows to A A' Engaging the project keeps the system busy number of new of what associate with state i), , £ E^. which correspond i^ in 1. There project types. all {!^tj)j^E- stage of states the state of a project. states of — of project types, labeled k number a finite i the linear in (1) a transition of the projects, independent of / The or of the transition. cliissical multi-armed bandit problem corresponds to the special case that there are no external arrivals of projects, and the stage durations are The branching bandit Subsection fore, as described in gaging a type I i job process is 2.4, In this section, The process. we first We may two will define one of a project selection process we shall refer to a project in state different performance measures for a / as a type show that they satisfy generalized conservation laws, branching liandit allow us to model an undiscounted tax structure. In each case we will and that the corresponding optimal control problem can be solved by a direct application of the results of Section Assume now job. i be appropriate for modelling a discounted reward-tax structure. will will 5 C £ be En- interpret that each project consists of se\eral The second one Let There- can be interpreted as a job scheduling process it In the analysis that follows, jobs. cjise the job scheduling model corresponds to selecting a project of state in the branching bandit model. in thus a special 1. We a subset of job types. that at time t = there shall refer to jobs only a single job is in with types 2. 5 in the system, which as 5-jobs is of type / Consider the sequence of successive job selections corresponding to an admissible polic\ n that gives complete priority to 5-jobs. This sequence proceeds until cl all 5-jobs are exhaust, for the first time, or indefinitely. Call this an (i.S) period. Let T,^ be the duration (pos>-ihl\ infinite) of an (i.S) period. It is easy to see that the distribution of the admissible policy used, as long as (i, 0) period is distributed as i',. it T^ is indepeniliMit of gives complete priority to 5-jobs Note thai .in be convenient to introduce the following addiiicuMl It will notation: Vi^k = duration of the independent of '''i,k f, = = ib ibth selection of a type /job; notice that the distribution of of times a type : job is i Qiit) = i job for the number Q,{tYs. We is job occurs: selected (can be infinity); {^i^k}k>i— duration of the (2,5)-period that starts with the tth selection of type ,j.. (vi). time at which the kth selection of a type number i fcth of type i a type /job time. jobs in the system at time assume Q(0) = (mj, . . . , m„) 21 is /. known. Q{t) denotes the vector of the : = T^ time until is Aff. = first time (can be T^ note rhat infinity); the duration of the busy period. a type ri, if 1 0, otherwise, ot: A > inf { Vtk between the If S-jobs are exhausted for the all job is Hjes : ifcth no more jobs i being engaged at time ^jC'"'.*^ + ^) = selection of a type in 5 for 1 }- t. ^' "°t^ that ^ ' Af^^. job and the next selection, j are selected, then Af^ = Tf^, if is the InttM-val Vjob any. of an the remaining interval of the busy period. A^ = is inf { t : ^,g5 = Ii{t) 1 }; selected. If no job in Proposition 4 Assume missible policy. S is is the interval until the A^ = T^ selected, , first job in S. if any, the busy period. that jobs are selected in the branching bandit procfis under an ad- Then, for every (a) If the policy gives A^ note that S C E complete priority then the busy period [O.T",^) con to S'^ -jobs hi par- titioned as follows: [o,r^) = [o,rf )U 0^'^-^..'= + ^'^') «• p- 1- (30) w. p. 1 (31) i6Sfc=l The busy period [0,T^) can (b) [0,T^) be partitioned as follows: = [0,Ai)\J (j[r,,,r.., + Af,) iesfc=i The following (c) inequalities hold w. p. I: ^tk<Tf.i (32) A^<Tf. (33) and Proof (a) Intuitively (30) expresses the fact that 5'^-jobs, the all jobs duration of a busy period in S'^ are exhausted for the is first under a policy that gives complete priority partitioned into (1) the initial interval time, and (2) intervals exhausted, given that after working on a job generated. 22 in S we in which all jobs in in to which 5 are clear first all jobs in S' that were More is formally, let u be an admissible policy that gives complete priority to easy to see then that the intervals the inclusion D inclusion C. Let at time that G t £ t some job t [T'j.fc. Tj.A: the right hand side of (30) are disjoint in obvious. In order to show that (30) is [0, [0, \ is indeed a partition, otherwise we are done. Since u 7"^'), being engaged. Let j be the type of this job. is + T^) for Tjl) some k, 5'^-.|olis. is Moreover, us shovv the let nonidlmg € 5 then If j and we are done. Let us assume a policy, clear is it £ S' tiiat j [t Let us . define D= Since D > T^" € t follows that D is {r,,fc : ie S,k € D ^ follows that it a finite set. Let Now, 0. 6 S and i' job, with ! is nonidling, it follows that at this epoch one starts working on + is, = r, j^ ''I'.jf And r,«,fc. + ^r +T,Tf.. k- • completes the proof of the proposition. this . by definition of (b) Equality (31) formalizes the fact that can be decomposed into (1) the interval is a decision epocii at D it follows that / G 5'~ is which contradicting the definition of " and 7",f \. /. it r. Now, ^. t jf 0. for all k' be such that must be r,. > t. G 5, that < <t}. < that r,.jf +7",. Since the policy r.,|.. since by hypothesis E[(,] = max r,. t* Assume iy,},and {1 . - ['..a... empty. some type Hence, ^.. .j,.. + T^ / it ;_. ). under an admissible policy the bus} period until tiie first job 5 in is selected. tiie disjoint ("2) union of the intervals between selections of successive 5-jobs and (3) the interval between the laist selection of a job in selected, then i/, = 0, for i 5 and G S, the end of the busy period. and A^ = Note that no if T^, thus reducing the partition .S-job is to a single interval. (c) Let Tj^jt be the time of the fcth selection of a type i job (i G 5). Since the next selec- tion (if any) of an 5-job can occur, at most, at the end of the (i,S^) period inequality (32) follows. On the other hand, since the time until the 5-job, A^, can be 3.1 Discounted Branching Bandits In this subsection at we most the duration of the will initial {i,S'^) first [r, j. . r, ;. + 7,"' ). selection of an period. (33) follows. introduce a family of performance measures for branching bandits. {i"(Qr)}Q>o, that satisfy generalized conservation laws. 23 They are appropriate for modelling a linear discounted reward-tax structure on the branching bandit process. We have already defined the indicator /.«) a > and, for a given xr(a) a type ;' 1, if 0, otherwise, job being engaged at time is /: = 0, we 34) define = E„ r = re-^'l,{t)dt Jo E,[I,{t)]e-'" dt, i e E. (33) Jo Generalized Conservation Laws 3.1.1 In this section we prove that the performance measure satisfies generalized for branching bandits defined in (3)) conservation laws. Let us define nso' 4S e-oUt] 3(3) and 6a(5) The main Theorem result 7 is = E - E e-^'di [37; the following (Generalized Conservation Laws for Discounted Branching Bandits) The performance vector for branching bandits (26) e-'^'di /; /; and (27) associated with matrix Aq and (a) satisfies generalized conservntion laujs J"" set function bd). Proof Let SC E. Let us assume that jobs are selected under an admissible policy ates a branching bandit process. Let us define two as functions of its random vectors, (r})i^E u. This gener- and (r, " ),^s- sample path as follows: r] = / I,(t)e-^'dt Jo = Y, ;. ^j e-"' dt Jr., e-^' di, (38) Jo rrt and ,"'^ = V t^x ' e-°--.* / J° 24 e-^'dt, ieS. (39) 5 Now, we have xr(a) = E,[r|] = E„ = E = E„ e-"' di f:E[e— .Mt^,]E[r%-'^' d/ (40) rv, Eu e-°'c/< / .Jo Note that equality (40) holds because, since u On dent random variables. Eu[r."''] = E. the other hand, "• = E„ = E = is nonanticipative, r, j. and i , jt f-^'ty/ Eu I -Ka^ (A ^0 it, /•T'r r dt f are uulepen- we have e-^'d/ t,'""" I (41) J e-^< dt (4-J) Eu L*:=l e-°'dt Eu ^0 Z^- 143) L*:=l Note that equality (42) holds because, since u is nonanticipative, r,,^ and 7","^, are indepen- dent. Hence, by (41) and (43) 11,5 Eu[r."-^] = ieS, .4f:.xr(a). (44) and we obtain; Eu We first (43) show that generalized conservation law complete priority to 5'^-jobs. (26) holds. Consider a policy t that gives Applying Proposition 4 (part (a)), r>.k we obtain: + T.^, fi-^' dt Jo Jo ,est=i^^.> di 65 *:=1 II. r ^'€5 ^0 25 (46) Hence, taking expectations and using equation (45) we obtain = E -^'dt I' Jo e-^Ut r + Y,Atx:{Q) le* or equivalently, by (37), Y,AtzUa) = b,{S), which proves that generalized conservation law (26) holds. We next show that generalized conservation law (27) under admissible policy u. is satisfied. Since jobs arc selectt^d Proposition 4 (part (b)) applies, and we can write '•.,*+^.-/. e-'"d< e-'^M/. Jo Jo On + the other hand, we have EE/ > e-^" dt (4S) e-'^'dt Jo Jo r^e-^uf-f > = Jo Jo / e--df-/ Notice that (48) follows by Proposition 4 (part Proposition 4 (part (c)). (c)), Hence, taking expectations r * > E = ba{S) / e-^'dt (•JOi e-^'dt. (51^ (49) follows by (47), and (JU) in (51), and applying (45) we li> nliiani m r'^' df e"-"' dt i: {')>) which proves that generalized conservation law (27) holds, and this completes the ptcur of the theorem. Hence, by the results of Subsection 2.3 we obtain: Corollary 2 The performance space for branching bandits corresponding to mance vector x^{a) the verfirfs of B{Aa,ba) are the is the extended polymatroid base performance vectors corresponding 26 B{Aa.ba); furthermore, to the fixed priority rvles. the perfor- The Discounted Reward-Tax Problem 3.1.2 Let us associate with a branching bandit process the following linear reward-tax stniciure; An instantaneous reward of R, addition, a holding tax C, in the is received at the completion epoch of a type is incurred continuously during the interval system. Rewards and taxes are discounted in tiiai / job In a type /job time with a discount factor a > U is Lt^t us denote Vu,a' (m) = expected total present value of rewards received minus taxes incurred under policy u, given that there are initially m, jobs of type The discounted reward-tcix problem is i i in the {m) over I'u.o" we reduce the reward-tax problem to the pure rewards case (where fi closed formula for Vu.a' o\ (t)) / G E the following optimal control problem: find an ad- ft C'\ missible policy u' that maximizes I system, for and show how all admissible policies u to solve the C= problem using 0) In thi.'- We seotinn also find a algoiitlini .4i The Pure Rewards Case. Let us introduce the transform of n,, i.e., = Vl^fHm) "^,{0) — E[e"^"' ]. We then have E. i6Efc=l av. E. (53) L*: =l E[e (54) (55) Notice that equality (54) holds by (41). It is also straightforward to model the case during the interval that a type Let Vu,a (ni) i job is in in which rewards are received continuously the system rather than at a completion epoch. be the expected total present value of rewards. Then re K!,^-°'(^n) = E. The Reward-Tax Problem; Reduction e-''^I,(t)dt to the 27 X;«,xr(Q). Pure Rewards Case We will next show how to reduce the reward-tax problem to the pure rewards case using the following idea introduced by Bell for further discussion). [1] (see also Harrison [18], The expected they are charged continuously the arrival epoch of a type « Stidham present value of holding taxes in time, or [2!^] is and Whittle the [38] same whether according to the following charging scheme At job, charge the system with an instantaneous entrance charge of (C/a), equal to the total discounted continuous holding cost that would be inruried the job remained within the system forever; at the departure epoch of the job parts), credit the system with an instantaneous depariure refund of (if it if ever de- (C',/o)- thus refunding that portion of the entrance cost corresponding to residence beyond the departure epoch. Therefore, we can write K^,o'^*(f") E^[Rewards]- E, [Charges at/ = = ( Eu[ Departure refunds] — 0] + Eu[ Entrance Charges] ) teE = V<«^^'°'(m)-i:m.(C./a) where fl: From equation (56) it is = (C./a)-^E[A^.,](C,/a). (57) straightforward to apply the results of Section 2 to solve the control problem: use algorithm A\ with input {Ra,Ac^), where > Let 7i(a), Theorem . . . a J 1 — ^(q) ,7n(a) be the corresponding generalized Gittins indices. Then we have 8 (Optimality and Indexability: Discounted Branching Bandits) (a) Al- gorithm Ai provides an optimal policy for the discounted reward-tax branching bandit problem; (b) An The in optimal policy is to work at each decision epoch on a project with largest inder -,(o). previous theorem characterizes the structure of the optima! policy. Moreover, since Proposition 6 below, we find closed form expressions for the matrix 28 .4^, and tiie set function 6a(). we can compute not only the structure, but also the performance of the optimal policy (optimal also that the optimal extreme point of the extended polymalroid) profit, decomposition of the indices does not hold in the general case; the generahzed indices of the states of a type k project depend of project types other than decomposable system. Theorem branching bandits t, i.e., We may in • • , other words. general on characteristics an example of an indexable but not is also prove the following result: 9 (Continuity of generalized Gittins indices) dices 7i(a), in \ore The generalized Giffnis 7n(a) are continuous functions of the discount factor a . for a > in- 0. Proof It is easy to see that the generalized Gittins indices depend continuously on the in|iut of algorithm ^i. Also, since the function a >—' [Rcc-^a) is continuous the result follows The Pure Tax Case: Minimizing Time-Dependent Expected Number D in Sys- is often tem In several applications of branching bandits (for example queueing systems) one interested in minimizing a weighted jobs in the system. Let number sum of discounted time-dependent e.\pected numl>er of QJ"() denote the Laplace transform of type j jobs in the system under policy u, g;"(e)=/ of the time-dependent expected i.e., E4Q,{t)\Q{0) = ni]e-^'dt. jeE m) Jo An interesting optimization problem The problem can be modelled as a is to: pure tax problem as follows: i:QQr(a) = -K<°/'(m). J€E and thus by making i? = 0, C, = 1 and C, = for i ^ j in (56) we obtain . 1 See Harrison [18] for 1 1' a similar result in the context of a multiclass queue. 29 € f (60) Interpretation of Generalized Gittins Indices in Discounted Branching 3.1.3 Bandits We Consider the following modification of the branching bandits problem: problem by adding an additional project type, which we original state/stage of infinite duration, that continuously discounted, is is, dq = call oo with probability with only one 0. A reward 1. received for each unit of time that a type Notice that the choice of working on project type modify the project engaged. is can be interpreted as the choice of retirement from the original problem for a pension of Ro, continuously discounted Now, the modified problem time = t is still then ask the following question: Which and another of continuation (working on the project in state i)? We have then the following Proposition 5 The generalized in state / at £ E. the smallest value of the pension Ro is which makes the option of retirement (working on project type value Ro{i)- time. in a branching bandits problem. Let us assume that there are only two projects present, one of type We may of Ro- Let us 0) preferable to the option snrrdider this (quitable call result Gittius indei of project state i the original hrniicbing in bandits problem coincides with the equitable surrender value of state i. Rq(') Proof Let 7i , . . . , 7n be the generalized Gittins indirps corresponding to the original branching bandits problem. Let 7oi7ii--.7n ^^ ^^^ generalized Gittins indices for the modifit^d problem. Let us partition the modified slate space as the decomposition condition (17) holds. Hence 7g Now, since by Gittins index, Theorem it 8 = it is flo and E= Theorem ^," = 7;, {0}U£^. 3 applies, is Whittle such a breakpoint. Therefore R^(i) = armed bandit problem, and provided an his intuitive proof. jiavp (61) ")o- makes the options \s [34] also is But by complete. in his of continudefinition D analysis of the multi- interpretation of the Gittins indices as equitable makes use of Here we extend Ro = 7o and the proof introduced the idea of a retirement option surrender values. Weber and therefore wp optimal to work on a project with largest current generalized follows that the surrender reward Ro which [36], [38] eas\ to verifv that J€E. ation and of retirement (with reward Rq) equally attractive R^{i) It is this characterization of the Gittins indices in this interpretation to the 30 more general ca.se of branching FVom bandits. this characterization it follows that the generalized Gittins indices roiiici<le indeed with the well known Gittins indices which The name. justifies their Computation of 3.1.4 the classical miilti-arnied bandit piolilem. in and .4o 6o( ) results of the previous sections are structural, but of the matrix .4^ and the set function ba{) appearing (26) and (27) for the in branching bandit problem. Our goal generic data the matrLx these computations do not lead Aa and make the set function 6q( ) to explicit compulations the generalized conservation laws in this section Combined with is to computp from the previous results possible to evaluate the performance of specific policies as well it as the optimal policy. As generic data of Vi,{Nij)j^E is for the branching bandit process, we assume that the joint distribution given by the transform In addition, we have already introduced bution of (denoted G,()): V, .'V.n = E <i>,{e.Zi,...,Zn) f62) -1 the the generating function of the marginal distri- (iyi) Jo Finally the vector As we saw role. We will in m= (mj, . . . , m^) of jobs initially present the previous section the duration of an compute its moment (;, is given. S)-period, plays a mirial generating function ^f(e) = E[e-^^."]. ((3 1) For this reason we decompose the duration of an (i,5)-period as a random 7",^. sum of mdepeiirj.Mit variables as follows: •V..; ((jj) je5 k=\ where v,, {Tji^}k>i are independent. Therefore. *f(e) = E = E jes = <I>.(e,(*f(0))^g.,,l5c), 31 i£E. ((3(3) Given 5, fixed point system (66) provides a way We now to compute the values of have the elements to prove the following set function 6a() G £' a branching bandif process. luntriT satisfy the following relations: .4^ l-*f(a) = 1 bais) / result: Proposition 6 (Computation of Aq and 6o()) For Ac and ^,- [0). for = - n - ieS, ^.(a) I'^Ti'')]"'' - - SC E. ni^f (^)^^ (67) scE (68) Proof Relation (67) follows directly from the definition of ^4^^ O" 'he other hand, we have m, (69) .€5 Hence, ,-»( u: dt fc=l Undiscounted Branching Bandits 3.2 In this section we address branching bandits with no discounts. pure rewards the problem all policies have the same reward. Under a undiscounted tax structure on the branching bandit process, however, the |irol)lem linear becomes interesting. criterion, also [24]), since trivial, is Clearly, in tiie case of Indeed, since an optimal policy under the time a\erage holding cost minimizes the expected total holding cost in each busy penod (see Nain f / nl. modelling and solving undiscounted branching bandits leads to the solution of several classical queueing scheduling problems. More importantly, our approach undiscounted problems, which, literature. To our opinion, has not been thouroughly addressed in indices. We q as the discount factor It is will ^ provided there are no 0, not clear, however, what happens introduce in this when the there are for the ties of the undiscounted corresponding ties. subsection a performance measure r" for a branching bandit process that satisfies generalized conservation laws It is appropriate for modelling ear undiscounted tax structure on the branching bandit process. following development that in after solving an indexable discounied scli^duling give a concrete example: problem, researchers say that the same ordering of the jobs holds problem and reveals rigorously the connections of discounted all We the expectations that appear are finite shall We assume will show a lin- m the later necessary and sufficient conditions for this assumption to hold. Using the indicator if {1, earlier, we i job is being engaged at time i: otherwise, 0, we introduced a type let -, — {t)idt C.U i€E (72) Jo Let us define ^s_^[Tr] E[i>,] and fc(5) = |E[(Tf)2]-iE[(rf )2] + 5^6.(5), .65 where E[u.]E[vf] fEjTn 33 E\iTfy]\ (74) 3.2.1 We Generalized Conservation Laws prove next that the performance measure for a branching bandit process defined The main satisfies generalized conservation laws. result Theorem 10 (Generalized Conservation Laws The performance vector for branching bandits and (27) associated with matrix A and is for m (7'2) the following: Undiscounted Branching Bandits) r" satisfies generalized covsert atiov /aits (26) set function b(). Proof Let SC E. Let us assume that jobs are selected under an admissible policy ates a branching bandit process. Let us define two random vectors. (fj),6£ u. This gener- and (r, ),e'^- as functions of the sample path as follows; ri r = - i,(t)tdt = j2 r + -^). H(^..in,fc tdt / G £". 76) and \r ii,s /" r,.k ' i£S. tdt, Jt, = l •''* , I fc + Tr I, Now, we have z." = E„[r|] = E. = E lk=l = E[v,] EW,] E[vf] Eu 79) 2 Lfc=i Note that equality (78) holds because, since u dent random Ve^iables. On the other hand, is nonanticipative, = E, = E„ jt and t,.; are indepen- we have ^.*+T,-V -..*+7,^ E„[r."'^] r, = tdt E„ tf tdt I/, I (T,';)\ .c lk=i = + EIT^] E 'fc=i 34 EHE[(Tf)2] (80) Note that equality (80) holds because, since policy independent random variables. Hence, by (79) and — u nonanticipativr, is wn = 7,""^. are (80): .r-^ENE[rf] _ E.[r;'-^]-iEHE[(7f-)^] EN and t,j; "' ' ^'^ and thus we obtain: We will first show that generalized conservation law (26) holds. gives complete priority to 5'^-jobs. Applying Proposition 4(a), Consider a t that polic.\ we obtain: tdi Jo Jo ,^sk=l''^'^ (T^m +!:' sr^ ) 2 II. 5 (^3) 165 Hence, taking expectations and using equation (82) and tiie definition of /j(,s') we nlirain Y^A^z' =b[S). which proves that generalized conservation law (26) holds. We next show that generalized conservation law (27) under admissible policy u. Then, Proposition ^0 On the other hand, Jo satisfied. 4 (part (b)) applies, f^^ ^tl Let thejobs and we can h(= selecti'd vvrite Jr,,, we have E^"-' = Ztj" €>" i=l .G5 > = * • ZE/ I / dt > Idt / (c)). 4 - = tdi (86) tdi. (87) Jo (part (c)). Hence, taking expectations Y^A^z^ (8o) Jo ^0 Notice that (85) follows by Proposition idt - Jo Proposition 4 (part is E.[^r|'-^-] 35 (86) follows by (84). and (87) by in (87), + ^6,(5) and applying (82) we obtain > E[ / Jo = " '"tdt]-E[ / tdt] + yb,{S) Jo f^s 6(5) {8S) which proves that generalized conservation law (27) holds, and this completes the proof of the theorem. Corollary 3 The performance space for branching bandits corresponding mance vector B{A,b) are z" the is the extended polymatroid base S(A,b); furthermore, performance vectors corresponding to the tbt p( rfor- vetiufs of to the fixed priority rules. The Undiscounted Tax Problem 3.2.2 A Let us associate with a branching bandit process the following linear tax structure holding tax C, per unit time is incurred continuously during the stay of a type /job in the initiall> ni, system. Let us denote Vu ' The ("i) = type I expected total tax incurred under policy jobs in tax problem is minimizes Vu for Vu ' ' the system, for / u, given that there are £ E. the following optimal control problem: find an admissible policy (m) over all (m) and show how we need some preliminary admissible policies to solve the u. In this section we (/' thai find a closed forniula problem usmg algorithm ^i For that purpi-)ve results: Expected System Times Let Q!"(), Ij{) and x^{-) be as in Q;''(0)=/ Subsection 3.1. E^[Qj{t)\Q{0) By = definition m]di, we have jeE Jo and i,"(0) From the above formulzis, 1- Q'"(0) 2. ij(0) is is = it is E„ [P d<|Q(0) lj{t) = m] , jeE clear that the expected total time spent in the system by type j jobs under policy the expected total time spent working on type j jobs under policy i"(0) does not depend on the policy u. Hence, we shall write Xj(0) 36 = (/ Jj(0). u. Clearly. Now, letting \ a on equation (60) we obtain ^'"^°^ = ^"""^''^^ " efi ^ ^ ^i^'""- *'^°^ + ^' ^ 189) -.(0) (90) ''' •'' where ^^=^'- 21^5) -^(0) and (j:")'(0) E E[^u] (1 denotes the right derivative of z"(q) at a (x]f(0))' - = -E„ ^) 2E[i., t€£ = 0, that is: dt ^0 Hence, we have q:r(o) = !¥-'."+''. Fi^--;-E E[r,] E[^'.] j^^ (92) .,£ Modelling and Solution of the Tax Problem We have, bv (92), = Vl"'^\rn} ^C.Q-"(0) 6E = From equation (93) it is (93) .Sl^^-^tF^l-S-- straightforward to apply the results of Section 2 to sohe the optimal control problem: use algorithm Ai with input (R,A). where «, = (9 4) E[v,] Let 7i, . . . Theorem ,7n be the corresponding generalized Gittins indices. Then we have the 11 (Optimality and Indexability: Undiscounted Branching Bandits) Algorithm A\ provides an optimal policy for the undiscounted tax branching bandit (b) An index 3.2.3 ivsiilt optimal policy is to work at each decision probhw epoch on a project with largest cinrmt 7,. Computation of A and In this section b(-) we compute the matrix A and the ^' - E[r.] 37 ' (a) set function 6() as follows. '^^ Recall that and .65 From equation (65) we obtain, taking expectations: E[Tf] = E[vi] Solving this linear system we obtain i€S. + J2m.j]E{Tfl E[7^'^]. Note that the computation of Af compared with the discounted easier in the undiscounted case (9o) case, is where we had niurh to solve a system of nonlinear equations. Also, applying the conditional variance formula to (63) we obtain: Var[7;^] = + (E[T/])[,5Cov[(yV.,),es)l Var[t;,] (E[T/])^,5 + ^ E[.V„] \ar[7/]. , G 5. (96) Solving this linear system we obtain Var[T;'^] and thus E[(7^^)^]. Moreover. E[vj] = mj+Y, E[A',;] E[i^.], j € E. (97) from equation (69) we obtain Finally, E[T^] = Y.m,E[Tfl (98) = ^m,Var[rr']. (99) and Var[T^l t€5 3.2.4 We Stability condition investigate in this section under a positive solution for all sets SC what conditions, the Theorem 12 if and only if finite. Let A'' systems (95) and (96) have E. In this way we can address the stability of a blanching bandits process, in the sense that the bandit process are linear first two moments of a busy period of a branching denote the matrix of E[N,j]. (Stability of branching bandits) The branching bandiis process the matrix I — N is is slable positive definite. Proof Suppose I — N IS positive definite. We will show the system is stable. System (95) can lie written in vector notation as follows: {I-N)sTs = 38 vs, (100) = where 7s minant where nominator along the column us we obtain: in the nonegative numbers (which are determinants themselves) ^r are definite, then dei[(I for all I Solving the system using Cramer's rule and expanding the dtnor- (£'[TJ^]),gs - N)s] > £ S and S C E. for all 5 C £ and = if 7 the — first {Var[Tf])t^s the system values for will •^t-^i'] ~ the system if stable for is all i is > 0. hypothesis is ] -^ ^' ^"^^'ch true for \S\ stable, E. show by mduction on detUl-N) — k. we > i.e., \S\ Note that the condition /? < 1 in N finite, i.e N < — - X)s] > is we use (101) —N vve obtain that positive definite - A'),] > 0. for all S C E for all Since Assuming For .s' \S\ C = E. I. that the in.liirtion to obtain; I — N is positive definite. positive definite) naturally generalizes the stability I translates to If we interpret a queueing system as a E[N] = p = ^E[v] < of customers that arrive (at a rate A) during the service time 3.3 X follows that stable. system (100) has a positive solution queueing systems as follows: branching bandit then I is it follows that £"[T,^] have finite nonegative it that del[{I I {I the system . show that will implies that det[{I < same argument Hence, from (98) and (99) 0. vectors rn, from the induction hypothesis. Therefore, condition > E{T,''] "S, Therefore, using the ror[T','-^] initial all S and S C €. A' is positive thus system (95) has a solution - N)sxs = two moments of the busy periods are Conversely, We ug ^rid positive definite, then is A'' — Similarly, (96) can be written as (7 where 15 If / v 1. since A' is the numb'^r of a customer. Relation between Discounted and Undiscounted Tax Problem In this subsection we study the asymptotic behaviour of the optimal counted tax problem as the discount factor a approaches counted tax problem, that corresponds to q equal to notation of Subsections 31 and 3.2, 0. 0. and It is its policies in the dis- relation with the undis- easy to see that, using the that hm 4f, =.4f, 39 (102) and ^ Um{c,-TE[N,j]Cj] ^^ = lima«,,, o\0 a\ol. ' pr -'J ^ ,"" '"^ - *(a 1 =«.• , Eh] (103) Therefore, because of the structure of the generalized Gittins indices (see Proposition 3) it and (103) that the generalized Gittins indices of the undiscouiited and follows from (102) discounted tax problem are related as follows: lim q-),(q) A consequence of (104) tax problem for a \ that a policy which is will is = (104) 7,. asymptotically optimal in the discountHd be optimal for the undiscounted problem. Applications 4 In this section we apply the previous theory to several classical stochastic schedniing prob- lems. The Multi-armed Bandit Problem 4.1 The multi-armed bandit problem was There are number K parallel projects, indexed k of states i^ E E^- At each only a single project. on k, i.e., /?. The state but not on j/(t -I- 1) = t), ifc(f) = ni K. Project k can be instant of discrete time ijt(f L- Rewards R,^(t) changes to the introduction. I one works on project If an immediate expected reward of factor defined + 1) in state = < iki't) are additive at one of a finite one can work on 0, 1, time in t then one receives and discounted by a Markov transition rule (which in tmio by a may depend while the states of the projects one has not engaged remain unchanged. i/(f) for / ^ Let P* k. probabilities corresponding to project k = (p'!j),.j^Ei, The problem be the matrix of Markov transition is how to allocate one's re.tiources to projects sequentially in time in order to maximize expected total discounted reward over an infinite horizon. goal is That is, if jit) denotes the state of the project engaged to find a nonidling and nonanticipative scheduling policy u that E.Ei' «,(,)] (=0 40 at time /. the maximizes (103) We For this reason we set of the previous section. P= (.?>})< problem as a branching bandits model the problem f~"' = J. order to apply the in v, = We 1 rcsultj; also define iiiatrix j€E by if {p'ly £ Ek, '.7 for some k = I,. . . ,K\ otherwise. 0, Moreover, by (62) we obtain: ^.{o,z, E[e— -.f•>.... ^"] = -'„) = 0YI P^J'J- ^oTieEi, (106) and, by (66) ^f(a) = <I>.(a,(*f(a))^^,,l5^) = /?{l-^p,j(l -*f(a))}, foriGE. j€5 By introducing 1_^ 1-*,(Q) and noticing that since u, = and by (107) and Proposition Moreover, since ^^(a) = *,(a) 1, 6 = /?, it follows from (107) that we obtain 0, where at time 1, if 0, otherwise. t = there 41 is a bandit in state j\ (107) The P= structure of the matrix (pij) implies that which implies that the index decomposition condition (17) holds, and therefore Theorem 3 applies, giving a Theorem 13 new proof (Gittins of Gittins theorem: and Jones [14]) For each project k there depending only on characteristics of project k, exist indices {',}i^Ek- such that an optimal policy is fo engage at each time a project with largest current index. By the results of Subsection 3. 1.3 we know that the generalized Gittins indices for this bandit problem coincide with the usual Gittins indices. Further, by definition of generalized we obtain a characterization of Gittins Gittins indices, purely algebraic characterization. Also, note that indices as simis of dual variabiles, a Theorem 13 implies that the multi-armi^d bandit problem not only has an optimal index policy, but which the Gittins indices can be 6, has an optimal index poliry index decomposition property, as described satisfies the stronger By Theorem it algorithm Ai to subproblem k, computed by solving with \Ek\ job types, k = I, . .. Subsection 2 4 subproblems. /\' K . in It is apjilyiiig easy to \enfv thp following complexity result: Proposition 7 The complexity Gittins indices of project k of algorithm A\ applied svbproblem to A for cninpuliiiq ihc is 0(\E,f). The algorithm proposed by plexity as algorithm ^i. Varaiya, Walrand and Buyukkoc [33] has the same time roni- In fact, both algorithms are closely related, as Let tf be as given by (108). Let rf be given by rf = «, + /? J^p.^rf. ieS. Let us now state the algorithm of Varaiya, Walrand and Buyukkoc: Algorithm VWB: Step 0. Pick TTn e argmax { -\rr i ^ E }\ \et g„„ I set Jn - = max{ -^ t {:r„}. 42 : i £ E }. we will see next . Step For k. pick it = - n 1 e argmax 7rn_fc 1: { \_^u|i> ' ^ € ^V J„_i}; = { ''j^_^u{,\ ' > 6 E\J„-k}- y„_i = Jn_fc+1 U [nn-k] set Varaiya et al. proved that gi,...,gn. as given by algorithm [33] indices of the multi-armed bandit problem. We set <7^„_^ state, VWP, are the Girtiiis Let (Tr,y,u,S) be an output of algorithm A\ VWP without proof, the following relation between algorithms A\ and Proposition 8 The following U} D i^nlUft} relations hold: For j .{t, ,--n 7r,] — 2, . . . ,n jr„} {tTj 111) and ^-'=Q, r<'^ and Ai and VVVB therefore, algorithms Klimov (112) are equivalent. introduced the following queueing scheduling process: There [22] and n customer types. External I £" G = {1, . distributed as a type I customer . . , n}. Service times for type random is arrivals of type variable i', system. The = (if t to find the U customers are independent and ideniirally customer), or with probability is initially some customer present), the epochs is in the j ser\ ice of a customers, with l-XI;g£Pu u; when lfa\*^sthe the decision a e|.-iochs customer arrives system). Let us consider the following three classes of admissible nonidling, nonpreemptive and nonanticipative policies; Uo is allowed); and U^ is by direct methods, the associated optimal control problem over U the class of all the class of all nonpreemptive and nonanticipative the class of all nonidling and nonanticipative policies (preemption Klimov When system empty and the epochs when a customer completes service (and some customer remziins policies; i a Poisson process of rate A, server selects the jobs according to an admissible policy there a smgl(-~ sppv.m- customers form with distribution function G,() _;' is / completed, the customer either joins the queue of type probability p,j (thus becoming a type are i£E. Scheduling Control of a Multiclass Queue with Bernoulli Feedback 4.2 for ft [22] solved, policies (idleness with a time-average holding cost criterion. Harrison [18] solved, is is allowed). using dynamic program- ming, the optimal control problem over Uo with a discounted reward-cost criterion, special case that there is no feedback. Tcha and Pliska 43 [29] in tiie extended Harrison's results to . They the case with feedback. also solved the control problem over U^ , case that the in tiie service times are exponential. The Discounted Case Let us consider the following reward-cost structure: There per unit time for each type ward of at the /?, i customer staying a continuous holding cost C, is the system, and an mstantaneous re- in epoch of completion of service of a type / customer There is also an instantaneous reward of idleness Rq at the end of an idle period. All costs and rewards are discounted time by a discount factor a in > 0. The optimal control problem is to find an admissible policy to schedule the server so as to maiximize the expected total discounted reward minus holding cost over an infinite horizon. Let us denote Pu Pu^ and Pup the optimal control problems corresponding to the classes of admissible policies respectively. We model each of these problems will also prove, applying the Index we only need m Uq and U^ problem \W will order to sohe problpiii P^^ problem Py. This problem can be modelled problem with n job types, J Decomposition Theorem, that . problem Pu. to solve First, let us consider N,j of a type as a branching bandit U as follows: We as a branching bandit interpret the customers as jobs The descendants job are composed of the transition of the job to another type (or outside the system) and of the external Poisson ^.(a,2i. ,zn) -av, arrivals. N,i ^1 The transform <J>,(.) is given by = E[e = E (l-LPud--';))^""*"^^'^^'^''"-"^' ~n J je£ ieE (113) Also, by (66) and (113) *f(a) = {l-^p.,(l-*f(a)}*,[a + ^Aj(l-^f(a))]. Let x^{a) = (ij'(a), . . . ,rj}(a))-^ denote the performance vector, as that i"(Qf) satisfies generalized conservation laws. matrix Aa is in By Proposition ieE. Section 6, (114) 31 We know the corresponding given by '^••'*- Let us consider now problem Fii^. l-*.(a)' In order to '^^- model the option of previous branching bandit process by adding an idling job type 44 , idleness, we modify the which we denote 0. Tlie duration of job type 0, it case. jVqo = 0; = A'o, transform $,(•) exponentially distributed with parameter A until the next arrival); the models time (since is I'o, and ^ ,V,o it i E G It with is G E. are i,j A] + + A^ as ui the prp\ ions easy to see that the corres|Mjiidiiig satisfies ^,{a,zo,zi,. Hence, for .V,^, = . .,i„) = <J>,(q,;i ^n). e ' E follows that *f'''"'(a) = SCE, ieE, \l/f(Q), and SCE. ^o'^<°>(a)=^J°^a), Consequently, we have, for £ S C i E that -j5u{0} _ 5 and -jSu{o} ^0,0 _ — ' _ 3{o} - -"'0,a Therefore, condition (17) holds, and the Index Decomposition Theorem 6 applies \ow. we have E[iV.,] = p„ + A,E[i,], and ^o(o) = A By +Q (56) *^< a J 1 - *,(a) Hence the index of the subsystem composed of job type 7,, 7i for < :'*,... • J • G E, < are 7i*-i < computed from algorithm ^1 applied 7o < 7i* < • 7n then an optmial policy is to is 70 = ^Ro problem P;^ The indices Therefore, if to serve customers of types ,n with a fixed priority policy, giving highest priority to n, and never serve customer types 1,...,:' Harrison [18] — 1. That is, the optimal policy and Tcha and Pliska [29]. 45 is a modified static policy, as proved by The Preemptive Case In preemption allowed, then the decision epochs are the arrival epochs as well as the is departures epochs of customers. then it If the service time /i, = exponential with rate + /i; where A A, three cases: (1) = Ai + One descendant, that service of the type i + A„. of type _;' j has a duration As for the v, for /i,. easy to model the possibility of preemption: model the process as is bandit process with n job types. Job type rate v, is a S E. / branching exponentially distributed with descendants of a t\pe / job. there are with probability ^p,j (correspondmg to the case customer ends before any arrival occurs and the customer moves to queue ^ (corresponding to the case that a type j customer arrives before service of the rypp customer is two descendants, one of type (2) j); i and the other of type completed); and (3) no descendants, with probability sponding to the case that service of the type the customer i ^ (1 customer ends before any j with probability - XI/eEPu) / (<"oi"''?- arrival occurs, and moves out of the system). The Undiscounted Case: Klimov's Problem Klimov considered the problem of optimal control of a single-server mnltirlass [22] first queue with Bernoulli feedback, with the cost per unit time. tive policy is criterion of He proved that the optimal minimizing the time average hoMing nonidling, nonpreemptive and nonantioip-i- a fixed priority policy, and presented an algorithm for computing the (starting with the lowest priority type and ending with the highest priority) prioriti'^s Tsoucas [:32] modelled Klimov's problem as an optimization problem over an extended polymatroul using as performance measures L" = Algorithm ^i applied to in this case is time average length of queue this problem that priorities are is it is depend in this u. computed from lowest A disad\ antau,e priority to highest priority for the right hand somewhat for all of the Also. sides of the e.xfe'iidid not possible to evaluate the performance of an optimal policy approach gives explicit formulae also explains the under policy exactly Klimov's original algorithm. Tsoucas does not obtain closed form formulae polymatroid, so i Our parameters of the extended polymatroid and surprising property that the optimal priority rule does not case on the arrival rates. The 46 key observation is that an optimal i^oliry under the time average holding cost criterion also minimizes the expected in each busy period (see Nain first [24] for et al. Assuming that the system = Now. ue may niorUl the busy period of Klimov's problem as a branching bandit process with the undi.scouiU''d tax criterion, as considered in Section |i, further discussion). total holding cost E[t',] in we apply the stable, is results of Subsection 3.2 \\t^ dt^fine and tf = E[Tf]. By (65) we have tf which 3. = ^i^ + Y.^P,J + ^i^XJ)r^. i€E, (II.3) vector notation becomes: I.e., = tgc (/s^ - Ps'^.S^ - /J^^ P5<:-^5c) and tf = + ^is (Ps.s^+^is^lc)tfc After algebraic manipulations we obtain tf = (.. ^ +pL- (/5c - P5c,50-' Therefore, by definition of M5c) ^ Af in (73) we - ^^'-^-'^ J''^% - Psc,s-= ^ det(/5c find that Af = - ^, . >es. (iri) fiS'=>^sc) tf^ /fi,, for ; £ S. while 6(5) is given by (74). Now, letting detjlsc ^ ^ det(/5c we may define arrival rates of Af = Af /Ks, and 6(5) = - Psc,s^) - Psc^c _p5cA^c)' 6(5)/A's, thus eliminating the dependencp on the matrix A. As for the objective function, we have by (93) yiO,C) = J2\ ^'~ ^'''''' ^' I ^-r - HE) Y, C,\ + X: C,h,. (117) Hence the problem can be solved by applying algorithm Ai with input {R, A), where and since {R, A) do not depend on the arrival rates neither does the optimal policy. Note that as opposed to Klimov's algorithm, with this algorithm priorities are computed from 47 This top-down algorithm highest to lowest. proved its weis first optimality using interchange arguments. Nain direct optimality proof. optimal among policies when the Bhattacharya ei el al. al. service time distributions are exponential. It is provided a [3] i^ also and among preemptive also easy to verify these the index of the idling state (in particular, vvlio [24]. proved that the resulting optima! index rule al. idling policies for general service time distributions, approach facts using our et proposed by Nain whereas is 0. all other indices are nonnegative). Moreover, in the case that the arriving jobs are divided into project consists of jobs with types within Ek, and E projects, where a t\pe a finite set Ek, jobs in Ek can only in E = partitioned as is K Uj^^^Ek, then it is make Ic transitions easy to see that the Index Decomposition Theorem 6 applies, and therefore we can decompose the problem into K smaller subproblems. 4.3 Multiclass Queueing Systems Shantikumar and Yao showed that a [26] The reader strong conservation laws. and performance measures that is large variety of multiclass queueing systems satisfy referred to their paper for a Job Scheduling Problems without There are n jobs to be completed by a tributed as the to model random variable of Af,^, in (36), that ^4^^ i"(a) studied in Section 3 discussed in Section I, n,, moment a polymatroid Arrivals; Deterministic Scheduling single server. with is Job / has a service requa-ement generating function ^,(). this job scheduling process as a branching bandit process in descendants. Let us consider of job of particular systems satisfy strong conservation laws. All their results correspond to the special case that the performance space B{A,b) 4.4 list 3, in = first for 1, is the discounted case; For i £ S. it immediate which jobs have no is clear by definiticm Therefore the performance space of the vectors a polymatroid. Consider the discounted reward-tax problem which a instantaneous reward Ri and a holding tax C, a > It is dis- is is received at the completion incurred for each unit of time that job Rewards and taxes are discounted in the generalized Gittins index for job ; is time with discount factor a. By (56) i, in the in the system. it follows that problem of maximizing rewards minus 48 taxes. IS = {«' + Q '' now Let us consider Af in (73) a*,(a) -}f^ C, ^.(-) T J 1 (US) - W,{a) the undiscounted case in the case without rewards we have Af = 1, for vectors z" studied in Section 3 is i ^ in a definition of Hence the performance space of the performance 5. Thus by equation also a polymatroid. the generalized Gittins index for job By (93) the undiscounted tax probleni 7. = it follows that is (HO) i^, E[i;,] thus providing a new polyhedral proof of the optimality of Smith's rule (see Smith [27]) among In the case that there are precedence constraints that is also be each job can have at most one predecessor, modeled developed 5 We in as a branching bandits Section it is the jobs that form out-trees, easy to see that the problem can problem and thus solvable using the theory we have 3. Reflections presented a unified treatment of several classical problems in stochastic and dynamic scheduling using polyhedral methods that leads, we believe, to a deeper understanding of and algorithmic properties. Perhaps the most important idea we used their structural to ask the question: We What is is the performance space of a stochastic scheduling problem'^ believe that the approach of characterizing the feasible region of a stocha-stic scheduling problem will lead to important new insights and methods and will gap between applied probability and mathematical optimization. our results will bridge the artificial Indeed, we hope that be of interest to applied probabilists, as they provide new interpretations. proofs, cdgorithms, insights and connections to important problems as well as to discrete optimizers, since they reveal a in stochastic scheduling. new fundamental structure (extendf^d polymatroids) which has a genuinely applied origin. Refe Fences [1] C. Bell, (1971), "Characterization and computation of optimal policies for operating an M/G/l queue with removable server", Operations Research. 19, 208-218. 49 [2] D. Bertsimas, Paschalidis and J. Tsitsiklis, (1992), "Optimization of multidass queue- I. ing networks: polyhedral and nonlinear characterizations of achievable perfoniiance'. working paper, Operations Research Center, MIT. [3] P. P. Bhattacharya, L. Georgiadis and timization in multiclass in part at the M/G//1 ORSA/TIMS P. Tsoucas, (1991), "Problems of adaptive op- queues with Bernoulli feedback'. Paper presented Conference on Applied Probability the in Engineering. Information and Natural Sciences, January 9-11, 1991, Monterey, California. [4] P. P. Bhattacharya, L. Georgiadis and P. Tsoucas, (1991), "Extended polymatroids: Properties and optimization". Proceedings of International Conference on Integer Pro- gramming and Combinatorial Optimization (Carnegie Mellon ical [5] Programming Matliemat- Society, 298-315. Y. R. Chen and M. N. Katehakis, (1986), "Linear programming armed bandit problems". Mathematics [6] University) for finite stare inulii- of Operations Research. 11, 183. A. Cobham, (1954), "Priority assignment m waiting line problem.s", Opernfum-. Re- search, 2, 70-76. [7] E. CofTman and 1. Mitrani, (1980), "A characterization of waiting time perfornianre realizable by single server queues". Operations Research, 28, 810-821. [8] E. Coffman, M. Hofri and G. Weiss, (1989). "Scheduling stochastic jobs with point distribution on two parallel machines'". Probab. Engtn. In for. [9] D. R. Cox and W. L. a two to appear Smith, (1961). Queues. Methuen (London) and Wiley (New York). [10] J. Edmonds, (1970), "Submodular functions, niatroids and certain polyhedra". binatorial Structures and Their Aphcattons. 69-87. R. Breach, [11] New Guy et al (eds). iii Com- Gordon fc York. A. Federgruen and H. Groenevelt, (1988). "Characterization and optimization of achievable performance in general queueing systems", Operations Research, 36. 733741. 50 [12] A. Federgruen and H. Groenevelt, (1988), "'M/G/c queueing systems with multiple customer classes; Characterization and control of achievable performance under noii- preemptive priority rules", Management Science, 34, 1121-1138 [13] Gelenbe and E. demic Press, [14] J. I. New Mitrani, (1980), Analysis and Synthesis of Aca- C. Gittins and D. M. Jones, (1974), "A dynamic allocation index for the sequential J. Gani, K. Sarkadi European Meeting of Statisticians 1972, J. Sysifin<^ York. design of experiments". In [15] Computer C the Gittins, (1979), "Bandit processes Royal Statistical Society Series, B & vol. 1. 1. Vince (eds.), Progress tn Statistics Amsterdam: North-Holland, 241-266. and dynamic allocation indices". Journal of 14, 148-177. C. Gittins, (1989), Bandit Processes and Dynamic Allocation Indices. John \\ ile\. [16] J. [17] K. D. Glazebrook, (1987), "Sensitivity analysis for stochastic scheduling problems'". Mathematics of Operations Research, 12, 205-223. [18] J. M. Harrison, (1975), "A priority queue with discounted linear co.sls". Opfi<ilioii>- Research, 23, 260-269. [19] J. M. Harrison, (1975), "Dynamic scheduling ity". [20] of a multiclass queue: discount optimal- Operations Research, 23, 270-282. M. N. Katehakis and A. F. Veinott, (1987), "The multiarmed bandit problem: decom- position and computation", Mathematics of Operations Research, 12, 262-268. [21] L. Kleinrock, (1976), [22] G. P. Queueing Systems, vol. 2. John Wiley. Klimov, (1974), "Time sharing service systems I", Theory of Probabilitij and Applications, 19, 532-551. [23] E. Lawler, J.K. Lenstra, A.HG. Rinnoy Kan and D.B. Shmoys. (1989), "Sequencing and scheduling; algorithms and complexity", Report BS-R8909, Centre [24] ics and Computer Science, Amsterdam. P. Nain, P. Tsoucas and J. for Walrand, (1989), "Interchange arguments scheduling". Journal of Applied Probability, 27, 815-826. 51 Mathemat- in stochastic [25] W. K. IEEE [26] Ross and D. D. Yao (1989), "Optimal dynamic scheduling Jackson Networks', Transactions on Automatic Control, 34, 47-53. G. Shanthikumar and D. D. Yao, (1992), "Multiclass queueing systems: J. ment W. 2, Polyma- and optimal scheduling control", Operations Research, 40. Supple- troidal structure [27] in S293-299. E. Smith, (1956), "Various optimizers for single-stage production'", Naval Re^tarch Logistics Quarterly, 3, 59-66. [28] S. = XW: A Stidham, (1972), "L discounted analog and a new proof". Operations Research, 20, 1115-1126. [29] D.-W. Tcha and S. R works and multi-class Pliska, (1977), M/G/l "Optimal control of single-server queueing net- queues with feedback". Operations Research. 25 248- 258. [30] J. N. Tsitsiklis, (1986), "A lemma on the multi-armed bandit problem". IEEE Trans- actions on Automatic Control 31, 576-577. [31] J. N. Tsitsiklis, (1993), "A short proof of the Gittins index theorem' [32] P. Tsoucas, (1991), "The region of achievable performance Technical Report RC16543, [33] P. P. Varaiya, J. C. IBM in a . in preparation. model of Klimov". T. J. Watson Research Center. Walrand and C. Buyukkoc, (1985), "Extensions of the multiarmed bandit problem: The discounted case", IEEE Transactions on Automatic Control. 30. 426-439. [34] R. Weber, (1992), "On the Gittins index for multiarmed bandits'". The Annah of Applied Probability, 2, 1024-1033. [35] G. Weiss, (1988), "Branching bandit processes". Probability Information Sciences, [36] P. 2, in the Engineering 269-278. Whittle (1980), "Multi-armed bandits and the Gittins index". Journal of the Statistical Society, 42, 143-149. [37] P. Whittle, (1981), "Arm and acquiring bandits'". Annals of Probability. 9, 284-292 52 Royal [38] P. Whittle, (1982), Opttmtzaiwn Over Time, 53 vol. 1. John Wiley. Chichester. IK. 5939 057 Date Due MIT UBRftRlF? OllPl 3 TDfiO Q0flMb321 5