Document 11036117

advertisement
DcWEy
HD28
.M414
^3
ALFRED
P.
WORKING PAPER
SLOAN SCHOOL OF MANAGEMENT
Conservation Laws, Extended Polymatroids and
Multi-Armed Bandit Problems; A Unified
Approach to Indexable Systems
Dimitris Bertsimas
Jose
WP#
-
3538-93
and
Nino-Mora
MSA
February, 1993
MASSACHUSETTS
INSTITUTE OF TECHNOLOGY
50 MEMORIAL DRIVE
CAMBRIDGE, MASSACHUSETTS 02139
Conservation Laws, Extended Polymatroids and
Multi-Armed Bandit Problems; A Unified
Approach to Indexable Systems
Dimitris Bertsimas
Jose
WP#
-
3538-93
and
Nino-Mora
MSA
February, 1993
MAR 2 5
1993.
Conservation laws, extended polymatroids and multi-armed
bandit problems; a unified approach to indexable s^•,stems
Dimitris Bertsimas
'
Jose Nino-Mora
-
Februarv 1993
Management and Operations Research Center. MIT. CamThe research of tiie autlior was partially supported by a Presidential ^'ouIlc
Investigator Award DDM-9158118 with matching funds from Draper Laboratory
-Jose Nifio-Mora. Operations Research Center. MIT. Cambridge. .MA 02139 The restsTicli of ihc
'Dimitris Bertsimas. Sloan Scfiooi of
bridge,
MA
02139.
author was partially supported by a MEC/Fulbright Fellowship.
Abstract
We
isfy
show that
if
performance measures
in stochastic
and dynamic scheduling problems
sat-
generalized conservation laws, then the feasible space of achievable performance
polyhedron called an extended polymatroid that generalizes the usual polymatroid^s
duced by Edmonds.
is
intro-
Optimization of a linear objective over an extended polymatroid
solved by an adaptive greedy algorithm, which leads to an optimal solution having an
dexability property {indexable systems).
Under a
a
is
in-
certain condition, then the indices have
a stronger decomposition property {decomposable systems).
The
following classical prob-
lems can be analyzed using our theory: multi-armed bandit problems, branching bandits.
multiclass queues, multiclass queues with feedback, deterministic scheduling pnDlilpnis
teresting consequences of our results include:
(
1) a
characterization of indexable systems
systems that satisfy generalized conservation laws,
new
programming proof
(3) a
property of Gittins indices
multi-armed bandit problems,
in
proach to sensitivity analysis of indexable systems.
of indexable systems as
sums
terms of retirement options
ysis of the indexability of
in
of dual variables
(-5)
a
of the decomposabilit}
(4) a unified
and practical ap-
new characterization
of the indu'^s
and a new interpretation of the indices
the context of branching bandits, (6) the
undiscounted branching bandits,
(7) a
first
is
in
rigorous anal-
new algorithm
the indices of indexable systems (in particular Gittins indices), which
known
a-s
(2) a sufficient condition for indexable
systems to be decomposable,
linear
In-
to
compute
as fast as the fastest
algorithm, (8) a unification of the algorithm of Klimov for multiclass queues and
the algorithm of Gittins for multi-armed bandits as special cases of the
same algorithm.
(9)
closed form formulae for the performance of the optimal policy, and (10) an understanding
of the nondependence of the indices on
some
of the parameters of the stochastic scheduling
problem. Most importantly, our approach provides a unified treatment of several
problems
in stochastic
variations such as:
and dynamic scheduling and
is
classical
able to address in a unified wax their
discounted versus undiscounted cost criterion, rewards versus taxes.
preemption versus nonpreemption, discrete versus continuous time, work conserving versus
idling policies, linear versus nonlinear objective functions.
1
Introduction
1
In the
tion
mathematical programming tradition researchers and practitioners solve optinuza-
problems by defining decision variables and formulating constraints, thus describing the
feasible space of decisions,
and applying algorithms
for the solution of tiie
underlying opti-
mization problem. For the most part, the tradition for stochastic and dynamic scheduling
problems has been, however, quite
different,
cis it
relies
primarily on dynamic prograninimg
formulations. Using ingenious but often ad hoc methods, which exploit the structure of the
particular problem, researchers and practitioners can sometimes derive insightful structural
results that lead to efficient algorithms.
scheduling problems Lawler
et.
al.
[23]
comprehensive survey of deterniiuistic
In their
The
end their paper with the following remarks:
results in stochastic scheduling are scattered
and they have been obtained through
con-
a
siderable and sometimes dishearting effort. In the words of Coffman, Hofri and Weiss
there
great need for
is
new mathematical techniques
[8].
useful for simplifying the dernation of
the results".
Perhaps one of the most important successes
twenty years
last
is
the area of stochastic scheduling
in
the solution of the celebrated mulit-armed bandit problem,
the
in
a g>^"iiPii<-
version of which in discrete time can be described as follows:
The multi-armed bandit problem:
.
K. Project k can be
,
.
.
time
f
=
ik(t) at
0, I,
time
.
.
<,
.
in
one of a
finite
There are
number
of states
one can work on only a single project.
<
i?
<
by a Markov transition rule (which may depend on
projects one has not engaged remain unchanged,
is
how
If
1.
one works on project
k,
i.e..
The
R,^{t)
k-
=
t),
I
stauar.^
+
if,{t
while the states of
ii{t) for
to allocate one's resources sequentially in time in order to
in
Rewards
state ik{t) changes to
but not on
ii{t+ 1)
=
k
At each instant of discrete
ifc.
then one receives an immediate expected reward of
additive and discounted in time by a factor
indexed
A' parallel projects,
/
^
k.
The
1
ilie
prnlij.-in
maximize expected imal
discounted reward over an infinite horizon.
The problem has numerous
and the references
therein). It
applications and a rather vast literature (see Gittiiis
was
originally solved by Gittins
and Jones
that to each project k one could attach an index 7'^(ijt(0)< which
is
They
also proved the
who proved
a function of the project
k and the current state ik{t) alone, such that the optimal action at time
project of largest current index.
[14],
[16]
t
is
to engage the
important result that thesp
iiidf^x
functions satisfy a stronger index decomposHton property: the function
on characteristics of project k
now known
More
Walrand and Buyukkoc
Weber
on interchange arguments.
The multi-armed bandit problem
in
on an interchange argument, other proofs
S- In this
is
[34]
[33]
and Weiss
provided
[35]
outlined an intuiti\e proof.
on a simple inductive argument
recently, Tsitsiklis [31] has provided a proof based
scheduling system
Gitrms con-
provided a proof based on dynamic programming, subsequently
[36]
simplified by Tsitsiklis [30]. Varaiya,
different proofs based
only depends
)
as Gittins indices, in recognition of
tribution. Since the original solution, which relied
were proposed: Whittle
"(
rewards and transition probabihties), and not on an\
(states,
other project. These indices are
-;
a special case of a dynamic and stochastic job
context, there
is
a set
E
of job types and
we
are interesti^d
optimizing a function of a performance measure (rewards or taxes) under
a
riass of
admissible scheduling policies.
1
system S
indexable
index,
is
if
In general the indices 7, could
of the set
E
say that
the following policy
At each decision epoch
7^.
We
(Indexable Systems)
Definition
to subsets Ek, k
1,
.
.
.
/\
,
entire set
entire set
i
E
of job types.
we attach an
of job types. Consider a partition
Such a property
is
lie
In certain situations, the index of
£ Ek depends only on the characteristics of the job types
E
/
which contain collections of job types and can
interpreted as projects consisting of several job types.
job type
To each job type
optimal:
is
stochastic jo6 scheduling
select a job with the largest index.
depend on the
=
adynamic and
in
Ek and not on the
particularly useful computationally sinci^
it
enables the system to be decomposed to smaller parts and the computation of the indices
can be done independently.
As we have seen the multi-armed bandit problem has
this
decomposition property, which motivates the following definition:
Definition 2 (Decomposable Systems) An indexable system
for
all
job types
i
£ Ek, the index
7,
of job type
i
is
called
decomposable
if
depends only on the characteristics of the
set of job types Ek-
In addition to the
multi-armed bandit problem, a variety of dynamic and stochastic
scheduling problems has been solved
1.
in
the last decades by indexing rules:
Extensions of the usual multi-armed bandit problem such as arm-acquiring bandits
(Whittle
[37],
[38])
and more generally branching bandits (Weiss
several important problems as special cases.
2
[35]).
that include
2.
The
multiclass queueing scheduling problem with Bernoulli feedback (Kliiiun
Tcha and
3.
The
[19],
Kleinrock
[21],
interesting distinction, which
and
(2)
Example
(3),
not emphasized
system, while example (4) above
problems have very
what
in
is
[26]).
is
that examples
general decomposable systems.
indexable, but not decomposable.
decomposable under the average cost
it is
results,
the literature,
in
It is
the multi-armed bandit problem
Faced with these
Shantikumar and Yao
[9].
[27]).
however, has a more refined structure.
As already observed,
particular,
is
[13],
above are indexable systems, but they are not
under discounting, while
trivial
Gelenbe and Mitrani
Deterministic scheduling problems (Smith
An
(1)
Pliska [29]).
multiclass queueing scheduling problem without feedback (Cox and Smith
Harrison
4.
[22].
is
criterion (the
an example of
a
rule).
r//
decomposable
also decomposable.
one asks what
is
efficient solutions
the underlying deep reason that these non-
both theoretically as well as practically
In
the class of stochastic and dynamic scheduling problems that ar^ uidex-
is
able? Under what conditions, indexable systems are decomposable'' But most im]3oi-taiul_\
is
there a unified
way
to address stochastic
and dynamic scheduling problems that
to a deeper understanding of their strong structural properties'' This
is
will lead
the set of questions
that motivates this work.
In the last
decade the following approach has been proposed to address special cases
of these questions.
stochastic and
In
broad terms, researchers try to describe the feasible space of
dynamic scheduling problem
dynamic scheduling problem
is
as a polyhedron.
a
Then, the stochastic and
translated to an optimization problem over the corresponding
polyhedron, which can then be attacked by traditional mathematical programming methods.
Coffman and Mitrani
[7]
and Gelenbe and Mitrani
[13] first
showed using conservation laws
that the performance space of a multiclass queue under the average cost criterion can be
described as a polyhedron. Federgruen and Groenevelt
by observing that
special structure
rule).
in certain special cases of multiclciss
(it is
[12]
advanced the theory further
queues, the polyhedron has
[26]
generalized the theory further by observing that
strong conservation laws, then the underlying performance space
polymatroid.
They
a
very
a polymatroid) that gives rise to very simple optimal policies (the c^
Shantikumar and Yao
satisfies
[1 1],
also proved that,
when
the cost
is
linear
is
if a
system
necessarily a
on the performance, the optimal
policy
a fixed priority rule (also called head of the line priority rule; see Cobliam
is
Cox and Smith
networks,
in
[6].
and
Their results partially extend to some rather restricted queueing
[9]).
which they assume that
all
the different classes of customers liave the same
routing probabilities, and the same service requirements at each station of the network (see
Tsoucas
also [25]).
scheduling a multiclass nonpreemptive
by Klimov
([22]).
performance
([32]) derived the region of achievable
M/G/1 queue
Finally, Bertsimjis et
al.
[2]
the problem of
with Bernoulli feedback, intioduced
generalize the ideas of conservation laws
to general multiclass queueing networks using potential function ideas
and nonlinear inequalities that the
in
feasible region satisfies.
They
finil
Optimization over
linear
this set of
constraints gives bounds on achievable performance.
Our
goal in this paper
is
to propose a unified theory of conservation laws
that the very strong structural properties
in
and
to establish
the optimization of a class of stochastic and
dynamic systems that include the multi-armed bandit problem and
its
extensions follow from
the corresponding strong structural properties of the underlying poiyhedra that characterize
the regions of achievable performance.
By
generalizing the work of Shantikumar and
measures
in
stochastic and
polymatroid (see Bhattacharya
is
[26]
dynamic scheduling problems
laws, then the feasible space of achievable
tended polymatroid
Yao
ei
at.
[4]).
performance
is
we show that
if
performance
satisfy generalized consa-i aiion
a polyhedron called an
Optimization of
t
a linear objective over
ri^itdtd
an ex-
solved by an adaptive greedy algorithm, which leads to an optimal
solution having an indexability property. Special cases of our theory include
lems we have mentioned,
i.e.,
all
tl>^ proli-
multi-armed bandit problems, discounted and undiscountpd
branching bandits, multiclass queues, multiclass queues with feedback and deterministic
scheduhng problems. Interesting consequences of our
1.
A
results include;
characterization of indexable systems as systems that satisfy generalized conserva-
tion laws.
2.
Sufficient conditions for indexable systems to be
3.
A
decomposable.
genuinely new, algebraic proof (based on the strong duality theory of linear pro-
gramming
as
opposed to dynamic programming formulations) of the decomposability
property of Gittins indices
in
multi-armed bandit problems.
4.
A
and practical approach
unified
to sensitivity analysis of indexable systems., based
on the well understood sensitivity analysis of linear programming
5.
A new
characterization of the indices of indexable systems as
sums of dual
variables
corresponding to the extended polymatroid that characterizes the feasible space.
6.
A new
interpretation of indices in the context of branching bandits as retiiement
options, thus generalizing the interpretation of Whittle [36] and
Weber
[34] for
the
indices of the classical multi-armed bandit problem.
7.
The
first
complete and rigorous analysis of the indexability of undiscounted branching
bandits.
8.
A new
algorithm to compute the indices of indexable systems
tins indices),
Buyukkoc
9.
The
which
is
as fast as the fastest
particular G:t-
known algorithm (Varaiya.
\\alran(l
and
[33]).
realization that the algorithm of
Klimov
for multicla.ss
of Gittins for multi-armed bandits are examples of the
10.
(in
queues and the algorithm
same algorithm.
Closed form formulae for the performance of the optimal policy. This also leads to an
understanding of the nondependence of the indices on some of the parameters of the
stochastic scheduling problem.
Most importantly, our approach provides
lems
in stochastic
variations such
bls:
a unified
and dynamic scheduling and
is
treatment of several
able to address
in
classical prob-
a unified
way
their
discounted versus undiscounted cost criterion, rewards versus taxes.
preemption versus nonpreemption, discrete versus continuous time, work conserving \ersus
idling policies, linear versus nonlinear objective functions.
The paper
is
structured as follows:
servation laws and
problem
vector
is
show that
satisfies generalized
if
Section 2 we define the notion of generalized con-
a performance vector of a stochastic and
dynamic scheduling
conservation laws, then the feasible space of this performance
an extended polymatroid.
show that
In
linear optimization
Using the duality theory of linear programming we
problems over extended polymatroids can be solved by an
adaptive greedy algorithm. Most importantly, we show that this optimization problem has
an indexability property.
In this way,
we
give a characterization of indexable systems as
)
systems that satisfy generalized conservation laws.
We
also find a sufficient condition for
an indexable system to be decomposable and prove a powerful result on sensitivity analysis.
we study
In Section 3
a natural generalization of the classical multi-armed bandit problem:
the branching bandit problem.
We
propose two different performance measures and prove
that they satisfy generalized conservation laws, and thus from the results of the previous
section their feaisible space
is
an extended polymatroid.
We
then consider different cost and
reward structures on branching bandits, corresponding to the discounted and undiscouiitpd
and some transform
case,
results. Section 4 contains applications of the previous sections to
various classical problems: multi-armed bandits, multi-class queueing scheduling prol)lenis
with or without feedback and deterministic scheduling problems.
some thoughts on the
field
The
final section contains
of optimization of stochastic systems.
Extended Polymatroids and Generalized Conservation Laws
2
Extended Polymatroids
2.1
Tsoucas
[32]
characterized the performance space of Klimov's problem (see Klimox
polyhedron with a special structure, not previously identified
et at.
called this polyhedron an extended polymatroid
[4]
erties of
it.
in
['22]) .i^ a
Bhattachar\
the literature
and proved some interesting
Extended polymatroids are a central structure
for the results
we present
n
pro|)in this
paper.
Let us
first
establish the notation
we
denote a real n-vector, with components
\S\
E=
Let
i,, for
£ E. For S C N,
/
denote the cardinality of 5. Let 2^ denote the class of
be a set function, that
/if
Let
TT
=
>
0,
(jri,
for
...,7r„)
satisfies 6(0)
i
e S
=
and
0.
{!,... ,n} be a finite set
will use.
Let
A =
A^ =
be a permutation of E.
0,
all
subsets of
i
G S"
,
.
.
.
,
i,r„)-^-
For clarity of presentation,
=
(x\
Let us write
t.
=
.
Let
for all
.
to introduce the following additional notation. For an n-vector x
(i„j
E
scE be a matrix
(.4f ),g£,
for
= f
let 5*^
(6({^i}).fc({Ti,7r2}),...,6({7ri,...,7r„})f.
it
\ 5.
6:
Lei
j-
ami ki
2^
—
h\
that satistip^
5C
is
x„)
£"
(
1
con\pnif^iit
let
j-
=
Let
An denote
the following lower triangular submatrix of
AiV^
/
.4:
...
A.=
T")
{t1
\Ai:
Let
i'(7r)
j{fl
^r,}
Ai'^i
^n}
'T2
be the unique solution of the linear system
"^^x,. =fc({;r:,...,:r,}),
tiAiV
j=l,...,n
1=1
or, in
matrix notation:
A^x^ =
b^.
Let us define the polyhedron
V{A,
6)
=
{
J
G
3?"
;
^ Afx, >
b(S),
for
5 C r
(4)
}
and the polytope
B(/l,6)= {r G3f" :^-4fr, >
Note that
is
if
r G V(A,b), then
due to Bhattacharya
et al.
it
this Ceise
>
follows that r
^.4fx, =6(f)}
and
componentwise. The following definition
We
with base set E,
if
say that the polyhedron V(A.b)
for every
we say that the polytope B(A,b)
is
permutation n of E.
v{tt)
is
an
G V{A.h).
f
r-
In
the base of the extended polymatroid 'P{A.Ij).
Optimization over Extended Polymatroids
2.2
Extended polymatroids are polyhedra defined by an exponential number
Yet, Tsoucas [32] and Bhattacharya et
on Klimov's algorithm (see Klimov
an extended polymatroid.
linear
program certain
al.
[4]
In this subsection
We
we provide
a
programming problem over
new
duality proof that this
then show that we can associate with this
indices, related to the dual
an indexability property.
of inequalities.
presented a polynomial algorithm, based
[22]) for solving a linear
algorithm solves the problem optimally.
hcis
(J)
[4].
Definition 3 (Extended Polymatroid)
tended polymatrotd
S C ^
for
6(S),
program,
in
such a way that the problem
Under certain conditions, we prove that
a
stronger index
We
decomposition property holds.
also present an optimality condition specially suited for
performing sensitivity analysis.
In
what follows we assume that V{A,b)
row vector. Let us consider the following
Therefore we
may
^
mm{
(D)
Let
R £
K"^^
be
a
programming problem;
linear
(6)
a polytope, this linear program has a finite optimal solution.
is
consider
have a dual variable y
an extended polymatroid
max{^i2,a;, :iGe(A,6)}.
(P)
Note that since B{A,b)
is
for every
b(S)y^
:
SCE
and
dual,
its
this will
S C E. The
^ Afy^ =
have the same optimum value.
dual problem
for
R,.
shall
is:
and
G E,
/
We
y-
<
for
0,
S C E}.
SBi
(7)
In order to solve (P),
Bhattacharya
ef al.
algorithm, based on Klimov's algorithm
[4]
presented the following adapftie giffdy
[22]:
Algorithm ^i
Input: {R,A).
Output: (n,y,i^,S), where
Step
0.
Set Sn
pick
5<ep
i.
S=
and
(t/i,... ,f„),
7r„
For t
=
i/„
Set S„_fc
}.
S„_fc+i \ {TTn-k+x
};
1,
=
= max{ -^
£
G argmax{ -^
=
7r„)
(ttj
5„}, with
{Si,
E. Set
=
tt
...,
n-
:
f
G
5jt
.
i
=
a
is
{tti,
permutation of E, y
.
.
,7r/;},
.
for k
5<ep
n.
For
7r„_t
5 C
(y'^)<;cE-
''
£ E.
£ E);
1:
set i/„_A:
= max{
'^°5^'_^
1,
pick
=
G argmax{
^'=1
,-„_*
'
—^^^
:
?
G Sn-k
}
£" set
{fj
0,
,
if
5 =
6j for
otherwise.
some
j
G
£';
—
^
'
€ Sn-k
)
—
n
It is
easy to see that the complexity of Ai, given {R, A),
reward vectors,
7r
may occur
ties
algorithm ^i
in
that vectors f and y are uniquely determined by A\.
will
be clear
permutation
Howmer. we
rules.
show
will
In order to prove this point,
and to understand better ^i,
later,
that, for certain
In the presence of ties, the
.
generated depends clearly on the choice of tie-breaking
importance
Note
0{n'^).
is
whose
us introduce the following
let
related algorithm:
Algorithm A2
Input: {R,A).
Output: {r,y,'H.J), where
partition of E, and J^
Step
1.
Set k :—
set 6\
Step
2.
= max{
-jf
f H^
<
r
n
U[_jt///, for
=
set Jj
1;
While Jk
=
<
I
=
1,
{y
^
)scE'
= {^1
//r}
is
a
h
]
»-.
.
.
.
,
E\
^
i
/
=
an integer, y
is
E)
H\ = argmax{ -^
and
:
'
£
G
}
do:
begin
Set t :=
set ek
fc
+
1;
=
set Jk
= max{ ^'"^^*
Jk-\ \ Hk-\\
"''"'
/
:
g
and
Jfe }
//,
A,
= argmax{ ^'"^?*
'"'''''
:
;
g
.4,
end {while}
Step
3.
Set r
for
=
SC E
t;
set
^it
=S
,
if
S =
Jfc
for
some
k
=
1
,
.
.
.
,
r;
y
0,
In
what follows
let (7r,y, i/,5)
otherwise.
be an output of
A2- Note that the output of algorithm ^2
The
idea that algorithm
A2
is
just an
is
v4i
and
(r,y,Ti,J) he the output of
let
uniquely determined by
unambiguous version of Ai
its
is
input.
formalized
in
the
following result:
Proposition
(a) for
1
The following relations hold between
the outputs of algorithms
A\ and A2:
1=1,...,
{6k,
i/'
=
\Jk\
for some k
=
1,
.
.
.
,
r;
(S)
0,
otherwise:
(b) y
(c)
n
=
V;
satisfies
=
Jk
ffk
=
{^i
^=1
^\J,\}.
r.
t
{t^\j,\-\h,\+i^--,^\j,\]-
=
(9)
(10)
r-
l
Outline of the proof
Parts (a) and (c) follow by induction arguments. Part (b) follows by (a) and the definitions
of y and
y.
Remark:
Proposition
shows that y and v are uniquely determined (and thus invariant
1
under different tie-breaking
the permutations
Tsoucas
solves linear
by algorithm Ai
rules)
also reveals in (c) the structure of
that can be generated by ^i.
tt
and Bhattacharya
[32]
It
et al.
[4]
proved from
program (P) optimally. Next we provide
a
first
new
principles that algorithm
proof, using linear
A\
programming
duality theory.
Proposition 2 Let vector y and permutation
and y are an optimal primal-dual pair for
tlte
be
rr
linear
generated by algorithm A\
Th>.n i(t)
programs (P) and(D).
Proof
We
first
show that y
is
dual feasible.
C 5„
it
follows that !/„_!
and since 5n-i
Similarly, for
it
=
1,.
.
.
—
n
,
By
definition of u^ in
,
it
follows that
<
by definition of
2,
^i
i^n-k
it
follows that
k
j=0
and since Sn-k-\
and by
C
Sn-k,
definition of y,
Moreover, for
Hence y
is
ib
=
follows that fn-t-i
it
we have y^ <
0, 1,
.
.
.
,
n —
1
0, for
we
5 C
<
0.
Hence
Uj
t"
have, by construction.
dual feasible.
10
<
0, for j
=
1,
.
.
c
-
1.
d
.
Let I
=
Since V[A,b)
t'(7r).
show that 1 and y
it
{ ""i
is
complementary slackness. Assume y^ ^
satisfy
must he S = Sk =
an extended polymatroid, x
is
>
,
""A.-
}
,
for
some
And
k.
^Afx, = J2^i?
i'(7r)
0.
Let us
Then, by const riirt ion
since x satisfies (3),
follows that
it
^'>z.^=6(5).
;=1
leS
Hence, by strong duality
primal feasible.
and y are an optimal primal-dual
pair,
and
comph^es
this
the proof.
Remark: Edmonds
[10]
introduced a special class of polyhedra called polymatroids. and
proved the classical result that the greedy algorithm solves the linear optimization |3roblem
over a polyhedron for every linear objective function
Now,
polymatroid.
that
^1
result
is
in
the case that
Af =
the greedy algorithm that sorts the
and Proposition
2
it
for
1,
/
/?, 's in
if
and only
if
£ S. and S C
the polyhedron
E
.
ii
nonincreasing order
follows that in this case B{A.b)
is
is
ea.s_\
is
a
to see
By Edmonds
a polymatroid
1 h<^refore.
extended polymatroids are the natural generalizations of polymatroids. and algorithm A\
is
the natural extension of the greedy algorithm.
The
well
fact that
known
i'(7r)
and y are optimal solutions has some important consequences
that every extreme point of a polyhedron
objective function. Therefore, the
i'(:r)'s
is
the unique maLXimizer of
som^
are the only extreme points of PiA.b).
It
:<«
liiu- ir
Uowrr
it
follows:
Theorem
1
(Characterization of Extreme Points) The
set of
eitremt points nfPiA.h)
is
{v(n)
The
:
n
ts a
permutation of
optimality of the adaptive greedy algorithm
Ai
E]
leads naturally to the definiin^n
of certain indices, which for historical reasons, that will be clear later, we
call
generahz'
Gittins indices.
Definition 4 (Generalized Gittins Indices) Let y be the optimal dual solution
y,e
lu r-
ated by algorithm ^i. Let
7,
^
=
S:
We
say that 7i
,
.
,
.
,
/,
,-€f.
E0S31
7n are the generalized Gittins indices of linear program (P).
11
(11)
Remark:
Notice that by Proposition 1(a) and the definition of
tation n
an output of algorithm A\ then the generalized Gittins indices can be coni|iutod
is
y, it follows
that
if
permu-
as follows;
12)
13)
Let 9?_
TT
=
{ar
6
3?
:
I
<
0}. Let 71,
be a permutation of E. Let
T
.
.
.
,
7„ be the generalized Gittins indices of (P). Let
be the following n x n lower triangular matrix:
/I
In the next proposition
0\
...
1
1
Vi
1
...
1/
and the next theorem we reveal the equivalence between some
optimality conditions for linear program (P).
Proposition 3 The following statements
(a)
satisfies (9)
TT
(b) T
is
are equivalent:
and (10):
an output of algorithm A\:
(c)
R^A~^ £
(d)
T^r,
<
3J!1~
7t2
X
9?,
< •• <
and then
the generalized Gittins indices are given by 'r
— R-AZ^T:
7>r„.
Outline of the proof
(a)
(b)
^
^
(b):
Proved
(c):
It is clear,
in
Proposition 1(a).
by construction
in
i^
Now,
in the
^1, that
= R.AZ\
proof of Proposition 2 we showed that
7^
and by (14)
it
i^
= ^^
follows that
7,
= R,AZ'T
12
14)
G
3?"
x
3?.
Moreover, by
(
12)
we get
By
(c) => (d):
(c)
we have
= 7,r-i ^r„a;' £^1-' x«,
whence the
result follows.
(d) => (a):
By construction
5
of
the generalized Gittins indices,
7,
Also,
it
is
=
^1
+
easy to see that
satisfy (10),
and hence
Combining the
(9).
it
...
9j
in
algorithm A2, the fact that y
y and the definition of
follows that
fovi£Hh,
4-^^,
<
=
0, for j
>
and t
=
These two facts
2.
lo)
1,
must
clearly imply that t
which completes the proof of the proposition
result that algorithm
equivalent conditions in Proposition
3,
A\
solves linear
we obtain
program (P)
optimal!.\
with the
several optimality conditions, as sliov^n
next.
Theorem
2 (Sufficient Optimality Conditions
of the conditions (a)-(d) of Proposition 3 holds.
and Indexability) Aisuwe
Then
i'(t) solves linear
fhaf
program [P)
any
opti-
mally.
It
is
easy to see that conditions (a)-(d) of Proposition 3 are not,
optimality conditions.
They
are neccessary
if
the polytope B(A,b)
is
in
general, necessary
nondegenerate
Some
consequences of Theorem 2 are the following:
Remarks:
1.
Sensitivity analysis:
Optimality condition
suited for performing sensitivity analysis
permutation n of E,
solves
for
what vectors
R
(c)
is
specially
vvi='ll
Consider the following question: given a
and matrices
problem (P) optimally? The answer
is:
for
RkAZ^ € 3?r^
13
of Proposition 3
X
R
9?.
.4
and
can we guarantee that l(t)
A
that satisfy the condition
.
We may
By
also
zisk:
which permutations
for
Proposition 3(d), the answer
thus providing an
problem of
0(n
now
is
optimal
permutations n that satisfy
for
is:
can we guarantee that uiw)
tt
log n) optimality test for
Glazebrook
tt.
addressed the
[17]
sensitivity analysis in stochastic scheduling problems.
His results are
in
the form of suboptimality bounds.
2.
Explicit formulae for Gittins indices: Proposition 3(c) provides an
mula
The formula
for the vector of generalized Gittins indices.
exi:)licit
for-
reveals that the indices
are piecewise linear functions of the reward vector.
3.
Indexability: Optimality condition (d) of Proposition 3 shows that any permutation
that sorts the generalized Gittins indices
nonincreasing order provides an optmial
in
Condition (d) thus shows that this class of optimization
solution for problem (P).
problems has an indexability property.
In the case that matrix
A
of (P) can be simplified.
let 5(^4^,6*^)
has a certain special structure, the computation of th^ indices
E
Let
be partitioned as
E =
be the base of an extended polymatroid;
U;.\_j
let x''
E'^.
=
.
For
(x|''),g£^;
^-
=
h'
1.
let (P;.
)
bo the
following linear program;
max{ Y, ^^-f
{Pk)
let {7,-
•
^^ e 5(.4^6'^)};
}i6E^ be the generalized Gittins indices of problem (Pk)-
Assume
(1(3)
that the follou ing
independence condition holds:
Af = Af^^" =
Under condition
(Pk), as
shown
Theorem
(17) there
in the
is
iA'')f''^'
f
€ 5 n Ek
and
SCE.
(17)
an easy relation between the indices of problems (P) and
next result.
3 (Index Decomposition)
dices of linear
for
,
programs (P) and (Pk)
7,
=
7*,
Under condition
(11),
the generalized Giflins in-
satisfy
forieEk
14
and k
=
I
K.
(18)
Proof
Let
hi
=
7*,
for
E
Let us renumber the elements of
of
tt
=
h
(IH)
A'
\
so that
/ij
Let T=:(l,...,n). Permutation
and
G Ek
/
<
£
• <
<
/l2
(20)
/in
induces permutations
Tr*^
of E^. for
fc
=
1.
/v.
that satisfy
(21)
Hence, by Proposition 3
follows that
it
=
7:.
=
fort
/?;.( -4^*)-' T,.
A-
l
or, equivalently,
\
r-'.4?
WjT]
>
7^2
1
•
•
'
77r*
=
/
On
an jE^I x \Ek\ matrix with the
the other hand,
(22)
sani»^
structure as matrix T. for k
=
1.
we have
/I
0\
..
•1
T-'Ar
./?:,)
u-
-I
is
i?^...
r-'.4:./
\
where Tk
(/;;,,
...
1
=
-1
...
1
.4^
-1
\
1/
/
\
{1,2}
.4
Now, notice that
if
i
'-'-A
£ Ek,
^{1
j)
j
_
"-'}
j{l
"}
\A\{1
£
E\
^{1
.{1
.4
Ek and
}}<^E,
_
i
<
--4a{1
j then,
^{\
15
"}
n-1}
.4i^
"^
I
by (17):
j-i}n£t
_
^{1
j-i}
(23)
Hence, by (19) and (23)
it
follows that system (22) can be written equnalentiy as
KT-^A,^R^.
Now,
(20)
(24)
and (24) imply that
hi
-
/13
R^A-' = KT-^ =
G D?r^ X K,
hn-l
\
and by Proposition 3
it
- hn
/
/in
follows that the generalized Gittins indices of
7.
12o)
problem (P)
satisfy
= R.a;'t
Hence, by (24),
h,
and
this
=
for
7i,
E
completes the proof of the theorem.
Theorem
3 implies that the
fundamental reason
easy and useful consequence of Theorems 2 and 3
Corollary
1
Under
assumptions of Theorem
the
can be computed by solving the
K
is
17)
An
an optimal solution of problr-m
{/')
for
decomposition to hold
3,
=
1,.
We
On
(
.
.
K
by algorithm
A\
(H'd
indices.
will see later that the classical
has the index decomposition property.
the other hand,
we
is
much
stronger th u
multi-armed bandit prolikm
will see that
(see [22]) has the indexability property, but in the general case
2.3
.
important to emphasize that the index decomposition property
the indexability property.
is
the following:
subproblems (Pk). for k
computing their respective generalized Gittins
It is
€
;•
it is
Klimov's proM<>in
not decomposalik
Generalized Conservation Laws
Shantikumar and Yao
mance meaisures
[26]
formalized a definition of strong conservation taw? for perfor-
in genera! multiclass
the performance space.
We
m
queues, that implies a polymatroidal structure
next present a more general definition of generalized foi/sfr-
vation laws in a broader context that implies an extended polymatroidal structure
in
tlie
performance space, which has several interesting and important implications. Consider
16
a
.
There are n job types, whirh we
general dynamic and stochastic job scheduling process.
label
i
E —
E:
we denote
U
We
{l,...,n}.
consider the class of admissible scheduling pohcit^. which
to be the class of
,
nonidling, nonpreemtive and nonantiripative scheduling
all
policies.
Let i" be a performance measure of type
We
assume that r"
jobs under admissible policy
u. for
£
/
E
Let r" be the corresponding performance \ector.
an expectation.
is
i
Let x^ denote the performance vector under a fixed priority rule that assigns priorities to
the job types according to the permutation n
highest priority,
.
type
,
.
.
ttx
=
(tti,
.
.
.
,
7r„)
of E, where type n„ has the
has the lowest priority.
Definition 5 (Generalized Conservation Laws) The performance vector x
satisfy generalized conservation laws
and a matrix
—
.4
if
there exist a function
(/if )teE,SC£ satisfying (1)
—
6:2
')?+
is
said to
=
such that 6(0)
such that:
(a)
= ^-4fx^
6(S)
for all
TT
= 5
and
^^ Af x"; =
b{E),
{ti
:
7r|5|}
S C E:
(2(5)
(b)
^Afx"^ >b(S),
SCf"
for all
and
i€S
In words, a
exist weights
performance vector
Af
any subset S
types (in
S'^)
is
G
</
said to satisfy generalized conservation laws
such that the total weighted performance over
under any admissible
in
for all
//
(27)
if:
there
.e£
C E
job types
is
achieved by any fixed priority rule that gives priority to
over types in 5.
The strong
correspond to the special case that
The connection between
una riant
and the minimum weighted performance over the job types
policy,
is
all
all
all
other
conservation laws of Shantikumar and ^'ao
weights are
Af =
[26]
1.
generalized conservation laws and extended polymatroids
is
the
following theorem:
Theorem
and
(26)
(a)
x"
(b)
The
=
4 Assume that
(27).
performance vector x
satisfies generalized conservation laws
Then
vertices of
v{ir),
the
B{A,b)
are
the
performance vectors of the fixed priority
for every permutation t of E.
The extended polymatroid base 3(A,b)
is
the
17
performance space.
rules,
and
Proof
By
(a)
(26)
(b) Let
it
follows that
X = {x":uGi/}be
pointsof 5(yl,6). By (27)
X
is
=
x'"
it
v(ir).
And
by Theorem
the performance space.
follows that
X C
^(/l,
the result follows.
1
Let Bi.[A,b) be the
By
6).
(a), -B„(,4, 6)
C
.set
.V.
of extreme
Hence, since
a convex set {U contains randomized policies) we have
B{A,b)
Hence
X=
B{A,b), and this completes the proof of the theorem.
As a consequence of Theorem
mance
of at
= com{B{A,b)) CX.
4, it
theorem that the perfor-
follows by Caratheodory
vector i" corresponding to an admissible policy u can be achieved by a randomization
most n
+
1
fixed priority rules.
Optimization over systems satisfying generalized conservation laws
2.4
Let x" be a performance vector for a dynamic and stochastic job scheduling process that
satisfies generalized conservation laws (associated
to find an admissible policy u that
maximizes a
with
linear
.4,
6()).
Suppose that we want
reward function J2i^£
Ri-i'"
This
optimal scheduling control problem can be expressed as
max{
(ft/)
By Theorem
4 this control
^
/?,x,"
:
uGi/}.
(28)
problem can be transformed into the following
linear
program-
ming problem;
(P)
The strong
max{^/?,z, :2:Ge(/l,6)}.
(29)
structural properties of extended polymatroids lead to strong structural prop-
erties in the control problem.
Suppose that to each job type
/
we attach an index.
-,
.
A
policy that selects at each decision epoch a job of currently largest index will be referred to
as an index policy.
Let 7i, ..., 7„ be the generalized Gittins indices of linear program (P).
As
a
direct
consequence of the results of Section 2.2 we show next that the control problem (Pu)
solved by an index policy, with indices given by 71,
Theorem
Then
5 (Indexability) (a) Let
i'(7r)
be
.
.
.
is
,7n
an optimal solution of linear program {P).
the fixed priority rule that assigns priorities to the job types according to permulation
18
n
IS
optimal for the control problem (Pn);
A
(b)
index
The
policy that selects at each decision epoch a job of currently largest generalized C'ttins
optimal for the control problem.
is
previous theorem implies that systems satisfying generalized conservation laws are
indexable systems.
Let us consider
are
K
now
a
dynamic and stochastic
project types, labeled
A
selected.
it
=
1,...,A'.
project of type k can be
At each decision epoch
one of a
in
project selection process, in which there
finite
number
project must he
a
of states
£ Ek
u-
These
states correspond to stages in the development of the project. Clearly this process can be
interpreted as a job scheduling process, as follows: simply interpret the action of selecting
a project k
in
state
i^
G Ek
as selecting a job of type
satisfies generalized conservation laws associated
will see
5,
=
We may
£ \Jki-iEk'
i^
interpret
Let us assume that this job scheduling process
that each project consists of several jobs.
Theorem
/
with matrix
the corresponding optimal control problem
is
and
.4
function b().
solved by an index policy
among
next that when a certain independence condition
set
the projects
is
By
We
satisfied, a
strong index decomposition property holds.
We
thus assume that
E
is
partitioned as
performance vector over job types
E =
[Jk=\Ek-
E^ corresponding
in
obtained when projects of types other than k are ignored
us assume that the performance vector x
with matrix
A''
and
set function b''{),
Let x
=
(x,),g£j.
to the project selection
bo the
inoMeni
they are never engaged). Let
(i.e.,
satisfies generalized
conservation laws associated
and that the independence condition
(
17)
is
satisfied.
Let Uk be the corresponding set of admissible policies.
Under these aissumptions, Theorem
3 applies,
and together with Theorem 5(b) we get
the following result;
Theorem
6 (Index Decomposition)
dices of job types in
The
E^ only depend on
Under condition
(17), the generalized Gitltn.'^ in-
characteristics of project type k.
previous theorem identifies a sufficient condition for the indices of an indexable
system to have a strong decomposition property. Therefore, systems that
satisfy generalized
conservation laws which further satisfy (17) are decomposable systems. For such systems the
solution of problem (i^) can be obtained by solving
This theorem
justifies the
term generalized Gittins
19
K
smaller independent subproblems.
indices.
We
will see in Section 4 that
when applied
to the multi-armed bandit problem, these indices reduce to the usual Gittins
indices.
Let us consider briefly the problem of optimizing a nonlinear cost function on the perfor-
mance
Bhattacharya
vector.
et al.
[4]
addressed the problems of separable convex, min-max.
lexicographic and semi-separable convex optimization over an extended polymatroid. and
provided iterative algorithms
for their solution.
reward case, the control problem
Analogously as what we did
reward function can be reduced
in the Ccise of a nonlinear
programming problem over the base of an extended polymatroid.
to solving a nonlinear
Branching Bandit Processes
3
Consider the following branching bandit process introduced by Weiss
[35].
wiio observed that
can model a large number of dynamic and stochastic scheduling processes.
it
finite
number
project.
=
ik,
It is
We
convenient
Let
E =
type
Ic
U^^^Ek-
=
development of
to stages in the
{l,...,n} be the
i
random time
of a project a
projects N,j, in states j £
Vi,
is
E We assume
for a
duration
(,
a single label
=
If
m,
is
clear that this process
The model
we denote
and the instants
project present.
the
is
//.
—
(the duration of
that given
/',
the durations and the
i.
Projects are to be selected
at which a project stage
number
in
of projects
in
state
project engaged to a
new
;'
is
state,
and
.V,
shall refer
The
decision
is
some
present at a given time, then
[37], [38]) is
(2) external arrivals of
new
m —
(mj,
.
,
lUr,
it
)•
a special case of branching
consist of two parts:
2U
We
completed and there
a semi-Markov decision process with states
which the descendants
u.
as the class of admissible policus
of arm-acquiring bandits (see Whittle
bandit process,
\,
N, are random variables with an arbitrary joint distribution, independent
to this class of policies, which
t
arrivals
replaced by a nonnegative integer
under a nonidling, nonpreemptive and nonantinpative scheduling policy
is
tiie
finite set of possible
and random
c,
other projects, and identically distributed for the same
epochs are
a
is
project can he in one of
combine these two indicators into
and upon completion of the stage the project
descendants
all
follows to
A
A'
Engaging the project keeps the system busy
number of new
of
what
associate with state
i),
,
£ E^. which correspond
i^
in
1.
There
project types.
all
{!^tj)j^E-
stage
of states
the state of a project.
states of
—
of project types, labeled k
number
a finite
i
the linear
in
(1) a transition of the
projects, independent of
/
The
or of the transition.
cliissical
multi-armed bandit problem corresponds to the special
case that there are no external arrivals of projects, and the stage durations are
The branching bandit
Subsection
fore, as described in
gaging a type
I
i
job
process
is
2.4,
In this section,
The
process.
we
first
We may
two
will define
one
of a project selection process
we
shall refer to a project in state
different
performance measures
for a
/
as a type
show that they
satisfy generalized conservation laws,
branching liandit
allow us to model an undiscounted tax structure. In each case we will
and that the corresponding optimal
control problem can be solved by a direct application of the results of Section
Assume now
job.
i
be appropriate for modelling a discounted reward-tax structure.
will
will
5 C £ be
En-
interpret that each project consists of se\eral
The second one
Let
There-
can be interpreted as a job scheduling process
it
In the analysis that follows,
jobs.
cjise
the job scheduling model corresponds to selecting a project of state
in
the branching bandit model.
in
thus a special
1.
We
a subset of job types.
that at time
t
=
there
shall refer to jobs
only a single job
is
in
with types
2.
5
in
the system, which
as 5-jobs
is
of type
/
Consider the sequence of successive job selections corresponding to an admissible polic\
n
that gives complete priority to 5-jobs. This sequence proceeds until
cl
all
5-jobs are exhaust,
for the first time, or indefinitely. Call this an (i.S) period. Let T,^ be the duration (pos>-ihl\
infinite) of
an (i.S) period.
It is
easy to see that the distribution of
the admissible policy used, as long as
(i,
0) period
is
distributed as
i',.
it
T^
is
indepeniliMit of
gives complete priority to 5-jobs
Note thai
.in
be convenient to introduce the following addiiicuMl
It will
notation:
Vi^k
=
duration of the
independent of
'''i,k
f,
=
=
ib
ibth selection
of a type /job; notice that the distribution of
of times a type
:
job
is
i
Qiit)
=
i
job for the
number
Q,{tYs.
We
is
job occurs:
selected (can be infinity);
{^i^k}k>i— duration of the (2,5)-period that starts with the tth selection of
type
,j..
(vi).
time at which the kth selection of a type
number
i
fcth
of type
i
a
type /job
time.
jobs
in
the system at time
assume Q(0) = (mj,
.
.
.
,
m„)
21
is
/.
known.
Q{t) denotes the vector of the
:
=
T^
time until
is
Aff.
=
first
time (can be
T^
note rhat
infinity);
the duration of the busy period.
a type
ri,
if
1 0,
otherwise,
ot:
A >
inf {
Vtk
between the
If
S-jobs are exhausted for the
all
job
is
Hjes
:
ifcth
no more jobs
i
being engaged at time
^jC'"'.*^
+ ^) =
selection of a type
in
5
for
1 }-
t.
^' "°t^ that
^
'
Af^^.
job and the next selection,
j
are selected, then
Af^ = Tf^,
if
is
the InttM-val
Vjob
any. of an
the remaining interval of the busy
period.
A^ =
is
inf {
t
:
^,g5
=
Ii{t)
1 };
selected. If no job in
Proposition 4 Assume
missible policy.
S
is
is
the interval until the
A^ = T^
selected,
,
first
job
in S. if any,
the busy period.
that jobs are selected in the branching bandit procfis under an ad-
Then, for every
(a) If the policy gives
A^
note that
S C E
complete priority
then the busy period [O.T",^) con
to S'^ -jobs
hi
par-
titioned as follows:
[o,r^)
= [o,rf )U
0^'^-^..'=
+
^'^')
«•
p-
1-
(30)
w.
p.
1
(31)
i6Sfc=l
The busy period [0,T^) can
(b)
[0,T^)
be partitioned as follows:
= [0,Ai)\J
(j[r,,,r..,
+ Af,)
iesfc=i
The following
(c)
inequalities hold w. p.
I:
^tk<Tf.i
(32)
A^<Tf.
(33)
and
Proof
(a) Intuitively (30) expresses the fact that
5'^-jobs, the
all
jobs
duration of a busy period
in S'^ are
exhausted for the
is
first
under a policy that gives complete priority
partitioned into (1) the initial interval
time, and (2) intervals
exhausted, given that after working on a job
generated.
22
in
S we
in
which
all
jobs
in
in
to
which
5
are
clear first all jobs in S' that were
More
is
formally, let u be an admissible policy that gives complete priority to
easy to see then that the intervals
the inclusion
D
inclusion C. Let
at time
that
G
t
£
t
some job
t
[T'j.fc. Tj.A:
the right hand side of (30) are disjoint
in
obvious. In order to show that (30)
is
[0,
[0,
\
is
indeed a partition,
otherwise we are done. Since u
7"^'),
being engaged. Let j be the type of this job.
is
+
T^)
for
Tjl)
some
k,
5'^-.|olis.
is
Moreover,
us shovv the
let
nonidlmg
€ 5 then
If j
and we are done. Let us assume
a
policy,
clear
is
it
£ S'
tiiat j
[t
Let us
.
define
D=
Since
D
> T^" €
t
follows that
D
is
{r,,fc
:
ie S,k €
D ^
follows that
it
a finite set. Let
Now,
0.
6 S and
i'
job, with
!
is
nonidling,
it
follows that at this epoch one starts working on
+
is,
=
r, j^
''I'.jf
And
r,«,fc.
+ ^r
+T,Tf..
k-
•
completes the proof of the proposition.
this
.
by definition of
(b) Equality (31) formalizes the fact that
can be decomposed into (1) the interval
is
a decision epocii at
D
it
follows that
/
G
5'~ is
which
contradicting the definition of "
and
7",f \.
/.
it
r.
Now,
^.
t
jf
0. for all
k' be such that
must be
r,.
>
t.
G 5, that
<
<t}.
<
that r,.jf +7",.
Since the policy
r.,|..
since by hypothesis E[(,]
= max
r,. t*
Assume
iy,},and
{1
.
-
['..a...
empty.
some type
Hence,
^..
.j,..
+ T^
/
it
;_. ).
under an admissible policy the bus} period
until tiie first
job
5
in
is
selected.
tiie disjoint
("2)
union of the intervals between selections of successive 5-jobs and (3) the interval between
the
laist
selection of a job in
selected, then
i/,
=
0,
for
i
5 and
G
S,
the end of the busy period.
and
A^ =
Note that
no
if
T^, thus reducing the partition
.S-job is
to a single
interval.
(c)
Let
Tj^jt
be the time of the
fcth selection of
a type
i
job
(i
G
5). Since the next selec-
tion (if any) of an 5-job can occur, at most, at the end of the (i,S^) period
inequality (32) follows.
On
the other hand, since the time until the
5-job,
A^, can be
3.1
Discounted Branching Bandits
In this subsection
at
we
most the duration of the
will
initial {i,S'^)
first
[r,
j.
.
r,
;.
+
7,"'
).
selection of an
period. (33) follows.
introduce a family of performance measures for branching bandits.
{i"(Qr)}Q>o, that satisfy generalized conservation laws.
23
They
are appropriate for modelling
a linear discounted reward-tax structure on the branching bandit process.
We
have already
defined the indicator
/.«)
a >
and, for a given
xr(a)
a type
;'
1,
if
0,
otherwise,
job
being engaged at time
is
/:
=
0,
we
34)
define
= E„
r
=
re-^'l,{t)dt
Jo
E,[I,{t)]e-'" dt,
i
e E.
(33)
Jo
Generalized Conservation Laws
3.1.1
In this section
we prove that the performance measure
satisfies generalized
for
branching bandits defined
in (3))
conservation laws. Let us define
nso'
4S
e-oUt]
3(3)
and
6a(5)
The main
Theorem
result
7
is
= E
- E
e-^'di
[37;
the following
(Generalized Conservation Laws for Discounted Branching Bandits)
The performance vector for branching bandits
(26)
e-'^'di
/;
/;
and (27) associated with matrix Aq and
(a) satisfies generalized conservntion laujs
J""
set function
bd).
Proof
Let
SC
E. Let us assume that jobs are selected under an admissible policy
ates a branching bandit process. Let us define two
as functions of
its
random
vectors, (r})i^E
u.
This gener-
and
(r,
"
),^s-
sample path as follows:
r]
=
/
I,(t)e-^'dt
Jo
= Y,
;.
^j
e-"' dt
Jr.,
e-^' di,
(38)
Jo
rrt
and
,"'^
=
V
t^x
'
e-°--.*
/
J°
24
e-^'dt,
ieS.
(39)
5
Now, we have
xr(a)
=
E,[r|]
=
E„
=
E
= E„
e-"' di
f:E[e— .Mt^,]E[r%-'^' d/
(40)
rv,
Eu
e-°'c/<
/
.Jo
Note that equality (40) holds because, since u
On
dent random variables.
Eu[r."'']
=
E.
the other hand,
"•
=
E„
=
E
=
is
nonanticipative,
r,
j.
and
i
,
jt
f-^'ty/
Eu
I
-Ka^
(A
^0
it,
/•T'r
r
dt
f
are uulepen-
we have
e-^'d/
t,'""" I
(41)
J
e-^< dt
(4-J)
Eu
L*:=l
e-°'dt
Eu
^0
Z^-
143)
L*:=l
Note that equality (42) holds because, since
u
is
nonanticipative,
r,,^
and
7","^,
are indepen-
dent. Hence, by (41) and (43)
11,5
Eu[r."-^]
=
ieS,
.4f:.xr(a).
(44)
and we obtain;
Eu
We
first
(43)
show that generalized conservation law
complete priority to
5'^-jobs.
(26) holds. Consider a policy t that gives
Applying Proposition
4 (part (a)),
r>.k
we obtain:
+ T.^,
fi-^' dt
Jo
Jo
,est=i^^.>
di
65
*:=1
II.
r
^'€5
^0
25
(46)
Hence, taking expectations and using equation (45) we obtain
= E
-^'dt
I'
Jo
e-^Ut
r
+ Y,Atx:{Q)
le*
or equivalently, by (37),
Y,AtzUa) =
b,{S),
which proves that generalized conservation law (26) holds.
We
next show that generalized conservation law (27)
under admissible policy
u.
is
satisfied. Since
jobs arc selectt^d
Proposition 4 (part (b)) applies, and we can write
'•.,*+^.-/.
e-'"d<
e-'^M/.
Jo
Jo
On
+
the other hand, we have
EE/
>
e-^" dt
(4S)
e-'^'dt
Jo
Jo
r^e-^uf-f
>
=
Jo
Jo
/
e--df-/
Notice that (48) follows by Proposition 4 (part
Proposition 4 (part
(c)).
(c)),
Hence, taking expectations
r *
>
E
=
ba{S)
/
e-^'dt
(•JOi
e-^'dt.
(51^
(49) follows by (47), and (JU)
in (51),
and applying (45) we
li>
nliiani
m
r'^' df
e"-"' dt
i:
{')>)
which proves that generalized conservation law (27) holds, and
this
completes the ptcur of
the theorem.
Hence, by the results of Subsection 2.3 we obtain:
Corollary 2 The performance space for branching bandits corresponding
to
mance vector x^{a)
the verfirfs of
B{Aa,ba)
are the
is
the extended polymatroid base
performance vectors corresponding
26
B{Aa.ba); furthermore,
to the fixed priority rvles.
the
perfor-
The Discounted Reward-Tax Problem
3.1.2
Let us associate with a branching bandit process the following linear reward-tax stniciure;
An
instantaneous reward of R,
addition, a holding tax C,
in the
is
received at the completion epoch of a type
is
incurred continuously during the interval
system. Rewards and taxes are discounted
in
tiiai
/
job
In
a type /job
time with a discount factor a > U
is
Lt^t
us denote
Vu,a'
(m)
=
expected total present value of rewards received minus taxes incurred under
policy u, given that there are initially m, jobs of type
The discounted
reward-tcix problem
is
i
i
in the
{m) over
I'u.o"
we reduce the reward-tax problem
to the pure rewards case (where
fi
closed formula for Vu.a'
o\
(t))
/
G
E
the following optimal control problem: find an ad-
ft C'\
missible policy u' that maximizes
I
system, for
and show how
all
admissible policies u
to solve the
C=
problem using
0)
In
thi.'-
We
seotinn
also find a
algoiitlini .4i
The Pure Rewards Case.
Let us introduce the transform of
n,, i.e.,
=
Vl^fHm)
"^,{0)
—
E[e"^"'
].
We
then have
E.
i6Efc=l
av.
E.
(53)
L*:
=l
E[e
(54)
(55)
Notice that equality (54) holds by (41).
It
is
also straightforward to
model the case
during the interval that a type
Let Vu,a
(ni)
i
job
is in
in
which rewards are received continuously
the system rather than at a completion epoch.
be the expected total present value of rewards. Then
re
K!,^-°'(^n)
= E.
The Reward-Tax Problem; Reduction
e-''^I,(t)dt
to the
27
X;«,xr(Q).
Pure Rewards Case
We
will
next show
how
to reduce the reward-tax problem to the pure rewards case using the
following idea introduced by Bell
for further discussion).
[1]
(see also Harrison [18],
The expected
they are charged continuously
the arrival epoch of a type
«
Stidham
present value of holding taxes
in time, or
[2!^]
is
and Whittle
the
[38]
same whether
according to the following charging scheme
At
job, charge the system with an instantaneous entrance charge
of (C/a), equal to the total discounted continuous holding cost that would be inruried
the job remained within the system forever; at the departure epoch of the job
parts), credit the
system with an instantaneous depariure refund of
(if it
if
ever de-
(C',/o)- thus refunding
that portion of the entrance cost corresponding to residence beyond the departure epoch.
Therefore, we can write
K^,o'^*(f")
E^[Rewards]- E, [Charges at/ =
=
(
Eu[ Departure refunds]
—
0]
+
Eu[ Entrance Charges]
)
teE
=
V<«^^'°'(m)-i:m.(C./a)
where
fl:
From equation
(56)
it is
= (C./a)-^E[A^.,](C,/a).
(57)
straightforward to apply the results of Section 2 to solve the control
problem: use algorithm A\ with input {Ra,Ac^), where
>
Let 7i(a),
Theorem
.
.
.
a
J
1
— ^(q)
,7n(a) be the corresponding generalized Gittins indices. Then we have
8 (Optimality and Indexability: Discounted Branching Bandits)
(a) Al-
gorithm Ai provides an optimal policy for the discounted reward-tax branching bandit problem;
(b)
An
The
in
optimal policy
is
to
work
at each decision epoch
on a project with largest inder -,(o).
previous theorem characterizes the structure of the optima! policy. Moreover, since
Proposition 6 below, we find closed form expressions for the matrix
28
.4^,
and
tiie set
function 6a(). we can compute not only the structure, but also the performance of the
optimal policy (optimal
also that the
optimal extreme point of the extended polymalroid)
profit,
decomposition of the indices does not hold
in
the general case;
the generahzed indices of the states of a type k project depend
of project types other than
decomposable system.
Theorem
branching bandits
t, i.e.,
We may
in
•
•
,
other words.
general on characteristics
an example of an indexable but not
is
also prove the following result:
9 (Continuity of generalized Gittins indices)
dices 7i(a),
in
\ore
The generalized Giffnis
7n(a) are continuous functions of the discount factor a
.
for a
>
in-
0.
Proof
It
is
easy to see that the generalized Gittins indices depend continuously on the in|iut of
algorithm ^i. Also, since the function a
>—'
[Rcc-^a)
is
continuous the result follows
The Pure Tax Case: Minimizing Time-Dependent Expected Number
D
in
Sys-
is
often
tem
In several applications of
branching bandits (for example queueing systems) one
interested in minimizing a weighted
jobs
in
the system. Let
number
sum
of discounted time-dependent e.\pected numl>er of
QJ"() denote the Laplace transform
of type j jobs in the system under policy u,
g;"(e)=/
of the time-dependent expected
i.e.,
E4Q,{t)\Q{0) = ni]e-^'dt.
jeE
m)
Jo
An
interesting optimization problem
The problem can be modelled
as a
is
to:
pure tax problem as follows:
i:QQr(a) =
-K<°/'(m).
J€E
and thus by making
i?
=
0,
C,
=
1
and C,
=
for
i
^
j in (56)
we obtain
.
1
See Harrison
[18] for
1 1'
a similar result in the context of a multiclass queue.
29
€
f
(60)
Interpretation of Generalized Gittins Indices in Discounted Branching
3.1.3
Bandits
We
Consider the following modification of the branching bandits problem:
problem by adding an additional project type, which we
original
state/stage of infinite duration, that
continuously discounted,
is
is,
dq
=
call
oo with probability
with only one
0.
A reward
1.
received for each unit of time that a type
Notice that the choice of working on project type
modify the
project
engaged.
is
can be interpreted as the choice of
retirement from the original problem for a pension of Ro, continuously discounted
Now, the modified problem
time
=
t
is still
then ask the following question: Which
and another
of continuation (working on the project in state i)?
We
have then the following
Proposition 5 The generalized
in state
/
at
£ E.
the smallest value of the pension Ro
is
which makes the option of retirement (working on project type
value Ro{i)-
time.
in
a branching bandits problem. Let us assume that
there are only two projects present, one of type
We may
of Ro-
Let us
0) preferable to the option
snrrdider
this (quitable
call
result
Gittius indei of project state
i
the original hrniicbing
in
bandits problem coincides with the equitable surrender value of state
i.
Rq(')
Proof
Let 7i
,
.
.
.
,
7n be the generalized Gittins indirps corresponding to the original branching
bandits problem.
Let 7oi7ii--.7n ^^ ^^^ generalized Gittins indices for the modifit^d
problem. Let us partition the modified slate space as
the decomposition condition (17) holds. Hence
7g
Now,
since by
Gittins index,
Theorem
it
8
=
it is
flo
and
E=
Theorem
^,"
=
7;,
{0}U£^.
3 applies,
is
Whittle
such a breakpoint. Therefore R^(i) =
armed bandit problem, and provided an
his intuitive proof.
jiavp
(61)
")o-
makes the options
\s
[34] also
is
But by
complete.
in his
of continudefinition
D
analysis of the multi-
interpretation of the Gittins indices as equitable
makes use of
Here we extend
Ro = 7o
and the proof
introduced the idea of a retirement option
surrender values. Weber
and therefore wp
optimal to work on a project with largest current generalized
follows that the surrender reward Ro which
[36], [38]
eas\ to verifv that
J€E.
ation and of retirement (with reward Rq) equally attractive
R^{i)
It is
this characterization of the Gittins indices in
this interpretation to the
30
more general
ca.se
of branching
FVom
bandits.
this characterization
it
follows that the generalized Gittins indices roiiici<le
indeed with the well known Gittins indices
which
The
name.
justifies their
Computation of
3.1.4
the classical miilti-arnied bandit piolilem.
in
and
.4o
6o(
)
results of the previous sections are structural, but
of the matrix .4^ and the set function ba{) appearing
(26)
and (27)
for the
in
branching bandit problem. Our goal
generic data the matrLx
these computations
do not lead
Aa and
make
the set function 6q(
)
to explicit
compulations
the generalized conservation laws
in this
section
Combined with
is
to
computp from
the previous results
possible to evaluate the performance of specific policies as well
it
as the optimal policy.
As generic data
of Vi,{Nij)j^E
is
for the
branching bandit process, we assume that the joint distribution
given by the transform
In addition,
we have already introduced
bution of
(denoted G,()):
V,
.'V.n
= E
<i>,{e.Zi,...,Zn)
f62)
-1
the the generating function of the marginal distri-
(iyi)
Jo
Finally the vector
As we saw
role.
We
will
in
m=
(mj,
.
.
.
,
m^)
of jobs initially present
the previous section the duration of an
compute
its
moment
(;,
is
given.
S)-period,
plays
a
mirial
generating function
^f(e)
=
E[e-^^."].
((3 1)
For this reason we decompose the duration of an (i,5)-period as a
random
7",^.
sum
of mdepeiirj.Mit
variables as follows:
•V..;
((jj)
je5 k=\
where
v,,
{Tji^}k>i are independent. Therefore.
*f(e)
=
E
= E
jes
=
<I>.(e,(*f(0))^g.,,l5c),
31
i£E.
((3(3)
Given 5, fixed point system (66) provides a way
We now
to
compute the values of
have the elements to prove the following
set function
6a()
G
£'
a
branching bandif process. luntriT
satisfy the following relations:
.4^
l-*f(a)
=
1
bais)
/
result:
Proposition 6 (Computation of Aq and 6o()) For
Ac and
^,- [0). for
= -
n
-
ieS,
^.(a)
I'^Ti'')]"''
- -
SC
E.
ni^f (^)^^
(67)
scE
(68)
Proof
Relation (67) follows directly from the definition of ^4^^
O"
'he other hand, we have
m,
(69)
.€5
Hence,
,-»(
u:
dt
fc=l
Undiscounted Branching Bandits
3.2
In this section
we address branching bandits with no discounts.
pure rewards the problem
all
policies
have the same reward.
Under
a
undiscounted tax structure on the branching bandit process, however, the |irol)lem
linear
becomes
interesting.
criterion, also
[24]),
since
trivial,
is
Clearly, in tiie case of
Indeed, since an optimal policy under the time a\erage holding cost
minimizes the expected total holding cost
in
each busy penod (see Nain
f /
nl.
modelling and solving undiscounted branching bandits leads to the solution of several
classical
queueing scheduling problems.
More importantly, our approach
undiscounted problems, which,
literature.
To
our opinion, has not been thouroughly addressed
in
indices.
We
q
as the discount factor
It is
will
^
provided there are no
0,
not clear, however, what happens
introduce
in this
when
the
there are
for the
ties of the
undiscounted
corresponding
ties.
subsection a performance measure r" for a branching bandit
process that satisfies generalized conservation laws
It is
appropriate for modelling
ear undiscounted tax structure on the branching bandit process.
following development that
in
after solving an indexable discounied scli^duling
give a concrete example:
problem, researchers say that the same ordering of the jobs holds
problem
and
reveals rigorously the connections of discounted
all
We
the expectations that appear are finite
shall
We
assume
will
show
a
lin-
m
the
later
necessary and sufficient conditions for this assumption to hold. Using the indicator
if
{1,
earlier,
we
i
job
is
being engaged at time
i:
otherwise,
0,
we introduced
a type
let
-,
—
{t)idt
C.U
i€E
(72)
Jo
Let us define
^s_^[Tr]
E[i>,]
and
fc(5)
= |E[(Tf)2]-iE[(rf )2] +
5^6.(5),
.65
where
E[u.]E[vf]
fEjTn
33
E\iTfy]\
(74)
3.2.1
We
Generalized Conservation Laws
prove next that the performance measure for a branching bandit process defined
The main
satisfies generalized conservation laws.
result
Theorem 10 (Generalized Conservation Laws
The performance vector for branching bandits
and (27) associated with matrix A and
is
for
m
(7'2)
the following:
Undiscounted Branching Bandits)
r" satisfies generalized covsert atiov /aits (26)
set function b().
Proof
Let
SC
E. Let us assume that jobs are selected under an admissible policy
ates a branching bandit process. Let us define two
random
vectors. (fj),6£
u.
This gener-
and
(r,
),e'^-
as functions of the sample path as follows;
ri
r
=
-
i,(t)tdt
= j2
r
+ -^).
H(^..in,fc
tdt
/
G
£".
76)
and
\r
ii,s
/"
r,.k
'
i£S.
tdt,
Jt,
= l •''*
,
I
fc
+ Tr
I,
Now, we have
z."
=
E„[r|]
=
E.
= E
lk=l
=
E[v,]
EW,] E[vf]
Eu
79)
2
Lfc=i
Note that equality (78) holds because, since u
dent random Ve^iables.
On
the other hand,
is
nonanticipative,
=
E,
=
E„
jt
and
t,.;
are indepen-
we have
^.*+T,-V
-..*+7,^
E„[r."'^]
r,
=
tdt
E„
tf
tdt
I/,
I
(T,';)\
.c
lk=i
=
+
EIT^] E
'fc=i
34
EHE[(Tf)2]
(80)
Note that equality (80) holds because, since policy
independent random variables. Hence, by (79) and
—
u
nonanticipativr,
is
wn
=
7,""^.
are
(80):
.r-^ENE[rf] _ E.[r;'-^]-iEHE[(7f-)^]
EN
and
t,j;
"'
'
^'^
and thus we obtain:
We
will first
show that generalized conservation law
(26) holds.
gives complete priority to 5'^-jobs. Applying Proposition 4(a),
Consider a
t that
polic.\
we obtain:
tdi
Jo
Jo
,^sk=l''^'^
(T^m
+!:'
sr^
)
2
II.
5
(^3)
165
Hence, taking expectations and using equation (82) and
tiie
definition of
/j(,s')
we
nlirain
Y^A^z' =b[S).
which proves that generalized conservation law (26) holds.
We
next show that generalized conservation law (27)
under admissible policy
u.
Then, Proposition
^0
On
the other hand,
Jo
satisfied.
4 (part (b)) applies,
f^^ ^tl
Let thejobs
and we can
h(=
selecti'd
vvrite
Jr,,,
we have
E^"-'
=
Ztj"
€>" i=l
.G5
>
=
*
•
ZE/
I
/
dt
>
Idt
/
(c)).
4
-
=
tdi
(86)
tdi.
(87)
Jo
(part
(c)).
Hence, taking expectations
Y^A^z^
(8o)
Jo
^0
Notice that (85) follows by Proposition
idt
-
Jo
Proposition 4 (part
is
E.[^r|'-^-]
35
(86) follows by (84). and (87) by
in (87),
+ ^6,(5)
and applying (82) we obtain
>
E[ /
Jo
=
"
'"tdt]-E[
/
tdt]
+ yb,{S)
Jo
f^s
6(5)
{8S)
which proves that generalized conservation law (27) holds, and
this
completes the proof of
the theorem.
Corollary 3 The performance space for branching bandits corresponding
mance vector
B{A,b) are
z"
the
is
the
extended polymatroid base S(A,b); furthermore,
performance vectors corresponding
to
the
tbt
p( rfor-
vetiufs of
to the fixed priority rules.
The Undiscounted Tax Problem
3.2.2
A
Let us associate with a branching bandit process the following linear tax structure
holding tax C, per unit time
is
incurred continuously during the stay of
a
type /job
in
the
initiall>
ni,
system. Let us denote
Vu
'
The
("i)
=
type
I
expected total tax incurred under policy
jobs
in
tax problem
is
minimizes Vu
for Vu
'
'
the system, for
/
u,
given that there are
£ E.
the following optimal control problem: find an admissible policy
(m) over
all
(m) and show how
we need some preliminary
admissible policies
to solve the
u.
In this section
we
(/'
thai
find a closed forniula
problem usmg algorithm ^i
For that purpi-)ve
results:
Expected System Times
Let Q!"(), Ij{) and x^{-) be as
in
Q;''(0)=/
Subsection
3.1.
E^[Qj{t)\Q{0)
By
=
definition
m]di,
we have
jeE
Jo
and
i,"(0)
From
the above formulzis,
1-
Q'"(0)
2.
ij(0)
is
is
=
it is
E„
[P
d<|Q(0)
lj{t)
= m]
,
jeE
clear that
the expected total time spent in the system by type j jobs under policy
the expected total time spent working on type j jobs under policy
i"(0) does not depend on the policy
u.
Hence, we shall write Xj(0)
36
=
(/
Jj(0).
u.
Clearly.
Now,
letting
\
a
on equation (60) we obtain
^'"^°^
=
^"""^''^^
" efi
^
^ ^i^'""-
*'^°^
+
^'
^
189)
-.(0)
(90)
'''
•''
where
^^=^'- 21^5) -^(0) and
(j:")'(0)
E E[^u]
(1
denotes the right derivative of z"(q) at a
(x]f(0))'
-
= -E„
^)
2E[i.,
t€£
=
0,
that
is:
dt
^0
Hence, we have
q:r(o) =
!¥-'."+''.
Fi^--;-E
E[r,]
E[^'.]
j^^
(92)
.,£
Modelling and Solution of the Tax Problem
We
have, bv (92),
=
Vl"'^\rn}
^C.Q-"(0)
6E
=
From equation
(93)
it
is
(93)
.Sl^^-^tF^l-S--
straightforward to apply the results of Section 2 to sohe the
optimal control problem: use algorithm Ai with input (R,A). where
«,
=
(9 4)
E[v,]
Let 7i,
.
.
.
Theorem
,7n be the corresponding generalized Gittins indices.
Then we have
the
11 (Optimality and Indexability: Undiscounted Branching Bandits)
Algorithm A\ provides an optimal policy for the undiscounted tax branching bandit
(b)
An
index
3.2.3
ivsiilt
optimal policy
is
to
work
at each decision
probhw
epoch on a project with largest cinrmt
7,.
Computation of A and
In this section
b(-)
we compute the matrix A and the
^'
-
E[r.]
37
'
(a)
set function 6() as follows.
'^^
Recall that
and
.65
From equation
(65)
we
obtain, taking expectations:
E[Tf]
=
E[vi]
Solving this linear system we obtain
i€S.
+ J2m.j]E{Tfl
E[7^'^].
Note that the computation of Af
compared with the discounted
easier in the undiscounted case
(9o)
case,
is
where we had
niurh
to solve
a system of nonlinear equations. Also, applying the conditional variance formula to (63)
we
obtain:
Var[7;^]
=
+ (E[T/])[,5Cov[(yV.,),es)l
Var[t;,]
(E[T/])^,5
+
^
E[.V„] \ar[7/].
,
G
5.
(96)
Solving this linear system we obtain Var[T;'^] and thus E[(7^^)^]. Moreover.
E[vj]
= mj+Y,
E[A',;] E[i^.],
j
€ E.
(97)
from equation (69) we obtain
Finally,
E[T^]
= Y.m,E[Tfl
(98)
= ^m,Var[rr'].
(99)
and
Var[T^l
t€5
3.2.4
We
Stability condition
investigate in this section under
a positive solution for
all
sets
SC
what conditions, the
Theorem 12
if
and only
if
finite.
Let
A''
systems (95) and (96) have
E. In this way we can address the stability of a blanching
bandits process, in the sense that the
bandit process are
linear
first
two moments of a busy period of
a
branching
denote the matrix of E[N,j].
(Stability of branching bandits) The branching bandiis process
the matrix I
—
N
is
is
slable
positive definite.
Proof
Suppose
I
—
N
IS
positive definite.
We
will
show the system
is
stable.
System
(95) can
lie
written in vector notation as follows:
{I-N)sTs =
38
vs,
(100)
=
where 7s
minant
where
nominator along the column us we obtain:
in the
nonegative numbers (which are determinants themselves)
^r are
definite, then dei[(I
for all
I
Solving the system using Cramer's rule and expanding the dtnor-
(£'[TJ^]),gs
- N)s] >
£ S and S C E.
for all
5 C £ and
=
if
7
the
—
first
{Var[Tf])t^s
the system
values for
will
•^t-^i']
~
the system
if
stable for
is
all
i
is
>
0.
hypothesis
is
]
-^ ^' ^"^^'ch
true for \S\
stable,
E.
show by mduction on
detUl-N)
—
k.
we
>
i.e.,
\S\
Note that the condition
/?
<
1
in
N
finite, i.e
N
<
—
- X)s] >
is
we use (101)
—N
vve
obtain that
positive definite
-
A'),]
>
0.
for all
S C E
for all
Since
Assuming
For
.s'
\S\
C
=
E.
I.
that the in.liirtion
to obtain;
I
— N
is
positive definite.
positive definite) naturally generalizes the stability
I translates to
If
we
interpret a queueing system as a
E[N] = p = ^E[v] <
of customers that arrive (at a rate A) during the service time
3.3
X
follows that
stable.
system (100) has a positive solution
queueing systems as follows:
branching bandit then
I
is
it
follows that £"[T,^] have finite nonegative
it
that del[{I
I {I
the system
.
show that
will
implies that det[{I
<
same argument
Hence, from (98) and (99)
0.
vectors rn,
from the induction hypothesis. Therefore,
condition
>
E{T,'']
"S,
Therefore, using the
ror[T','-^]
initial
all
S and S C
€.
A' is positive
thus system (95) has a solution
- N)sxs =
two moments of the busy periods are
Conversely,
We
ug
^rid
positive definite, then
is
A''
—
Similarly, (96) can be written as
(7
where 15
If /
v
1.
since A'
is
the numb'^r
of a customer.
Relation between Discounted and Undiscounted Tax Problem
In this subsection
we study the asymptotic behaviour of the optimal
counted tax problem as the discount factor a approaches
counted tax problem, that corresponds to q equal to
notation of Subsections
31 and
3.2,
0.
0.
and
It
is
its
policies in the dis-
relation with the undis-
easy to see that, using the
that
hm
4f, =.4f,
39
(102)
and
^
Um{c,-TE[N,j]Cj]
^^
=
lima«,,,
o\0
a\ol.
'
pr
-'J
^
,""
'"^
- *(a
1
=«.•
,
Eh]
(103)
Therefore, because of the structure of the generalized Gittins indices (see Proposition 3)
it
and (103) that the generalized Gittins indices of the undiscouiited and
follows from (102)
discounted tax problem are related as follows:
lim q-),(q)
A
consequence of (104)
tax problem for
a
\
that a policy which
is
will
is
=
(104)
7,.
asymptotically optimal
in
the discountHd
be optimal for the undiscounted problem.
Applications
4
In this section
we apply the previous theory
to several classical stochastic schedniing prob-
lems.
The Multi-armed Bandit Problem
4.1
The multi-armed bandit problem was
There are
number
K
parallel projects, indexed k
of states
i^
E E^- At each
only a single project.
on
k,
i.e.,
/?.
The
state
but not on
j/(t
-I-
1)
=
t),
ifc(f)
=
ni
K. Project k can be
instant of discrete time
ijt(f
L-
Rewards
R,^(t)
changes to
the introduction.
I
one works on project
If
an immediate expected reward of
factor
defined
+
1)
in
state
=
<
iki't)
are additive
at
one of
a finite
one can work on
0, 1,
time
in
t
then one receives
and discounted
by a Markov transition rule (which
in
tmio by
a
may depend
while the states of the projects one has not engaged remain unchanged.
i/(f) for
/
^
Let P*
k.
probabilities corresponding to project k
=
(p'!j),.j^Ei,
The problem
be the matrix of Markov transition
is
how
to allocate one's re.tiources to
projects sequentially in time in order to maximize expected total discounted reward over
an
infinite horizon.
goal
is
That
is,
if
jit) denotes the state of the project
engaged
to find a nonidling and nonanticipative scheduling policy u that
E.Ei' «,(,)]
(=0
40
at
time
/.
the
maximizes
(103)
We
For this reason we set
of the previous section.
P=
(.?>})<
problem
as a branching bandits
model the problem
f~"'
=
J.
order to apply the
in
v,
=
We
1
rcsultj;
also define iiiatrix
j€E by
if
{p'ly
£ Ek,
'.7
for
some k =
I,.
.
.
,K\
otherwise.
0,
Moreover, by (62) we obtain:
^.{o,z,
E[e— -.f•>.... ^"]
=
-'„)
= 0YI
P^J'J-
^oTieEi,
(106)
and, by (66)
^f(a)
=
<I>.(a,(*f(a))^^,,l5^)
=
/?{l-^p,j(l -*f(a))},
foriGE.
j€5
By introducing
1_^
1-*,(Q)
and noticing that since
u,
=
and by (107) and Proposition
Moreover, since ^^(a)
=
*,(a)
1,
6
=
/?,
it
follows from (107) that
we obtain
0,
where
at
time
1,
if
0,
otherwise.
t
=
there
41
is
a bandit
in
state j\
(107)
The
P=
structure of the matrix
(pij) implies that
which implies that the index decomposition condition (17) holds, and therefore Theorem 3
applies, giving a
Theorem 13
new proof
(Gittins
of Gittins theorem:
and Jones
[14])
For each project k there
depending only on characteristics of project
k,
exist indices {',}i^Ek-
such that an optimal policy
is
fo
engage
at
each time a project with largest current index.
By
the results of Subsection
3. 1.3
we know that the generalized Gittins
indices for this
bandit problem coincide with the usual Gittins indices. Further, by definition of generalized
we obtain a characterization of Gittins
Gittins indices,
purely algebraic characterization. Also, note that
indices as simis of dual variabiles, a
Theorem
13 implies that the multi-armi^d
bandit problem not only has an optimal index policy, but
which
the Gittins indices can be
6,
has an optimal index poliry
index decomposition property, as described
satisfies the stronger
By Theorem
it
algorithm Ai to subproblem
k,
computed by solving
with \Ek\ job types, k
=
I,
.
..
Subsection 2 4
subproblems.
/\'
K
.
in
It is
apjilyiiig
easy to \enfv thp
following complexity result:
Proposition 7 The complexity
Gittins indices of project k
of algorithm
A\ applied
svbproblem
to
A
for cninpuliiiq ihc
is
0(\E,f).
The algorithm proposed by
plexity as algorithm ^i.
Varaiya, Walrand and Buyukkoc
[33]
has the same time roni-
In fact, both algorithms are closely related, as
Let tf be as given by (108). Let rf be given by
rf
=
«,
+
/?
J^p.^rf.
ieS.
Let us now state the algorithm of Varaiya, Walrand and Buyukkoc:
Algorithm VWB:
Step
0.
Pick
TTn
e argmax
{
-\rr
i
^
E
}\ \et
g„„
I
set Jn
-
= max{ -^
t
{:r„}.
42
:
i
£
E
}.
we
will see
next
.
Step
For
k.
pick
it
=
-
n
1
e argmax
7rn_fc
1:
{
\_^u|i>
'
^
€
^V J„_i};
=
{
''j^_^u{,\
'
>
6 E\J„-k}-
y„_i = Jn_fc+1 U [nn-k]
set
Varaiya
et
al.
proved that gi,...,gn. as given by algorithm
[33]
indices of the multi-armed bandit problem.
We
set <7^„_^
state,
VWP,
are the Girtiiis
Let (Tr,y,u,S) be an output of algorithm A\
VWP
without proof, the following relation between algorithms A\ and
Proposition 8 The following
U}
D
i^nlUft}
relations hold: For j
.{t,
,--n
7r,]
—
2,
.
.
.
,n
jr„}
{tTj
111)
and
^-'=Q,
r<'^
and
Ai and VVVB
therefore, algorithms
Klimov
(112)
are equivalent.
introduced the following queueing scheduling process: There
[22]
and n customer types. External
I
£"
G
=
{1,
.
distributed as a
type
I
customer
.
.
,
n}. Service times for type
random
is
arrivals of type
variable
i',
system.
The
=
(if
t
to find the
U
customers are independent and ideniirally
customer), or with probability
is
initially
some customer
present), the epochs
is
in the
j
ser\ ice of a
customers, with
l-XI;g£Pu
u;
when
lfa\*^sthe
the decision
a
e|.-iochs
customer arrives
system). Let us consider the following three classes of admissible
nonidling, nonpreemptive and nonanticipative policies;
Uo
is
allowed); and
U^
is
by direct methods, the associated optimal control problem over
U
the class of
all
the class of
all
nonpreemptive and nonanticipative
the class of
all
nonidling and nonanticipative policies (preemption
Klimov
When
system empty and the epochs when a customer completes service (and some
customer remziins
policies;
i
a Poisson process of rate A,
server selects the jobs according to an admissible policy
there
a smgl(-~ sppv.m-
customers form
with distribution function G,()
_;'
is
/
completed, the customer either joins the queue of type
probability p,j (thus becoming a type
are
i£E.
Scheduling Control of a Multiclass Queue with Bernoulli Feedback
4.2
for
ft
[22] solved,
policies (idleness
with a time-average holding cost criterion. Harrison
[18] solved,
is
is
allowed).
using dynamic program-
ming, the optimal control problem over Uo with a discounted reward-cost criterion,
special case that there
is
no feedback. Tcha and Pliska
43
[29]
in tiie
extended Harrison's results to
.
They
the case with feedback.
also solved the control
problem over U^
,
case that the
in tiie
service times are exponential.
The Discounted Case
Let us consider the following reward-cost structure: There
per unit time for each type
ward of
at the
/?,
i
customer staying
a continuous holding cost C,
is
the system, and an mstantaneous re-
in
epoch of completion of service of a type
/
customer
There
is
also an
instantaneous reward of idleness Rq at the end of an idle period. All costs and rewards are
discounted
time by a discount factor a
in
>
0.
The optimal
control problem
is
to find an
admissible policy to schedule the server so as to maiximize the expected total discounted
reward minus holding cost over an
infinite horizon.
Let us denote
Pu
Pu^ and Pup the
optimal control problems corresponding to the classes of admissible policies
respectively.
We
model each of these problems
will
also prove, applying the Index
we only need
m
Uq and U^
problem
\W
will
order to sohe problpiii P^^
problem Py. This problem can be modelled
problem with n job types,
J
Decomposition Theorem, that
.
problem Pu.
to solve
First, let us consider
N,j of a type
as a branching bandit
U
as follows:
We
as a branching bandit
interpret the customers as jobs
The descendants
job are composed of the transition of the job to another type (or outside the
system) and of the external Poisson
^.(a,2i.
,zn)
-av,
arrivals.
N,i
^1
The transform
<J>,(.) is
given by
=
E[e
=
E (l-LPud--';))^""*"^^'^^'^''"-"^'
~n
J
je£
ieE
(113)
Also, by (66) and (113)
*f(a) = {l-^p.,(l-*f(a)}*,[a + ^Aj(l-^f(a))].
Let x^{a)
=
(ij'(a),
.
.
.
,rj}(a))-^
denote the performance vector, as
that i"(Qf) satisfies generalized conservation laws.
matrix
Aa
is
in
By Proposition
ieE.
Section
6,
(114)
31 We know
the corresponding
given by
'^••'*-
Let us consider now problem
Fii^.
l-*.(a)'
In order to
'^^-
model the option of
previous branching bandit process by adding an idling job type
44
,
idleness,
we modify the
which we denote
0.
Tlie
duration of job type
0,
it
case.
jVqo
=
0;
=
A'o,
transform $,(•)
exponentially distributed with parameter A
until the next arrival); the
models time
(since
is
I'o,
and
^
,V,o
it
i
E
G
It
with
is
G E. are
i,j
A]
+
+
A^
as ui the prp\ ions
easy to see that the corres|Mjiidiiig
satisfies
^,{a,zo,zi,.
Hence,
for
.V,^,
=
.
.,i„)
=
<J>,(q,;i
^n).
e
'
E
follows that
*f'''"'(a)
=
SCE,
ieE,
\l/f(Q),
and
SCE.
^o'^<°>(a)=^J°^a),
Consequently, we have, for
£ S C
i
E
that
-j5u{0}
_
5
and
-jSu{o}
^0,0
_
—
'
_ 3{o}
- -"'0,a
Therefore, condition (17) holds, and the Index Decomposition
Theorem
6 applies
\ow. we
have
E[iV.,]
= p„ +
A,E[i,],
and
^o(o) =
A
By
+Q
(56)
*^<
a
J
1
- *,(a)
Hence the index of the subsystem composed of job type
7,,
7i
for
<
:'*,...
•
J
•
G E,
<
are
7i*-i
<
computed from algorithm ^1 applied
7o
<
7i*
<
•
7n then an optmial policy
is
to
is
70
= ^Ro
problem P;^
The
indices
Therefore,
if
to serve customers of types
,n with a fixed priority policy, giving highest priority to n, and never serve customer
types 1,...,:'
Harrison
[18]
—
1.
That
is,
the optimal policy
and Tcha and Pliska
[29].
45
is
a modified static policy, as proved by
The Preemptive Case
In
preemption
allowed, then the decision epochs are the arrival epochs as well as the
is
departures epochs of customers.
then
it
If
the service time
/i,
=
exponential with rate
+
/i;
where A
A,
three cases: (1)
=
Ai
+
One descendant,
that service of the type
i
+
A„.
of type
_;'
j
has a duration
As
for the
v,
for
/i,.
easy to model the possibility of preemption: model the process as
is
bandit process with n job types. Job type
rate
v, is
a
S E.
/
branching
exponentially distributed with
descendants of a t\pe
/
job. there are
with probability ^p,j (correspondmg to the case
customer ends before any arrival occurs and the customer moves
to
queue
^
(corresponding to the case that a type j customer arrives before service of the rypp
customer
is
two descendants, one of type
(2)
j);
i
and the other of type
completed); and (3) no descendants, with probability
sponding to the case that service of the type
the customer
i
^ (1
customer ends before any
j
with probability
- XI/eEPu)
/
(<"oi"''?-
arrival occurs,
and
moves out of the system).
The Undiscounted Case: Klimov's Problem
Klimov
considered the problem of optimal control of a single-server mnltirlass
[22] first
queue with Bernoulli feedback, with the
cost per unit time.
tive policy
is
criterion of
He proved that the optimal
minimizing the time average hoMing
nonidling, nonpreemptive and nonantioip-i-
a fixed priority policy, and presented an algorithm for computing the
(starting with the lowest priority type and ending with the highest priority)
prioriti'^s
Tsoucas
[:32]
modelled Klimov's problem as an optimization problem over an extended polymatroul using
as performance measures
L"
=
Algorithm ^i applied to
in this case
is
time average length of queue
this
problem
that priorities are
is
it
is
depend
in this
u.
computed from lowest
A
disad\ antau,e
priority to highest priority
for the right
hand
somewhat
for all of the
Also.
sides of the e.xfe'iidid
not possible to evaluate the performance of an optimal policy
approach gives explicit formulae
also explains the
under policy
exactly Klimov's original algorithm.
Tsoucas does not obtain closed form formulae
polymatroid, so
i
Our
parameters of the extended polymatroid and
surprising property that the optimal priority rule does not
case on the arrival rates.
The
46
key observation
is
that an optimal i^oliry
under the time average holding cost criterion also minimizes the expected
in
each busy period (see Nain
first
[24] for
et al.
Assuming that the system
=
Now. ue may
niorUl the
busy period of Klimov's problem as a branching bandit process with the undi.scouiU''d
tax criterion, as considered in Section
|i,
further discussion).
total holding cost
E[t',]
in
we apply the
stable,
is
results of Subsection 3.2
\\t^ dt^fine
and tf = E[Tf]. By (65) we have
tf
which
3.
=
^i^
+
Y.^P,J
+
^i^XJ)r^.
i€E,
(II.3)
vector notation becomes:
I.e.,
=
tgc
(/s^
-
Ps'^.S^
-
/J^^
P5<:-^5c)
and
tf =
+
^is
(Ps.s^+^is^lc)tfc
After algebraic manipulations we obtain
tf =
(..
^
+pL-
(/5c
- P5c,50-'
Therefore, by definition of
M5c)
^
Af
in
(73)
we
-
^^'-^-'^
J''^%
- Psc,s-=
^
det(/5c
find that
Af =
-
^,
.
>es.
(iri)
fiS'=>^sc)
tf^ /fi,, for
;
£ S. while 6(5)
is
given by (74). Now, letting
detjlsc
^ ^
det(/5c
we may
define
arrival rates of
Af = Af /Ks, and
6(5)
=
-
Psc,s^)
- Psc^c _p5cA^c)'
6(5)/A's, thus eliminating the dependencp on the
matrix A. As for the objective function, we have by (93)
yiO,C)
= J2\
^'~ ^''''''
^'
I
^-r
- HE) Y, C,\ + X:
C,h,.
(117)
Hence the problem can be solved by applying algorithm Ai with input {R, A), where
and since {R, A) do not depend on the
arrival rates neither does the
optimal policy. Note
that as opposed to Klimov's algorithm, with this algorithm priorities are computed from
47
This top-down algorithm
highest to lowest.
proved
its
weis first
optimality using interchange arguments.
Nain
direct optimality proof.
optimal
among
policies
when the
Bhattacharya
ei
el
al.
al.
service time distributions are exponential.
It is
provided a
[3]
i^
also
and among preemptive
also easy to verify these
the index of the idling state
(in particular,
vvlio
[24].
proved that the resulting optima! index rule
al.
idling policies for general service time distributions,
approach
facts using our
et
proposed by Nain
whereas
is 0.
all
other
indices are nonnegative).
Moreover,
in
the case that the arriving jobs are divided into
project consists of jobs with types
within Ek, and
E
projects, where a t\pe
a finite set Ek, jobs in Ek can only
in
E =
partitioned as
is
K
Uj^^^Ek, then
it
is
make
Ic
transitions
easy to see that the Index
Decomposition Theorem 6 applies, and therefore we can decompose the problem into
K
smaller subproblems.
4.3
Multiclass Queueing Systems
Shantikumar and Yao
showed that a
[26]
The reader
strong conservation laws.
and performance measures that
is
large variety of multiclass queueing systems satisfy
referred to their paper for a
Job Scheduling Problems without
There are n jobs to be completed by a
tributed as the
to
model
random
variable
of Af,^, in (36), that ^4^^
i"(a) studied
in Section 3
discussed in Section
I,
n,,
moment
a polymatroid
Arrivals; Deterministic Scheduling
single server.
with
is
Job
/
has a service requa-ement
generating function ^,().
this job scheduling process as a branching bandit process in
descendants. Let us consider
of job
of particular systems
satisfy strong conservation laws. All their results correspond
to the special case that the performance space B{A,b)
4.4
list
3,
in
=
first
for
1,
is
the discounted case; For
i
£
S.
it
immediate
which jobs have no
is
clear by definiticm
Therefore the performance space of the vectors
a polymatroid. Consider the discounted reward-tax problem
which a instantaneous reward Ri
and a holding tax C,
a >
It is
dis-
is
is
received at the completion
incurred for each unit of time that job
Rewards and taxes are discounted
in
the generalized Gittins index for job
;
is
time with discount factor a. By (56)
i,
in the
in
the system.
it
follows that
problem of maximizing rewards minus
48
taxes.
IS
= {«' +
Q
''
now
Let us consider
Af
in (73)
a*,(a)
-}f^
C,
^.(-)
T
J
1
(US)
- W,{a)
the undiscounted case in the case without rewards
we have Af =
1,
for
vectors z" studied in Section 3
is
i
^
in
a
definition of
Hence the performance space of the performance
5.
Thus by equation
also a polymatroid.
the generalized Gittins index for job
By
(93)
the undiscounted tax probleni
7.
=
it
follows that
is
(HO)
i^,
E[i;,]
thus providing a
new polyhedral proof
of the optimality of Smith's rule (see Smith [27])
among
In the case that there are precedence constraints
that
is
also be
each job can have at most one predecessor,
modeled
developed
5
We
in
as a branching bandits
Section
it
is
the jobs that form out-trees,
easy to see that the problem can
problem and thus solvable using the theory we have
3.
Reflections
presented a unified treatment of several classical problems
in stochastic
and dynamic
scheduling using polyhedral methods that leads, we believe, to a deeper understanding of
and algorithmic properties. Perhaps the most important idea we used
their structural
to ask the question:
We
What
is
is
the performance space of a stochastic scheduling problem'^
believe that the approach of characterizing the feasible region of a stocha-stic scheduling
problem
will
lead to important
new
insights
and methods and
will
gap between applied probability and mathematical optimization.
our results
will
bridge the
artificial
Indeed, we hope that
be of interest to applied probabilists, as they provide new interpretations.
proofs, cdgorithms, insights
and connections
to
important problems
as well as to discrete optimizers, since they reveal a
in stochastic scheduling.
new fundamental structure
(extendf^d
polymatroids) which has a genuinely applied origin.
Refe Fences
[1]
C. Bell, (1971), "Characterization and computation of optimal policies for operating
an
M/G/l
queue with removable server", Operations Research. 19, 208-218.
49
[2]
D. Bertsimas,
Paschalidis and J. Tsitsiklis, (1992), "Optimization of multidass queue-
I.
ing networks: polyhedral and nonlinear characterizations of achievable perfoniiance'.
working paper, Operations Research Center, MIT.
[3]
P. P.
Bhattacharya, L. Georgiadis and
timization in multiclass
in
part at the
M/G//1
ORSA/TIMS
P.
Tsoucas, (1991), "Problems of adaptive op-
queues with Bernoulli feedback'. Paper presented
Conference on Applied Probability
the
in
Engineering.
Information and Natural Sciences, January 9-11, 1991, Monterey, California.
[4]
P. P.
Bhattacharya,
L.
Georgiadis and
P.
Tsoucas, (1991), "Extended polymatroids:
Properties and optimization". Proceedings of International Conference on Integer Pro-
gramming and Combinatorial Optimization (Carnegie Mellon
ical
[5]
Programming
Matliemat-
Society, 298-315.
Y. R. Chen and M. N. Katehakis, (1986), "Linear programming
armed bandit problems". Mathematics
[6]
University)
for finite stare inulii-
of Operations Research. 11, 183.
A. Cobham, (1954), "Priority assignment
m
waiting line problem.s", Opernfum-. Re-
search, 2, 70-76.
[7]
E.
CofTman and
1.
Mitrani, (1980), "A characterization of waiting time perfornianre
realizable by single server queues". Operations Research, 28, 810-821.
[8]
E. Coffman,
M.
Hofri
and G. Weiss, (1989). "Scheduling stochastic jobs with
point distribution on two parallel machines'". Probab. Engtn. In for.
[9]
D. R.
Cox and W.
L.
a
two
to appear
Smith, (1961). Queues. Methuen (London) and Wiley (New
York).
[10]
J.
Edmonds,
(1970),
"Submodular functions, niatroids and certain polyhedra".
binatorial Structures and Their Aphcattons. 69-87. R.
Breach,
[11]
New
Guy
et
al
(eds).
iii
Com-
Gordon
fc
York.
A. Federgruen and H. Groenevelt,
(1988).
"Characterization and optimization of
achievable performance in general queueing systems", Operations Research, 36. 733741.
50
[12]
A. Federgruen and H. Groenevelt, (1988), "'M/G/c queueing systems with multiple
customer
classes;
Characterization and control of achievable performance under
noii-
preemptive priority rules", Management Science, 34, 1121-1138
[13]
Gelenbe and
E.
demic Press,
[14]
J.
I.
New
Mitrani, (1980), Analysis and Synthesis of
Aca-
C. Gittins and D. M. Jones, (1974), "A dynamic allocation index for the sequential
J.
Gani, K. Sarkadi
European Meeting of Statisticians 1972,
J.
Sysifin<^
York.
design of experiments". In
[15]
Computer
C
the
Gittins, (1979), "Bandit processes
Royal Statistical Society Series,
B
&
vol. 1.
1.
Vince
(eds.), Progress tn Statistics
Amsterdam: North-Holland, 241-266.
and dynamic allocation indices". Journal of
14, 148-177.
C. Gittins, (1989), Bandit Processes and
Dynamic
Allocation Indices. John
\\ ile\.
[16]
J.
[17]
K. D. Glazebrook, (1987), "Sensitivity analysis for stochastic scheduling problems'".
Mathematics of Operations Research, 12, 205-223.
[18]
J.
M. Harrison, (1975), "A
priority
queue with discounted
linear co.sls".
Opfi<ilioii>-
Research, 23, 260-269.
[19]
J.
M. Harrison, (1975), "Dynamic scheduling
ity".
[20]
of a multiclass queue: discount optimal-
Operations Research, 23, 270-282.
M. N. Katehakis and A.
F. Veinott, (1987),
"The multiarmed bandit problem: decom-
position and computation", Mathematics of Operations Research, 12, 262-268.
[21]
L. Kleinrock, (1976),
[22]
G.
P.
Queueing Systems,
vol. 2.
John Wiley.
Klimov, (1974), "Time sharing service systems
I",
Theory of
Probabilitij
and
Applications, 19, 532-551.
[23]
E. Lawler, J.K. Lenstra,
A.HG. Rinnoy Kan and
D.B. Shmoys. (1989), "Sequencing
and scheduling; algorithms and complexity", Report BS-R8909, Centre
[24]
ics
and Computer Science, Amsterdam.
P.
Nain, P. Tsoucas and
J.
for
Walrand, (1989), "Interchange arguments
scheduling". Journal of Applied Probability, 27, 815-826.
51
Mathemat-
in stochastic
[25]
W.
K.
IEEE
[26]
Ross and D. D. Yao (1989), "Optimal dynamic scheduling
Jackson Networks',
Transactions on Automatic Control, 34, 47-53.
G. Shanthikumar and D. D. Yao, (1992), "Multiclass queueing systems:
J.
ment
W.
2,
Polyma-
and optimal scheduling control", Operations Research, 40. Supple-
troidal structure
[27]
in
S293-299.
E. Smith, (1956), "Various optimizers for single-stage production'",
Naval Re^tarch
Logistics Quarterly, 3, 59-66.
[28]
S.
= XW: A
Stidham, (1972), "L
discounted analog and a new proof". Operations
Research, 20, 1115-1126.
[29]
D.-W. Tcha and
S.
R
works and multi-class
Pliska, (1977),
M/G/l
"Optimal control of single-server queueing net-
queues with feedback". Operations Research. 25
248-
258.
[30]
J.
N. Tsitsiklis, (1986),
"A lemma on the multi-armed bandit problem". IEEE Trans-
actions on Automatic Control 31, 576-577.
[31]
J.
N. Tsitsiklis, (1993), "A short proof of the Gittins index theorem'
[32]
P.
Tsoucas, (1991), "The region of achievable performance
Technical Report RC16543,
[33]
P. P. Varaiya, J. C.
IBM
in
a
.
in
preparation.
model of Klimov".
T. J. Watson Research Center.
Walrand and C. Buyukkoc, (1985), "Extensions of the multiarmed
bandit problem: The discounted case",
IEEE
Transactions on Automatic Control. 30.
426-439.
[34]
R. Weber, (1992),
"On
the Gittins index for multiarmed bandits'".
The
Annah
of
Applied Probability, 2, 1024-1033.
[35]
G. Weiss, (1988), "Branching bandit processes". Probability
Information Sciences,
[36]
P.
2,
in
the Engineering
269-278.
Whittle (1980), "Multi-armed bandits and the Gittins index". Journal of
the
Statistical Society, 42, 143-149.
[37]
P. Whittle, (1981),
"Arm
and
acquiring bandits'". Annals of Probability. 9, 284-292
52
Royal
[38]
P.
Whittle, (1982), Opttmtzaiwn Over Time,
53
vol.
1.
John Wiley. Chichester.
IK.
5939 057
Date Due
MIT UBRftRlF? OllPl
3
TDfiO
Q0flMb321
5
Download