22903 >> Nikhil Devanur Rangarajan: Hello everyone. Welcome to... Series. Today it's my pleasure to introduce R. Ravi,...

advertisement
22903
>> Nikhil Devanur Rangarajan: Hello everyone. Welcome to the MSR Talk
Series. Today it's my pleasure to introduce R. Ravi, who is a professor at Tepper
School of Business at CMU, and formerly Associate Dean of Technical Strategy.
Ravi has been a regular visitor here at MSR Redmond, and it's a great pleasure
to have him here always. And so today he's going to talk about correlated
stochastic knapsacks and non-Martingale bindings.
>> R. Ravi: Thank you, Nikhil. Thank you all for coming. This is one of the
papers that grew out of teaching in the MBA classroom. I was teaching a class
on applying optimization to marketing and you know after you teach a class for
two or three times you actually start focusing on what the model is that you're
trying to solve. And you strip away everything and what we came away with was
some kind of a multi-armed bandit problem with rewards that don't obey a
commonly assumed property. The Martingale property.
So that's sort of where we started. But then we worked our way -- we simplified,
simplified and we came down to a knapsack problem. So I'm going to do the talk
in the reverse order. So I'm going to start with the simplest problem, the
knapsack problem, and I'm going to add one or two complications to it.
And so it's sort of monotonically increasing in difficulty. And then once I have a
solution for stochastic knapsacks with all the extra constraints I'll try to make a
leap to this multi-armed bandit problem. But before I get started, this is joint work
with my colleague in the computer science department, Professor Anupam Gupta
and two graduate students, Ravishankar Krishnaswamy and Marco Molinaro.
They are at the computer science department and the school of business,
respectively.
The paper, the full paper is out there. It's about 30, 35 pages, and there will be a
short version in the proceedings of the Fox Conference this year.
So most of you probably know the knapsack problem. So I'll start with a very
simple extension of the knapsack problem. And the knapsack problem, we have
a knapsack with some capacity B to fill. And then the traditional knapsack
problem, items that you're trying to fill the knapsack with have sizes and profits or
weights sometimes I call them. Rewards.
And the traditional knapsack problem, the deterministic knapsack problem you
have to pack a subset of items that don't overfill the knapsack and maximize the
reward that they fetch. The stochastic version now has uncertainty associated
with the size of the item and/or the reward that you get from the item.
Okay. So you pick an item but you actually don't know how much of the
knapsack it's going to occupy. Okay. And maybe you also don't know how much
reward you're going to get. You have some distributions over these.
Okay. So the goal of the stochastic version is the same. But there's a slight
subtlety here. You first pick an item and only when you pick an item you know
how much it's going to fill up this knapsack, right? And once you pick an item
you report it on the knapsack.
And then you look at what happened, what all items you have in your knapsack
and what the current sizes are, and you can use all that information to pick the
next item to fill your knapsack with. So you're filling the items one by one.
There's some kind of inherent online nature to this problem.
Right? And I'm again maximizing expected reward. For now let's think of the
rewards of being deterministic. Only the item sizes are uncertain.
So the first difficulty here is that the solution that I'm looking for is in general a
strategy. I'll have to -- I'll have to tell you if I saw these four items before with
these respective sizes, which are realized from the distribution, then I have to tell
you what to pick next.
So if I have to explicitly write out the whole optimal solution, it would be
exponential size, because I'll first pick an item and there will be different possible
size realizations. For each one of those choices I'll have to tell you which is the
next item to pick and so on.
And in general this problem is P space hard, because of this complexity of
description of the optimal strategy.
>>: So do the item sizes and rewards come from the succinct ->> R. Ravi: Yes, so you're given distribution. Usually in this case -- I'm just
thinking of simple finite support here. Very simple discrete case. No distribution.
So let's do an example. Okay. This example is from the paper that introduces
problem by Brian Dean, Mitchell Goemans, Juan Vondrak, in the general paper.
We have three items. The knapsack size is size 1. The first item with
probability -- all of them have the same deterministic reward of one.
And the first item has size .2 of the whole knapsack. 20 percent of the knapsack
with probability half and .6 for probability half. And so on.
Here is the optimal strategy. So first you insert item one. If it comes up to be of
the smaller size, then you know for sure you can insert item two. Right?
Because you have enough .8 space in the knapsack. But if it doesn't, you know
two is useless to you, right because you know after the .6 you can't possibly not
fill the .8 which comes up with deterministic size.
And so you try your luck with .3, with the third item, and indeed the third item will
fit with probability half. So one-fourth of the overall time, right, you'll be able to fill
it with items one and three. Right?
And one-fourth of the time you'll be able to fill it with one and two. And one-half
the time. Sorry. And then the remaining one-fourth of the time, the second item
that you tried to put in, which is item number three, will overflow and you'll only
get value one.
Okay. So the expected profit here is, yeah, one plus .5, plus .25, 1.75. 1.75 is
the expected reward from the strategy, and it is optimal for these three items.
Okay. Good. So one of the big ideas of the Dean Goemans Vondrak paper,
which I'll just call DGV from now, is that they wanted to ask if I don't want to
bother with telling you about this tree, this whole strategy, if I just want to give
you a very succinct solution, and this succinct solutions are nonadaptive policies,
they're just permutations of the items, I just give you an order, you go down that
order and when you pick an item which overfills the knapsack you stop and that's
the end of it. So I'm not allowed to change the order depending on the real size
that the item came up with previously.
Okay. So nonadaptive policy is just an ordering of the items in which I will
process them in order and no matter what the size that's realized for each item,
I'll just keep going down the row until the thing overfills.
>>: What do you mean ordering, the ordering is fixed, right?
>> R. Ravi: So I want to compare, again, the best such order. That would be the
best known adaptive policy.
>>: Wait. I thought the order was adverse ->>: It's an off line.
>> R. Ravi: It's an off line, as I started it it's an off line problem. You're allowed
to determine the -- yeah. So in fact the strategy tree will give me different
permutations, yeah.
>>: I don't need to know that. I only probably need to know the sum of the sizes
of the things that I -- or do I really -- could explain also the individual ->> R. Ravi: Probably the sum is enough, if you sort of write it as a decision
process and try to solve it, yeah, that's probably enough in this case, yeah.
So in this case, for example, if the order I use is actually one, two, three, what
would be my reward of that policy, well, I lose the chance that when I schedule
one and the item comes up with size .2, right, I lose that chance, that third branch
in the tree where I schedule item three.
>>: [inaudible].
>> R. Ravi: Sorry. When it comes up with .6, I lose the chance of going to three,
thank you. So what I'll have to do then is just have to schedule two and I'll lose -so in fact I don't get to play -- yeah, when it comes .6 I can't insert three, if I think
of it as this tree, right? All the nodes in this level of the tree are the second item
in that ordering.
So I can't insert two here and three here, this also has to be a two, right, in which
case -- yeah, this will overfill.
>>: So if you -- I didn't understand. So if you pick item two, if you get to .6, then
you get to .8, what happens? You have to stop completely?
>> R. Ravi: Yeah, at that point the knapsack is full, and I end. That's it. That's
the end of the procedure.
>>: So as soon as you start to overflow.
>> R. Ravi: As soon as you overflow it's over. Even if there is a tiny item which
for sure will fit in later, yeah, the game's over.
>>: And that's overflowing on ->> R. Ravi: That's right. That's right.
>>: I see.
>> R. Ravi: That's right. Good question.
>>: So size is realized after you made a decision to put in your knapsack.
>> R. Ravi: That's right. In fact, very soon I'm going to switch to the perspective
where these sizes are stopping times of jobs. So each of these items is a job.
I'm going to put it in my machine. And I actually won't know until .2 whether it's
actually going to stop at .2 or it's going to continue on until .3.
That's actually the version I'm going to work with from now on. So in that case
it's kind of clear that I've blown the time. So that's sort of a better motivation.
>>: In that case you can do preemption.
>> R. Ravi: That's precisely what I'm going to allow. That's what -- then I'm
going to ask if allow preemption how much better can I do and things like that.
That's where I'm going. Any other questions? We need to -- so what is the
expected reward from the order one, two, three? So I'll definitely get one. Right?
And only .5 of the time I'll get two and that's it. So it's 1.5. Okay. So in 1.5 is not
too far from 1.75 and what DGV show is that, right, is that there always exists
such a permutation, nonadaptive ordering for this problem where the expected
reward is at least nearly one-fourth of the best possible adaptive strategy.
Okay. But they need two key assumptions. One is that these distributions of the
item sizes and rewards are independent of each other. Okay. There's no
correlation between the amount of money you'll make and the size of the item.
Okay. That's one important assumption. And the second is, like Yosi was saying
we can't have any cancellation. That is, you can't take an item and after a .1 say
I don't like this item, I'm just kind of canceling the item. In the context of
knapsack we'll call it cancellation. In the context of scheduling, it will be a
preemption.
Okay. So it's a .2 and a .6 job. But at .1 I say I'm just going to preempt this guy.
Okay. So you can't cancel. And you have to have independence between sizes
and rewards. And then you get a permutation. And actually I'll prove to you this,
a weaker version of this. In fact, what I'm going to do is give you the solution for
the basic stochastic knapsack, a weak solution which I'll need. Then I'll add
these two things. Correlation between item sizes and rewards, and the
cancellation preemption.
And I'll still come up with a constant factor approximation. Right? And then the
way I'll do that is using a linear program with time indices and a very simple
rounding. All I'm going to use today is just Markov inequality. It's amazing how
far Markov inequality will take me.
I'll just extend it to these multi-armed bandit problems and at some point I'll have
to stop. And when I extend to multi-armed bandit problem I have do LP
decomposition and rounding. I think the first part should be fairly clear.
>>: Not cancellation so strengthening the DGV; is that correct?
>> R. Ravi: It's just an assumption. If you allow cancellations.
>>: Getting one-by-four without cancellation?
>> R. Ravi: That's right.
>>: Could also ->> R. Ravi: Could be much larger. In fact I'll show you some examples. You
guys are asking good questions. They're all in my talk. So from now I'll
interchangeably use item with job and size with processing time. The natural
way, right? And capacity is the deadline for the scheduling problem. So filling
the knapsack, inserting an item into the knapsack is like picking that job and
starting to schedule it. And I'll only know when it stops, when it's, what the size
of that item is.
>>: All of these are inherent peer strategies that you're talking about?
>> R. Ravi: That's right.
>>: Do they say anything about if you mix the problem, nonadaptive strategies
[inaudible].
>> R. Ravi: That should not be bad. I think there's an averaging argument that
shows that one of them will get just as much, because I'm doing expectation,
expected reward.
>>: As much as ->> R. Ravi: I want at least, right? So I want a strategy, I want to maximize
expected reward. So if I have a combination of strategies, then my expected
reward is just a linear combination of their rewards. And so one of them will be
just as good. I think I'll be okay there. Yeah.
>>: So with the job scenario, we're assuming that when it comes to [inaudible]
time like two jobs can be running simultaneously?
>> R. Ravi: Absolutely. That's my capacity constraint of the knapsack. Right?
That's precisely what I mean by the capacity constraint. If I schedule a job, no
other job can be running at the same time. So if it runs for .2, nobody else is
running for that .2 of the knapsack. Good. Okay. I think you guys are all with
me. Perfect.
So now I'll obfuscate things a bit and write a linear program. Linear program is a
simple upper bound for the amount of reward I can get from any strategy. Okay?
So this is the way I want to think of it. So there's a slight caveat. I'm going to
change the problem on you a little bit. Okay? Again, the profit at all is not
random. It's a deterministic number W for each item. The processing time is
random. It comes from some discrete distribution, let's say for now.
And in fact I'm going to allow myself to get the reward. If I get to start the job. I
don't even have to complete it. I just have to get -- I just have to put it in my
knapsack.
Okay. It's just got to be space enough in the knapsack. So this is what's called
the start deadline version in scheduling, where you get the reward not for ending
the job but for actually starting it, just putting it.
But it's just sort of a scale factor, because if I have to wait for that guy to finish,
how long will I wait? At most the size of the knapsack, right? So these two
things are very close to each other.
>>: [inaudible].
>> R. Ravi: It's an MBA class.
>>: [inaudible] strategy for businesses. [laughter].
>> R. Ravi: Yeah, I mean I'm doing approximation algorithms, right. So sort of
things come up. So what is my LP? I have a variable for the item. And you
should think of it as the probability over all possible realizations of these item
sizes that an optimal strategy will schedule this item. So an optimal strategy, you
know first may pick an item and then its size comes up with something. And then
depending on the size it may change which item it picks next.
Right? So an item may be chosen at different points in this decision tree. I'm
just saying like what's the overall the runs of this item sizes, what is the
probability that a particular item is chosen by this optimal strategy.
All right. So now think of how much money and expectation this thing is making.
Well, for every run where this is a 1, it makes WJ. So in expectation it is making
XJ times WJ. So that's the right expression for the expected profit of opt. Now
look at any particular run.
Right? In a run, if I schedule J, that is if XJ is one, this is a 01 case in the run,
right? Then the sum of the processing times that that particular item took -- this
is a random variable -- right, remember it's a distribution of sizes.
So the sum of the processing terms for the items that are scheduled, right, and I
took reward from, should be less than or equal to the start deadline D. So this is
the deadline that I'm addressing.
And the reason I'm able to truncate this processing time random variable at 1,
which is a deadline, is that I'm only averaging over the runs where opt went
through. So if opt actually went through and included some guy, all the guys
before are definitely terminated before the deadline. Otherwise opt is out of the
game.
Opt is out of the game at deadline one. So since I'm only looking at trajectories
where opt is picking some items, all the item sizes that it looked at are at most
one.
So trajectory by trajectory the sum of the truncated item sizes for opt is at most
one. At most a deadline. Right? And so if I take expectations, right, these two
are independent variables, because how big the item is independent of how opt
is making its decisions.
So expectation of those 01 variables is this probability, expectation of these guys
exactly that.
>>: So what is D again?
>> R. Ravi: D is some parameter deadline. So I have to set it to one.
>>: Row one ->> R. Ravi: That's a parameter. It's either one or -- in fact, I'm going to argue in
the next side that setting D equals 2 gives me an upper bound. Why? Because
the last guy that I schedule, right, his size may be as big as almost one. Right?
So if I sum up the times of everybody who has been given reward, right, all the
guys up until one, they fill up to one. The last guy fills up his PJ luckily I've
truncated his PJ at one. He overfills by at most by one. And so definitely in
every trajectory of the algorithm if I include the last guy too, the overhang I'm
going to have is at most 1. Therefore if I set D equals 2, then every trajectory of
opt is feasible, and therefore this is the expected profit, this is an upper bound on
the expected profit of opt. So that's why I use 2. So that's the explanation. And
also because I'm looking at the objective function of a maximization LP as a
function of this right-hand side, this is a concave function. Another way to think
of it is these things scale nicely. If I take the solution for this LP with right-hand
side 2, right, and I divide the solution by two, that's a feasible solution with
right-hand side being one.
Therefore, the solution for the right-hand side being one has to be at least as
much as that previous solution. It could get better. Okay. So in fact I can use
either phi of one or phi of 2 as my upper bound and they're off by only a constant
factor. I'm going to use that next.
Okay. So I think I've set up enough. Now I'll tell you what the algorithm is. The
algorithm is a knapsack algorithm. DGV algorithm is just a greedy algorithm for
knapsack. Sort in decreasing order. You want high profit, low time. Decreasing
order of profit by time, except that my time is this expected truncated size.
This is how you fill a knapsack typically. Okay. And now you just go down that
order. That is my permutation I was telling you about. The claim is that if you go
down that permutation, it's a nonadaptive policy. It's just one order you're going
to go down. The claim is you'll get at least one-fourth of the rewards of opt. Let
me actually show you a slightly weaker algorithm which gets one-eighth of the
rewards.
Okay. I'll need the proof of this. So this is what's going to drive the rest of my
proofs. So let me solve that LP, upper bound. The right-hand side I'm going to
make one rather than 2. It doesn't matter. It's just a scale factor. And I'm going
to sort the items not in that greedy order, but any order you like. Okay. So here
is where my power is really -- this is where I'm going to extract the power of this
weak algorithm. I don't have to go down in the specific order of cost, profit by
time. Sort in any order, but just damp the probability with which you'll actually
pick that item.
So you're supposed to pick it with probability XI according to the LP. But just pick
it probability XI over 2. Okay. And then just take them in any order and keep
scheduling them until you run out, you overflow. And the claim is this is
one-eighth. I'll show you a proof very quickly.
Okay. The way I'm going to show that you're getting one-eighth of the profit is
I'm going to show that for any item you consider, when you get to consider it, I'm
going to argue with probability at least a half the knapsack is not full. That
means you can get this started and you can make your money. It's enough to
get started to make money.
And therefore what is the expected profit that I make in this weak algorithm?
Well, first I have to pick it. And then I'll have to schedule it. And if I schedule it I
make this money. If this lemma, the probability of being able to schedule it is a
half. Probability of picking is damped by 2. I'm not -- the original probability
ought to be XI according to the LP. But I damp it by a factor of 2. And this is the
expected value. But now if I look at this expression, this is just phi of one
because that's objective function of my LP divided by 4. But because of the
relation between of phi of one and phi of two it's scaled by a factor, this an upper
bound on opt. This is where I started.
So all I have to show is when you consider any item, any item, right, the
knapsack is not full with probability at least a half, now you just look at how much
size is occupied by everybody else when I consider a particular item. When I'm
considering an item, look at all the items other than yourself, because you're not
being considered yet. You're just being considered.
What's the chance I pick the other item and how much does it fill the knapsack?
Well, how much does it fill the knapsack, well, I can again truncate at one
because I'm considering -- I mean this is only happening when I'm still in the
game, right? The knapsack is not yet full and I'm considering what's the chance
that knapsack is not full?
So the probability of picking any item, this is where I'm using my damping is at
most its LP value by 2 times this truncated size. Now, the expression up in the
numerator is exactly, the left-hand side of my knapsack constraint, since I used
phi of 1, it's at most 1. At most half. So the expected size that the other damped
guys are expected to fill is at most half by Markov inequality, the chance that I
exceed two times that, right, is at most half.
That's it. Therefore, when I consider any item, the space filled by everybody else
is at most an expectation of at most half the knapsack size, just damping, that's
all, very simple.
Good. Now let's start canceling items. Okay. So let's think of -- I should note -okay. So if I gave you an instance where I gave you identical items they all have
reward of one each. The size is either tiny or half the knapsack, and each of
these happen with probability half.
Now, if I don't allow you to cancel, right, very soon you'll get an item of value half
and that's the end of the game. Right? So your expected reward will be a
constant.
But I allow you to cancel, then I'll schedule this job. I'll run it for epsilon. If it goes
for a little bit more than epsilon, I'll preempt it. I'll throw it away.
Okay. And I'll keep doing this for every job. And so I'll get all the epsilon jobs.
Right? I'll preempt away all the long jobs. I'll keep all the epsilon jobs and my
reward will be N over 2. Right? So with preemption I get much higher reward.
Sorry. Cancellation, as I call it here, for items, right? Whereas without it the
reward is actually quite low.
So there's this huge gap between what you can do. Unfortunately, if you use the
other LP, the DGV LP, as it is, it won't work out of course, because what is the
expected truncated size of a job? Well, a job with probability half is half the
knapsack. So its expected size is one-fourth the knapsack. So how many
truncated jobs can you put in a knapsack for?
So that LP will only be able to give you four items of reward. It has an upper
bound of constant. Clearly it's not capturing the ability to truncate.
Okay. Now if you look around and you find that someone has already solved this
problem. So there's a series of papers which I'll talk about briefly by Sudeep
Guha and Munagala. They have LPs like this for solving these Markovian bandit
problems. So in those bandits you can either kind of play the bandit or stop and
say I want to use this bandit and collect my reward. Either explore or exploit.
So sort of stopping and saying I'm going to exploit this is like preemption,
because you're sort of going along, right, and then at some point you're saying
I'm going to stop and take the reward. So that's kind of like preemption.
If you see how they do it, how they decide between stopping or going, you can
actually adapt the LP and work out this case.
But unfortunately if you throw in the second complication, which is correlation
between the reward sizes, rewards and the sizes, right, their approach doesn't
work anymore. So what is correlation? Basically the -- here's an example. N
items, each item can be very tiny. Give you no reward. But that happens most
of the time.
And the very -- with the remaining tiny chance, right, it fills up the whole
knapsack, and it gives you reward of one. So how much can you get ->>: You get again the reward, when do you realize?
>> R. Ravi: So when you realize this -- so think of it in a job setting, it's easiest.
You start running it. And if it stops at size epsilon, right, you get a reward of 0.
That's at the point where it stops.
So, unfortunately, because of that, if I start, I'm hosed because that first thing that
I started with is a one size item, I won't get any reward, because all the other
items that I can get will only fill the knapsack, but the only thing that will give me
a reward, right, will need the whole knapsack.
So therefore the expected amount of reward I can collect is only one and N.
Because with probability one and N I'll get that one reward. But now you see
what you've done with that fix that I talked about, and that doesn't work. Okay.
So these bandit LPs that take care of the cancellation part, they don't work for the
correlation part. So now you look around further, further, you don't find anything,
now you have to solve it yourself.
So now we really have our first problem. Okay? Our problem is stochastic
knapsacks with cancellations or preemptions and correlation. And a convenient
way to think of it is we have a deadline and for every job we have a distribution
over sizes. So for each size I tell you what's the probability that this item will take
that size T as of time it runs for, and what reward it will get you, if it's stopped at
that size.
Okay. So for discrete item sizes 1 to D, I have the probability that the size of that
item is T. And the associated correlated reward. It's complete correlation.
Okay. You can cancel any job while running. What does that mean? You're
running the job and it's not stopped at time T. So that means it's still running at
that point. So at that point you can just say I'm not going to worry about this item
anymore. But that also means you can't come back to this item. So I'm also
assuming that once you like toss something away preempt something away you
never come back.
>>: You get 0 rewards.
>> R. Ravi: You get 0 rewards, that's right. So only the jobs ending by time in
the knapsack way. That's the problem you're going to work on. It's a very simple
idea that solves this also. So it's a standard idea. You basically split where
you're getting your reward from. Are you getting your reward from items which
are ending before half the size -- half the deadline, B over 2. Or are you getting
most of your rewards from the items which are ending at time D over 2 and
higher?
So if -- you're definitely getting at least half the reward from one of these two
types of items. Overall in this optimal strategy. So let's just break up our
problem into two problems. One where you zero out the reward for all the values
from B over 2 plus 1 to B. So you get reward only for small item sizes.
And then you keep the remaining part of the reward support in the second
instance, and you flip a coin and you pick one of them and you do the right thing
for each of them.
The nice thing is if you look at the second type of instance where you get your
reward if you run for at least B over 2 or more you never need to preempt
because once you start a guy he's only going to give you a reward after B over 2.
You never preempt that guy and get everybody else because everybody else is
in the same boat they'll have to run B over 2 plus 1 in which case they'll
overshoot the knapsack. The second type of jobs is like the DGV type of job,
one-shot thing. No need for preemption. I can just use my solution from before
and take care of the second type of jobs. I only have to worry about this first type
of job where I'm getting reward from the first half of the support.
Okay. So I'm going to quickly run through this first part, because there's not
much going on here. And then I'm going to focus a little bit on this, and that will
shoot me into the bandit part.
>>: From the size of the chart, dealing with the strategy just one specific ->> R. Ravi: That's it. So that final thing is just 1. Yeah.
Yeah, because -- that's all I can do in that case. Okay. So the way you can
actually do it -- so I'm going to rush through this. Don't bother if you don't follow
the next three slides, I'll blitz through it.
The LP here basically is just DGVLP. I've just scaled it from 1 to T. Now it's
become a B knapsack. Okay. So, again, what does this variable tell me? It tells
me if I'm going to start this job I at time T, between 1 and B, right, some global
time T, okay, so if I did start this job at time T, how much reward will I make?
Well, I can only go until the deadline. The deadline is only B minus T steps left.
Right? So whatever I -- if ever I get to stop by the deadline, which happens with
these probabilities, right, I get the corresponding -- expected reward, conditional
expected reward if I start at T.
And so that's what I should be collecting in the objective function. I can't start
more than one job at a particular time. That's the capacity constraint. And this
one is just the same packing constraint. It says, look, if I start any job, right, at
any time T prime, then the -- if I just sum up all the expected truncated sizes that
they add up, then it can't be more than 2 T. And the 2 is because of the
overhang in the, of the last job that I scaled. It's exactly the same thing, previous
arguments tell us that this is a legitimate upper bound. Right? And like I said
before cancellation does not help. So just solve that LP. Pick an item with
probability -- so now actually for each item I have values for what different times
to start it at. So now my LP is a little more refined for each item I for each T it
tells me with what probability I should start at time T.
So I first go to an item I and then I pick a T from its X distribution, right? And I
damp it. So only with probability one-fourth I even pick anybody. With probability
three fourth I don't even bother with this item. But when I go to an item and I pick
it, I pick it to start at a particular time T with probability XIT over 4. Right? And
then I just schedule this.
So now every item which has been picked successfully, right, is at some time
slot. Now I just walk down the time slot, I pick the first guy and that's it and that's
all I have time for.
Okay. And the proof is kind of identical, conditioned on when you start the
probability that you get added at all, is that the previous guys have left some
space for you.
The probability that they don't fill up more than the current time T, right, is going
to -- it's the same Markov inequality. Nothing is going in.
So I start with at least probability one-fourth of my LP value. I get to go, continue
with probability at least a half. Therefore overall with probability one-eighth I get
what my LP value tells me I should get. Therefore, I'm one-eighth of expected.
Okay. Now let's do the more interesting case. If this is confusing, it's okay, you
can record it back. But this one has a new idea. Okay. Now this LP, things start
to get a little more interesting.
So remember all I have is support every job will give you a reward if it finishes
between 1 and B over 2. Right? The upper part of the support is giving me an
award, this is the small jobs case.
So now I have an LP which is starting to look like this, like this Markovian bandit
LP. So I have two variables for each item. One is the visit variable and the other
is the stop variable. Again, I have the variables indexed by the job and the
particular time T, right? Running time T. So if VIT is 1, that means that my
policy, right, allowed this particular item to run for T steps.
So I visit the T at time step. And if SIT is 1, that means my policy said I'm going
to stop this item at time T. Now, the interesting thing is if you run for T items,
inherently that item will stop with some probability pi. Remember, there's some
underlying distribution according to what the job stops. So the rate at which my
policy should ask an item to stop at time T should be greater than that. Right?
So inherently the item if I say run for T steps might stop by itself at time T but I
can forcibly terminate at T. That would be a preemption.
Okay. So this variable stop SIT is going to take care of both. So now let's look at
what this says. It says if I visit time T, that is if I let the job to run for time T, item
idle to run for time T either I have to stop it at that time or I let it run for one more
time.
Right? That is if I visit the state of letting the job run for time T I should either
stop it right now or I should let it run for one more time. These are the only two
options.
Now, the stopping rate should be at least what the stopping probability
distribution tells me it is. What is that? The conditional on my getting to T, right,
the remaining stopping probability is this much. And the probability of stopping
right now at T is that much. So this is the conditional stopping rate if I got to time
T. So your stopping probability better be at least the probability of visiting times
the stopping rate.
I mean because it's going to stop by itself at that rate but it could be strictly
bigger. If it's strictly bigger that means you're preempting at least this fractional
policy is preempting.
>>: The conditional probability of stopping or original probability of stopping?
>>: [inaudible].
>>: [inaudible].
>>: It seems like it's conditional probability, no?
>> R. Ravi: On visiting T, that's right. Yeah. I mean, SIT means, yeah,
conditional on visiting T. So let's look at this and see what this says.
This says that if I stop at time T, no it's the original probability. No, it's just -every item has values of stopping at any value between like 0 and B. 0 means
you never even get to start the item.
Right? So if I stop at time T, then this item has actually occupied T units of my
knapsack or time. Right? So if I just sum over all possible real stopping times,
right, multiplied by the time it has run for, it has to fit in my knapsack.
>>: I see, this is why you multiply by ->> R. Ravi: That's exactly right.
>>: So the real probability.
>> R. Ravi: To make it the real probability, right. So by the same arguments that
I've been making at the beginning this is an upper bound on how much an
optimal policy can make, again just by tracing trajectories. Okay. Now, how can
we use this LP for designing an algorithm?
Well, we should try to visit with probability V in our algorithm. And we should try
to stop with probability S in our algorithm. That's what we should really try to do.
Right? And that's sort of what I'm going to do. I'm going to use my previous trick
of damping, right? So I'm not going to start this whole process off with probability
VI 0. Right? I'm just going to pick an item only with probability VI 0 over 4 or 8 or
something, and I won't even get going with an item. But once I get going, I'm just
going to simulate what the LP is telling me.
Right. That is if the LP has SIT value strictly bigger than for the remaining
fraction of the time I'm going to preempt that job a bit. I'm going to toss a coin
with that probability I'm going to preempt it.
Okay. This is not great text form, but this is just laziness on my part. So if -- so
what do you do? Again, you solve the LP. You get an optimal solution. SS and
Vs, right? And right off the bat with probability three-fourths you just forget about
an item.
Right? And then you start -- you take any item you like, that's what you're doing,
you ignore the item with probability three-fourths, and then you just go through
the time steps from 0 to B over 2, because that's when it matters after that, the
rewards are 0, you cancel at the rate that the LP is telling you.
Remember, the rate at which it inherently stops is this conditional probability
times V. Right? So anything that's left is what LP tells you should cancel with.
So with that probability you cancel. I've just extracted it out of the LP solution.
And when I cancel, that means I just throw this item away and I go to the next
item. But if not, I go on to the next time step. When I go to the next time step, I
look at what the LP value is and I cancel with the corresponding preempting
probability and I keep going around and around. But, of course, notice when you
go to the next time step, right, if I process it, it may also terminate by itself,
according to an inherent -- I'm just running this job. It may stop by itself. If it
stops by itself it's great I'll get the reward.
I'm just simulating the LP. So the reason I'm damping is so that I can use
Markov inequality to make up space. Because when I'm running a job I don't
want it to get interfered by the space used by the other jobs. So that's what this
is going to buy me. And this cancellation is going to buy me by induction this
nice property.
If I schedule a job, conditional on not throwing it away, then I'm doing exactly
what the LP is asking me to do. That's just induction. Probability of stopping is S
star. Probability of processing is V star and probability of completing is exactly
how the job is. Okay. So this is set up to do that induction.
Now let's finish the proof. When I'm trying to schedule a particular item, right, so
that means in this loop I'm just picking this item I. Now, how much are the other
items already using up my knapsack, my time? It's exactly the same calculation.
For all the other guys I sort of scheduled that guy. Must have been in the
one-fourth probability that I actually did something with that guy, right? And then
there's some probability with which that guy stopped at some time T. If I stopped
at time T then I've used up this much time steps from that guy. If prime stops
with time T which happens with probability I star by that inductive claim then I
used this much space.
Now I use the damping, the probability that I even let this guy in, the algorithm is
one-fourth. Therefore, the amount that's used by this other guy is one-fourth of
this amount.
But now my LP inequality says the one-fourth of this amount is at most B over
four. So that means an expectation, the amount of space used by everybody
else is at most one-fourth of the knapsack. Great. I'm done, because with
probability by Markov inequality, the probability they occupy more than half of the
knapsack is at most half. That means I'm starting at time, at least B over 2 or
earlier. Everybody's starting in time B over 2 or earlier and they have enough
time to finish up because they're only going for time B.
Okay. That same weak DGV algorithm that I'm milking again and again but now I
have this nice LP that tells me when to cancel using the LP. So it's still the same
calculation.
All right. So now I'm going to make a real leap. So I think I'm ready to make my
last grand leap. Shoot, is it really 4:22? Wow. Okay. So in fact the stochastic
knapsack with correlated rewards that I was talking about is just example of a
Markov chain, a Markovian bandit that looks like this.
So if I have one arm, if I have one item, okay, and its rewards are R1, R2, R1, R
3 and R 4 for stopping at times 1, 3 and 4, right? Then it's really like going
through this Markovian process. That is, I start with probability half I terminate.
This is the terminal state of stopping at 1.
And I get some reward, right? Then with the remaining probability I continue. I
don't terminate at all at time 2. So I continue with probability 1. And then with
some other conditional probability on having arrived here, right, I continue and
terminate at 3.
So every PIBIRI sequence that I got has input can be converted into one of these
part bandits. So what I was actually doing was solving a problem of trying to
figure out, given all these different part bandits, right, which one should I kind of
move now?
Right? This sort of the most general preemptive case. I move one guy, I park
the other guy in some state. I say, okay, if I move you then I get a certain reward
from you. And so I just have a limited number of pulls. B pulls in these bandits,
and I'm just trying to figure out which one of them I should move.
So that is the Markovian multi-armed bandit problem. So you have different
arms. Each arm has a set of states SI. It has a root state where you start, and it
has a transition graph, you know, which is -- the sum of the probabilities of
getting out of a state to the other states as one. So when I go to a particular
node in this Markov chain, if I say play that node, that means I might get a
potential reward from that state, from playing it. And then the chain goes to one
of the states that comes out of it, according to that transition probability.
Okay. And now I'm given a budget on the number of pulls. And I ask you,
maximize the total expected reward that you collected.
So this is the corresponding generalization. Here's some examples. And one
particular sort of assumption has been very strongly used in all the past work,
until our work.
And that assumption is that these reward numbers follow a Martingale process.
So what I mean? If I look at the reward values, say, on these two descendants,
the expected value of the reward -- for example, if this guy was ->>: [inaudible].
>> R. Ravi: Is half and half. So, yeah, so, for example, yeah, this guy is 4 and
this guy's 2, let's say. Right? Then the reward at this state is 4. But if I pull and I
go to two different states, the expected rewards from playing them is also 4,
probability half I'll play this guy and get 6. With probability half I'll play this state
and get 2.
So the expected reward at the next state is going to be equal to my actual current
reward. Okay. So when -- and this kind of a reward structure comes up when
you have a bias, a bandit is really effectiveness of drug against a disease or how
much money you can get, the bias of a casino machine.
And you're trying to learn that from no priors or from the uniform prior. So you
just have a distribution of how many positive and 0s and negative examples
you've got. And the expected reward that you get follows this process.
Okay. So a lot of the previous papers looked at the case when this, especially by
these authors. Looked at the case when this expected reward process has this
Martingale assumption. But before that this whole area of multi-armed bandit
exploration, this very vast -- there's some beautiful papers for infinite horizon
policies where I'm going to play this arm forever.
And the rewards I'm getting are time discounted. That would make sense, if you
look at infinite [inaudible] policy. And in that case there are some very simple
policies called index policies.
You don't have to look at any other arm, just look at one arm and compute a
number for every state and then your algorithm is just play the arm which has the
highest index. And that's optimal for these infinite horizon time discount policies.
This recent stream of work is for finite horizon problems. They're called
budgeted learning problems. Okay. And all of them arise either from
model-driven optimization and database systems so you see some parts papers,
or you see them from manufacturing -- this is an OR paper from a manufacturing
motivation.
But all of them use this kind of Martingale assumption. So where we started with
this whole project is how do we get to this Martingale assumption.
Okay and the way we did that is by rewriting that last LP I had for small jobs, for
the general multi-armed bandit. Okay. Remember I said visit and stop? Right
now I have -- the LP looks kind of simpler. Again, I have two variables for each
state and remember you can be a state in any of the bandits, any one of the
different bandits. So ZUT says that at global time T, I'm actually going to pull or
play that arm U.
The arm in which U is U is actually a state. And W is going to be the -- is this the
time if WUT is 1, then T is the time when I'm first entering that state U. Okay?
So sort of the first entering time. This is entering and this is playing. Okay.
Now, look at this. You enter at time T precisely if you were at the parent at time
T minus 1 and you pulled in your parent. Right? And you with some probability
you got to where you are. Okay. And this is valid if you have just one parent, for
example. So if I have, for example, a tree transition graph you have exactly one
parent. And enter your state at time T if I was at your parent and I pulled at your
parent at time T minus 1 and this probability. This is the transition probability of
that length.
Okay. You can't pull, right, more than you have entered. You can't pull at a state
until you entered that state. We visit that state before you get to pull it. That's
what this is saying. This is saying at any point in time you can pull only one on
average.
Right? And this says you can start at the root of everybody. That's the initial
state. And if you pull you make the reward. Okay. So now my argue has been
simplified because earlier I had to actually put the conditional reward. All of
those, the messy conditional rewards came because I had this ->>: The budget you can see the time horizon is one to ->> R. Ravi: That's right. I only get B pulls. Because at each time I can pull
once. Okay. So earlier it was easier with the knapsack to figure out how to use
the LP to simulate what the LP was doing, canceling. Right? But this one was
much more complicated. Because the solution to this will be some kind of a
fractional strategy going over this forest, over this tree.
Right, I have many arms. Each arm is a tree. Right? And I have for each state
in each arm a probability of visiting it of some time and also probability of pulling
it at a certain time. So it doesn't decompose that nicely. Okay? So basically in a
nutshell this is what we do. We do something like a flow decomposition. We
take the LP solution, whenever we find that in any arm there's any state which
has a positive pulling probability, if I pull a particular state at any point of time,
that must be because I was able to enter it. But if I was able to enter it I must
have pulled its parent. So its parent must have a positive pulling probability. So
this way we work our way back until we infer that the root has a positive pulling
probability.
Now what we do, we find the smallest amount of that pulling probability that we
can strip off from this whole tree that's left. Okay. So we strip off this fractional
forests of plays. That's the convex decomposition. And then we look at that
fractional forest of plays in a timeline and it sometimes has these large gaps and
what we have to do is we have to fill these gaps to make our Markovian
inequality proof work. So we have a gap filling phase.
>>: [inaudible].
>> R. Ravi: So I actually have -- so if I'm actually with you -- so this is what I just
said. As long as there's a positive play probability, I strip off a tree. Okay?
So something that I strip off may look like this, that is for a particular arm, right,
this state at time three has some positive play probability. This following state at
time four has positive play probability. But this following state has positive play
probability only at time seven, not at time four. Why? Because my LP says you
can enter you can wait and pull later.
So when I look at this forest in a timeline, right, this guy's seven. So after this
happens, there's actually a gap in my strategy. Okay. So this is what I mean by
gaps in my schedule. So if I take one of these fractional strategies that comes
out and I actually think of what it is asking me to do, it's asking me to play this at
three, and then it's asking me to wait and play this at seven. So it's actually
asking me to do this.
>>: With the gaps it's a perfect decomposition, right?
>> R. Ravi: With the gaps?
>>: Yeah, as long as you leave the gaps in ->> R. Ravi: Right. Now we have one problem with that. So why can't I just run
this strategy? Okay. So what would running the strategy mean? I start this guy.
Right? And in case it actually transitions to that state, I wait for three steps.
Right? And then I run this guy. And in case it transitions here, then I wait and
actually that's the end of this, this strategy just ended it preempted it. That
means it's the end of it.
The problem is this weighting, because these weightings will add to the capacity
or load of everybody else. Remember, I'll pick a random strategy, random arm,
and pick a random forest from it and I'm going to try to do this; these things will
add up.
So in fact the particular problem is if I schedule a guy at a particular time, okay,
so I'll tell you exactly why the gap is a problem. If I schedule a guy who is at
depth six from the root, okay, and he's scheduled at time ten, okay, so in this
particular strategy, if I actually -- this is global time, by the way, between one and
B. Okay. So if I actually do this, then I know that there will be absolutely no time
left for any of the other forest in the first. Right? So this is bad, because I don't
want to fill the whole time B with just one guy. So in fact the real problem is if
there is a node, a component in this forest whose depth, right, is less than -- if
twice the depth is greater than the current time that it wants to get scheduled,
right? So that means if I'm scheduled at time ten and my depth is six, then more
than half the ten units will be used just on me alone. That's not good, because
remember the way I tried to do the previous proof was I only wanted to run this
for half of the budget. So I don't want myself to fill more than half of the budget.
So the quick fix to this is you just take the strategy forest, sweep your way from
the back, and whenever anybody is scheduled at a time, right, which is smaller
than twice my depth, another way of saying it is if my path to the root contains
more than half of my own ancestors, then I'm just going to push this guy back,
I'm just going to push all these guys back. So I'm going to compress this. But
there's a problem with compressing because when I compress I'm adding more
play on previous time steps when there was no play.
Okay. If you're still with me, right, then adding that is not really going to be a
problem, that much of a problem because how many different guys might come
and crowd a particular time step? Well, if somebody ever got pushed back, it
was because more than half of its ancestors were on the path before. So if I'm
coming back to time T, there must be more than T over 2 guys of my ancestors in
the past but the total capacity up until time is only T. So how many such guys
can come fill me up? On average two.
Okay. That's really -- it's a very simple argument. Just an averaging argument.
When I push this back, I'm claiming that no time will get overcrowded for play
from its current value of one to a value that may have been three. It will just get
extract two units of play. Fine, damp it by another factor of three.
So instead of damping by one-eighth, damp it by 1-24th. So the algorithm is that,
you compress the forest and you just pick a particular strategy forest with
probability equal to the roots playing probability damped by 24. And then you
just run this strategy. That's what the proof is just exactly what I did before.
There's nothing more you have to do.
Okay. I'm way past my time. So thanks for hanging in there. And there's really
only -- the one comment I want to make, even though I did this only for a tree,
you know I said this analysis works only for a tree, it actually works for very
general Markovian bandits. Because general bandits can be converted to
layered DAGs because we have a finite time horizon, only running for T steps.
So whatever transitions are happening in that general Markov chain just let it
happen in a time index chain like this.
And DAG bandits can really be blown up into an exponential-sized tree. Right?
Depending on I keep track of where I came to the sea from. Did I come from this
row, this A or this B. If I blow up that thing I get an exponential sized tree so I
can keep a representation of that exponential sized tree implicitly and run this
algorithm.
So in fact everything I said will work even for the most general multi-armed bandit
with non-Martingale rewards. So really the interesting things are can we use
these kind of LP decompositions to come up with adaptive approximations?
Because our algorithm, once it strips out this forest, depending on the sizes, it
sort of does different things.
And we know from all the previous examples that you need to do that. You need
to be adaptive to get any kind of constant factor approximation for this problem.
So this seems to be a very appealing hammer to use for other sort of adaptive
approximations.
And there's some generalization we have some results for. Okay. Thank you
again. Sorry for going over time.
[applause]
Download