23776 >> Yuval Peres: It's a pleasure to have Ravishankar... to talk about approximation techniques for stochastic optimization.

advertisement
23776
>> Yuval Peres: It's a pleasure to have Ravishankar Krishnaswamy. He's going
to talk about approximation techniques for stochastic optimization.
>> Ravishankar Krishnaswamy: Thanks, Yuval. Thanks for the invitation. It's a
pleasure to be here. And okay. So like Yuval was saying, I'll talk about
stochastic optimization from an approximation algorithm point of view. And feel
free to stop me at any point in the talk if you have questions and so on.
So it's joint work with Anupam Gupta at CMU, and Ravi, also a professor, and
Viswanath, who is a student at CMU, over at IBM TJ Watson and Marco, who is
a student at CMU.
So what's the outline of the talk. Very high level outline. A little more formerly
we'll talk about the models that we study. I'll concretely present the problems
that we look at before moving on to a result. And the main technical part of this
talk will focus on this problem called orientary [phonetic], I'll define the problem in
that section, and then it will come up with a general solution framework for this
problem and apply it to a couple of other problems, of which I'll describe in the
focus and future work.
So let's move on to the model that we study. So one thing we all know and we
released we have to be solving optimization problems all over the place.
And there are different models of looking at this notion of optimization problems.
So the one extreme is the notion of deterministic optimization. So where before
solving the problem we assume we know the entire input in advance and we
solve deterministic problem, combinatorial problem it may be. And the downside
of that is that we assume that we know everything about the input before we
come up with the solution. So I call it the know all model and in some sense it's a
little optimistic in real life.
We don't always know the entire input ahead of time. So the other extreme is the
model of online optimization, where we assume we know as little as possible
about the input before coming up with the solution.
And the solution is built incrementally as we know more about the input. Maybe
it's a little harsh to call it known. We know something about the input, but just to
illustrate the point.
And the flip side of it is that typically the algorithm assumes that it knows very
little about the input and it knows only as the input is revealed it can figure out
solutions. So it's a little pessimistic in its modeling assumptions.
And the gray area in between is sort of how most problems in real life are where
we know more than what is assumed by online but not as much as what is
assumed by deterministic, which is what I call the notion model, the concept of
stochastic optimization.
And it's been studied in different flavors all the way from 1955 where it was
initiated by DanSIG right after he came up with the simplex method for solving
LP, one of the things that he started thinking about is how do you handle
uncertainty in the matrix and the input.
And we're not going to be looking at this model of stochastic optimization. We're
going to be looking at some different models, which I'll talk about when I get
there. Okay. And so here we tried to get -- we tried to use the nice results that
we know about deterministic optimization. And also handle some sort of
robustness in the input uncertainty. So that's the problem. So since it's been
initiated and then last 50 years there's been a lot of work on this problem, and I'm
just going to paraphrase and abstract from this book by Birge and Louveaux on
stochastic optimization to motivate why we're interested in this problem.
The first thing it says a lot about the framework of stochastic optimization. The
goal is to come up with decisions, optimal decisions to handle uncertain data.
And it's received a lot of contribution from different areas in science. And it has a
lot of practical applicability.
And our goal in some sense is to sort of further, take more steps towards adding
computers to the list it's not absent from the abstract.
And what do I mean by adding computing as a computer science or computing to
this list? So the first thing that comes to mind, you want to compute the decision
efficiently and some notion of polynomial time in the input. But not always can
we find efficient solutions; in particular, if the problem is NP complete, you can't
expect to find optimal decisions efficiently so you add the notion of near
optimality and efficiency. That's sort of what the high level goal of this talk is.
What does that do to our model? What do we need to do to address these
questions at a high level? So the first thing is we need to come up with nice
modeling, nice ways to model uncertainty in such problems. Combinatorial
optimization problems.
Second thing is how does -- once we introduce uncertainty, how does the
solution best change? And the third thing is what techniques do we need to
handle uncertainty. And these are the three things that we look at in this talk
from a point of view of approximation algorithms.
So all the problems we look at are NP complete. Yes? Okay. So the canonical
NP complete problem is the knapsack problem, the maximum knapsack problem,
is going to be the focus of this talk, initial part of this talk, at least to answer the
questions from a high level. What is this problem? So we're given N jobs. Each
job has a reward and a size.
And what's the -- and we're also given a knapsack of size B. So in this example
the rewards are the numbers returned inside each. There are five jobs, and the
sizes are their lengths. And the knapsack has a capacity of B. And it's a goal so
we want to identify a subset of these jobs, which all pack into the knapsack and
the reward, total reward is maximized.
So in this example you can notice that all five jobs don't fit in, but four out of five
does, and you get a large reward. And, yes, you can think of it as a scheduler
which has a time reserve of B units of time. Each job takes a deterministic
amount of time, and you want to schedule as many jobs as possible before you
run out of time.
Again, it's an NP complete problem. We know extremely good approximation
algorithms for this. In fact, you can get one plus epsilon type approximation in
time that's polynomial in N over 1 epsilon for every epsilon.
So to address the first question, so how do we introduce uncertainty into this
framework? So the way we do that is through this problem called stochastic
knapsack. Again, it's been very widely studied in operations and also in
approximation algorithm. So let me just talk about what this problem is.
So here instead of jobs having a deterministic size and reward they have a
random size and reward.
But the variant that we study is the distribution is given us input. So to think
about this, you may think that you may assume that each job is a randomized
algorithm. It tosses internal coins, and there's a distribution over what its utility
will be in the end and there's a distribution over its running time. It's given as part
of the input.
What do we need to do? So the only thing is that only after we run the algorithm
do we get to see the true realization of the algorithm. So how useful was it to us
on that given run and how long it took to run.
So we know the distribution and the true [inaudible] is revealed on the weight on
it. For example, if this was the original deterministic reward some size, now it's
given as a distribution. It's either size small size and reward time with probability
half or large size and reward zero with probability half. So for each job we now
have a distribution in this form. And we need to somehow come up with a
scheduling policy.
So that's the addressing the first part. So this is how we model uncertainty for
this combinatorial optimization problem or about the second question. So how
does this change the solution space?
So recall that in the earlier model, solution is just a set of items that you pick.
And that fit into the knapsack. Now it's no longer as simple as that. So in the
most complex solutions can actually be an adaptive policy.
So what do we mean by that? So all that I mean is that subsequent actions can
depend on how the randomness turned out. So in this example maybe the
algorithm decides to schedule the green job first. Just by looking at all these
distributions and depending on how the randomness spanned it can do different
actions.
Perhaps in this example the green job came up small size but reward only five.
So maybe it chooses to do the -- for no reason, I don't have any reason for
making the scheduling policies. These are all ad hoc. Maybe it chooses to run
the brown job.
On the other hand, if the green job had come up in a different instantiation it
could have done a different thing, that's the point where adaptivity is the key.
Decides to do the last job. Maybe the last job instantiates to a top size and it
decides to do the middle job which is always deterministic, it comes up only in
one possible way.
And it decides to do the fourth job. And when it does the fourth job, the knapsack
is already expired by then. It does not get any reward from the first job because
it's not completed in time and the game ends still.
>>: So when you say adaptive policy are you saying next action, according to is
there an order on the items where the randomization appears in order? I don't
understand ->> Ravishankar Krishnaswamy: Yes. So it's a completely off line problem.
Every item is given as a distribution. The algorithm has to first pick an item to
schedule. Just by looking at all the distributions. Then it is irrevocably
committed to scheduling the item.
>>: Then it sees, the random coins.
>> Ravishankar Krishnaswamy: Then it sees this item took five times to
complete and gave me a reward, can choose a different job. That different job
can change depending on how this randomization occurred.
>>: I see.
>> Ravishankar Krishnaswamy: And there are different models here. So in one
model it can, if a job is taking too long, it can just abruptly cancel the job maybe
run a different job and come back to this job and so on. But we will look at the
basic model. It has to commit to scheduling a job completely just before running
it.
And after it completes, it can decide to run the next one. Yes. So earlier while it
was just a subset of items to schedule, now it's a complete decision tree.
So to look at it in a more formal definition maybe the algorithm decides to run job
just by looking at all the distributions, and job two has two ways in which its
randomizes. And for each of this branch the algorithm can choose a different job
to schedule. So maybe if it has reward five and size four it decides to insert job
five. On the other branch decision are job three and so on.
On each of these sample parts, the knapsack gets expired at some point and you
have to count the total reward you have collected before it expires, and the goal
is to come up with this tree, decision tree, which maximizes the total expected
role. So that's the problem. So earlier this is just a single path and you want to
maximize the reward.
Okay. So that's addressing question two about this model. And what about
question three? And the answer I'd love to say is you'd have to stay for the rest
of the talk.
Okay. So what's another motivation -- aside from being interesting in its
foundation, what's another motivation for studying such problems? These are a
form of branch of problems called stochastic scheduling problems, where, for
example, like I said the jobs are randomized in order to come up with policies to
schedule them.
And there have been rightly looked at from an OR point of view. There's a lot of
research. I'm just going to highlight some of them and in fact this is a good
survey I'll talk about them. And there's also another interesting point. A lot of
these algorithms use nice heuristics that just ignore everything about the
distribution, but just look at these expected values of the variance of the
distribution and deterministic algorithms on the reduced problem.
And one meta question for looking at -- for focusing on this problem is: Can we
explain the effectiveness from a worst case point of view, for any distribution at
these algorithms could be required some nice property distributions. That's the
motivation. The problems that we look at in this talk are, like I said, the
stochastic knapsack problem. Then an extension called stochastic routing
problems where if you just want a quick -- a quick insight into this problem, these
are problems that, random jobs are located at different vertices on the metric you
actually have to travel to a vertex before you can process that job. And another
generalization is those of multi-[inaudible] problem where I guess I'll get to this at
the end of the talk. If you know it's there then that's fine otherwise I'll define them
at the end of the talk.
So we look at all these problems and try to provide a unified framework toward
solving them, viewing them from the approximation algorithm's point of view.
And, yes, so more concretely or results to be sure. So we give efficient
approximation algorithms for the knapsack problem for the bandits problem and
for the orientating problem. And like I said so we only knew theoretical
guarantees for special cases of these problems.
Okay. Again to reiterate the meta problem, so we want to understand the
techniques that we need. A, to come up with algorithm design. Because the
solution space has changed. And B, how do you analyze them for combinatorial
optimization problems, stochastic point of view. And take aways, optimal
solutions can be very adaptive and algorithms may need to be also. So that's the
first part of the talk.
So now we move to the main technical part of this talk, we'll focus, in fact the only
technical part of this talk we'll focus on the problem of stochastic orientating, and
I'll define that problem now.
So once I define the problem, we look at the section results. And then motivate
the reduction before moving on to proofs.
>>: How is this problem related to the decision tree.
>> Ravishankar Krishnaswamy: This is one of the problems I mentioned. This is
the routing section. This is an example of a routing problem.
>>: Okay.
>> Ravishankar Krishnaswamy: So what's this problem? So if you want the
quick one-sentence thing for this slide, one-sentence take away for the slide
essentially a stochastic knapsack but jobs are located at different vertexes the
metric.
More detail we're given a graph with metric distances, and we're also given as
part of the input a distribution of processing time or waiting time at each vertex.
Think of these as the random jobs we have to do and each has a distribution
over how long it takes.
And we're given fixed rewards at this vertices. If I complete a job I get a fixed
reward. There's a total budget of B of how long I have. That's both combined for
traveling so we need to first travel according to these distances, and for
weighting. So there's a combined budget.
>>: Objective is time?
>> Ravishankar Krishnaswamy: Yes. So these two are in the same -- yes.
They're in the same unit.
>>: When you said grab metric distances did you just mean the gravometric.
>> Ravishankar Krishnaswamy: Yes. Exactly. Metric is a notion of time. So
how long it takes from traveling from A to B. And additionally you have to weight
the vertex because there's a random job you have to process.
>>: Uncertainties would weight ->> Ravishankar Krishnaswamy: Uncertainties only at the vertices and not in the
travel lane.
>>: Not in the rewards.
>> Ravishankar Krishnaswamy: Not in the rewards, correct. So we can handle
uncertainty in the rewards but currently we have to have a deterministic metric in
this problem.
>>: Look into the problems, how much time does it take to switch from one job to
the other?
>> Ravishankar Krishnaswamy: You could think of it like that. But although that
is not the motivation. Because at least I talked to an architecture person and he
sort of said that not really realistic in that framework, but you can think of them
as -- I'll give an example which is more maybe -- a example but a motivated
objective.
>>: Total distance traveled.
>> Ravishankar Krishnaswamy: Plus the weighting time.
>>: I see.
>> Ravishankar Krishnaswamy: Okay.
>>: So the knapsack has been the metric is ->> Ravishankar Krishnaswamy: So, yeah, this reduces to the previous problem
and there's no traveling at all. The metric is zero. Okay so what is the objective.
The objective like a style is to come up with an adaptive strategy. A strategy just
says, it's a sequence of vertices to visit. Maybe it's vertex two first and
depending on how long the job at vertex two took it can visit different vertices and
so on.
After some time it's run out of time, after a while it's run out of time and you want
to come up with the strategy to maximize the total expected reward.
>>: Stopping vertex is fixed.
>> Ravishankar Krishnaswamy: Stopping vertex is fixed. There's a given route.
So, yes, it's fixed. So the example to address all these questions that you guys
asked. So let's say we start at home at 9:00 a.m. and we have to try to do a
bunch of chores. So we have to send mail at the Post Office. We have to maybe
deposit some checks at the bank get some food and also buy some tickets. And
all these places close at 5:00 p.m. so that's the budget for this problem. So eight
hours is the budget for this problem.
And there's some distribution over how long we have to wait at each of these
vertices you can think of Qs at these vertices and this is the distribution of how
long we have to wait.
We started home at 9:00 a.m. and maybe one possible solution is to say visit the
Post Office first. So it takes two of us to visit the Post Office. So the traveling
times are indicated there. It's 11:00 now, and maybe the distribution at the Post
Office is such that they're extremely unlucky and it takes six hours to, we are still
waiting at the Post Office and not got to the head of the queue, which means in
that case we haven't done anything all the shops closed. Ultimately we could be
lucky and it only takes two hours so in the second sample path by the time it's
1:00 we've completed the Post Office, and we can do more actions.
So maybe we decide to get some food, but it takes three hours to travel to the
restaurant, and by the time it's 4:00 we've reached the restaurant.
>>: Distribution make you pay or the time?
>> Ravishankar Krishnaswamy: Yes. It's an assumption we make it's
completely fixed, independent of the time. And maybe the restaurant as I said it's
a pipeline takes half an hour deterministically to get the food. After half an hour
you've got the food and maybe we decide to go to the bank but we run out of
time walking to the bank. Notice if we were lucky at the Post Office, we got two
tasks done. Otherwise we got zero tasks done and the goal is to come up with
such best policy to maximize the number of tasks. Each number of tasks has
unit reward also.
So the interest on the practical side. So this orientating problems are more
generally routing problems. Have been looked at from approximation algorithms
point of view and OR point of view. More than the deterministic case and
stochastic settings. The deterministic box, very extensive literature and there's a
lot of literature and you can refer to the citation I made earlier.
And all the stochastic problems, all these problems, stochastic routing problems,
have been looked at, but the results that are only for special cases of special
types of distributions. And they're absolute guarantees. So we're sort of
interested in a worst case guarantee over all distributions, what about the
deterministic problem from an approximation point of view. So this problem
where the weighting time at every vertex is deterministic has received attention in
the past and a very nice elegant constant approximation for this.
>>: What is the name?
>> Ravishankar Krishnaswamy: It's called orientating problem. And fourth box is
what we're interested in, and that's the focus of this talk. Okay. Again, so this
just reiterating one more fact. So, again, if you look at the heuristics we know a
lot of nice heuristics work for this problem in the practice and simulations, and
they all throw away everything about the distribution and keep only very few
factors, like the variants and the mean and so on.
So it can be explained their effectiveness from a point of view, from
approximations point of view. And for the more set of tools required. So with
that are results. So we find nonadaptive solutions after talking about adaptivity
we end up finding a solution which is nonadaptive a sequence of vertices to visit,
which is a factor log, log B approximation to the best optimal adaptive policy.
And recall B is the budget of this problem. And the nice thing about that
nonadaptive solution, it's simultaneously a constant approximation if you're only
comparing against nonadaptive policy sites. So that's the results we get.
Yes, we assume that everything is -- the problem is discrete as generalists. So
this is the focus of this next section of this talk is going to look at the log-log B
approximation and how do we get it.
So the main motivation for our reduction is from the tools that work in practice
and simulations, and throw away everything about the distribution just look at
expectations and try to work with them. How well does that perform?
Here's the high level approach. So we are very good at solving deterministic
optimization problems, in particular for orientating, very good approximation
algorithm, factor three. Not try to reduce the stochastic instance into a
deterministic instance. Let's try to use these expectations and reduce it to a
deterministic question of the problem, solve the deterministic question; and the
deterministic question naturally gives a nonadaptive solution. There's no notion
of adaptivity. This is the high level approach that we're indeed going to follow.
The natural candidate for this deterministic problem is orientating in itself, which
we know [inaudible].
So since the title says I don't run there, it's not a reduction that's going to work in
entirety. So let's first look at what the reduction is, present what the bad example
is. Such reduction. For every job we're just going to replace the random variable
debt by a deterministic variable with value equal to the expectation.
I'm going to think of the waiting time as the expected waiting time. Now we want
to find a path whose length plus waiting time is at most B and the reward is
maximized.
So this is a problem on a completely deterministic instance, and in fact this is the
deterministic question of the orientating problem, for which we know constant
approximations.
So the real question is how well does this perform in the original instance? And
as you might have guessed as I already indicated this is not -- this does not work
for the original instance. So that's a bad example.
So imagine that the home is here. And all the random tasks are located at a
distance of B minus 1 and they're all located at the same vertex. Essentially look
at it as the same vertex. And there's an alternate community, a neighboring
community of a distance B minus 1. Now how do these jobs look like? In the
first community every job has extremely high variance.
So each job has a waiting time of zero with overhead min of probability or a
waiting time of B with some tiny probability. So that -- and the other community,
they're all very completely deterministic jobs waiting time one-half.
So how does this reduction perform, so when you throw out everything about the
distribution, in particular we throw out the variance. We get that the waiting times
here is 1. And the waiting time here is half. So when we solve the deterministic
question of the problem, it will prefer traveling here and picking two jobs out of
this. So it gets a reward of 2. But on the other hand, so you can see the
deterministic solution is just two jobs.
It prefers this cloud or this cloud. On the other hand, for the stochastic instance
you can notice because these things are very high variant, it's going to turn up 0
with overwhelming probabilities. In fact, omega B of these jobs will all finish with
at size 0 and you can collect a reward for all of them. With constant probability
you can get B jobs therefore it gets omega B profit in total. You can notice by
throwing out the variant, the deterministic instance got fooled into going into the
second community or preferring the first community. So, yes, the take-away is
that the deterministic reduction here completely characterizes the mean perfectly,
but it ignores the variance.
>>: Are there equally bad counter examples for natural distributions like normal,
Poisson, anything ->> Ravishankar Krishnaswamy: I would say not. Yeah. I don't know for sure.
But my guess would be -- so one thing is if the reward also can be correlated with
the size. Then I think there are some easier, bad examples, not so harsh bad
examples, but again they may not be true, the more the distributions are nice in
their respective marginals.
Okay. So the main thing that we observed that was that we need to know -- we
need to somehow control the variance in that reduction. And what we do is we at
least in the previous example we know the optimal solution a lot more so we can
do a lot more to control the variance. So in particular, so suppose we know that
the optimal solution always travels for some fixed P start time units. It's not true
in general but suppose this were true. Imagine there were true. In the previous
example it's always B minus one regardless which community it goes to it travels
B minus one time units. And again the traveling may not occur all up front but
let's say we rearrange and put all the traveling up front. So then what it's doing,
it's solving a residual knapsack problem on the jobs that are located with the
knapsack budget of B minus T star. And this may be adaptive. It can depend on
the sample parts; but in all the sample parts it's solving a knapsack problem with
budget B minus T star. In the previous example it was T star was B minus 1 and
this was just 1.
Therefore, the insight that we get is that we should really not look at the
distributions beyond B minus T star. In particular what I'm saying is there is no
point in the distribution supporting the sizes because they're not going to get any
profit from those large instantiations in any way. So because we're traveling T
star, if any size crosses T minus T star the game ends and we're not going to get
a reward from that job. So why not minus the distribution up to B minus T star
and look at every large size as just being B minus T star epsilon. So what I
mean by that.
So one thing in particular is that because we don't look at very large sizes, it
gives us much better control over the variance.
So what does this do in the previous example? So now let's ignore [inaudible]
and we'll argue that this community own by itself does very well. So why is that.
Earlier the mean waiting time was 1. But because we always have to travel B
minus 1 units the knapsack problem we're always solving is size 1. Therefore,
I'm not going to treat it as B anymore. I'm just going to treat it as 1 plus epsilon.
And I won't collect any reward in the second case. If I do that, then I get the
expected value. To be 1 over B and deterministic question after truncation can
collect all the B here and here.
So now the deterministic instance has good property. So the questions that
come out, the main assumption is that optimization always travels for some fixed
amount of time on all its sample parts, which is probably the dubious assumption.
That you might have guessed.
The question is there such a single good threshold that we can truncate and is it
good in general, is the solution good in general? So we answer that affirmatively.
So to address the question of is there a good truncation in general, the main
issue is that we don't know the structure of optimal solution as well as in the prior
example that we considered.
And in particular it's very adaptive solution, and it can have different traveling
times on different sample points.
Okay. So the answer is, yes, as you might have guessed and in fact there's a
very simple solution fixed for it. Try all possible truncation thresholds and one of
them will work out well.
So what's the final algorithm that we look at? It's more or less just reducing the
deterministic instance. But after preprocessing of truncation. We'll try all
possible values of truncation, one, two up to B, log many values, powers of two.
Choose the one which has more deterministic reward. What's the ability that we
solve?
First step is we want to truncate every job size distribution at this value B minus
T. So what I mean by that is suppose this was the size distribution of the job. I'll
draw a line at B minus T. And think of every large size as being B minus T itself.
And we'll zero the reward in this section because there's no point in getting -there's no way you can collect the reward in that. So after this truncation, notice
the size of the job which is a deterministic quantity to be the expectation after
truncation.
And so now what's the problem that we need to solve? So notice that we
assumed that we are traveling for T units of time. That's what led us to truncating
at B minus T. We want to find a path that travels at much T units of time. Such
that a separate constraint -- there's a separate constraint on the size. So the size
of the jobs it picks up along the path should be at most B minus T where the size
is computed in step two and the reward is maximized. It's the final subverting we
solve. So barring the truncation, this is almost the simple heuristics suggested
up front.
So the question is the first observation is since we're solving a deterministic
problem it gives a nonadaptive solution problem which I claimed up front. Two
things that linger how good is the solution for the stochastic instance? And is it
even polynomial time. So why am I talking about that? It's unclear if step 3 can
be solved in polynomial time.
So we answer about these questions affirmatively and ensure the theorem that
this algorithm is log, log B approximation to the best adaptive policy.
Okay. So let's move into the proofs. So what do we need to show? So we need
to show the first thing we need to solve is that we have an algorithm and we want
to show that there is, when we choose the solution at the most reward we want to
show there's actually indeed a good candidate solution with height reward,
compared to the stochastic solution.
The second question is can we implement all three steps in polynomial time.
And for this we need to implement this sub routine where there's a bound on -we need to find a tour which has bounded length. And in a separate dimension it
has bounded size of jobs it picks up and the reward is maximized. So this is a
problem we call the orientating problem with the knapsack constraint which is a
sub routine for our overall goal. And we ensure we can approximate this within
constant factors.
So I won't talk about that in the talk, but trust me on that. And the third step is we
find a solution for this such as nonadaptive. The question is how does it relate to
the general instance? So having ignored everything and just looking at the
expected values does it still do in the original instance.
So these are the three questions that we talk about, that we address now. So
the first question -- so in particular indeed if you go back to the old outline, we're
just using the surrogate problem as the orientating problem with knapsack
constraints. So to answer the first question, we saw that if you look at the
optimal adaptive solution it can embed into the deterministic instance for some
guest T star with the gap log log B. And the second question says can we
implement knapsack orientating we can answer it giving an approximate constant
for that. And the third question is once we found a solution, how well does it do
in the original stochastic instance. We again show it does very well up to
constant factors. Three parts to address. So let's look at the easiest part first,
which is moving from the knapsack orientating solution to the original stochastic
instance. Okay. So what do we have? So assume that we have a good solution
to the knapsack orientating instance. In particular, you have a path which travels
at most T. Has reward at least T star and the size of items it visits is at most B
minus T. Each of these sizes at the expected values that we have computed.
So now what's the candidate nonadaptive solution? It says we have a candidate
path. Let's follow the same prescription. Visit all the vertices in that order and try
to fetch the reward. But one thing that this did not capture was the variance. So
to account for some slack, we also do a preprocessing step we visit a vertex we
simply toss a coin probability half don't process the job at that vertex and
probability half actually process that job at that vertex.
So this is our final nonadaptive policy. So how well does it do? So the analysis
is extremely simple. And it fits into this little box here. And it's just an application
of Markov inequality and linearity expectation. First thing we know, since we
follow this prescription, the total traveling we do is at most T. But what about the
total size? So the total size, if we were to play everything, has an expectation of
B minus T because each size is in itself an expected size.
But we also sub sample probability one-half. So the total size and expectation is
now at most B minus T over 2. And therefore we don't have any control over the
variance. So the best we can do is apply Markov's inequality here. So every job
finishes within total B minus T with probability at least half. In that case we get
the claim of O of R in the original instance. So this gives us the nonadaptive
policy which fetches an expected policy of row R over 2.
>>: So [inaudible] distribution.
>> Ravishankar Krishnaswamy: Again. So we have to be -- I was cheating a
little bit here. So we have to handle the truncation separately, but we can handle
it. And the proof follows this mostly along the same lines.
So the truncation actually reduces the size a little bit. But as we don't have any
control or truncation here. So we need to handle that but it can be done. Okay.
So we've seen that if we solve the knapsack orientating, well we can get good
solution for the original stochastic instance. Let's move on to the harder part. I'm
going to skip this part. So other part is to show there's indeed a good feasible
solution for the deterministic surrogate that we have created.
Okay. So notice what does it entail, we need to show the following. So we have
an optimal stochastic solution. So you can think of this as a stochastic process,
which inserts some vertex visits, some vertex up front and depending on how
random it is it can do different things on different sample parts. Think of it as a
stochastic process, and we need to show that what do we need to show?
We need to show the following. That there is some sample path in this process
as indicated by the curved line there. Which has the following nice properties.
Just that. If the sample path travels for T star units of time in total, then if you
look at all the vertices it visits along the way, truncate their means at B minus T
star, then some of these truncations is also at most B minus T star that we need.
And furthermore it has good reward.
Why are we done if we show this? If we show this then we obtain a feasible
solution for the deterministic surrogate we created. So in particular for this
choice of T star, T equal to T star we will find this sample, but at least that
sample but in step three.
So we are done if we showed this. That there's some nice sample path, which
has good bound on some of the means and reward. So okay so let's for sure
with some assumptions. So for that I'm going to -- before showing this, I need
some little bit of notation.
So I'm going to call a sample path as well behaved, if it satisfies the following
property. In particular, if it travels at most T units of time in total, then the sum of
these truncated means is at most some alpha times B minus T. Why do you
expect this to be true? So one intuition is we know if you look at any sample path
if it's traveling at most T, and some of the actual processing times, then it's at
most B minus T. Because it's a feasible solution for -- it has to travel -- it has to
spend at most B time and fountains traveling T wait for B minus T we know it's
true for every sample path. And we expect for some constant alpha, even if you
look at their means, it's reasonable to expect that this concentration holds.
So it's reasonable to expect that most of these sample paths are well behaved
with constant alpha.
And the easiest case is if each sample path is well behaved. So in that case it's
just going to be a convexity argument. So if each sample path is well behaved,
it's very easy. So we know that since the average of all these sample paths
fetches a reward of opt, one in particular must be have a reward of opt and
expected value of opt. And all of of them, if you sum up all the truncated means,
it's at most alpha B minus T and so I can pick the best fraction, alpha fraction of
them, and get there's a feasible solution for the deterministic problem we create.
So the real case is every sample path is not well behaved. The question is how
do we handle the best behaved sample parts? So the answer of the fix we have
is also very simple.
We look at the stochastic process and terminate the process and then it
misbehaves. What does it mean formally. We look at the stochastic process and
travel down every sample path. So this was the first vertex it visited on the left
branch it visited some second and third vertex so on travels down each sample
path and look at the first node along any sample path where this condition is
violated.
So satisfied. So if you sum up all the truncated means of all previous vertices,
then that should exceed alpha times B minus T. So the first time T when this
happens, I'll call it stopping time and let's mark it by A Star.
So this is one technical aspect here is that these means are truncated at the
stopping time. So I look at all the previous random variables. And truncate their
expectations at B minus T. Where T is the depth of this process.
And that's the star node. So on each sample project computes the thing and
marks it. That's the first time when opt misbehaves. What we say is we truncate
star nodes so we don't look it up. So now it's easy to, it's clear that the new
solution we've constructed is well behaved by definition. But the only thing we
need to show is that by doing this chopping off we don't lose too much reward
and expectation.
So in particular that can be carried upper bounded by the probability that the
optimization reaches such a star node times the expected value of [inaudible]
under each star node it cannot get more than opt node. That's a contradiction
optimality mode. The main crux of the argument is showing that the probability
that you reach a star node is at most half.
And to help along the way, I'm just going to construct -- I'm just going to explain
how you should view the optimal solution as a stochastic process. So imagine
this was the optimal solution. So you start at home, then you go to the Post
Office. You have to travel three units of time to go to the Post Office. Or maybe
you have to travel four units of time to go to the Post Office.
So the random variables they're going to create are the stochastic process we're
going to look at is the following: So look at the optimal solution. Look at where it
is after it has traveled to units in time. So after it's traveled for units in time, if it's
actually at the vertex and processing a job, then XMT is going to be the random
variable for the processing time of that job. On the other hand, if it's along an
edge and traveling between two vertices, then XMT is the random variable.
So in this example, suppose it took four units of time to reach from home to the
Post Office. And X sub 1, X sub 2, X sub 3 are all basically random variables.
And XM 4 is random variable corresponding to the Post Office distribution.
Depending on how long it took at the Post Office, XM5 can be a different random
variable. In this case XM5 is traveling on both branches XM50. But on the left
branch, it's still traveling at the sixth unit of time, whereas it's already reached the
restaurant on the right side. Except six is different anyway.
So what is the motivation behind all this? So the nice thing is sum up X sub T
from 1 to capital T you get the total time that opt has weighted first after traveling
units in time. So that's one thing. And the second thing, is if you look at this
difference, XMT minus the expected value that forms a process, these are not
IAD because XM6 for instance can depend on what happened in XM4 but forms
a process nonetheless.
So now using this notation, what do we need -- what does proving the lemma
translate to? So recall that we were truncating the opt at the first points where it
was misbehaving. That is, looking at it from this definition. The sum of all these
expectations, truncated at B minus T. The first point where it exceeds alpha B
minus T is the stopping time in the process.
So I look at the min of T such that this condition is satisfied. And we want to
show that the optimal solution does not cross such star nodes at too high a
probability.
So what does it translate to? So it says, firstly, there must exist a time where the
optimal solution can actually cross it. That means the total time for which it is
weighted is at most B minus T because T is the time it travels for. And it must
satisfy the stopping time criterion, which if you expect the sum of these expected
values, it must exceed alpha T times T. We only get some hand delivered
property, show it's at most half. That's what we want to show. If we show that
we are done.
The natural thing might be to use some sort of Martingale concentration bounds.
So the only thing I'm going to mention here without going into too much detail is
that typically Martingales bounds, deviation of some random variables, sum of
random variables from expected values. Whereas what we have here
metaphorically it's like picking out a bad apple out of a bunch of bad apples.
What we want here is to bound deviation over some random variable over a
bunch of the expected values, but each expected value is a function of the
stopping time.
So if you look at the stopping time T, go back in time, look at all the previous
random variables. Alter them based on the stopping time and compute their
expected values, and you want to bound this deviation which is why we need
nonconcentration standard bounds because it's comparing apples and oranges
and in particular it's like picking a bad apple out of a whole bunch of oranges.
So that's the take away from this slide. So we do that by running several
Martingale processes for each truncation thresholds. And again in each
truncation threshold also the optimal solution is the stopping time is a random
variable in itself.
So assuming it does not suffice and we use this variance bound for equality for
Martingales and to get some concentration.
And the reason for log, log B is we can get exponential concentration in each of
them we have log B different Martingales need a differentiation of log, log B
overall gives union bound and log, log B.
So we've seen this. So the last couple of minutes I'm just going to talk about the
bottom arrow. And all I'm going to say is that we can approximate this problem
well. And that's where a reduction to the -- it uses the Lagrangian relaxation
method and lifts the knapsack constraint into the objective function and says for
suitable choice of Lagrange parameter you can solve this problem and get a
constant approximation.
Okay. So what did we learn from this whole sojourn. So the things that we
learned the first thing is we follow this framework of we have high entropy
problem here, all the distributions. We say let's ignore the entropy in entirety use
expectations.
So it's deterministic problem. So, yeah, and solve it and then recover a solution
for the original problem. But the natural candidate did not work because it's only
encoding mean and did not capture variance. So we were able to capture -again, this is the problem that we saw, that if you don't encode variance, then
there's a problem.
So regarding that way enforcing structure on the optimal solution by saying, look,
it has to travel for some T units of time. And now we can look at every random
variable, truncate it and get much better control over the variance, and this
standard turned out to be good enough to be log log B factors.
And now the bulk of the analysis goes into showing that the -- now getting better
structure on the optimal solution does not change things too much.
So that's the Martingale proof that I skipped. So what did we learn? So we sort
of looked, identified a general framework using a deterministic surrogate
problem.
But the things that we need to think about when we come up with the surrogate
should be able to handle the variance in the problem. That is what this previous
example explained. And then if indeed there are correlations then it should be
able to handle correlations between random variables. So the previous case I
completely used independence all over the place.
And thirdly, sometimes the surrogate must be general enough to yield to adaptive
solutions. So there are some problems you cannot do with nonadaptive
solutions, there's a large adaptivity gap and your surrogate should be general
enough to handle adaptive solutions and these are parameters you need to come
up with when you construct a surrogate.
For a lot of problems our framework says you can't come up with nice surrogates
but the difficulty is often pushed into the analysis. Why is that? So at a very high
level the surrogate is a string running instead of being a relaxation. Usually when
we write an LP relaxation or mathematical problem the relaxation is truly a
relaxation, optimal solution is embedded into that. And here we are saying let's
ignore everything about entropy, we start with a high entropy problem and create
low. I'm going to look at the means and variances solve the problems using only
those parameters. So it assumes a lot more about the input than the original
problem. So it pushes the analysis into the -- pushes the difficulty into the
analysis.
Now having seen this framework, how does it apply to the other problems that I
mentioned up front? So that's the focus of this part of it.
So now going back to the knapsack problem, again, recall that now the jobs have
distributions of sizes and rewards, and these can be correlated. So that's now -this part of the talk will address how to handle correlations. Size variable can be
coordinated with the reward. That's a very easy example of that is randomized
algorithm. So depending on the internal coin tosses, both the running time and
the utility affect us and they could be correlated.
Now the goal is to adaptively schedule to maximize the total expected reward.
And there's no metric here. All the jobs are at the same vertex. What can we
show you can get a constant factor for this problem. Mentioning up front we can
even handle general models where you can preemptively stop a job if it's taking
too long and try to run two jobs and so on. Maybe the distributions are such that
you can infer the conditional information about the distribution after the job has
run for a little while. So for five units of time maybe the distribution is such that
it's always going to take a million units and I guess you can throw this job and
move to different jobs.
>>: So it's more adaptive.
>> Ravishankar Krishnaswamy: It's not adaptive still. So figures when to stop
and when to move and all that up front.
>>: Thank you. Any correlation between ->> Ravishankar Krishnaswamy: Yeah, let's talk about that. And one thing I'd like
to say is that prior work assumed independence between these parameters.
There have been very nice algorithms that sort of motivated the framework that
we also studied and that deduced in implicit manner and that assumes
independence. So one example where they're not independent are, for example,
let's say that when you start a task, it can crash immediately with some
probability. Clear crashing does not give you any utility, right. If it crashes, then
you get maybe tiny size zero reward otherwise you get the real size and real
reward, perfect example on which prior algorithms fail.
And just a few lines of about how we handle correlations here. We go back to
the same framework that we need to come up with the deterministic surrogate,
solve a surrogate recover a good solution. What's a good choice of surrogate for
these problems?
Based on prior work, we sort of want to use linear programming formulations as
good surrogates. But these LPs use some kind of expected value. We throw
away everything about the distribution. Use the expected value in the linear
program formulation and think of that as a surrogate.
So surrogates are LPs. The only thing I'd like to state is natural LP does not
work as a surrogate. So we need to come up with a stronger LP to handle
correlations.
And maybe I'll address Nico's question we can bring up the LP at the end of the
talk and argue how it helps after the talk maybe. But just if you use surrogates
as LPs, this embedding argument becomes easier because it's just an averaging
argument.
Mainly because LPs can tolerate fractional solutions if you use deterministic like
in the previous case we could not tolerate fractional solutions.
And the moving to a nonadaptive solution is simply an argument of randomized
learning. These are both very nice tools that LPs provide for us. So one reason
why this framework would not play in the previous problem is we don't know good
LP relaxation for the orientating problem.
So like I said state LP doesn't work we have to use a stringed LP. It's just -- now
a second problem that I mentioned at the beginning of the talk is what's known
has multi-young bandits problem. So here's the problem. Imagine we're given N
Markov chains up front and each of them, each state and each Markov chain has
an associated reward and transition probability.
So the input consists of all these parameters. So the input consists of the
Markov chains the transition probabilities and reward of each state. And there's
also a starting state. And there's a budget of B in the -- there's a budget of B.
Okay. I'll tell you what that budget means. So what should the algorithm do?
So the algorithm at any time step should pick on Markov chain. And that Markov
chain will make a random transition according to its distribution and move to a
new state and give us the reward of the new state.
So in this example let's say that the budget is three. So I'm allowed to make
three pulls. So I make the first move, maybe the algorithm chooses Markov
chain one. And then it makes a random transition and gives me the reward of
the blue state which is $1. It gives me a reward of $1.
Now the algorithm, depending on what happened here, the algorithm can choose
a Markov chain to move again. So the second time step maybe it chooses this
Markov chain, because there's a very high chance that this gives a zero reward
here. So maybe it chooses to play the second one to be safe.
So the second one makes a transition and gives me a third of a dollar. And the
third time step maybe the algorithm chooses to risk it all because there's a tiny
chance you can get a large reward here. Maybe it chooses the first chain again
and so on. So the goal is to come up with such a policy to maximize the total
reward of all the states that you traverse within the time of B.
And alternately you could consider different objectives where after playing for B
time steps the objective is to maximize, the maximum over all the final
configurations of all the Markov chains so on. So the max version has been
studied from machine point of view we're looking into some objective.
And so that's the problem here. So for those of you know the exploration
framework, this phase is completely the exploration phase because the objective
tells you what you can get in the exploitation phase this is what can you do in B
time steps to discover most about the chain. With temper many times I'll talk
about this at the end of the talk. We look at this problem, as Nicol was asking,
this is probability where adaptivity is key. We have examples to show that
nonadaptive solutions can perform arbitrarily worse than adaptive solutions. And
how do we do that? Firstly, what do we show? So we show constant factor
algorithms which are adaptive and also we show adaptivity is necessary. Prior
work assumed that each Markov chain in itself should be a Martingale. It should
satisfy the Martingale property, and we don't need that. A simple example, this is
not true, is this is the case of knapsack with correlated size intervals. Not going
into the reduction but that's a very simple example that's not a Martingale.
So, again, what's the framework? We apply the same framework, and we want
to use LPs again, we use stringed LPs again because it has to handle the
correlated knapsack as special case. But the addition that we have to do early
year-rounding that rounding should recover adaptive strategy, nonadaptive
strategy. So this is -- yes, this is the interesting part of this step here. So we've
looked at these. Orientating problem first interesting part because surrogates
were not capturing fraction solutions and the second part we could string LP to
capture correlations and the third part we have to come up with rounding
algorithms that are adaptive.
So what about how do we -- what about the future directions in this framework?
So the first thing is we have looked at approximate algorithms problems from a
stochastic point of view. And there are lots of problems that just scratched the
surface and euphemistically, metaphorically, and we only understand very basic
problems right now. Another example of problems we don't understand is things
where there's an inactive transition the bandit problem Markov chains are making
transitions even though you're not processing them actively. So one thing we
don't know how to handle is that sort of correlation. So there's much more global
correlations that we don't know how to handle in some sense. And another from
a technical point of view, all our algorithms only work if the objective is linear. So
it's all expected reward and so on and so forth one thing is what about nonlinear
objectives we need to come up with different processes to come up with, different
analyses techniques for handling this. And, for example, one problem might be
that the objective itself is an optimization problem do adaptive probing to solve
optimization problem tomorrow. And one practical application of this is if you
think of a learning problem, immediate correlation between our framework and
learning setting is this, the domain of active learning. So there the question is
you have a prior distribution over possible classifiers and the goal is to probe
some data points and figure out the labels so that after probing say some K data
points you can get the most information about the classifier you can minimize the
resulting error of the classifier so exactly it fits into our framework only thing is the
objective is nonlinear error of a classifier as linear objective. So the question is
we can use our framework to generalize to nonlinear objectives first so we can
solve such problems. And some of the other research areas that I've been
looking at, one of them is network design. So we look at problems like survey
built design network and so on, and so we give good online algorithms for this
problem so the nice contribution is that we can sort of use embeddings in the
random sub trees to get, to come up with a very nice set cover and encoding of
the survey built design network problem and some of the things we've looked at
are scheduling from, scheduling problems. So we have some nice algorithms
broadcast scaling and one thing I'm interested in looking at is how well do -- how
can we solve problems that convex objectives using online problems convex
objectives using some primal dual framework, and that has applications in
scheduling with abject constraint and so on. To summarize, we looked at
stochastic optimization from the approximations point of view. Presented general
framework to solve using a deterministic surrogate. So the key is to come up
with a nice surrogate which can handle whatever you want it to handle. And it
works for three problems. And future direction, can we use this framework for
nonlinear objectives which can help in machine learning and other problems with
this. And, yes, so this should be MSR -- this should still be [inaudible] because I
don't have this. Thanks a lot. And questions?
[applause].
>> Ravishankar Krishnaswamy: Any questions, comments, suggestions? Okay.
Thank you.
[applause]
Download