>> Kostya Markarychev: Okay. I will start. ... great to have Nikhil Devanur today speaking about

advertisement
>> Kostya Markarychev: Okay. I will start. So it's
great to have Nikhil Devanur today speaking about
fast algorithms for online stochastic convex
programming. So Nikhil is an expert on approximation
algorithms, online algorithms, and [inaudible] game
theory and today he's talking about I guess online
algorithms, right?
>> Nikhil Devanur: Thanks, Kostya, and it's nice to
speak here. I think I'm speaking after for a long
time. So it's nice to be on the podium. And I will
talk about this work which is joint work with Shipra
Agrawal from MSR India. As many of you know, I spent
about a year in India, and this is one of the two
papers I wrote, and I was there, really like both of
them.
And it's a very simple talk. I'll define a problem,
I'll give you the solution, and I'll point to some
future work.
So what is the motivation. As again, many of you
know, I worked, been working on problems coming up in
display advertising. This is the things where we
advertise on people's T-shirts. Just kidding. And
previous to this work already the algorithms we
design are actually being used by Microsoft in the
past few years to run this [indiscernible]
And here's the basic problem formulation. The basic
decision you have to take in this problem is you get
an add slot or an ad opportunity, and this is called
an impression, so it's a technical term. And you
have to assign this impression to one of many
competing ads. So which ad do you assign the
impression to.
So if you assign an ad J to impression T, then you
get some value VTJ, that depends on this pair. Now,
you could say okay, why don't you just assign each ad
to -- the each impression to the advertisement with
the highest value. You don't do that because there
are constraints. The constraints are each ad has a
target number of impressions and you don't want to
assign more than these impressions, this many
impressions to this advertisement.
Okay. It's still an easy problem if you know
everything. Here's [indiscernible] you can solve the
[indiscernible]. You can also round it very easily,
right? So you want to maximize the total value
generated by the location, the capacity constraint,
and for every ad, J is an ad, you cannot assign more
than GJ impressions. And every impression you can
assign to one ad.
Okay. The difficulty comes from the online part.
The thing is you don't know the center thing ahead of
time. The impressions come one at a time and you're
to assign the impression without knowing what
impressions you're going to get in the future. So
there's uncertainty in the future. It's what makes
the problem difficult.
>>:
[inaudible] meaning that just one --
>> Nikhil Devanur:
>>:
One impression.
>> Nikhil Devanur:
>>:
Just assign it to one of the ads.
[inaudible]
>> Nikhil Devanur:
>>:
It's just one impression.
I'm sorry?
[inaudible]
>> Nikhil Devanur:
impression.
Yeah.
It's just impression by
>>:
There is a sum of G --
>>:
It should be J.
>>: Oh, this is J. Sorry. Sorry. Yeah. That's
what I was going to copy this. Yeah, yeah. This
should be sum over J. Yeah, I changed this thing
here and didn't change this thing.
Okay. So we're doing this paper joint with Balu and
others. We designed a near optimal algorithm in a
particular stochastic model. I'll mention what the
stochastic model is later. And this is algorithm
with some modifications that is being used by
Microsoft.
And then we met this guy called reality. So I'm
exaggerating a little bit. The thing is this was an
essentially linear formulation, and in reality, there
are all these nonlinear things that creep up. So one
thing is called under-delivery. So the way this
works is this target is not just an upper
[indiscernible]. It's essentially upper and lower
bond, so this is a number of impressions you promise
the advertiser, and if you don't give this many
impressions, then you pay a penalty. So if you
promised a million impressions, you only give 90
percent, you're to pay a penalty for this ten
percent.
And if you assume that this penalty is linear, then
it still becomes a linear program. It's a linear
formulation. But in reality, this penalty is
nonlinear. It's some kind of convex. So if you give
50 percent, then it's -- no, it's really bad. So
it's a nonlinear penalty.
The other thing is, advertisers, they require a
mixture of different impression types and it's -- if
you deviate away from this mixture, then it's not so
good, but, okay. So how does it look like. Again,
it's not a linear relationship. It's some kind of
convex penalty. So the farther there is an ideal
mixture, let's say you want 50/50 of man and woman,
but the farther you deviate from this ideal, the
worse it is and again, it's not linear.
>>:
[inaudible]
>> Nikhil Devanur: Yes. We know a lot of times.
Most of the time we know because they're logged in.
They've told us what the [indiscernible] is. So
that's just an example. We might know other things.
>>: You're saying it's not linear, but it's well
defined. I mean, some of those were exactly how bad
it is to have 48 percent man and 52 percent women.
There is a score that defined by some human?
>> Nikhil Devanur: Yeah, so you can model it. It's
a little fuzzy. So advertisers don't really give you
this function, but you can kind of capture this. So
advertisers won't tell you here it's a penalty, but
you know, you know. So someone who knows the
business can model this and say, okay, look, you
know, if we deviate too much, then this is how bad it
should be. So some human will come up with this
function.
>>: At least for under-delivery it's precisely
known.
>> Nikhil Devanur: Underdelivery, yes, actually this
is part of the contract. So this is part of
contract. So we know exactly ->>:
The algorithm designer --
>> Nikhil Devanur: For the algorithm designer this
one [inaudible]. This function is given. Okay? And
another place where this comes up is this value that
I said, so there are many things we care about.
Revenue, of course. Also relevance and maybe clicks,
conversions. All kinds of things we care about.
One thing you can do is just take a linear
combination, so then it becomes this linear objective
again, but you may not -- that may not again capture
what you really want. So you might -- again, what
you really want optimized might depend in some
nonlinear way on all these different objectives.
So all these things that are not captured, but the
nice thing is either they occur as a concave
objective or a convex constraint. And this is like a
convex programming version, right? So from linear
programming we can go to convex program. In the
offline world we still know we can solve this, so
there is hope.
So in practice what we do is we took the linear
programming algorithm for linear programming and, you
know, you do some hacks, you do some changes that
kind of make sense and it still works pretty well.
But as a theoretician, it was not a principal
approach, so I really wanted to figure out, okay, how
can we handle these convex things that come up.
that's for draw this problem.
So
Okay. So now I'm going to give you a very general
problem [indiscernible] that is going to capture all
these things, and we call it online convex
programming. So this is -- we called it online
convex optimization, but that was already taken, so
we had to settle for online convex programming, but I
think it's a very apt thing.
So here you're given a convex function F and a
constraint set S and a time horizon D. This you're
given ahead of time and this is not changing. So is
this is what you're talking about. The algorithm is
given this objective function F. Okay. This is the
same [indiscernible].
Now, this definition, it's going to look very
abstract, so bear with me for a moment and I'll show
you how this corresponds to the kind of problems
we're talking about.
So abstractly in every time T a request arrives and
what is a request? It's just a set of options. And
what the algorithm is required to do is pick an
option from the set. What is an option? Each option
is associated with a certain vector, okay? And this
vector can correspond to cost or rewards or some
dimensions can correspond to cost. Some dimensions
can correspond to reward, and so on. So you encode
everything in this vector. Okay?
So the set, the choice set is just a set of vectors,
R to the D. And the algorithm has to pick one vector
from the set. So pick one option from this choice
set, and every time. And then again, this is an
online problem, so in every time you don't know what
the future is going to be. You're to do it in an
online fashion.
And what is your goal? The goal is to maximize this
function F that is given of the average of all the
vectors that you pick. So you can think of it as a
function of the sum, but it's just more convenient to
think of it as an average for us. But the two are
equal, and if you know the time horizon, you can just
scale everything.
Okay.
So this is -- and then the constraint is that the
average has to lie in S. So this is a global
constraint and a global object there, because it
binds across the decisions you take across time. So
it's not part-time stuff. It's across what you do
through the entire time horizon. So the average of
everything has to line us.
>>: [indiscernible] corresponds to some stochastic
constraint?
>> Nikhil Devanur:
Yeah.
>>: Then does it make sense to average it over time?
I mean, I don't want to violate any capacity
throughout time.
>> Nikhil Devanur: So, yes. That's a very good
question, and that was our linear formulation, linear
programming formulation was you had this capacity
constraint, and when you specialize it to that, we
can actually make sure that you always satisfy it.
But if you have arbitrary constraints, then it's not
possible. In the intermittent steps it may not be
possible to satisfy everything. So it may be
possible to only satisfy everything in the end. So
for this generality, you need that everything is only
required to hold at the end.
>>:
[inaudible]
>> Nikhil Devanur:
You know the time horizon.
>>: [inaudible] because otherwise it could end any
time and expect you to be all the time.
>> Nikhil Devanur: Yeah, you know the time horizon.
Think of the opposite of packing the covering. In
that case, you can only hope to satisfy at the end,
for instance.
So let's go back to -- yeah.
>>: Just to make sure I understand.
you don't know is AT?
So the thing
>> Nikhil Devanur:
sum.
AT you don't know what the AT
>>: I see. And if you knew all the ATs ahead of
time, then you would need just a [inaudible].
>> Nikhil Devanur:
yeah.
>>:
It's like a convex programming,
[indiscernible].
Oh, I see.
>> Nikhil Devanur: So, okay. As you each 80 years
convex, if you want, then it becomes convex
programming, otherwise it's not -- if your ATs are
discrete, there's still a notion, you know, this
issue of discrete versus continuous, but because of
averaging, you know, that is only like a one over T
thing. I don't think that I should say, this is not
really -- we assume this is bounded, so it's not
really R to the D. It's minus 11 to the D.
So then the discreteness issue is not a big issue, so
it's essentially convex programming. Okay? So think
of AT as just a polytope.
>>:
[indiscernible]
>> Nikhil Devanur: Convex [indiscernible], yeah. So
there are going to be more rounding issues, but it's
going to be easy rounding.
All right. So what about the distract column? So
this is a generation of the distract problem. How is
it [indiscernible] relation. Here's how we can
encode it. The vector has N plus one dimension, so
if there are N advertisers, it has N plus 1
dimensions. The first N dimensions encode which
advertiser this corresponds to. So there's a one in
the dimension corresponding to advertiser. The last
dimension corresponds to the value. Okay?
And your objective is just the last dimension. You
just care about maximizing the total value. And most
the constraints, right? The constraints, the
capacity constraints or box constraints, right? I
just divide it by T because the constraint is on the
average, okay? So for in the J dimension, you know,
you cannot execute [indiscernible] objectivity.
Okay. So if you pick this vector, it corresponds to
assigning this impression to this advertiser that
uses up one capacity and it gives you value of your
TG. It's clear how this encodes to this
[indiscernible] problem, right?
And now let's say you had an under-delivery penalty,
a nonlinear under-delivery penalty. Here's how we
would change it. [indiscernible] effects to be just
the last coordinate. You would have this penalty
here, so GJ encodes how this penalty looks like. GJ
only depends on XJ, which is the count of the number
of impressions you assign to [indiscernible].
Okay. So you would have something like this. And if
G is convex, this is concave. They can maximize
this.
This also generalizes a more general linear version
called the online packing. You have this packing
constraints and, you know, you can do a similar
thing. One example of an online packing is a network
routing problem. The ith request corresponds to a
request to route one flow from a known site to an OTI
in a network. You have each capacities are given
ahead of time. So that's a packing problem.
Another example is that combinatorial auction. So
buyers arrive. They have combinatorial valuations
for sets of items, and let's say you have many copies
of each item. So buyer arrives. He has some
valuation you have to maybe offer prices or, you
know, do something else. You ask him to tell you
your valuations in some way and then you assign a
bundle of items to him and you charge him something
and then next pair arrives.
So this is also an online packing problem. So these
kind of things can be encoded in this format.
And even though I said for the online packing, this
is a linear version, this is already solved. Even
for this we get an improvement. Improvement is
technical, so earlier the dependence is currently
that we get that I'm going to show later, it depended
on the number of constraints. And the number of
constraints in this formulation, for instance, like a
network routing formulation, the number of
constraints could be exponentially the number of
variables. Right? Because you know if graph is
images, the number of parts could be exponentially.
So the number of constraints would be exponential,
and we get [indiscernible] only depending, you know,
only on the number of variables. So even in some
cases we get improvement here, even in the linear
cases.
And there are potentially other applications. We are
looking into this and people are studying similar
problems. Some of it fit exactly in our model, some
of it needs some work, and some of it, you know, it's
not clear the fit. Hopefully we can extend these
techniques to that. So there's a lot of work in
especially operating, operations research community
that look at similar problems.
Okay. So one thing that I haven't told you yet,
mentioned is this thing is in a stochastic model. So
what is a stochastic model? So this is not the usual
worst-case model of a comparative analysis. So the
problem we concern is called the random permutation
model. So this should be familiar to people who have
seen the secretary problem.
In this model it's collection of the sets ATS is
adversarial. So an adversary picks the collection of
ATs, but the order in which these are presented to
the algorithm is random. So it's a uniformly random
order. And that's the random permutation order.
And that's, the stochasticity is only in the order in
which it comes.
There's a related model called the IID model. In
this case there is a distribution on sets of vectors.
It's not a distribution on vectors. It's a
distribution on sets of vectors. And every time this
AT is generated from this distribution, it's an IID
sample.
And the difference between these two is like
sampling -- the difference between sampling with and
without replacement. And it's known that this is the
more difficult model to any problem -- any algorithm
that works for this also works with this. Not
necessarily vice versa.
And there is also a model with time varying
distributions. This is like the generalization of
IID that is not covered by random permutation.
Random permutation is inherently stationary. We can
allow the distributions to change over time, but I'll
not go into detail of this. So for this talk, I'm
just going to talk about the random permutation
orders. So just think of this.
>>:
So chosen adversarially but not unknown to me.
>> Nikhil Devanur: Unknown to you. All you know is
that they're coming in a random order. Okay. And
how do we measure the performance of algorithm? We
look at an [indiscernible] notion which is called the
comparative difference. So this is clearly the
difference between OPT, which is if I knew
everything, what would I do. I would pick some
vectors to be so that the averages in a set and
maximizes functional average, so that OPT, so offline
optimal solution. And you look at the difference
between that and what we get.
The thing is we cannot make sure -- we cannot make
sure that our average is actually in the set S. But
we will make sure that the distance from the set is
one. So we have to -- it's like a [indiscernible]
criteria. We have to relax the constraints a little
bit. So both of these will go to zero, as you will
see.
And in the special case, so the traditional thing is
competitive ratio, which is a multiplicative thing,
and this is what is used in the linear version, for
instance. And the difference there is, again, as I
said, the constraints have to be satisfied at all
times for this packing problem.
This is too strong for the general framework, as I
said. But when you specialize it to this, we can
make sure you satisfy the constraint at all times and
you can also make sure that we get the competitive
ratio. So it's not like we are kind of playing some
games through this. It's just the nature of the
problem because of this generality, we have to do
additive. So if you specialize it to this, our
technique also gives you multiplicative
[indiscernible].
>>: Is there some constraint on F except that it's
convex or ->> Nikhil Devanur:
>>:
It's Lipschitz.
It's Lipschitz.
>> Nikhil Devanur: Okay. So what has been done
previously. I said mostly people have wanted the
linear version, and this online packing especially,
and there are a bunch of results that are dual based.
So essentially the idea is to learn an optimal dual
variable to this packing problem and use this dual
variables to do the assignment.
This is efficient in the sense that it has to solve a
batch LP, a logarithm in the number of time steps.
So lot T times. It has to follow a batch LP, but
every step picking this vector from the
[indiscernible] set is fast. So in that sense, it's
efficient, but it is suboptimal. So the guarantees
that you get are not optimal.
And then there is this paper that I mentioned earlier
that is used. So this is something called the hybrid
argument. It is both efficient. It also solves a
batch LP log times. And it gets optimal bound for
the IID model, but for random permutation model, it's
suboptimal. So actually, we didn't know any bound
for random permutation. So it only works for the IID
model.
And then last year there was this paper by
Kesselheim, et al., that used a primal approach. So
there's no duals here. And it was optimal for both
the random permutation and the IID model, but this
was really inefficient. It had to solve an LP every
time step. So every time step to pick this vector,
it has to solve an LP. And okay. This is
[indiscernible] time, but this is not something we
would ever use in the advertising application or most
of the applications, for that matter.
And so this result gives the best of both worlds, and
also generalizes to the convex program, so it is
efficient and so it's a primal-dual approach. We
have to solve a batch LP or a batch convex program.
In the convex programming case, we are to again solve
it to log times, but for LPs we have to solve it only
once, which improves even on these things that have
to solve log times. And it is optimal for both RP
and random permutation and IID models.
And also another thing that I like a lot about this
is these earlier ones, all of them, the proof is
quite complicated. It's not clear -- I mean, it's
not completely clear what's happening. And here we
give a very simple and modular proofs. So we have
different components and it's easy to see how these
components interact with each other.
And simultaneous related to our work people did
similar things, but this is only again for the linear
version. So they give similar results for the linear
version that match ours. And this is also again a
linear version, but has some local convex/concave
objectives, not take the global concave objectives
that we have.
Okay. So what do we get? Essentially we get close
to optimal results. We get this comparative defense
of one by square root of T. So this goes to zero.
ST goes to infinity. This is -- this looks very -this should look similar -- familiar to the people in
online learning, the square root T regret.
In particular for the objective, we get this one over
square root T times Z plus L. L is the Lipschitz
constant of the function that you asked about, and Z
is another parameter of the problem. It's a problem
dependent parameter that I will talk about later.
And for the constraint, there's no such thing. It's
just one over root T. Okay?
>>: Assuming everything on the vectors are
[indiscernible] --
>> Nikhil Devanur:
Minus one.
>> Nikhil Devanur:
constraint.
C is just some [indiscernible]
>>:
Oh, okay.
>> Nikhil Devanur: Yeah, I don't
order and C, yeah. And the thing
better for some special cases, so
smooth, the objective function is
actually get log regret.
know why I have
is we get even
if a function is
smooth, you can
And this was not known earlier. This is tide in
general, even for linear problems, but if the
function is smooth, you get log regret. This comes
from the corresponding log regret for strongly convex
functions for -- yeah, for online learning.
So we get the corresponding things here, and you'll
see why we get this, because of this kind of a
modular proof. So we use this online learning as a
blackbox and any improvement in the regret there
translates into improvement here.
>>:
[inaudible]
>> Nikhil Devanur:
>>:
Yeah.
So smooth --
>> Nikhil Devanur: So if -- so smooth here
translates to strongly convex in the online learning.
Because that's the essential conjugate of this. So
online learning will have essential conjugate of F.
So smooth becomes strongly convex there and then ->>:
[inaudible] be able to see that immediate?
>> Nikhil Devanur: That the essential conjugate of
smooth is strongly convex?
>>:
That is fine, but --
>> Nikhil Devanur:
>>:
Yes.
But why you get to do it?
>> Nikhil Devanur: No. I haven't told you, so you
shouldn't -- yeah. Maybe you can, but I couldn't.
Okay. So, yeah. And as I said, in the special case
of online packing, we get the optimal [indiscernible]
ratio that was known before.
Okay. So now to the good part, the algorithm, right?
What I'm going to do is I'm going to confer a very
simple special case. It's a one dimensional problem,
okay? And there are no constraints. There's only an
objective. So all you're doing is, for the sake of
the diagrams to be nice, I'm going to switch to
minimizing a convex function, okay? It's equal,
right?
So you're given a convex function ahead of time, and
that's all you're given, and the number of time
steps. And in every period you see a set of points.
So again, this should be minus 1, 1. So you see a
set of just real numbers. You're to pick one number.
So you see some numbers on this axis, you pick one
and then you repeat.
And the goal is to minimize the value of this convex
function on the average of all the numbers that you
picked. Okay? So this is a game. And, you know, we
mentioned this comparative defense. We compare it to
the optimal choice on hindsight. Okay?
So here's an example. This is -- you're given this
ahead of time and the first time step you get two
points. You are to pick one. And then next step you
get, let's say, two more points. You're to pick one
and you repeat. Okay. And every time step you get
two choices, you're to pick one. Okay?
And these are the points you picked. We're at the
end of the algorithm and we compute the average of
these points. You locate what the function is on
[indiscernible]. This is what we got. And you can
look at all the points, all the choices you had on
hindsight after everything is done and you can say,
okay, if I knew everything, I should have done
something else, and that is your optimal solution.
And if you had picked these, you would have got this
as average and you will have got something here.
And this should be computed difference. So earlier
we used to call it regret and somebody pointed out
this is called [indiscernible], not regret. So,
yeah.
>>: Computing the best thing in hindsight, how do
you do that?
>> Nikhil Devanur: Doesn't matter.
exponential time and do it.
>>:
You use
Thought something that couldn't be done.
>> Nikhil Devanur: So you can approximate it very
well, once again. You could solve like a convex
program and you can round it and it's going to be
very close, so -- but for this sake, you could -- you
might as well take exponential time and compute the
optimal time. Doesn't matter. You're going to see
how we're not really going to care -- [indiscernible]
get away from having to characterize this exactly, so
you will see how it doesn't matter.
>>:
[indiscernible]
>> Nikhil Devanur: So here there's no constraint.
There's only an objective. So this is like a very
special case, but I just picked this to illustrate
the algorithm, main ideas. And everything that I
said generalizes very nicely. Okay?
>>: So this choice of the average regret will depend
on your random samples? Like if you did the same
random sample ->> Nikhil Devanur: So in the random permutation
model, the optimal solution doesn't depend on the
stochasticity because the order doesn't matter. In
the IID model it would matter, then you take the
expectation of the optimal.
Okay. So here's the algorithm. In every time step
the algorithm is going to pick a tangent to this
convex function. So this is a tangent where the
parameters is a tangent with a slope. So the slope
of this tangent is [indiscernible] 1, and if you fix
[indiscernible] you think of this as a linear
function of X.
And I'm going to pretend that this is a function that
I'm really optimizing. I'm going to forget about H.
I'm going to just try to optimize L. Okay? And in
one dimension it just means that if the slope is
negative, you pick the point on the left. If the
slope is positive or whatever, the vice versa. The
slope is positive, you pick the point on the left
because I'm minimizing the point on the right.
But can you imagine in many dimensions this is going
to be nontrivial. Right? So, you know, in many
dimensions I'm going to have a hyperplane and I'm
going to have a linear function, and given all the
points, the choice at the ATM will optimize this
linear function over the choice at AT.
>>: So here you will always choose either the
leftmost or the rightmost point?
>> Nikhil Devanur:
>>:
Yes.
Should it be clear to me that --
>> Nikhil Devanur:
It's a linear function, right?
>>: I understand.
problem.
The offline, you take the offline
>> Nikhil Devanur: Not offline. This is an
algorithm.
.
>>: No, I understand. But now go to the hindsight.
You have the offline problem. You have one problem
is the original problem. Just choose the optimal
location. The second one is choose the optimal
location, but you're restricted ->> Nikhil Devanur:
>>:
-- the left or the right one.
>> Nikhil Devanur:
>>:
Yeah, the offline --
Yeah.
Isn't there even there a gap?
>> Nikhil Devanur:
[inaudible].
>>:
There shouldn't be because of
Should it be clear to us why that's the --
>> Nikhil Devanur: It shouldn't be clear. It's not
clear [indiscernible]. That's a very good point.
It's only because -- because this works, there
shouldn't be a gap. Because here in this case,
algorithm is only picking one of the end points.
It's a convex function, so -- yeah, that's a very
good point. It's not clear that you can restrict the
algorithm to do one of the two and get something
close.
>>:
[indiscernible] offline --
>> Nikhil Devanur:
Yeah.
>>: [inaudible]. AT is always, you know, zero and X
[indiscernible] always. 0X, 0X. I mean, maybe your
algorithm will do something smart.
>>:
You just choose --
>> Nikhil Devanur:
No, no.
>>: [indiscernible].
choose zero.
No, no, no.
At the beginning it could
>>: But think of the off -- just mention the offline
version. You know everything.
>> Nikhil Devanur: You have to have three points or
more than two points for this -- I mean if there are
only two points, of course you're either picking the
leftmost or rightmost. So you have to have three
points or more for this to be nontrivial. So you can
imagine, actually, that if you have many points and
maybe the points in the middle, you always want to
keep picking the points in the middle, whereas the
algorithm will keep picking the extreme points and
it's not at all clear why this could do as well as a
point in the middle. I mean, that's a very good
point. It's not real clear. It's only through the
algorithm that, you know, it has to be.
>>: The gradient you're looking at, that's at the
current average?
>> Nikhil Devanur: No, I have somehow picked the
theta one, and I'm looking at the tangent to H which
slopes theta one. And think of this as a linear
function of X.
>>:
Sure.
>> Nikhil Devanur: And I'm going to pick the point
that minimizes this linear function.
>>:
And that is your current --
>> Nikhil Devanur: No. So now I'm going to -- you
know, think of this as a linear function of X and I
have three -- two choices. Which one minimizes ->>:
[indiscernible].
Is the slope of the current --
>> Nikhil Devanur: It's the slope of my current
linear function that I'm going to pick.
>>: Yeah, but I guess the question [indiscernible].
How do you pick ->> Nikhil Devanur: Yeah, okay. So, yeah. First
time I pick some more, okay? So I just pick the one
that minimizes. Now, how do I pick it in the next
time step? I'm going to use this online learning as
a blackbox. I'm going to feed what happened,
essentially this XT to that, and that is going to
spit out the next slope. Okay? And I'm going to use
that slope now. So it spits the slope, and now when
I get these two points, I'm going to pick this one,
okay?
>>:
[indiscernible]
>> Nikhil Devanur:
>>:
And I repeat.
[indiscernible] what happened.
>> Nikhil Devanur: Okay. I picked theta one
somehow, but arbitrary. Doesn't matter. And that
told me to pick this point. Okay. Now I'm going to
feed this to this online learning blackbox. Exactly
how this thing is going to work I'll tell later, but
some blackbox, I'm going to feed it and that blackbox
is going to tell me what to do, what slope to pick
next. It's going to pick -- it's going to say, okay,
pick this slope.
I'm going to next pick the tangent with this slope,
right? And I'm going to repeat. So now I get these
two points and this thing says I should pick this
one. So I pick this point, and then I'm going to
repeat it and then go back to the blackbox. That's
going to tell me what the next slope should be. So
now you see where the Fenchel comes in.
>>: There's something more, right?
it's a tangent.
>> Nikhil Devanur:
>>:
Yeah.
That's a very specific linear function.
>> Nikhil Devanur:
doesn't matter.
>>:
You keep saying
Okay.
So actually, the Y intercept
It doesn't matter.
>> Nikhil Devanur: It doesn't matter for the choice.
But it matters for the analysis.
>>:
Okay.
[indiscernible].
>> Nikhil Devanur: For the choice it doesn't matter,
right? You move the tangent function up and down.
The Y intercept is like -- is where the Fenchel
conjugate comes. So that's the Fenchel conjugate of
theta.
>>: You basically look at like where it aligns with
the slope touches ->> Nikhil Devanur: Yeah. Usually there's an
explicit formula. So this is always going to be the
Fenchel conjugate of F star of theta. So it's just F
[indiscernible] of theta plus X times theta.
>>: Well, the blackbox takes is just the slope.
online learning ->> Nikhil Devanur: Yeah. So, yeah.
XT is all that matters, really.
>>:
The
It takes this
Oh.
>> Nikhil Devanur: Assuming it knows H. It knows H,
it takes -- I'll tell you what the -- exactly what
the blackbox does later, but think of it as a
blackbox and it uses -- there's a blackbox
[indiscernible] and another blackbox of online
learning. So think of it like that.
>>:
[indiscernible] some kind of framework.
>> Nikhil Devanur: Yeah. I use this to define some
kind of reward in that time step and that's going to
tell me the next step.
>>: [indiscernible] blackbox include not only your
slopes, also the previous XT where it touches.
>> Nikhil Devanur: Yeah. I'm going to tell you
exactly what it is. So for now, you know, wait.
Have some patience, all right. I'll tell you. It's
coming soon.
Okay. The algorithm is clear. There is some
blackbox that keeps updating the theta. And that
uses online learning. What I have not told you is
how to define the reward. Okay. So that's
algorithm. Every time I just do this, repeat it.
That's how I pick the points. So we're done.
This is the average and this was optimal in hindsight
and this is the [indiscernible] difference. So now
what's happening. So here I want to look at the
average of the linear functions that I used, okay?
So I use this L of X comma theta D. This is what I
pretended I was optimizing, and I want to look at
what happened in this case. And what I'm going to
argue is this isn't lower bound on this H of X
[indiscernible]. Okay. And then now I only -- this
is [indiscernible] comparative difference, so I only
have to bound this gap. I really don't care about
what X star is.
Okay? Assuming that this is a lower bound.
clear why this should be a lower bound.
It's not
Okay? So this is -- what is this gap? This gap is
the active function H at my average. So this is an
actual function that I should have been minimizing,
and this is my pretend function that I was pretending
that I was minimizing. So what's the difference
between the actual function and my pretense? Right.
This is a gap that I have to bound, and this gives me
a bound on the comparative difference.
>>:
It is a lower bound because you used the --
>> Nikhil Devanur: No. The lower bound uses
stochasticity. It's not true always. Because I
wanted to -- you'll see in a little bit. I'll show
you why this is a lower bound. It is a tangent, yes,
but there is more to it because this is only true.
So now think of -- I said, you know, think of random
permutation, but for moment this holds actually only
for the IID case. Okay?
So think of you're drawing -- think of the empirical
distribution, right? You've seen everything, and now
let's say you draw one of them at random. Think of
your sampling from the empirical distribution. And
let's consider the expectation of this draw,
expectation of this L of XT comma [indiscernible],
okay?
For each, you know, each pair, the one that we picked
minimized L and so it is smaller than the one that
have been picked in OPT. So X star is an average of
the optimum choice. So here's, you know, because
they're optimizing for L, we pick the best one, so it
is better than the one in the OPT. And that is why
this expectation of this is less than L of X star of
theta D.
>>:
So XT is your choice then?
>>:
What is this expectation over?
>>:
This XT your choice to [indiscernible].
>> Nikhil Devanur:
So actually, okay.
So --
>>: Why is this not true without expectation?
You're just saying XT is optimal for theta T so it's
better than X star.
>> Nikhil Devanur:
be is ->>:
X star is not necessarily --
>> Nikhil Devanur:
>>:
So actually here what it should
X star is an average.
Oh, it's not an intercept.
Gotcha.
I see.
>> Nikhil Devanur: Yeah. So if you fix theta T and
now you're picking this space at random and you're
picking the optimal one, so for every pair what you
would pick is a point where you're better.
>>:
Yes.
>> Nikhil Devanur: So here in this -- when you take
the expectation, you just get X star.
>>:
So this expectation over which random is it?
>> Nikhil Devanur: So this is -- you fix theta T and
you pick expectations over -- you pick the AT at
random.
>>:
[indiscernible]
>> Nikhil Devanur: Which set you're optimizing. So
this holds for the IID model because there, every
time it's a random sample. And this is exactly the
difference between random computation and IID. So
for random permutation, the expectation is going to
be slightly off, and what we need to do is we have to
look at what the difference between the expectation
for random permutation and IID is cumulative over all
the time steps, and that's exactly, we get an extra
error [indiscernible] because of that.
And that is also like one over root T, so everything
works out. So this is exactly the point where random
permutation and IID differs. This was one of the
things where we didn't understand, okay, why do they
differ. The earlier answers for -- didn't throw any
light, but whereas here, we capture exactly where the
two differ.
>>: The difference [indiscernible] is one over root
T also [indiscernible].
>> Nikhil Devanur: No, no, actually for smooth
version only works for ->>:
[indiscernible]
>> Nikhil Devanur:
>>:
It's going to be there, right?
>> Nikhil Devanur:
>>:
Because this root T, yeah is --
Okay.
Yeah, so that only works for IID.
Yeah.
>>: And is this is only place where you use the
stochastic model?
>> Nikhil Devanur:
Yeah.
>>: You could also say that the AT could be
completely adversarial but you just want the X start
to be the convex star of the AT time step. If that's
true ->> Nikhil Devanur: Yeah. Yeah.
>>:
[indiscernible]
>> Nikhil Devanur: Yeah. There's a list that we use
[inaudible]. And yeah, this is because they're using
a tangent. But this step is a nontrivial step where
we use the stochasticity.
So the next slide, in expectation, this
bound. It's not true point-wise. Some
not be, but in expectation it is true.
also convert it into high probability.
fine.
is the lower
point it will
And we can
So that's
Okay. So the summary of all this is we can forget
about X star. We just have to bound this gap now,
the gap between the H and the average and the average
of this linear functions.
Okay. And this is where -- okay. So I already gave
some of the proof. So this is where online learning
comes into picture. Okay? And what is online
learning? I think most people here know, but let me
make -- let me do a quick summary.
You want to make accurate predictions using what's
happening -- what happened in the past. So at any
time T, you want to predict some vector theta D in
some domain. And after you predict this, you get a
reward, some reward LT. This is a concave function.
And you want to maximize, so here again, these are
reward vectors. You want to maximize them. And the
regret is, if you knew all the reward vectors, then
work would be the single theta that you pick, on
hindsight, versus the thetas that you picked, the
reward of the thetas you picked online.
Okay. So this is a problem, this is an online
learning problem. So notice here that these are
local, unlike in our case we are [indiscernible] for
global. So this is only some of the LT, right? And
there are many algorithms available here that
essentially get this one over square root T regret.
So here I define everything in terms of average so
that [indiscernible] becomes one over square root T.
Okay. So how do we use this? In fact, it turns out
the way I've set it up, the gap we had to bound is
exactly the online regret if you define this L
correspondingly, accordingly.
Okay. So the reward that we use is this tangent.
Okay. So now think of this as a function of theta.
Earlier we were fixing theta and we were thinking of
it as a function of X, but now I'm going to fix XT
and think of it as a function of theta.
So this is a convex function and this is a reward as
a function of theta that I'm going to feed back into
the online learning. Okay. So what is this
function? I fixed XT and as you change theta, the
tangent changes and it's a value of the tangent at
XT. So this is a concave function because here we
are maximizing.
>>: Sorry.
AT, right?
I'm a bit confused.
Like I've got a set
>> Nikhil Devanur: Yeah, XT is what I picked. I
picked some point XT. Now I have to go to the online
learning and define what the reward function is for
this time step for it to spit out the next one,
right? And what is the reward function? It's a
function of theta, right? Now fix XT and look at
this L of XT comma theta. That is a function of
theta.
>>: [indiscernible] because XT could change. I
mean, 2XT is [indiscernible]. You have a slope at
some point. If I tilt it that way, then its
[indiscernible] are XT, right?
>> Nikhil Devanur: No, no, XT is whatever I
picked -- the vector that I picked in the last time
step.
>>:
So this is [indiscernible] plus one.
>>:
Oh, okay.
>> Nikhil Devanur: So my algorithm picked the
particular XT in time step T.
>>:
Okay.
So here you're picking theta T plus one.
>> Nikhil Devanur:
one.
>>:
Here I'm picking theta T plus
[indiscernible]
>> Nikhil Devanur: Okay? The algorithm picked the
particular XT. Now I'm going to use this XT to
define the reward for the online learning.
Okay. So people had questions about what this reward
is, what it uses. So this is what it uses. So if
you fix XT that defines a function of theta, which is
you look at the tangent with slope theta and look at
how much this tangent is at this XT and that is the
reward. And this is a concave function. Okay?
So, now what is this online learning regret? So it
is reward of this optimum choice on hindsight minus
whatever the algorithm [indiscernible]. So this is
exactly the -- in our gap, this is exactly the lower
tongue, right, which we have. And I'm going to show
you in a moment that this turn, the optimum on
hindsight is going to be exactly this H of average.
Okay?
So this regret is exactly the gap that we wanted to
bound. Y is this. So note that L is linear in X.
Okay? So for any theta, so theta star will give the
optimum, right? Or any theta star. Any theta. This
average, I can just take the average inside because L
is linear in X. It's a linear function, right? If
you fix theta.
Okay. So now what is this function? This is -- you
look at the average and for any theta you look at the
tangent, it's slope theta and what the tangent is at
this average. Now you want to pick the theta to
maximize it. So what would you pick? You would, of
course, pick the tangent at the average itself.
Right? Because if you pick anything else, you'd only
get a smaller value, so obviously the best thing to
do is to pick the tangent at the average. And then
you just get the value of the function.
>>:
[indiscernible]
>> Nikhil Devanur: No, no. What is this function?
This function is you pick any tangent, look at what
the value of the tangent is and the average. That is
its function.
>>:
[indiscernible]
>> Nikhil Devanur: I mean, this is my definition of
L, right? L of XT comma theta and the function of
theta is. If you take the tangent with slope theta,
what does this tangent -- what is the value of this
function at XT.
>>:
[indiscernible]
>> Nikhil Devanur:
Yeah, yeah.
Convex H of course.
Right. So this best thing, on hindsight, is exactly
H of the average. So this regret is exactly the gap
you wanted to bound. Okay?
So that's it. So note that there is a switch here,
in some sense. Just think the best thing, on
hindsight, is somehow the performance of the
algorithm, right? How much algorithm got. And this
online learning thing is somehow the lower bound on
OPT. So there is a switch, which is interesting.
Maybe it's natural. I don't know.
Okay. So here's the oral algorithm. Of course, the
same thing extends to many dimensions. Now, consider
the case where, like in the special case you either
have only an objective or only a constraint. So if
you have only a constraint, it's like having only an
objectivity. Just the distance percent. Okay. So
if you have only the objective and many dimensions,
you do exactly this.
And so these thetas are like Fenchel conjugates. So
these thetas give you some kind of linear function.
You can interpret it as shadow price and so on based
on the specific scenario.
And all you're doing is you're picking, you're
optimizing the linear function given by theta over
this choice set. Okay. And then you're getting this
and you're feeding this to the online learning.
You're defining the reward appropriately. It's the
same thing. Take the tangent in this dimension and
then it's a linear function -- or sorry -- it's now
still a concave function of the thetas, and that's
going to tell you what the next theta, next slope
should be. And then you repeat this. It's exact
same thing that I said works in any dimensions.
>>:
[indiscernible] for finding theta one.
>> Nikhil Devanur: So, okay. In this case when you
have either only an objective or only a constraint, I
don't have to solve any LPs. I have to solve LPs
when I have both. Even in the earlier work we don't
have to solve any LP if there is only an objective or
only constraint, right?
Okay. So now what happens when you have both
objective and constraint? So we're going to run two
separate instances of the one I showed you in the
previous slide. Okay. So for the objective, I'm
going to have this theta. For the constraint, I'm
going to have lambdas. But now how do I pick the
vector? I'm going to combine these two and I'm going
to combine this using a value Z. Okay?
So what is the Z? It's a tradeoff between the
objective and the constraint. If I violate the
constraint by epsilon, how much does the objective
change? So OPT of epsilon is I take an epsilon ball
around us, so I take all points within epsilon
distance around S and look at the OPT of this relaxed
problem. There's an OPT of epsilon. OPT of zero is
just OPT. And Z is this derivative of OPT of
epsilon, it's just epsilon.
So this is what I said, you know, if I violate the
constraint a little bit, how much should I expect to
get in the objective? And you assume that you're
given the Z ahead of time or you can estimate from a
sample. So this is where we need to solve like a
batch LP or a batch convex program. And the thing is
we only need a constant factor approximation to do
this, so we don't need very accurate estimates. We
need a constant factor [indiscernible] to Z. So it's
a much easier estimation problem. Okay?
And then what we're going to do is in every time
step, we're going to combine these two. So we have
the theta and then we have the lambda. We're going
to combine them within Z. So this is again a linear
function, and then we're going to optimize this
linear function [indiscernible]. Okay.
>>: This is also the reason why you do not solve LPs
often, right?
>> Nikhil Devanur:
Exactly.
>>: Because with the small sample, you get a
constant factor anyway.
>> Nikhil Devanur: Very good point. So for LP
version, you only have to solve it once to get a
constant factor.
>>:
Small samples of --
>> Nikhil Devanur: Yeah, small sample, and only once
is enough to get a constant. So like I just need
epsilon squared or epsilon squared fraction of all
the requests. So if there are T requests, I take it
times squared T. Samples and it already gives me a
constant factor approximation. And I only have to do
it once.
So why does this work? Okay. So analysis is a
little more complicated and I don't have the time to
go through this, but this works.
Okay. So let me summarize. Here are the, I think,
salient features of this algorithm. It's very
general. It works for arbitrary -- essentially
arbitrary convex constraints and concave rewards.
It's [indiscernible] and the Fenchel duality, I
didn't say things in terms of Fenchel duality, but
that is what is underlying it. It seems to take the
very powerful or at least the right tool to attack
this problem.
And for a long time we were struggling with this
random permutation for society. They should be the
same or very similar, but we didn't really have a
good explanation. This captures exactly where this
comes. So already the expectation is the same
throughout. Random permutation expectation is
changing and it's just a difference.
So, you know, if you take the random permutation and
every time step you take the expectation of the
remaining and you see how different it is from the
expectation over the whole, you sum it up and that's
exactly the extra term you get for random
permutation. So really captures the difference
between the two.
And the other nice thing is this modular proof where
we use this online learning as a blackbox and we
don't really care what algorithm you use for the
online learning. Earlier problems were very specific
to either -- even the other work that I mentioned was
simultaneous later, they were very specific to the
particular algorithm you use to solve the online
learning. But here we show that okay, doesn't
matter. All I care about is it has low regret.
And this was actually conjecture even back in 2007
when Mehta, et al. came up with this adverse problem.
They said, oh, there should be some relation to this
expert's algorithm, and people, all the subsequent
ones hinted at this connection but there was no
formal connection, and we make the first connection.
And the other progresses beyond this, any other
advantages, if the learning problem can -- you can
get better regret, then that translates to a better
computer defense as we saw with the smooth functions.
And of course, this is only limited to IID. It's a
random permutation. This extra term is already one
over T, so [indiscernible].
Okay. I think there's more to do here. One
direction yeah, that could be useful is to go beyond
the stochastic models that we have. So we have this
IID random permutation. We have some time changing
distributions. But these are mostly stateless. So
maybe some kind of Markovian process or something
might be able to extend it.
The other problem is -- so this is dealing version of
this, which doesn't quite fall in this general
framework that I mentioned, and I think figuring out
similar algorithms and similar guarantees in
scheduling would be very interesting, so we are
working on this with Balu and others.
And for me personally, another thing I would like to
see is more applications of this. It seems like a
very general framework. So I'm interested in, okay,
what are the other problems that people have worked
on that we can formulate in this framework. Maybe
that will point to some other ways of fixing in the
model. So maybe they won't fit in exactly, but it's
close enough that we can extend it. So, thanks.
[applause]
>>:
Questions?
>>: Can you go the other way?
[indiscernible]?
>> Nikhil Devanur:
>>:
Is there a connection
That's a good question.
Maybe.
Just like [inaudible]
>> Nikhil Devanur: I actually haven't thought about
that. But it is definitely possible.
>>: There is some reason in the worst-case analysis
of the online algorithms convex costs. How are these
algorithms related? I mean, I notice ->> Nikhil Devanur:
different.
>>:
The problems are a little
The resulting algorithms are singular because --
>> Nikhil Devanur: Not sure. I think the
algorithms, are they similar? No, probably not. The
results are different. So there you have to keep
covering at every time step. So you're to maintain,
so that makes a difference.
>>:
So you have more freedom intuitively to --
>> Nikhil Devanur: Yeah. We only have to satisfy at
the end. So yeah, I can say we have more freedom.
And also we're crucially using the stochasticity.
Maybe not so crucially as I mentioned. Yeah.
Actually, that's a good point.
>>:
You can just take the condition.
>> Nikhil Devanur: Yeah, but -- or else it's only -it holds only an expectation something, so it doesn't
follow that condition maybe too strong, right?
>>:
I was saying --
>> Nikhil Devanur:
this or that.
>>:
No, no.
It's like two statements:
Either
You can say -- I mean, that you have a
distribution that satisfies this. The distribution
could be, you know -- [indiscernible] I mean,
something.
>> Nikhil Devanur:
a good point.
>>:
Yeah.
Okay, okay, yeah.
That's
Some factor [indiscernible].
>> Nikhil Devanur: So this is -- what I'm saying is
actually very similar to this time [indiscernible]
distribution model that I mentioned but didn't say.
>>:
Okay.
>> Nikhil Devanur: It's very similar to that. In
some sense that's what this time [indiscernible]
distribution does. Still, each step has to be
independent of other steps, but that distribution, in
some sense has to include this X star.
>>:
[indiscernible]
>> Nikhil Devanur: Yeah, so you go to the linear
version this one over root T is tied.
>>:
For a constraint as well?
>> Nikhil Devanur: Yeah. Yeah. Unconstrained. So
you can, I guess with linear [indiscernible] You can
just think of it as the [indiscernible] from the set
kind of thing, I think. I don't know, Balu. Even
for just minimizing a function. Yeah, I don't know.
But at least for the packing problem, you know, you
have a linear function and you have packing
constraints. It's tied. For instance, for smooth
functions it's not. I mean, you can just get this
log T, right, so for instance.
>>:
[indiscernible]
>> Nikhil Devanur: That's a good question. So the
thing is even though, you know, it's only a linear
version that we saw and the rest is hack, that
doesn't work very well. So it works so well, they're
so happy with it that they're not developing on the
algorithm anymore. So they're focusing on other
things. So currently they're happy with the
algorithm as it is. So maybe at some point of time
they come back to it and want to revamp it, maybe
then, you know, we will get them to use this. But as
of now, they are quite happy with it, so ->>:
So right now it's a linear model that --
>> Nikhil Devanur:
Yeah.
Yeah.
>>: No need to solve an LP, you're not solving an
LP ->> Nikhil Devanur: So, yeah, they are solving an LP.
They solve this LP offline and then that kind of
feeds the online -- the online, they don't have an
LP. It's just ->>:
[indiscernible]
>> Nikhil Devanur:
Offline LP is big.
>>:
Offline LP [indiscernible].
>>:
But what if they use --
>> Nikhil Devanur:
same algorithm.
>>:
So actually, they're using the
What tools are they using for that?
>> Nikhil Devanur: So they're using the same
algorithm. When you have to solve it offline, you
can shuffle the order, so you know you're getting a
random order. So that way you can solve.
[indiscernible].
>>: I think there's another interesting model. So
in your [indiscernible] in terms of T,
[indiscernible] you have this [indiscernible]. But
think about the variety is not really
[indiscernible]. But that means your AT is
actually -- somehow from the final site.
[indiscernible].
>> Nikhil Devanur:
Yeah.
>>:
Same requests like a third time.
>> Nikhil Devanur: But that's something -- I don't
think the support of this distribution is small.
Right? It's distribution on sets of vectors and the
support of this distribution I don't think is small.
I think the support is much bigger than T something.
That's is how I would model it, right? Or at least
as T.
>>:
What happens as far as [indiscernible]?
>> Nikhil Devanur: If the support is small, then
yes, you could actually learn the distribution and do
something simple. But every impression is
essentially different. So ->>:
[indiscernible]
>> Nikhil Devanur: So the support is huge. And
that's why we need to do something other than learn
the distribution.
[applause]
Download