Document 17865069

advertisement
>> Sebastien Bubeck: All right. So it's my pleasure to introduce
Hariharan Narayanan from the University of Washington. So many of us
in the room have been working with self-concordant barriers. And Hari
will tell us how to use them for the polynomial sampling.
>> Hariharan Narayanan: Thank you for the invitation to speak here.
It's a pleasure to be here. The title of my talk is randomized
interior point methods for sampling and optimization. So in this talk
I'll put together work that appears in three papers. The first was
joint work with Ravi Kannan, called random walk on polytopes on
assigned interior point method for linear programming. Then in the
later work I generalized this to hyperbolic barriers and other certain
barriers. And I also will talk about an application that is joint work
with Alexander Rakhlin to existing randomization. So the task of
randomly sampling a polytope has the following form: We are given P
and N dimensional polytope given by M linear inequalities and an
interior point X. The task is to sample the polytope at random. And
we will discretize the diffusion process designed to mix fast.
So why should one be interested in sampling polytopes? Well, one
application is to sampling lattice points in polytopes. This gives
rise to algorithms for sampling combinatorial structures which is
contingency tables. And it also gives rise to algorithms for volume
computation. That is, the task of sampling a polytope is used as a
black box in volume computational algorithms, and which in turn leads
to algorithms for counting lattice points and polytopes. And finally,
as I discussed today, these give rise to randomized interior point
methods for context optimization. So the model for past work has been
somewhat different in this area. The convex set is specified using
what is called a membership oracle. Sorry. So the convex set is
invisible, unfortunately. But it is -- it is sandwiched between the
two walls here. So inner wall and outer wall. And, yeah, given a
query point X1, the answer is yes X belongs to the convex set and no
otherwise.
So past approaches to sample convex sets have included the grid walk
due to Dyer-Frieze Kannan, which had a mixing time of N to the 23 from
a warm start of a preprocessing. By warm start, I mean a starting
distribution whose L infinity -- so the L infinity value of the random
concordant derivative is bound by universal constant with respect to
the stationary measure. So this is how the grid walk works. You
basically take small steps on a grid. And then another walk considered
by Kannan-Lovasz is called the ball walk here. You start -- take a
ball of a certain radius around a current point. Pick a random point
in the ball and repeat this procedure. And this was shown to have a
much faster mixing time. That is, O star, N to the 3 from warm start
after preprocessing. Then there's this random walk called hit-and-run
that was first analyzed by Lovasz and then the analysis was elaborated
upon and improved by Lovasz and Vempala. This here in one transition
you pick a random chord and pick a random point on that random chord
and repeat this. And this is is called hit-and-run. The mixing time
from a fixed point was order N cubed R by R squared, log capital R by D
epsilon, where D here is the distance of the initial starting point
from the boundary, and R capital R is the radius of the circumscribed
ball. Little R is the radius of the inscribed ball. And epsilon is
the -- yeah, the total variation distance to the uniformity that you
desire. So the mixing time from a fixed point is an order N more than
the mixing time from a warm start. And this is related to the fact
that in high dimensions, if you have a single point and you take one
step, then you essentially start with the distribution whose L2 norm is
exponentially large in N. And so you pay a penalty of N when you are
starting from a fixed point. So for a polytope less or equal to poly N
faces, capital R or little R cannot be better than order N to the half
minus delta. And this is in fact true only for symmetric polytopes.
If you have asymmetric polytopes, then what you need to do is you need
to put them in a certain position, and then you need to shave off their
corners by intersecting them with a ball of radius root N or so. So
it's not that you actually can sandwich the entire convex set between
balls of radius 1 and root N, you actually have parts that stick out of
the outer ball but they're very small in measure. So you can ignore
them. But this gives N to the half and the overall bound, the mixing
time from a fixed point is therefore N to the 4 minus delta. So this N
cubed and this N to the 4 minus delta. So the mixing time of the new
Markov chain that I'll discuss today is as follows: It is for
sufficiently large N, the mixing time from a fixed point is this
quantity MN. M is the number of faces of the polytope. N is the
dimension. S is a notion of centrality of the chord, which is the
maximum value of this longer part divided by the smaller part, taken
overall chords passing through the point X. So ->>:
[indiscernible].
>> Hariharan Narayanan: X is the starting point. So for sufficiently
large N, the mixing time is this to get to epsilon total variation
distance from a warm start the mixing time is order MN. And so if -- I
have a somewhat crude upper bound in the number of arithmetic
operations it takes for one step, which is M to the gamma. In that has
a better analysis which gives essentially number of non-0s in the
matrix of the polytope, plus order N squared. So this M to the gamma
can be replaced by something like that. And as a corollary, we see
that the number of random walk steps to mix is smaller for this walk if
the number of faces is order N squared and has less arithmetic
operations if M equal to order N to the 1.46. Again, with the analysis
this becomes order N squared as well. This M becomes order N squared
as well.
>>:
Is there any condition on the polytope with being isotropic.
>>: No.
variant.
>>:
So random walk that I'm going to discuss is a finding
So it won't matter.
Interesting.
>> Hariharan Narayanan: So this is. So in order to define the random
walk, I'm going to have to define Dikin ellipsoids, because this is how
the one step of the property of the Markov chain will be made. So the
Dikin ellipsoid of -- the Dikin ellipsoid of a polytope around a point
X is defined as the set of all points Y such that X minus Y transposed
multiplied by the matrix summation EI, EI transpose divided by 1 minus
EI transpose X 4 squared. Here AIX less or equal to 1 is the polytope.
And AI is the Ith row of the matrix A. So here for X belonging to PDX
is a set of all Y is that basically this vector, this deviation vector
from X multiplied by -- this is actually the Hessian of a certain
barrier function that I'll come to, is less or equal to R squared, and
for us R is squared 3 by 40, the polytope case. So to take a step
back, there is some convex function whose Hessian, if you take you'll
get this. And this times the Hessian less or equal to R squared
defines the ellipsoid. These ellipsoids have nice properties. For
one, if you take around any point in the point X 6, you take the
symmetrization of the polytope, and then if you take the convex, if you
take the Dikin ellipsoid and dilate it by square root M, it will
contain the symmetrization of the polytope. It's too much to ask that
the Dikin ellipsoid dilated by square root contain the polytope,
because you can go very close to a corner and then the Dikin ellipsoid
will be tiny. But what is almost as good for our purposes is that when
you dilate it by square root N, it contains the symmetrization of the
polytope at that point.
>>:
Excuse me, what do you mean by symmetrization.
>> Hariharan Narayanan: So it's the set of all points that a property
that both the point and its reflection about the center, central point,
belong to the polytope. So it's P intersection minus P where is the
origin. So Dikin affine scaling LP algorithm, this was when he
proposed the Dikin ellipsoid in the first place. Do the following:
You start at a point, take a Dikin ellipsoid, and you optimize the
linear function over the Dikin ellipsoid. Pick the new point, pick a
new ellipsoid around the new point. And repeat. So this algorithm has
no polynomial time guarantees and is believed to not be in polynomial
time. So although it's a very natural looking interior method it's
good to be not polynomial time. On the other hand, the Dikin walk for
which I'm going to describe with some modifications leads to polynomial
time algorithms for linear programming. And the Dikin walk -- now that
I've defined the Dikin ellipsoid, I can talk about the Dikin walk.
It's defined as follows. I can take a point, take the Dikin ellipsoid
around that point, pick a random point inside the Dikin ellipsoid.
Look at the new Dikin ellipsoid. Now, if the new Dikin ellipsoid
contains the old point, then accept this point with the probability
that is proportional to the minimum of one and volume of DX0, the
initial Dikin ellipsoid divided by volume of DX1, Dikin ellipsoid. If
new point DX0 does not belong to DX1 then don't accept the move at all,
just reject the move. So this is the random walk. And as you see it
goes through -- if you start at a corner, the ellipsoid is small. But
as you move closer to the interior, all the high dimensions you never
really go close to the interior, you kind of move along the surface of
the polytope. But still when you're far from very low dimensional
corners, the sizes of these ellipsoids become bigger. So this is a
natural discretization of a Brownian motion with drift on a certain
manifold where the metric is given by these ellipsoids in the void in
tangent spaces of the metrics. And that diffusion process is a
specific focus plan which is very much akin to the heat equation,
stationary measure is not uniform measure. It's -- it's the uniform
measure on the polytope with, once you equip the polytope with this
different measure, with this different metric, it also gets a natural
measure which is, whose density with respect to the uniform is
proportional to the inverse of the volume of the Dikin ellipsoids. So
that's mu actually. And so there's a natural process in the background
here.
>>: This is for even with rejection step that -- I mean if we are not
doing the rejection.
>>: Hariharan Narayanan: I'm not claiming an exact correspondence.
I'm just saying that this is the more linear -- this is -- yeah, I
don't know if with the rejection step and you take delta into 0 you
actually get this or not. I don't know if formally that's true. So
the hit-and-run of Lovasz involves transition starting at X draw a
chord, pick a random point on the chord and the mixing from a warm
start -- I'm sorry. That should not be there. So this is -- so the
mixing from a warm start here is MN. Whereas the mixing time from a
warm start here was N cubed. So this algorithm was used in integer
many programming by Huang-Mehrotra, and the idea was that there you
want to find not only a point with a large objective value, you want to
find a point which is integral and has a large objective value. And to
do this, what they did was they basically, instead of doing a normal
interior point method they used a variant of Dikin walk, short and long
step Dikin walks, and they got a point that was not exactly optimal but
kind of close to optimal and a bit random and then they did some
additional stuff to make it an integer point. They basically rounded
it, and if it landed outside the convex polytope they took the nearest
point in the convex set to that new point and then again rounded it and
did this repeatedly. And in the end they would get a current integer
solution and cut a new plane of that integer solution and repeated the
whole process. So that was what they did. So the way we're going to
analyze the Markov chain here is by getting a lower bound in the
conductance and the conductance fee of this Markov chain is obtained by
taking the infinitum over all sets S of measure less or equal to half
of the property of executing S from random X in S1. Pick a random
point in the set and pick a random set that in the property during this
experiment you move from one side to the other side. And Lovasz and
Simonovits in '93 proved the following bound. They proved that
starting the distribution has density rule where the supreme of low X
is equal to M that low X be equal to random walk and for all values of
S the probability that XK belongs to S minus mu S is less than square
root M E to the minus K phi squared by 2. And so even if M is
exponentially large than N that gets translated to an additional N
factor in the K and you can get bounds from a fixed point as well. So
this is the mixing time from a warm start is 1 by phi squared and now
the question is how do we bound the conductance. The way the
conductance is bounded in this situation is you first get a lower bound
which is purely geometric on the convex polytope with this particular
metric. And then show that -- show that the transition properties of
the Markov chain are in some sense are faithful to this metrics. So if
you take two points that are geographically close by, then the total
variation distance of the transition kernels corresponding to those two
points is 1 minus omega 1. Bounded away from 1 by omega 1. So I
discussed take the convex set and then you give it a metric where tiny
distances are given by the ellipsoids. So the ellipsoids are circles.
Ellipsoids are the unit balls in that metric locally. So for that
metrics space, for that metrics space we want isometric inequality. So
it turns out that Lovasz analyzed hit-and-run with a different
isoperimetric metric, different metric called the Hilbert metric. And
that this metric is approximately isometric to the Hilbert metric up to
square root M factor. And that allows us to analyze the isoperimetric
convex of it using the metric instant of the Hilbert metric. Here's
what the Hilbert metric. Is if you take two points, X and Y, you
define sigma of XY as X minus Y divided by U minus V, multiplied by U
minus V, divided by U minus X, divided by Y minus V. This is
project -- this is not an hitter [phonetic] distance. This is a
projective ratio. If you take log of 1 plus this then that becomes the
distance. So log of 1 plus this is the distance. And what Lovasz
proved was that if you take, if you put any uniform measure on this
polytope and you partition it into three parts, P is the polytope. S1
prime is this part. S2 prime is this part, and P minus S1 minus S2
prime is the part in between, then the measure of P multiplied by the
measure of the part in between is great or equal to the distance
between the two parts multiplied by mu of S1 prime multiplied by mu of
S2 prime. So by distance between the two parts, I mean the distance
between the nearest pair X and Y, X in the first part, Y in the second
part. So this gives rise to an isometric inequality for our setting
also. So here this is the metric. It's given the locally it's given
by Dikin ellipsoid. So there's something not working right with my
slides.
>>:
[indiscernible].
>>: Hariharan Narayanan: Yes, if I need something -- as I mentioned
before, the Dikin ellipsoids, if you dilate by square root M, they
contain the symmetrization of the polytope. And what I forgot to
mention was that, as stated, they're contained inside the polytope. So
there is this symmetrization of the convex polytope about a point are
sandwiched within square root M of the Dikin ellipsoid and its
dilation. So because of that, the Hilbert metric is within one by
square root M of the Dikin metric, and this gives rise to this the
isoperimetric inequality for the Dikin metric. If you have quality
isotopes you can do better because if you take the manifolds which are
products of smaller manifolds and dimension, then the verse
isoperimetric cart is not much worse than the worst factor. So this M
can be replaced by the M of the smaller polytope which are used in the
[indiscernible] product. But that's a special case. So next the
second step is to be able to relate the total variation distance
between the density functions, the transition density functions for two
nearby points. For this, so this here I want to make a point that
actually I said that everything is a Dikin ellipsoid and so on, but
once you do the rejection step, that rejection step is very extreme in
the case of polytopes, because you're taking pretty large steps and end
up rejecting one whole side of the polytope in the worst case. So they
don't really look like ellipsoids anymore. They look like these sort
of things. And you need to argue about the overlap about these sort of
things. So the lemma is that if the distance is less than 1 by root N,
then the less the total variation distance between PX and PY is less
than 1 minus omega 1. And how does that go? So basically we're going
to use the isoperimetric inequality. And what you see is that let's
consider one particular -- we wanted to say that -- I'm sorry. So that
discussion here, this is a lemma which I'll discuss a little later
perhaps. But I'm going to talk about the consequence of this lemma to
conductance. So we want to bound the conductance and we want to do so
using the isometric inequality. So what we do we take a cut and then
we say we want to find out what's the property of moving from one to
S2, and we want to give a lower bound on that. So we associate with
S1. S1 prime, which is a set of all points and one says PX of S2 is
less than delta by 2. So these are somehow deeper inside S1 prime,
these points are deeper inside S1 and they don't actually go into S2,
with good probability. And similarly we associate with S2 and S2 prime
which are points deeper inside S2 which don't go into S1 with good
probability. But what we want to say is that the mass of the
intermediate portion that is in between this is actually large. And
that portion corresponds to points that are going to the other side
with good probability. And so the probability of going to the other
side is large. So that's going to be the argument. So let X belong to
S1 prime and Y to S2 prime. Then this implies that the distance of PX
and PY is greater than 1 minus delta. Because PX is mostly supported
inside S1 and PY is mostly supported inside S2. And because these
things are so far apart by the lemma, the X and Y must have been far
apart. If we know X and Y are close by then DPXY is also small. X and
Y is far apart. But now we are in a position to use the isoperimetric
inequality and get -- by the bound, the Cheeger constant, this measure
of this part in between is large. And it has measure omega 1 by root
MN. The 1 by root N comes from here and the 1 by root N comes from the
original isoperimetric inequality for the Dikin metric. So therefore
the point, the band jumps to the other side with probability omega 1
implies that phi is greater or equal to 1 by root M. This is the
argument that conductance is omega 1 by root MN. So to prove the
lemma, we have to -- I'm not going to prove the full lemma but let me
just tell you what are the steps. So certainly if you want to prove
that, if two points are close, if then DPX, DPY is bound a bit from
one, you need to be able to show if two points are infinitely small
distance from each other then DPXDPY is bound away from 1. So that is
related to the probability of proper move. The probability that you're
not stuck at a given point. So suppose you want to show you're not
stuck at a given point with good probability. What you need to show is
that, first of all, when you make a move from X to, when you pick a
proposed point, then the X point is likely to be contained in DXD
omega. Because if this is not likely to be contained in D omega, then
we are dropping it with 100 percent probability.
>>:
Sigma, what distance is this one?
>> Hariharan Narayanan: Sigma here I'm using the lower distance. So
this lemma is not in terms of the Dikin metric. It's in terms of the
low Lovasz metric but the scaling is right. So row XY is going to be 1
by root N in this case.
>>:
I forgot what D omega was?
>> Hariharan Narayanan:
>>:
Oh, okay.
>>:
Omega is omega.
What is D omega?
Dikin ellipsoid center omega.
>> Hariharan Narayanan: Yes. Sorry. X is actually contained in DW.
So this is something you need to show because otherwise every time you
make a move it's going to be impossible to come back. So you'll be
forced to reject it when you impose a metric for this filter you need
to impose it in a way such that the moves that go in one direction are
also moves that can go the other way, otherwise it's not going to be
reversible. And we don't know how to analyze nonreversible Markovian
chains in this setting. So the volume of the new ellipsoid unlikely to
be much different than the previous ones, because if you recall there's
two things that involve the rejection. First is whether it's there in
the old, if the old, if the new point is contained inside, if the new
Dikin ellipsoid contains the old point and the second was the ratio
between the volumes of the two Dikin ellipsoids. So we don't want the
volume of the new ellipsoid to be much larger than the current one. So
we proved something like that.
>>:
What's the [indiscernible].
>>:
Error function.
>> Hariharan Narayanan: Error function. So step two follows from
these two steps. First is that the gradient of the log of the volume
of the Dikin ellipsoid, this is also known as the volumetric barrier,
is less or equal to square root N in the local metric. If you take a
Dikin ellipsoid, rescale it so that the Dikin ellipsoid is the unit
ball and then you measure the gradient of the log of the volume of the
Dikin ellipsoid around it, then that is less or equal to root N. And
we know that when you pick a random point in high dimensions, it's the
dot product is going to be like 1 by root N. That one by root N and
that one is you cancel and you get the bound you want. That's of
course the linearization but they've shown that the log of the volume
of the DX is a barrier in fact which means it's convex. So the log
of -- it should be negative log of -- so this is a concave function.
Its negation is a concave function which completes the proof because of
this inequality. So that's how -- that gives you a flavor of what kind
of arguments go into proving that the total variation distance between
the transition probabilities of nearby points is going to be small.
The final result is that if you let tau be greater than some number,
then if you let X0 X1 be a Dikin walk, then with for any measurable
S&P, the probability that X star belongs to S minus epsilon and X
measures the centrality of X 0 and Q. So now that was the first part
based on work with Ravi Kannan. Now I'll talk a little bit about how
to extend it to arbitrary convex set. So this is going to be using the
concept of a self-concordant barrier. Self-concordant barrier is a
function from P to R. It is convex as X extends to the bound of P and
F of X extends to infinity. For any vector H point X and P the
following hold: The gradient of F of X in the direction of H, the
derivative of F inner product with the vector H is less or equal to
square root mu times the norm of H. And the third derivative of F in
the direction of H is smaller than the Hessian raised by 3 by 2.
Hessian metrics, entry of the Hessian metrics 3 by 2.
>>:
A bunch of things here to arrive here.
>>:
Mu is the barrier parameter.
>>:
Quantify [indiscernible] for all X.
>> Hariharan Narayanan:
>>:
For example, what is mu.
Barrier parameter.
For all H, right?
Yes for all X, for all H, yes.
And [indiscernible].
>>: The last thing is the smoothness of the issue.
[indiscernible].
It's
>> Hariharan Narayanan:
the second degree form.
>>:
Yes.
Sure.
Because we call it the third degree form and
That was not exactly the question.
Okay.
>>: So if instead in the last inequality on the right-hand side
instead of the Hessian you put the identity.
>>:
Yes.
>>: You're saying that the derivative is just Lipschitz, it's bounded?
So you're saying the second derivative is Lipschitz and now you're
saying that it's the second derivative is Lipschitz with respect to
itself?
>>:
[indiscernible] these words before [laughter].
>>:
[indiscernible].
>> Hariharan Narayanan: So hyperbolic barrier is a very kind of
special barrier which comes from being the logarithm of a polynomial
and it's a polynomial that has only real roots in a certain fixed
direction. So what you do is you it's a multivariate polynomial, you
fix the X and take a vector V and you look at P of X plus DB. This is
now a univariate polynomial for a fixed X but it has only real roots.
So this is called the hyperbolic polynomial. And the hyper -- so you
take the log, negative log of hyperbolic polynomial, you get a
hyperbolic barrier and the hyperbolic of P is defined as basically the
set of all T. So you fix the X and then for all values of T great or
equal to 0, so you fix a V and you look at all those Xs such as if you
take a ray in the V direction you never encounter any real root. So
that's the hyperbolicity cone. And minus C is hyperbolic barrier for
any find section of it this is very important it means you can
intersect the cone with linear subspace and gives you a large class of
convex sets which you can express as sections of these cones.
>>:
That's always true.
Doesn't have to be the hyperbolic barrier.
>> Hariharan Narayanan: Sure. But you get a barrier for -- you get a
barrier for those convex sets. True. So the log defined by semi
definite scenes P and delta X is X is a cone of [indiscernible]
matrixes, and so this is used a lot. And so we consider the
intersection of a polytope the supports of a hyperbolic and self
barrier ->>: So that people get -- another example and the same example but
just if you take the product of XIs then you get a positive ->>:
Right.
>>: So, yes, so if you take the hyperbolic, one hyperbolic polynomial
X1 to XN and the hyperbolic cone is R 10 so the LP the log barrier that
I spoke about is actually also hyperbolic barrier also. So what we do
in this case is if you have intersection of a polytope supports
hyperbolic cone barriers construct a new barrier by adding weighted
sums of the previous barrier. So FL is the original log, logarithmic
barrier, N times FX is hyperbolic barrier. N squared times FS is the
self-recorded barrier. So I need to ->>:
So last, what is FL.
>> Hariharan Narayanan:
>>:
FL is log barrier for polytope.
Polytope setting.
>> Hariharan Narayanan: Now I'm looking at a convex set that's the
intersection of a polytope. Convex set coming from say SDP and some
general convex set that has a self-barrier around it. This is the most
general setup you could look at. I could have left it as SP left it as
self barrier corresponding with the set but my bounds are not as good
for self barrier. So I wanted to show the setting I get reasonable
bound. So that's why this scaling factor is what makes the bound works
for self-imposed barrier and for hyperbolic barriers. The mixing time
of the Dikin warm start order plus N new edge so the constant of the
hyperbolic barrier and this is the constant of the self-concordant
barrier you can recover polytopes up to constant by when you set this
to 0. So this is the general setup. Now I move to programming, any
questions?
>>:
So on the previous slide.
So FS is a general --
>>:
Convex body for which you have a barrier.
>>: And what property of hyperbolic barriers do you use to remove the
factor N?
>> Hariharan Narayanan:
derivative.
The fact that you can take forth the
>>: And it could be that the entropy barrier defines this in general
any convex.
>> Hariharan Narayanan:
>>:
I think it's true.
>> Hariharan Narayanan:
>>:
I see.
I see.
Yes but it's not computable.
I forget what bounds.
>>:
[indiscernible].
>>:
Yes.
>>:
Yes.
>> Hariharan Narayanan: Now I'll move on to linear programming. Here
the model for linear programming, you have a polytope BX equal to 1 and
you want to find R max Y belongs to Q. C transpose Y. Okay. So this
is what Karmarkar's algorithm basically does. You write the polytope
at the intersection of a simplex with a fine subspace, containing the
origin. And then you take the ball and you make -- you make a move
within the ball intersected with the fine subspace in the direction of
the objective. So you get a new point. And so now here is the special
thing you do. You do a projective transformation that maps the old
point back on to the origin. And of course the whole subspace also
changes in this process. Where it's still linear subspace. The
projective transformation preserves the boundary of the simplex not
point-wise but as a set it preserves it. And now you do the same
thing, you repeat and so on. So this is actually a site simplification
of the algorithm but this is the key idea of projective transformation
that he uses. The Dikin algorithm that I described has a similar way
of looking at it. Here you do a find transformation. You write the
polytope as the intersection of affine subspace with an quadrant and
then you do optimization over the ball on the subspace. And then you
do affine transformation to move it back to the center. And then you
repeat. So in Karmarkar's view from the polytope might be something
like this. So interestingly the ellipsoids there are not centered.
They're projectively invariant in the sense if I took this picture,
took a point and took the ellipsoid and did a projective transformation
of the whole picture then the corresponding Karmarkar ellipsoid would
be the projective measure of these things it's not true for the Dikin
ellipsoids. So you do something like this. Where for Dikin the Dikin
ellipsoids are fine invariant and centered. So you keep moving like
this. So we'll consider the following approximate of the equaling
standard transformation. So given Q such that Y of, given a polytope
given by this form of inequality, if there is a Y such that P transfer
Y is greater or equal to Y find a Y transpose Y equal to Y epsilon.
This is the kind of program you're interested in. And then the
algorithm involves doing a random walk without the ratios of volumes in
the met to list filter required to reject all samples correspond to
moves that you cannot reverse, but you don't need to add that other
metropolis filter for the ratio of the volume. So you do this random
walk. It takes a surprising number of small walks. The first move
is -- however this is very important. You have to do a projective
transformation. This is very much like Karmarkar's algorithm, except
we do it once, and we don't, and when you do this to a polytope, it
really blows up. So the parts near the top where this thing U is going
to infinity and it kind of over here it looks like it's very easy to
optimize a polytope because you know the direction, but actually it's
going to be a high dimensional polytope and there will be some
direction in which it is unbounded. But you don't know that direction
and detecting that direction from the face is very difficult. So which
is why you need to do something iterative. So here you map that to
infinity. So we have a special subspace that is slightly translated
version of it. That gets mapped here. And we basically choose a
random point inside this polytope. And this what happens is that this
tiny band here which looks very tiny here is actually huge over here.
So if you start here and you do a small number of random walk steps you
end up here with high probability. And the analysis again involves a
cross issue because of the transformation the cross issue is preserved.
So you can do the analysis neatly.
So this is -- so we do a modified Dikin walk for a certain number of
steps and output this and you get a good point in there. So this is
how the walk might look. I didn't draw any prediction steps. But
basically it goes faster and faster. So now I'll quickly go through
one application, this is joint work with Dr. Rakhlin. This is true
online convex optimization. Here the model is that nature chooses
bounded loss function F1, F2 and each time T an agent chooses an agent
has to choose an X2 YK and nature reveals HT and agent suffers loss.
And the agent's goal is to minimize the regret, which is the total loss
that he incurred. Just note that nature reveals FT after the agent
chooses XT. So this is the total loss incurred by the agent. Minus
the optimal loss that any fixed plane of X would have incurred with
hindsight. So infinitum over all X and K. You're comparing the best
possible strategy with the best fixed move X which has knowledge of all
the FTs. And so here is the scheme.
>>: So you compare your strategy to the best possible fixed point.
Not the best possible strategy.
>>: Best fixed point. Trivial strategy, but, yeah. Best fixed
strategy. Fixed point, yeah. Fixed point. So the schema is as
follows. At times T equal to 1T you sample XT from mu T, density E to
the minus ST of X. ST of X is eta summation FT of X. This is an
exponential distribution restricted to the convex set. And the
learning rate is 1 by root T. The thing is if you sample from this, it
turns out it's good enough. The question is can you sample from such a
moving distribution, because FTs are now changing and you want to
sample from it efficiently. It turns out that the Dikin walk can do
that issue the appropriate metropolis filter, but you need to change
the filter at every time based on the knowledge of FT.
>>: Another comment, which is so in the previous slide, I mean if the
LT can be convex function, right ->>:
Actually I'm looking at linear functions.
>>: But they could be convex functions and then it's enough to just
look at the linearization.
>> Hariharan Narayanan:
>>:
I see.
In the previous algorithm and everything works.
>> Hariharan Narayanan: I see. So this is the theorem we have.
Points of Lipschitz 1, and then appropriately defined time homogenous
Dikin walk provide a sequence X1, X2, X3, et cetera, that does well
it's a good strategy in the sense.
>>:
Being just one step at a time or more?
>> Hariharan Narayanan:
>>:
Yeah, one step at a time.
That goes to the next --
>> Hariharan Narayanan: So what happens is a time T, it is very close
to a stationary distribution, depending on T. A time T plus 1 it's
very close to a stationary distribution compared to T times 1 it moves
basically. You have to learn the distribution. And it moves with
the -- so the random walk moves faster it can track a moving
distribution in that.
>>: So XT and XT plus 1 are one step after another or take a bunch of
steps.
>> Hariharan Narayanan:
One step after the other.
>>: One step after the other but still makes -- okay.
function distribution.
>> Hariharan Narayanan:
[applause]
>> Sebastien Bubeck:
Okay.
Linear
So thank you.
Questions?
>>: Is it the proof completely different for the last thing that you
showed us, when you track moving distribution?
>> Hariharan Narayanan: No, so in L2, in L2 if you can get good enough
bounds, then I think the fragment distributions involved you can bound
those also. The key is to get L2 bound on the distribution with
respect to the stationary distribution. Conducting.
>>:
Is it some dependence on the pixel of the mixing --
>> Hariharan Narayanan: On the constant, yes. I didn't tell you what
the dependence. I just said orders T square root T.
>>: Polynomial dependence on S. I meant is the dependence on the
number of facets at this point necessary?
>> Hariharan Narayanan:
So --
>>: It's a little odd because you said -- okay in the membership
oracle world where you want a well-rounded set, somehow like polytopes
with the assets are the worst because you can't have -- I mean balls
are the best.
>> Hariharan Narayanan:
Balls are in the defense of sandwiching.
>>: But the opposite like if you -- so for you if you just looked at
the bounds, if you take like a polytope with a lot of facets, which is
very close to sphere, and so can be sandwiched very nicely, from your
analysis at least ->>: I think that's a problem in Dikin, write the same, constraint
hundred times. [indiscernible].
>>:
[indiscernible] is what makes this.
Makes sense.
>>: I guess. My question was supposed to be like do any other natural
ellipsoid make sense. Take the max volume continue in the body at that
point in time.
>> Hariharan Narayanan: So I had got a polynomial bound using that.
It was not a very good polynomial bound. But max volume. John's
ellipsoid gives ->>: Ellipsoid, choose dependent of the volume, based upon the
presentation.
>> Hariharan Narayanan: The thing is extremely multi invariant.
There's a trade off. The ellipsoids smoothly. So you lose out in the
bound of total variation distance, because the shape changes very
rapidly when you move the point a little bit. For that analysis.
>>: But you could use a universal barrier, and then there's the
computational cost in terms of varying smoothly you can vary smoothly.
>> Hariharan Narayanan: But I told you that N squared for mu is for me
so that N squared is because I don't have control over the volumetric
barrier of a universal barrier. So that is one reason. And other.
>>: And the barrier volumetric barrier of the [indiscernible] barrier
is the [indiscernible] barrier.
>> Hariharan Narayanan:
You can repeat it again.
>>: So you know there are three universal barriers, one is a canonical
barrier and the canonical barrier has the properties if you take the
volumetric barrier out of this it doesn't change. It's a fixed point
of this operation.
>> Sebastien Bubeck:
[applause]
All right.
Thanks.
Download