Document 17865069

>> Sebastien Bubeck: All right. So it's my pleasure to introduce Hariharan Narayanan from the University of Washington. So many of us in the room have been working with self-concordant barriers. And Hari will tell us how to use them for the polynomial sampling. >> Hariharan Narayanan: Thank you for the invitation to speak here. It's a pleasure to be here. The title of my talk is randomized interior point methods for sampling and optimization. So in this talk I'll put together work that appears in three papers. The first was joint work with Ravi Kannan, called random walk on polytopes on assigned interior point method for linear programming. Then in the later work I generalized this to hyperbolic barriers and other certain barriers. And I also will talk about an application that is joint work with Alexander Rakhlin to existing randomization. So the task of randomly sampling a polytope has the following form: We are given P and N dimensional polytope given by M linear inequalities and an interior point X. The task is to sample the polytope at random. And we will discretize the diffusion process designed to mix fast. So why should one be interested in sampling polytopes? Well, one application is to sampling lattice points in polytopes. This gives rise to algorithms for sampling combinatorial structures which is contingency tables. And it also gives rise to algorithms for volume computation. That is, the task of sampling a polytope is used as a black box in volume computational algorithms, and which in turn leads to algorithms for counting lattice points and polytopes. And finally, as I discussed today, these give rise to randomized interior point methods for context optimization. So the model for past work has been somewhat different in this area. The convex set is specified using what is called a membership oracle. Sorry. So the convex set is invisible, unfortunately. But it is -- it is sandwiched between the two walls here. So inner wall and outer wall. And, yeah, given a query point X1, the answer is yes X belongs to the convex set and no otherwise. So past approaches to sample convex sets have included the grid walk due to Dyer-Frieze Kannan, which had a mixing time of N to the 23 from a warm start of a preprocessing. By warm start, I mean a starting distribution whose L infinity -- so the L infinity value of the random concordant derivative is bound by universal constant with respect to the stationary measure. So this is how the grid walk works. You basically take small steps on a grid. And then another walk considered by Kannan-Lovasz is called the ball walk here. You start -- take a ball of a certain radius around a current point. Pick a random point in the ball and repeat this procedure. And this was shown to have a much faster mixing time. That is, O star, N to the 3 from warm start after preprocessing. Then there's this random walk called hit-and-run that was first analyzed by Lovasz and then the analysis was elaborated upon and improved by Lovasz and Vempala. This here in one transition you pick a random chord and pick a random point on that random chord and repeat this. And this is is called hit-and-run. The mixing time from a fixed point was order N cubed R by R squared, log capital R by D epsilon, where D here is the distance of the initial starting point from the boundary, and R capital R is the radius of the circumscribed ball. Little R is the radius of the inscribed ball. And epsilon is the -- yeah, the total variation distance to the uniformity that you desire. So the mixing time from a fixed point is an order N more than the mixing time from a warm start. And this is related to the fact that in high dimensions, if you have a single point and you take one step, then you essentially start with the distribution whose L2 norm is exponentially large in N. And so you pay a penalty of N when you are starting from a fixed point. So for a polytope less or equal to poly N faces, capital R or little R cannot be better than order N to the half minus delta. And this is in fact true only for symmetric polytopes. If you have asymmetric polytopes, then what you need to do is you need to put them in a certain position, and then you need to shave off their corners by intersecting them with a ball of radius root N or so. So it's not that you actually can sandwich the entire convex set between balls of radius 1 and root N, you actually have parts that stick out of the outer ball but they're very small in measure. So you can ignore them. But this gives N to the half and the overall bound, the mixing time from a fixed point is therefore N to the 4 minus delta. So this N cubed and this N to the 4 minus delta. So the mixing time of the new Markov chain that I'll discuss today is as follows: It is for sufficiently large N, the mixing time from a fixed point is this quantity MN. M is the number of faces of the polytope. N is the dimension. S is a notion of centrality of the chord, which is the maximum value of this longer part divided by the smaller part, taken overall chords passing through the point X. So ->>: [indiscernible]. >> Hariharan Narayanan: X is the starting point. So for sufficiently large N, the mixing time is this to get to epsilon total variation distance from a warm start the mixing time is order MN. And so if -- I have a somewhat crude upper bound in the number of arithmetic operations it takes for one step, which is M to the gamma. In that has a better analysis which gives essentially number of non-0s in the matrix of the polytope, plus order N squared. So this M to the gamma can be replaced by something like that. And as a corollary, we see that the number of random walk steps to mix is smaller for this walk if the number of faces is order N squared and has less arithmetic operations if M equal to order N to the 1.46. Again, with the analysis this becomes order N squared as well. This M becomes order N squared as well. >>: Is there any condition on the polytope with being isotropic. >>: No. variant. >>: So random walk that I'm going to discuss is a finding So it won't matter. Interesting. >> Hariharan Narayanan: So this is. So in order to define the random walk, I'm going to have to define Dikin ellipsoids, because this is how the one step of the property of the Markov chain will be made. So the Dikin ellipsoid of -- the Dikin ellipsoid of a polytope around a point X is defined as the set of all points Y such that X minus Y transposed multiplied by the matrix summation EI, EI transpose divided by 1 minus EI transpose X 4 squared. Here AIX less or equal to 1 is the polytope. And AI is the Ith row of the matrix A. So here for X belonging to PDX is a set of all Y is that basically this vector, this deviation vector from X multiplied by -- this is actually the Hessian of a certain barrier function that I'll come to, is less or equal to R squared, and for us R is squared 3 by 40, the polytope case. So to take a step back, there is some convex function whose Hessian, if you take you'll get this. And this times the Hessian less or equal to R squared defines the ellipsoid. These ellipsoids have nice properties. For one, if you take around any point in the point X 6, you take the symmetrization of the polytope, and then if you take the convex, if you take the Dikin ellipsoid and dilate it by square root M, it will contain the symmetrization of the polytope. It's too much to ask that the Dikin ellipsoid dilated by square root contain the polytope, because you can go very close to a corner and then the Dikin ellipsoid will be tiny. But what is almost as good for our purposes is that when you dilate it by square root N, it contains the symmetrization of the polytope at that point. >>: Excuse me, what do you mean by symmetrization. >> Hariharan Narayanan: So it's the set of all points that a property that both the point and its reflection about the center, central point, belong to the polytope. So it's P intersection minus P where is the origin. So Dikin affine scaling LP algorithm, this was when he proposed the Dikin ellipsoid in the first place. Do the following: You start at a point, take a Dikin ellipsoid, and you optimize the linear function over the Dikin ellipsoid. Pick the new point, pick a new ellipsoid around the new point. And repeat. So this algorithm has no polynomial time guarantees and is believed to not be in polynomial time. So although it's a very natural looking interior method it's good to be not polynomial time. On the other hand, the Dikin walk for which I'm going to describe with some modifications leads to polynomial time algorithms for linear programming. And the Dikin walk -- now that I've defined the Dikin ellipsoid, I can talk about the Dikin walk. It's defined as follows. I can take a point, take the Dikin ellipsoid around that point, pick a random point inside the Dikin ellipsoid. Look at the new Dikin ellipsoid. Now, if the new Dikin ellipsoid contains the old point, then accept this point with the probability that is proportional to the minimum of one and volume of DX0, the initial Dikin ellipsoid divided by volume of DX1, Dikin ellipsoid. If new point DX0 does not belong to DX1 then don't accept the move at all, just reject the move. So this is the random walk. And as you see it goes through -- if you start at a corner, the ellipsoid is small. But as you move closer to the interior, all the high dimensions you never really go close to the interior, you kind of move along the surface of the polytope. But still when you're far from very low dimensional corners, the sizes of these ellipsoids become bigger. So this is a natural discretization of a Brownian motion with drift on a certain manifold where the metric is given by these ellipsoids in the void in tangent spaces of the metrics. And that diffusion process is a specific focus plan which is very much akin to the heat equation, stationary measure is not uniform measure. It's -- it's the uniform measure on the polytope with, once you equip the polytope with this different measure, with this different metric, it also gets a natural measure which is, whose density with respect to the uniform is proportional to the inverse of the volume of the Dikin ellipsoids. So that's mu actually. And so there's a natural process in the background here. >>: This is for even with rejection step that -- I mean if we are not doing the rejection. >>: Hariharan Narayanan: I'm not claiming an exact correspondence. I'm just saying that this is the more linear -- this is -- yeah, I don't know if with the rejection step and you take delta into 0 you actually get this or not. I don't know if formally that's true. So the hit-and-run of Lovasz involves transition starting at X draw a chord, pick a random point on the chord and the mixing from a warm start -- I'm sorry. That should not be there. So this is -- so the mixing from a warm start here is MN. Whereas the mixing time from a warm start here was N cubed. So this algorithm was used in integer many programming by Huang-Mehrotra, and the idea was that there you want to find not only a point with a large objective value, you want to find a point which is integral and has a large objective value. And to do this, what they did was they basically, instead of doing a normal interior point method they used a variant of Dikin walk, short and long step Dikin walks, and they got a point that was not exactly optimal but kind of close to optimal and a bit random and then they did some additional stuff to make it an integer point. They basically rounded it, and if it landed outside the convex polytope they took the nearest point in the convex set to that new point and then again rounded it and did this repeatedly. And in the end they would get a current integer solution and cut a new plane of that integer solution and repeated the whole process. So that was what they did. So the way we're going to analyze the Markov chain here is by getting a lower bound in the conductance and the conductance fee of this Markov chain is obtained by taking the infinitum over all sets S of measure less or equal to half of the property of executing S from random X in S1. Pick a random point in the set and pick a random set that in the property during this experiment you move from one side to the other side. And Lovasz and Simonovits in '93 proved the following bound. They proved that starting the distribution has density rule where the supreme of low X is equal to M that low X be equal to random walk and for all values of S the probability that XK belongs to S minus mu S is less than square root M E to the minus K phi squared by 2. And so even if M is exponentially large than N that gets translated to an additional N factor in the K and you can get bounds from a fixed point as well. So this is the mixing time from a warm start is 1 by phi squared and now the question is how do we bound the conductance. The way the conductance is bounded in this situation is you first get a lower bound which is purely geometric on the convex polytope with this particular metric. And then show that -- show that the transition properties of the Markov chain are in some sense are faithful to this metrics. So if you take two points that are geographically close by, then the total variation distance of the transition kernels corresponding to those two points is 1 minus omega 1. Bounded away from 1 by omega 1. So I discussed take the convex set and then you give it a metric where tiny distances are given by the ellipsoids. So the ellipsoids are circles. Ellipsoids are the unit balls in that metric locally. So for that metrics space, for that metrics space we want isometric inequality. So it turns out that Lovasz analyzed hit-and-run with a different isoperimetric metric, different metric called the Hilbert metric. And that this metric is approximately isometric to the Hilbert metric up to square root M factor. And that allows us to analyze the isoperimetric convex of it using the metric instant of the Hilbert metric. Here's what the Hilbert metric. Is if you take two points, X and Y, you define sigma of XY as X minus Y divided by U minus V, multiplied by U minus V, divided by U minus X, divided by Y minus V. This is project -- this is not an hitter [phonetic] distance. This is a projective ratio. If you take log of 1 plus this then that becomes the distance. So log of 1 plus this is the distance. And what Lovasz proved was that if you take, if you put any uniform measure on this polytope and you partition it into three parts, P is the polytope. S1 prime is this part. S2 prime is this part, and P minus S1 minus S2 prime is the part in between, then the measure of P multiplied by the measure of the part in between is great or equal to the distance between the two parts multiplied by mu of S1 prime multiplied by mu of S2 prime. So by distance between the two parts, I mean the distance between the nearest pair X and Y, X in the first part, Y in the second part. So this gives rise to an isometric inequality for our setting also. So here this is the metric. It's given the locally it's given by Dikin ellipsoid. So there's something not working right with my slides. >>: [indiscernible]. >>: Hariharan Narayanan: Yes, if I need something -- as I mentioned before, the Dikin ellipsoids, if you dilate by square root M, they contain the symmetrization of the polytope. And what I forgot to mention was that, as stated, they're contained inside the polytope. So there is this symmetrization of the convex polytope about a point are sandwiched within square root M of the Dikin ellipsoid and its dilation. So because of that, the Hilbert metric is within one by square root M of the Dikin metric, and this gives rise to this the isoperimetric inequality for the Dikin metric. If you have quality isotopes you can do better because if you take the manifolds which are products of smaller manifolds and dimension, then the verse isoperimetric cart is not much worse than the worst factor. So this M can be replaced by the M of the smaller polytope which are used in the [indiscernible] product. But that's a special case. So next the second step is to be able to relate the total variation distance between the density functions, the transition density functions for two nearby points. For this, so this here I want to make a point that actually I said that everything is a Dikin ellipsoid and so on, but once you do the rejection step, that rejection step is very extreme in the case of polytopes, because you're taking pretty large steps and end up rejecting one whole side of the polytope in the worst case. So they don't really look like ellipsoids anymore. They look like these sort of things. And you need to argue about the overlap about these sort of things. So the lemma is that if the distance is less than 1 by root N, then the less the total variation distance between PX and PY is less than 1 minus omega 1. And how does that go? So basically we're going to use the isoperimetric inequality. And what you see is that let's consider one particular -- we wanted to say that -- I'm sorry. So that discussion here, this is a lemma which I'll discuss a little later perhaps. But I'm going to talk about the consequence of this lemma to conductance. So we want to bound the conductance and we want to do so using the isometric inequality. So what we do we take a cut and then we say we want to find out what's the property of moving from one to S2, and we want to give a lower bound on that. So we associate with S1. S1 prime, which is a set of all points and one says PX of S2 is less than delta by 2. So these are somehow deeper inside S1 prime, these points are deeper inside S1 and they don't actually go into S2, with good probability. And similarly we associate with S2 and S2 prime which are points deeper inside S2 which don't go into S1 with good probability. But what we want to say is that the mass of the intermediate portion that is in between this is actually large. And that portion corresponds to points that are going to the other side with good probability. And so the probability of going to the other side is large. So that's going to be the argument. So let X belong to S1 prime and Y to S2 prime. Then this implies that the distance of PX and PY is greater than 1 minus delta. Because PX is mostly supported inside S1 and PY is mostly supported inside S2. And because these things are so far apart by the lemma, the X and Y must have been far apart. If we know X and Y are close by then DPXY is also small. X and Y is far apart. But now we are in a position to use the isoperimetric inequality and get -- by the bound, the Cheeger constant, this measure of this part in between is large. And it has measure omega 1 by root MN. The 1 by root N comes from here and the 1 by root N comes from the original isoperimetric inequality for the Dikin metric. So therefore the point, the band jumps to the other side with probability omega 1 implies that phi is greater or equal to 1 by root M. This is the argument that conductance is omega 1 by root MN. So to prove the lemma, we have to -- I'm not going to prove the full lemma but let me just tell you what are the steps. So certainly if you want to prove that, if two points are close, if then DPX, DPY is bound a bit from one, you need to be able to show if two points are infinitely small distance from each other then DPXDPY is bound away from 1. So that is related to the probability of proper move. The probability that you're not stuck at a given point. So suppose you want to show you're not stuck at a given point with good probability. What you need to show is that, first of all, when you make a move from X to, when you pick a proposed point, then the X point is likely to be contained in DXD omega. Because if this is not likely to be contained in D omega, then we are dropping it with 100 percent probability. >>: Sigma, what distance is this one? >> Hariharan Narayanan: Sigma here I'm using the lower distance. So this lemma is not in terms of the Dikin metric. It's in terms of the low Lovasz metric but the scaling is right. So row XY is going to be 1 by root N in this case. >>: I forgot what D omega was? >> Hariharan Narayanan: >>: Oh, okay. >>: Omega is omega. What is D omega? Dikin ellipsoid center omega. >> Hariharan Narayanan: Yes. Sorry. X is actually contained in DW. So this is something you need to show because otherwise every time you make a move it's going to be impossible to come back. So you'll be forced to reject it when you impose a metric for this filter you need to impose it in a way such that the moves that go in one direction are also moves that can go the other way, otherwise it's not going to be reversible. And we don't know how to analyze nonreversible Markovian chains in this setting. So the volume of the new ellipsoid unlikely to be much different than the previous ones, because if you recall there's two things that involve the rejection. First is whether it's there in the old, if the old, if the new point is contained inside, if the new Dikin ellipsoid contains the old point and the second was the ratio between the volumes of the two Dikin ellipsoids. So we don't want the volume of the new ellipsoid to be much larger than the current one. So we proved something like that. >>: What's the [indiscernible]. >>: Error function. >> Hariharan Narayanan: Error function. So step two follows from these two steps. First is that the gradient of the log of the volume of the Dikin ellipsoid, this is also known as the volumetric barrier, is less or equal to square root N in the local metric. If you take a Dikin ellipsoid, rescale it so that the Dikin ellipsoid is the unit ball and then you measure the gradient of the log of the volume of the Dikin ellipsoid around it, then that is less or equal to root N. And we know that when you pick a random point in high dimensions, it's the dot product is going to be like 1 by root N. That one by root N and that one is you cancel and you get the bound you want. That's of course the linearization but they've shown that the log of the volume of the DX is a barrier in fact which means it's convex. So the log of -- it should be negative log of -- so this is a concave function. Its negation is a concave function which completes the proof because of this inequality. So that's how -- that gives you a flavor of what kind of arguments go into proving that the total variation distance between the transition probabilities of nearby points is going to be small. The final result is that if you let tau be greater than some number, then if you let X0 X1 be a Dikin walk, then with for any measurable S&P, the probability that X star belongs to S minus epsilon and X measures the centrality of X 0 and Q. So now that was the first part based on work with Ravi Kannan. Now I'll talk a little bit about how to extend it to arbitrary convex set. So this is going to be using the concept of a self-concordant barrier. Self-concordant barrier is a function from P to R. It is convex as X extends to the bound of P and F of X extends to infinity. For any vector H point X and P the following hold: The gradient of F of X in the direction of H, the derivative of F inner product with the vector H is less or equal to square root mu times the norm of H. And the third derivative of F in the direction of H is smaller than the Hessian raised by 3 by 2. Hessian metrics, entry of the Hessian metrics 3 by 2. >>: A bunch of things here to arrive here. >>: Mu is the barrier parameter. >>: Quantify [indiscernible] for all X. >> Hariharan Narayanan: >>: For example, what is mu. Barrier parameter. For all H, right? Yes for all X, for all H, yes. And [indiscernible]. >>: The last thing is the smoothness of the issue. [indiscernible]. It's >> Hariharan Narayanan: the second degree form. >>: Yes. Sure. Because we call it the third degree form and That was not exactly the question. Okay. >>: So if instead in the last inequality on the right-hand side instead of the Hessian you put the identity. >>: Yes. >>: You're saying that the derivative is just Lipschitz, it's bounded? So you're saying the second derivative is Lipschitz and now you're saying that it's the second derivative is Lipschitz with respect to itself? >>: [indiscernible] these words before [laughter]. >>: [indiscernible]. >> Hariharan Narayanan: So hyperbolic barrier is a very kind of special barrier which comes from being the logarithm of a polynomial and it's a polynomial that has only real roots in a certain fixed direction. So what you do is you it's a multivariate polynomial, you fix the X and take a vector V and you look at P of X plus DB. This is now a univariate polynomial for a fixed X but it has only real roots. So this is called the hyperbolic polynomial. And the hyper -- so you take the log, negative log of hyperbolic polynomial, you get a hyperbolic barrier and the hyperbolic of P is defined as basically the set of all T. So you fix the X and then for all values of T great or equal to 0, so you fix a V and you look at all those Xs such as if you take a ray in the V direction you never encounter any real root. So that's the hyperbolicity cone. And minus C is hyperbolic barrier for any find section of it this is very important it means you can intersect the cone with linear subspace and gives you a large class of convex sets which you can express as sections of these cones. >>: That's always true. Doesn't have to be the hyperbolic barrier. >> Hariharan Narayanan: Sure. But you get a barrier for -- you get a barrier for those convex sets. True. So the log defined by semi definite scenes P and delta X is X is a cone of [indiscernible] matrixes, and so this is used a lot. And so we consider the intersection of a polytope the supports of a hyperbolic and self barrier ->>: So that people get -- another example and the same example but just if you take the product of XIs then you get a positive ->>: Right. >>: So, yes, so if you take the hyperbolic, one hyperbolic polynomial X1 to XN and the hyperbolic cone is R 10 so the LP the log barrier that I spoke about is actually also hyperbolic barrier also. So what we do in this case is if you have intersection of a polytope supports hyperbolic cone barriers construct a new barrier by adding weighted sums of the previous barrier. So FL is the original log, logarithmic barrier, N times FX is hyperbolic barrier. N squared times FS is the self-recorded barrier. So I need to ->>: So last, what is FL. >> Hariharan Narayanan: >>: FL is log barrier for polytope. Polytope setting. >> Hariharan Narayanan: Now I'm looking at a convex set that's the intersection of a polytope. Convex set coming from say SDP and some general convex set that has a self-barrier around it. This is the most general setup you could look at. I could have left it as SP left it as self barrier corresponding with the set but my bounds are not as good for self barrier. So I wanted to show the setting I get reasonable bound. So that's why this scaling factor is what makes the bound works for self-imposed barrier and for hyperbolic barriers. The mixing time of the Dikin warm start order plus N new edge so the constant of the hyperbolic barrier and this is the constant of the self-concordant barrier you can recover polytopes up to constant by when you set this to 0. So this is the general setup. Now I move to programming, any questions? >>: So on the previous slide. So FS is a general -- >>: Convex body for which you have a barrier. >>: And what property of hyperbolic barriers do you use to remove the factor N? >> Hariharan Narayanan: derivative. The fact that you can take forth the >>: And it could be that the entropy barrier defines this in general any convex. >> Hariharan Narayanan: >>: I think it's true. >> Hariharan Narayanan: >>: I see. I see. Yes but it's not computable. I forget what bounds. >>: [indiscernible]. >>: Yes. >>: Yes. >> Hariharan Narayanan: Now I'll move on to linear programming. Here the model for linear programming, you have a polytope BX equal to 1 and you want to find R max Y belongs to Q. C transpose Y. Okay. So this is what Karmarkar's algorithm basically does. You write the polytope at the intersection of a simplex with a fine subspace, containing the origin. And then you take the ball and you make -- you make a move within the ball intersected with the fine subspace in the direction of the objective. So you get a new point. And so now here is the special thing you do. You do a projective transformation that maps the old point back on to the origin. And of course the whole subspace also changes in this process. Where it's still linear subspace. The projective transformation preserves the boundary of the simplex not point-wise but as a set it preserves it. And now you do the same thing, you repeat and so on. So this is actually a site simplification of the algorithm but this is the key idea of projective transformation that he uses. The Dikin algorithm that I described has a similar way of looking at it. Here you do a find transformation. You write the polytope as the intersection of affine subspace with an quadrant and then you do optimization over the ball on the subspace. And then you do affine transformation to move it back to the center. And then you repeat. So in Karmarkar's view from the polytope might be something like this. So interestingly the ellipsoids there are not centered. They're projectively invariant in the sense if I took this picture, took a point and took the ellipsoid and did a projective transformation of the whole picture then the corresponding Karmarkar ellipsoid would be the projective measure of these things it's not true for the Dikin ellipsoids. So you do something like this. Where for Dikin the Dikin ellipsoids are fine invariant and centered. So you keep moving like this. So we'll consider the following approximate of the equaling standard transformation. So given Q such that Y of, given a polytope given by this form of inequality, if there is a Y such that P transfer Y is greater or equal to Y find a Y transpose Y equal to Y epsilon. This is the kind of program you're interested in. And then the algorithm involves doing a random walk without the ratios of volumes in the met to list filter required to reject all samples correspond to moves that you cannot reverse, but you don't need to add that other metropolis filter for the ratio of the volume. So you do this random walk. It takes a surprising number of small walks. The first move is -- however this is very important. You have to do a projective transformation. This is very much like Karmarkar's algorithm, except we do it once, and we don't, and when you do this to a polytope, it really blows up. So the parts near the top where this thing U is going to infinity and it kind of over here it looks like it's very easy to optimize a polytope because you know the direction, but actually it's going to be a high dimensional polytope and there will be some direction in which it is unbounded. But you don't know that direction and detecting that direction from the face is very difficult. So which is why you need to do something iterative. So here you map that to infinity. So we have a special subspace that is slightly translated version of it. That gets mapped here. And we basically choose a random point inside this polytope. And this what happens is that this tiny band here which looks very tiny here is actually huge over here. So if you start here and you do a small number of random walk steps you end up here with high probability. And the analysis again involves a cross issue because of the transformation the cross issue is preserved. So you can do the analysis neatly. So this is -- so we do a modified Dikin walk for a certain number of steps and output this and you get a good point in there. So this is how the walk might look. I didn't draw any prediction steps. But basically it goes faster and faster. So now I'll quickly go through one application, this is joint work with Dr. Rakhlin. This is true online convex optimization. Here the model is that nature chooses bounded loss function F1, F2 and each time T an agent chooses an agent has to choose an X2 YK and nature reveals HT and agent suffers loss. And the agent's goal is to minimize the regret, which is the total loss that he incurred. Just note that nature reveals FT after the agent chooses XT. So this is the total loss incurred by the agent. Minus the optimal loss that any fixed plane of X would have incurred with hindsight. So infinitum over all X and K. You're comparing the best possible strategy with the best fixed move X which has knowledge of all the FTs. And so here is the scheme. >>: So you compare your strategy to the best possible fixed point. Not the best possible strategy. >>: Best fixed point. Trivial strategy, but, yeah. Best fixed strategy. Fixed point, yeah. Fixed point. So the schema is as follows. At times T equal to 1T you sample XT from mu T, density E to the minus ST of X. ST of X is eta summation FT of X. This is an exponential distribution restricted to the convex set. And the learning rate is 1 by root T. The thing is if you sample from this, it turns out it's good enough. The question is can you sample from such a moving distribution, because FTs are now changing and you want to sample from it efficiently. It turns out that the Dikin walk can do that issue the appropriate metropolis filter, but you need to change the filter at every time based on the knowledge of FT. >>: Another comment, which is so in the previous slide, I mean if the LT can be convex function, right ->>: Actually I'm looking at linear functions. >>: But they could be convex functions and then it's enough to just look at the linearization. >> Hariharan Narayanan: >>: I see. In the previous algorithm and everything works. >> Hariharan Narayanan: I see. So this is the theorem we have. Points of Lipschitz 1, and then appropriately defined time homogenous Dikin walk provide a sequence X1, X2, X3, et cetera, that does well it's a good strategy in the sense. >>: Being just one step at a time or more? >> Hariharan Narayanan: >>: Yeah, one step at a time. That goes to the next -- >> Hariharan Narayanan: So what happens is a time T, it is very close to a stationary distribution, depending on T. A time T plus 1 it's very close to a stationary distribution compared to T times 1 it moves basically. You have to learn the distribution. And it moves with the -- so the random walk moves faster it can track a moving distribution in that. >>: So XT and XT plus 1 are one step after another or take a bunch of steps. >> Hariharan Narayanan: One step after the other. >>: One step after the other but still makes -- okay. function distribution. >> Hariharan Narayanan: [applause] >> Sebastien Bubeck: Okay. Linear So thank you. Questions? >>: Is it the proof completely different for the last thing that you showed us, when you track moving distribution? >> Hariharan Narayanan: No, so in L2, in L2 if you can get good enough bounds, then I think the fragment distributions involved you can bound those also. The key is to get L2 bound on the distribution with respect to the stationary distribution. Conducting. >>: Is it some dependence on the pixel of the mixing -- >> Hariharan Narayanan: On the constant, yes. I didn't tell you what the dependence. I just said orders T square root T. >>: Polynomial dependence on S. I meant is the dependence on the number of facets at this point necessary? >> Hariharan Narayanan: So -- >>: It's a little odd because you said -- okay in the membership oracle world where you want a well-rounded set, somehow like polytopes with the assets are the worst because you can't have -- I mean balls are the best. >> Hariharan Narayanan: Balls are in the defense of sandwiching. >>: But the opposite like if you -- so for you if you just looked at the bounds, if you take like a polytope with a lot of facets, which is very close to sphere, and so can be sandwiched very nicely, from your analysis at least ->>: I think that's a problem in Dikin, write the same, constraint hundred times. [indiscernible]. >>: [indiscernible] is what makes this. Makes sense. >>: I guess. My question was supposed to be like do any other natural ellipsoid make sense. Take the max volume continue in the body at that point in time. >> Hariharan Narayanan: So I had got a polynomial bound using that. It was not a very good polynomial bound. But max volume. John's ellipsoid gives ->>: Ellipsoid, choose dependent of the volume, based upon the presentation. >> Hariharan Narayanan: The thing is extremely multi invariant. There's a trade off. The ellipsoids smoothly. So you lose out in the bound of total variation distance, because the shape changes very rapidly when you move the point a little bit. For that analysis. >>: But you could use a universal barrier, and then there's the computational cost in terms of varying smoothly you can vary smoothly. >> Hariharan Narayanan: But I told you that N squared for mu is for me so that N squared is because I don't have control over the volumetric barrier of a universal barrier. So that is one reason. And other. >>: And the barrier volumetric barrier of the [indiscernible] barrier is the [indiscernible] barrier. >> Hariharan Narayanan: You can repeat it again. >>: So you know there are three universal barriers, one is a canonical barrier and the canonical barrier has the properties if you take the volumetric barrier out of this it doesn't change. It's a fixed point of this operation. >> Sebastien Bubeck: [applause] All right. Thanks.

Document 17865069

Related documents

Products

Support

Document 17865069

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib