John Langford: Okay. So we have two papers... paper and the other one is this paper. This...

John Langford: Okay. So we have two papers this year in IPS. One of them was the actual learning paper and the other one is this paper. This one has been kicked around a little bit. But might be some nice things in it. So for me to go through it carefully. So this is logarithmic time online prediction. I worked on this with Anna Choromonska when she was visiting as an intern, and Alec was also involved in the project, my coauthor. So we're doing multi-class prediction. Doesn't get simpler than that. It's just, it features, you have the label, and we're thinking about this in kind of an online setting. We're thinking we see features, predict, and we see features. And our goal is to just find the classifier, which minimizes the error rate. Pretty straight forward. And this is the twist. We want this to be fast. So why do you want this? Well, if you take a look at the ways classification is often applied, they're often kind of confusing. I mean often people think about ranking in returning a search result. But really what's going on is you have some sort of index into billions of web pages that pulls up a future plausibly relevant and then you have some sort of rerigging process to that which is over and over again, until finally you spit out the final results. But in some sense, logically, the problem is really this problem right here. It means you have to return things fast. You just have eight billion possibilities, possible labels. Or I don't know how many it is these days. But you have many billions of possible labels, which are all of the individually unique web pages. Or another example is here, it's like the who is that question, right? There's almost ten billion people. So it's a big multi-class prediction problem. This one I happen to be able to solve, because I know that's my brother and his family. But otherwise you wouldn't necessarily know. >>: [Indiscernible]. John Langford: What's that? No, he did not authorize me. I'm probably doing something bad. Okay. So there's a lot of approaches people use to avoid this high and multi-class prediction problem. And I wanted to kind of go through those and discuss for the obvious things that we're not doing, because I guess we've discovered that a lot of people are used to the obvious ways we're not doing it. So trick No. 1 is when K is small. Often K is small and, you know, that's easy. Trick No. 2 is if you have a preexisting hierarchy, you just reduce to the K of small case by building a predictor at each node in the hierarchy. And that can work fine if you have a good preexisting hierarchy. Trick No. 3 is a shared representation. This is very common in ImageNet applications or [indiscernible] applications. You have a bunch of shared processing. And then you have a final output layer. And because everything is shared here, it's amortized across to the final output. And so you save a lot computationally. That makes a lot of sense. It's very useful. But it's still the case that you still have that final output, out layer, and if you have ten billion classes, you know, things are going to blow up. There's going to be too many parameters. Okay. So another trick is structured prediction. So idea here is that you want to make a bunch of individual predictions. And maybe you have some order that you impose over the set of predictions that you want to make. Or you have some structural predictions. So you choose that structure in advance, and then you sort of start thinking about exactly how to do that. So this is very similar to the hierarchy case in some sense, although some of the techniques vary a little bit. So that's a good trick. Sometimes the structure is not very clear. So this is exactly like the hierarchy case. If you have a good hierarchy then you can use that, if you don't, then not so clear. Trick No. 5 is the GPU. People are hitting things with GPUs a lot quite hard these days. And GPUs are great, but as far as we can tell, they're only kind of a constant factor great. So the physics of computation are such that maybe GPUs give you a little bit like a constant factor, more efficient computation, but it still only cost effective more efficient computation than general purpose systems. So these are the tricks people like to use. Now, let's think about this problem a little bit more fundamentally. How fast can we hope to go? And the answer is log K. And you can't really hope to go faster than log K. The cannon example is very simple, you say, well, suppose my label is uniform on 1 through K. Then, you know, information theory tells me I really have to put out log base 2 of K bits in order to be able to even specify a label. And so there you go, log K time. Right? It's kind of the end of the story. So you can't really hope to go faster than log K. And then if you look at the way things work, right now typically order K is what you see. The amount of computation that goes on to predict one of K things is order K. So, okay, so K divided by log K, what does this mean? So when K is kind of small, it doesn't really matter very much. When K is 100, you might be like, okay, a factor of 10, maybe a factor of 10 was important, but not too important. If you get up to a thousand, then, you know, you get a factor of a hundred or so, maybe it starts to become important. But you know, GPUs are really good at doing a bunch of parallel computations. So maybe even a factor of a thousand is not, a factor of a hundred is not so important, because [indiscernible] this was the GPU. But when you get to a million here, you're looking at a factor of a hundred thousand or so, well, even if you have a GPU, you're going to start to care. So the claim is there is some K for which you care to be sublinear in K. And in particular it would be nice to be logarithmic in K. So now the question is how do we actually get there. So there's a lot of approaches people have taken including myself. I worked on this. So for sparse error correcting output codes, what happens is you say I'm going to create some binary vectors with a length of a number of labels. So as much as 01 entries. And then I'm going to train some binary classifiers to minimize the error rate. So I just want to predict that bit, essentially. Right? So I have a bunch of, I have log K different, or order log K different binary classifiers, each of which predicts the bit in my binary vector. And then I want to predict by finding the Y with minimal error. So if the label, if the value of biy is 1, then you really want the H of X to equal 1 or zero or visa-versa. You guys can figure it out. >>: [Indiscernible]. John Langford: So what I mean is there's really a matrix where one dimension has order log K and the other dimension has a size order K. And biy is iy entry of that matrix. But each predictor is only trying to minimize the error rate on the one dimension of that matrix. So this is only an hix. >>: That should be BY maybe? You create log K binary vectors, each one is B sub Y, BIY is -John Langford: This is the entry. This is the entry. >>: Yeah, you're confusing the entries with the vector. >>: B sub Y is the value. >>: Yeah. John Langford: Oh, man. So here's my fundamental flaw. I took general relativity and I do intend summation notation. So I always notate things. I talk about the entry and define its indices. So this is a scalar, which is the I Y entry in the matrix. And then the whole transpose thing people do, that really confuses me. >>: [Indiscernible] John Langford: I'm trying but I don't really understand ->>: Because you don't have log K, [indiscernible]. >>: [Indiscernible]. John Langford: I don't understand why it's unclear, I guess. >>: It's clear. John Langford: Maybe we should discuss it later. Seems like a distraction. So the question is why is this not logarithmic time. >>: [Indiscernible]. John Langford: Say again? >>: [Indiscernible]. John Langford: Yeah, so the length of the K vector is actually, I think that's not really a problem, because we kind of imagine that we have at least one example of each label. So we can kind of amortize the length of the vector across individual labels. >>: [Indiscernible]. John Langford: Yeah, we want to do logarithmic time prediction per example. >>: [Indiscernible]. So if you mention the label as just a number, so the vector could be a binary extension of this number. John Langford: Could be. But I will still say this is not logarithmic time. Yeah, step 3, when you find the Y with minimal error that's typically an order K operation. >>: You've had zero [indiscernible]. John Langford: Your lookup table, that could be true. This is order log K. There's a constant which is larger than 1 here. So the lookup table is not size K in that case, it's going to be size, kind of large. >>: It would depend on [indiscernible]. Matrix, any prediction produced can be found in constant time [indiscernible]. John Langford: Say it again? >>: So if I say it's really that 1 is very dense. John Langford: So B is dense. >>: Yeah and then whatever you produce is step 2, you just find the binary representation of the [indiscernible]. John Langford: Why is it constant time? >>: Well, to make sure that, well, there's no, the log is produced in sub 2 gives you an error of zero instead of 3. >>: If you have [indiscernible]. John Langford: If you have no error. We don't want to worry about errors. >>: [Indiscernible] let's say you have [indiscernible]. John Langford: Maybe. This particular paper doesn't do that. But I'm not saying there isn't an approach along these lines, but the paper certainly didn't do that. The paper was just saying, look, we're going to check all the individual labels. >>: So he said kind of like a [indiscernible]. So why should that be linear instance, would be really a horrible way to solve these papers. >>: [Indiscernible]. >>: [Indiscernible] like to me it seems to be the more issue of the problems that you create in the extension that were difficult. You have the usual problem like that's the thing that I think is the worse problem here. Because 3 you can just solve with the try beta structure, I think. John Langford: I will get into that. And I think these two problems are related. So when you have a particular problem and a representation that you're worrying about, you really want to be tuning the bits here to match the problem representation that you care about. Which means that it's not going to be the problem representation which happens to give you logarithmic time. And I mean -- >>: I think this is the problem. The problem is not, if you didn't have this problem that some combination, some [indiscernible] and it's not linear, right. You can solve it. But the other way is that the problem that you generate, the B factor, is such that you're just going to get horrible results. >>: [Indiscernible]. John Langford: So prediction is, I think, okay, let me rephrase this in a way that is fallible. The prediction is omega K if you're tuning the Bs to match your representations. Because then you don't [indiscernible] >>: [Indiscernible]. John Langford: So another approach people use, step one you build a confusion matrix of errors. And then step 2, you recursively partition to create a hierarchy, maybe you use some sort of spectrum method to partition things. Part 3 is you have a hierarchy now and you just use it. So you use your hierarchy solutions. And that would say this is not what we want either. So why is that? >>: You have cumulative errors. John Langford: You have cumulative errors coming down, that's true. So step 1, I think, which is the really killer one. So if we go back and we think about this graph, you can afford to spend more time on training than on testing, maybe up to 10,000 labels or something, but beyond that you really need to have, I mean something which is logarithmic time and training also matters. >>: By this you mean logarithmic overhead compared with the size of datasets. Right? John Langford: I'm thinking about logarithmic time per sample, per example. >>: Per example. Yeah. John Langford: So this doesn't seem like something interesting. And then there's another trick here, which is pretty cool. It says unnormalized learning. So this is a simple one-against-all approach. You have a predictor for each class label. And you can just build a predictor for this class label or not. So [indiscernible]. There's a trick you can actually make the training very fast. You can say oh, we're going to train, this is the example, we're going to train the Y regressor with X, 1, and they're going to pick a random other regressor and train it with minus 1. And now the training has become potentially, I mean just for every example you just do two updates. So it's essentially a constant or logarithmic or something like that. So that's fine. But the prediction is still going to be omega K. So this approach is actually kind of effective. It's good for training but not for prediction. Okay, so now we want to think about how do we actually do this. And we're going to think about learning a tree, which is the problem with accumulating errors, but we have to start somewhere. And when we think about how do we actually want to do things. So let's imagine we're doing digital recognition. And we have 1, 7, and 3, and 8. We can do a split at the root between 1 and 7 and 3 and 8, and then we can have a hard split further down between 1 and 7 and between 3 and 8. Or we can have a hard split at the root between 1 and 8 and numbers 3 and 7 and have an easy split further down. So the question is do you prefer having the hard split at the root or do you rather have a hard split further down? Any opinions? >>: [Indiscernible]. John Langford: You think you prefer this one? See we preferred this one. Okay, so the claim is that it's better to have confusing problems near the leaves, difficult problems near the leaves. And the problem is that, so you have some representation which is being used at the root, and that representation is getting hit by all the samples. So it becomes a very constrained in the sense that it needs to get a lot of things right. Well, further down near the leaves, you get fewer samples, you get fewer things right. >>: [Indiscernible] mass samples, you want the easy to classify. You have more samples, you hope you can deal with your distribution there. >>: That is like making a status symbol from where's your main [indiscernible], what kind of expressivity points. John Langford: So, yeah, making a point about representation and capacity. So because I think often representation capacity ends up constraining your performance. Representing of capacity, rather I think the way is works is representation of capacity can strain your performance at the root. The same complexity constrains your performance towards the leaves. >>: Who is B W G? John Langford: Bengeo Weston Granger. >>: So there's another point, which is for a lot of these, a lot of these algorithms, if you make a mistake at the root then you have nowhere to come but back up. So just mistakes at the root are more sparsely, so you might want to put these problems there. >>: [Indiscernible] John Langford: We will actually have some ability to recover from mistakes because we'll have the same label maybe in multiple leaves. Okay. So do we need to think about what is a good partition of, into some right and left branch. And this is easy when you only have two classes. But it's essentially fairly subtle when you have many classes. Because kind of very fundamentally non-convex. So we have a bunch of examples and we wanted to look at the quality of some partitioner. We want some way to judge how good different partitioners are. And the claim here is that a decent way to do this is to, like I say, if the probability of A turns into the probability of B is equal to the probability of AB, then it would be as if these were independent. As if A and B were independent. But instead what's happened here is that we, we're trying to maximize that difference. So trying to maximize the dependence between the left/right decision and the label. So it should at least intuitively be something along the lines of what you want. >>: This is like a purity criteria of sorts? John Langford: It's not just a purity criterion. Because you also want the number of examples to be going, similar number of examples to be going left as right. So we're weighting this by the fraction of examples and not just by some uniform [indiscernible] classes. >>: So it's going to need something like increasing correlation coefficiency so you want present correlation coefficient [indiscernible]. >>: Purity usually weighs, no? >>: This correlation between H and Y; right? >>: [Indiscernible]. John Langford: Could be. I don't know if there's some correlation coefficient off of [indiscernible]. So in the paper there's a property called purity and a property called balance. And you kind of want both of them in order to, and this achieves both of them. Or at least it drivers you toward both of them. >>: I was thinking about the kind of typical purity measure or impurity when you kind of just growing the trees; right? John Langford: Yes. >>: So that one usually weights by the sizes? It's not purely just, you know, where your computation looks like? John Langford: So if you look at a typical entropy like thing, what you see often in decision trees, it looks kind of like this. I wish I could draw better. And then what we're doing looks a little bit different. It looks kind of like this. >>: Right. Okay. John Langford: It's a bit sharper. Since we're trying to maximize things, this is somehow a bit less non-convex than that, which makes it a bit easier to optimize later. So a different form of this, which I think is a bit more operational, you can factor out the distribution on Y. And then you can kind of see, you're looking at the fraction of time that you go right and the fraction of time that you go right [indiscernible] of the label. So those are kind of the fundamental qualities that you're trying to work with. And then you can see why this is sort of non-convex, because when you're doing classification, if you take all of your labels and you swap them, you get something terrible performance-wise. But if we take all of our, if we take our partition and we swap it, you end up with the same value here. And now if you take an average of those two, you get something terrible. Because the average of a good partition and the complimented partition is going to be, which is what do you do with a tie. But nothing good is going to happen if heverything is a tie. So that's kind of rough. I mean we're used to convex optimization. Convex optimization is much easier. But this seems kind of hard. So now we need to think about how do we, so we don't just need to optimize this criteria in an individual node, we need to optimize it in many nodes across the entire structure. And one question is can we build things in a bottom-up fashion. Kind of a natural approach you might imagine you take two random labels, you build a predictor between the two, you combine with other sub trees to make a super tree and you build up some sort of tree-like structure. And the claim is this does not work very well. And this canary example is here. So you have three labels on the line, single feature, which is the line. You're doing linear prediction, you pair 1 and 3, you build a great predictor between 1 and 3. And then you pair that with 2, and then there's no linear predictor which can distinguish between 2 and 1 or 3. So some sort of randomized matching and then build up to get the big structure approach seems to not be viable in general. At least we haven't been able to figure out a way to do it. So now the question is can we do things in a top-down fashion. Right? And it turns out you can take, so there's this motion of decision tree learning as boosting, which people worked on previously. And you can take these, this proof and you can redo it for a multiclass case. And what you end up with is, so we imagine that we make some progress on our criteria, gamma, just like a boosting assumption, and then you can prove that the number of nodes in your structure is something like this. And the gamma also causes you to get a log K [indiscernible]. So this is something nice that says if we optimize this criteria effectively at every individual node, then what we can individually hope to get is small error rate. >>: So this is like the basic error rate you reduce entropy at each step so at bottom you must have like pure leads? John Langford: Yeah, so inside of the proof it's doing that. This is going to be a little bit distressing. It says that if gamma is small, that's in the experiment, that's bad. So we can't really afford to have a small gamma. You need to be doing strong learning, not weak learning in the individual nodes. So we really want to strong learn the predictors in the nodes of the tree. Otherwise it's just going to, the tree's going to be enormously large. So now the question is how do we actually get a good gamma here, a big gamma at nodes of the individual tree? And there's a couple common tricks that we use. The first one is you can relax the optimization criteria. This is kind of a heuristic. We relax the optimization criteria to use some sort of score margin like thing rather than using the hard decision left/right. So if you go back here, here we're looking at just did we go left, did we go right. Here we're looking at the expected score on like a minus 1 to plus 1 range and then everything is the same. But even this is kind of hard to optimize because every time you change the parameters in your Y hat, you have to recalculate this to get the exact value over all of your examples. And that doesn't work very well, so we approximate this within a running average. Until you have some running average of what this value is, some running average of what these values are, one for each Y. And then as your learning rate converges toward zero, the learning average should converge to the expectation. So the algorithm at the node then looks something like this. you start with your running averages are zero for each of those. You compare EY to E. So this is the expectation. A condition otherwise is that Y independent expectation. You create a label which is plus 1 or minus 1, and then you update the weight just using that binary label. So we're using an online learning algorithm, online classifier underneath. And then you [indiscernible] just to keep everything up to date. So this is kind of the core update rule that you apply to an individual node. And then of course you're going to go left or go right depending upon what your update says. And then you're just going to recurse, doing this at all the nodes in your structure. Yeah? >>: So why is this W transposed? John Langford: Yeah, so we're using linear predictors in our experiment. But anything that, it should work with any online classifier outputting a score that you can use up there. So now the question is sort of what happens experimentally. There's many different experiments that you can do. One experiment is let's say we have a fixed amount of time and then we just reduced datasets so that the amount of time used by individual algorithms is the same. So we can compare this logarithmic online multi-class tree to one-against-all, looking at the accuracy. So higher is better for a bunch of different datasets that we found. And what you see, okay, for this one, I think the one-againstall is better, and then it just gets worse and worse and worse and worse as you get to more and more classes. So on a per computational time basis, the long tree is getting an accuracy much better than an order K approach. >>: There's one correct label? John Langford: Yeah. >>: Oh, there's always just one correct label? John Langford: Yeah. >>: And everything else is bad. John Langford: Yeah. So this is the 22K version of ImageNet. So it's like the big version of ImageNet. This is .NET hierarchy that we got from Paul Bennett. >>: One could argue that if you have a hierarchy you could measure your error proportional to the graph distance in the hierarchy or something. John Langford: Could argue that. We're trying to see how good this is at just flat out, straight up, doing high multi-class learning. >>: Is there a way if you have a hierarchy and you want to use the method, is there some way you could check hierarchy using this method? John Langford: There's no interface for that right now. I could certainly imagine the training process to obey hierarchy to some extent. So I mean right now, this is really just an online procedure which is going on inside of, so the order of examples really matters. You had some preexisting hierarchy so maybe you could make it so the order of example doesn't matter so much. >>: Is there a way to use this as some kind of stochastic sub-gradient center or something like that? Just trying to understand. Like you said any high-level entry should use for this five steps. John Langford: It's really just these two steps. [indiscernible]. >>: [Indiscernible] so. See just really. I see. So then B minus 1 feels like a sub-variant of 2 or a subgradient on the absolute loss. Okay. John Langford: We have this loss structure, so it's ->>: So I mean in that case is there a way then to incorporate hierarchy so essentially I guess [indiscernible] so rather than having 01 error you might have some kind of like a version of the nearest common ancestor, have error write, and there might be a way how to write it in like some [indiscernible] slides to be able to create -John Langford: I think it's possible. We haven't done it. But I think it's possible. >>: So it sounds like your measure of purity, right, would be not treating all the classes equally [indiscernible] multiplier; right? John Langford: Yeah. >>: So but instead you would be incorporating it into the measure of purity. John Langford: Another thing to think about a moment is so you can see where this optimization kind of fairly directly implies plus 1, minus 1, something just classification. If you have this kind of thing that you care about, it's a bit less obvious particularly how you want to do it. Since we started this, we have found some other purpose for optimizing linear predictors or in kind of criteria, but they involve things like you kind of create a sigmoid so you pretend that there's some fraction of the times that you go left, some fractions of the time that you go right. And it kind of continues on with the optimization. And then you have to worry about regularization maybe a little bit harder, and so forth. So actually for fixed rate time and then we can look at, this is a test error, so lower is better now. We can look at the test error for a bunch of other logarithmic kind of approaches. You can have a random tree, you can use [indiscernible] tree. This is an early paper that I worked on. It's also in logarithmic time. It's not logging the structures, but it's just something nice there. So we see that generally the test error is smaller, this is much smaller, depending upon the degree of compatibility between the representation and the problems. >>: So hierarchy is used for [indiscernible] tree? John Langford: It's just a bit representation of a label. >>: [Indiscernible] and random is what? John Langford: Just a randomly created hierarchy. >>: [Indiscernible] John Langford: Just training in the way that you normally would. So the filter tree is a technique which is, if you train top down, is this subset or that subset is the problem. Because this takes an existing training algorithm. This tree fixes that and I think it trains bottom up in order to find a fixed hierarchy. So fixed tree is better than random tree. And then this long tree seems to be as good as or better, sometimes much better, than the other trees. So now we can also run one-against-all, at least in some of the smaller datasets. We don't care about computation, we're just going to try to optimize to get the best prediction performance we can. And what we see is there's still some gap, right. But it seems this is where we are right now. We've closed the gap statistically to be much closer to one-against-all than we were previously with logarithmic time prediction. But I don't think it's the case, it's not the case yet that you just want to use this no meter what. You need to care about the computation if you would prefer to use this. And the last question you might ask is do we actually get logarithmic time in prediction? And the answer is yes, we do. So the reason why this question is nontrivial is because when you are, when you have this branching structures, they're not nearly as cache local and highly optimized as [indiscernible] operations or whatever. So you expect to pay some penalty associated with the fact that you're branching. And there is a penalty, if you look at this. But indeed it's still generally logarithmic time. One of these number may be, so let me go back to, so this ImageNet here is Liam Batu trained a [indiscernible] network, scraped off the top layer, and then used that to create feature vectors for all of the examples in the 22K version of ImageNet. And now the prediction here takes half a millisecond. So that's decent. >>: John? John Langford: Yeah. >>: Do you have a sense of where [indiscernible] on that graph? John Langford: Yeah, so I think, so Monarch's approach is going to be slower. So we haven't done side-by-side comparisons. I've tried to understand the performance characteristics of the effects of the [indiscernible]. I think it's roughly a factor of 10 more state and a factor of 10 slower. And this approach on the datasets that he was applying things to. But the performance I think is better than what we're getting. >>: Should be around the same as performance, right [indiscernible]? John Langford: I'm not entirely, I expect it will perform better. I don't know if it completely closes the gap. >>: [Indiscernible]. THE WITNESS: What's that? >>: So say it's going to have the better accuracy but slower? John Langford: It's better accuracy but slower and more state. >>: So what is this actually you have? What is this complexity? John Langford: Fast XML? >>: Yeah. John Langford: It's not analyzed, but Menic is not using online learning, he's using some sort of iterative optimization. And the number of iterations is not very controlled. But in fact it's relatively small. >>: At prediction time for example. John Langford: Well, it's going to be, it's logarithmic time for each individual tree but then there's a bunch of trees and he's doing [indiscernible] things or random forest like things. So it's dependent upon how many individual trees he uses. >>: I see, so then there is no, is there any iteration for like how the number of trees you need will scale the number of classes or is this something that's just one of those things that's more on the problem than on the number of classes? John Langford: [Indiscernible]. Okay, so can a predicting time logarithmic in K. >>: There's a reason that that [indiscernible] talk. >>: [Indiscernible]. John Langford: So right now the long tree does not incorporate benefits of the filter tree. So when you have a high noise rate it's good to have a training rule which is consistent. And that's what you encounter with multi-class predictions. So maybe there's a way to fix that. When I get a moment I'm going to try some things. I think we do know how to do that. And there's also this issue that your representation complexity constrained at the root, your sample complexity is constrained at leaves. And it's kind of, it's something that's fundamentally uncomfortable about a decision tree. So maybe there's some other approach that's less decision tree like that also gives you logarithmic time. Other questions? Yeah? >>: So when the problem [indiscernible]? John Langford: No. So this is an online approach. And it's actually a bit more to make things online that he didn't describe. So you can have nodes that become obsolete because all the examples go away from them. And there's some process of recycling nodes in an online fashion, which is described inside the paper. So when the distribution is changing and you want logarithmic time, I think there is no good alternative to this. I'm aware of nothing else that plausibly works. Okay. Thanks a lot. [applause] MR. EUFPLT: So now we're going to be Bayesian, and instead of minimizing the convex loss, what we want to is to center from a distribution which is arrived from this loss. So I will tell you exactly what I mean. And what we will see is that the algorithms that we know and love to minimize the convex function can also be used to sample from the measure derived from this loss. So as always for the past year my story starts with convex buddy. So K is a convex buddy. So it's a convex set, compact with non-[indiscernible] and what we find is a function F which is going to be from K to R, which is convex. So we have a convex function defined on our convex buddy, and I'm going to assume that this function is smooth in the sense that it is L [indiscernible]. This was a gradient, the number of the gradient is redundant, and it's better smooth. So the gradient map itself is Lipschitz. So the gradient of F of X minus the gradient of F of Y is in Euclidean knowledge is in Euclidean [indiscernible] there's a better known of X minus Y. And the object we're going to be interested in is a lot concave measure, which is derived from F. So mu is going to be a probability on K. And I'm going to write for you as a random [indiscernible] derivative of mu which is a [indiscernible] measure or the density sorry, of mu. So the density of mu with respect to the limbic measure of D nu over DX is going to be equal to 1 over Z exponential minus F of X. In the indicator that X is in K. Okay. And Z is just a normalization constant to make this a property measure. So Z is equal to the integral of exponential minus F of X, D X over X and K. And the program we're going to be interested in, the goal, is to sample from mu. Sample from mu given some [indiscernible] access to mu and K. So that's going to be our problem. So for instance this morning, [indiscernible] talked about decision trees and how to optimize the weight on the decision trees. So what he was interested in is minimizing F over those weights. But now we're taking this Bayesian approach where every weight has a certain probability. And then you might be interested in doing different kinds of inference using the distribution itself. So this is a basic task. And the constraint K, for instance, in our first talk, it was about, you know, you had this proper invalid weighting of the tree. So this gives you a [indiscernible] from K. So this directly appears in real life example. So this program has a long history. I'm going to tell you about this in a minute. But first let me tell you about the results and the theorem that we proved. So this is, by the way, the papers that I have at NIPS with [indiscernible]. So the algorithm is the simplest you can think of in some sense. So what [indiscernible] side 1, side 2, et cetera. An I A D sequence of Gaussian and 1 identity. And we'll have Eta positive step size. And what we're going to do is the following. So we're going to start our initial point at 0, it's going to be 0, and I'm going to define my sequence to be X still. There will be another sequence, which is more deserving of the name X. So X still is this. So still this one is going to be following. So I'm at X T K. What do I do? So, okay, so the point where F is small, they are a higher probability. So I want to go, I want to drift towards the minimum of F. So to do this, I know how to do it, I just do a gradient step using the gradient F X, so I do minus Eta, gradient of F [indiscernible]. If I do that I will go to the minimum and I want to keep some viability. So what I'm going to do is I'm just going to add one of those Gaussian, I'm going add some noise. So plus square root Eta, [indiscernible]. So it's very much related to stochastic gradient descent. So this in expectation is just equal to the gradient. The catch is that it's a variant of my noise is exactly equal to the step size. So the usual theorem standing as a stochastic gradient descent from the minimum do not apply. Exactly because I'm in the [indiscernible] is matched with the step size. And I should put Eta over 2. And the theorem is very simple. So it's [indiscernible]. So let's do, so we need some kind of normalization somewhere. Because K was just an arbitrary convex buddy, so what I'm going to assume is that K contains a unit ball. Let's say the diameter of K divided by R. And to simplify we're going to assume that L and Beta are some constant, numeric constant. I mean the theorem, it has all the parameters in the paper. Now the theorem goes as follows. So total variation distance, you remember the T V between Nu and Mu is just a supreme of all events of the difference between the priorities that Mu assigns to A minus the priorities that Mu assigns to A. So if I control this, I control uniformly over all events of different probabilities. It's a very strong notion of convergence. So we are going to say that the total variation distance between X to N, after doing capital N step and Mu is less than Upsilon. If I set Eta to be some, no, let's say a constant times R squared over N, R squared over N. So that's what we prescribe for the step size, R squared over N. And the most important thing is what is N, capital N. Capital N should be at least, it should be at least, so square root R times N over Upsilon. This looks pretty good. Except that it's 12. That looks less good. But the point is that it's polynomial in N. So what is the difficulty of this problem? Yes? Of course. >>: Mu is a measure. Sebastian Bubeck: Mu is a measure. Yes. >>: And X [indiscernible] N. Sebastian Bubeck: Yes, this is a point. Yes. >>: So what's the total variation between [indiscernible]. Sebastian Bubeck: First of all. I cover two things. So one of them is, okay, so here, you know, I identify the random variable with its measure. So this is a random variable. This is the underlying low and identifies the 2. But more importantly I forgot the most important thing, which is just like in stochastic gradient descent, when you do a step of gradient descent, you might step outside of your constraint. So you need project back. So you project back. And this will be actually the whole difficulty. So where P K is a Euclidean projection. Euclidean projection on K. So if I step outside, I project back on K. So then I get indeed a point in K and I say that the total variation distance is less than Upsilon provided I get this. And if N 0, meaning that I'm really interested in the uniform measure on this, then we can improve it a little bit, then 12 becomes 7. So let me tell you some of the history of this work. Not much, but a little bit. >>: So what did N do? Sebastian Bubeck: I'm sorry? N as 0 just means F is constant. So I want to sample uniform point in K. So the first one to do this, to prove that there exists a polytime algorithm to sample from this type of measure was Dyer, Freese, and Cannon. Until 91, I think. And what they did is essentially this thing, exactly this. So they were interested in the case where F was the constant of this low gradient. But the point is instead of the projection what they were doing is if you step outside, then you just repeat. You don't move and you repeat. So the thing is that this type of Markov chain is called a board walk. This will not mix from any point. If you start in the corner then you can get stuck and it will take exponential time to make a move. You see, if you are in a corner like this and I am here, then I have my small ball. And most of the time I'm going to step outside. It's going to take a very, very long time for me to stop moving. And what they proved is that from a good starting point, you will mix in polynomial time. Whereas what we do mixes from anywhere. I didn't say anything about where you start. And why do you mix from anywhere? Because even if I start here, you see the picture I'm going to draw will make sense in twenty minutes, is that I will move like this. I will escape like this. I will keep bouncing on the boundary and this will make me escape very fast. So that one, the first one was Dyer, Freese, and Cannon. I think they got, now, I don't remember, but something like N to the 21. And the current we called is by Lovas. Sorry, Lovas, and then Barna. And they get an amazing N to the 4, using what is called the hit-and-run. So hit-and-run is a following. It's very simple. You are at the point X K and now X K plus 1 is going to be sampled from a certain distribution, which is the following. So I first take L at random from the sphere. So uniform in the sphere. So I take a uniform direction and then I sample X K plus 1 from Mu, restrict it to the line in just by L, starting at the point X K. So I was here. That's my X K. I take a random line, okay, and I take a point uniformly at, not uniformly, at random. But at random from the execution of Mu from this line. So sample from Mu restricted to the line, X K, plus R L, or L R. So this is called hit-and-run. This is one-dimensional problem. And you can solve it in many ways. The point is that these two algorithms are very different. So they get a better dependency than this definitely. This is a zero sort of method. What it means is to be able to query F at two points. You get two queries for the value of F. What we have is a first order method. You can query a grid in F. So what I expect or what I hope is that this type of message should be better in practice. It's usually much more information about the function. So it should mix faster. It's not yet shown in the theorem, but that's what I hope to obtain. I will tell you towards the end about real experiment. So that's one thing. This hit-and-run also mixes from any starting point. Yes? >>: [Indiscernible] it had to go through a succession of convex files. You don't just start with an arbitrary -Sebastian Bubeck: So that's if you want to compute a volume. >>: Oh, so for the something that he had [indiscernible]. Sebastian Bubeck: So for here sampling this works directly. And they prove, so I mean, you know, there are many papers. But one of the most recent one, which is still I think, I don't know, seven years old, shows that it mixes from any stopping point. So you don't need a [indiscernible] stop or anything like that. Which is just like what I do. The one thing which is tricky is, yes, we're fighting with them and they worked a lot on this. And this is very optimized. But still I expect, I mean, okay, our point was to prove polynomial. And let me say something else is this is N to the 4 times and log 1 of Upsilon. Whereas we get polynomial. And this is the key difference. Yes? >>: So let's kind of not waste [indiscernible]. Sebastian Bubeck: Yes. >>: So suppose I can solve optimization problem. So do you expect I can do a better algorithm if I just do a series of full minimization problems [indiscernible] functions? Sebastian Bubeck: I don't know. I don't think so. But I don't know. So it's a good question. But here you will see, I mean what is underlying this is a certain division process, which has Mu as its stationary measure. So it's very natural to just discategorize the division process. And it clings to stochastic to be in a sense. But it's in that way that you do it. You start with the continuous thing, you discategorize and you get this. And then you see that there is a connection with optimization. But maybe there are other ways to do it. >>: Probably in that process you're doing sort of a [indiscernible] approximation somewhere, which [indiscernible]? Sebastian Bubeck: So you will see. It's a bit subtle because you're working with those random variables and you need couple them. So you will see. We're going to do that. >>: Just to make sure, in your main selling point is that you think this one does better. Sebastian Bubeck: My main selling point is that this is a natural algorithm that you may want to use and that is actually used a lot. So this is called the [indiscernible]. This is [indiscernible]. Going back a long time. And it has been analyzed without the projection. So that was going to be my next point. and statistics people really look at this, and it's known how to analyze it when you don't project. The projection induces a lot of real difficulties and we have a way to deal with them and which can give you [indiscernible] context new insight about how to deal with those projections. Because this is something you really deal with quite often. >>: If you don't project them, what happens if you step out of K? You just ignore that step? Sebastian Bubeck: If you don't do the projection? >>: Yeah, you just ignore it. Sebastian Bubeck: Yeah, that's what Dyer, Freese, and Cannon did. Yeah. You just say okay I don't move. I try it again. And you count the number of failure. So this will not be polytime from any starting point. So [indiscernible] in 14 [indiscernible] of L M C. For K equals R N. And F [indiscernible] from convex. So our starting point was this very nice paper of Armand Delano who showed how to analyze this in the case with the potential [indiscernible] curvature and there is no constraint. And I'm going to walk you through this. It's very nice, very simple, and then you will see what those difficulties and how we deal with this. Give me just one last thing that I want to say about this. Here you see you're playing with the boundary at every step. So every step you have to go and see how far can I go in this direction. Whereas what will happen with this guy is that most of the time you don't have to play with the boundary. It's only from time to time that you step outside, you project and then you wander again inside and then you project at some point. So there is some notion of average complexity which should be lower for this algorithms than for the hit-and-run. But this we didn't do. But in practice you see it. I mean this, you just have to do it every time. Yes? >>: You just said it has to do with, you're assuming that the minimum of the function is somewhere in the deep in the interior of the convex. If it's on the boundary then you are -Sebastian Bubeck: That's a good point. So I mean it's, yes, it's almost a good point. [laughter] No, no. I will make this precise. We will exactly need this. But the point is that for any F, you know, which is smooth enough, there is not too much mass near the boundary actually. So near the boundary, there is never too much mass near the boundary. So only way to really make a lot of mass near the boundary is you need the minimum to be at boundary and then you need to explode very fast. But this is prevented by those conditions. >>: So these are the units for, so F is just constant. Sebastian Bubeck: Yes. >>: And [indiscernible] most of the volume is [indiscernible]. Sebastian Bubeck: So it depends on what do you mean. It always depends on when you say that what do you mean by closed? So the thing is if I have my convex buddy here, if I do a little thing here of size like Polly 1 over in, then this has exponentially small mass. The thing is there is not, you know, it's in between. Like there is no mass here and all the mass is in between. There is no mass too close to the boundary and no mass too close to the center. So it's really in this same shell. Yes, so that's the picture, I think. At least to me. So let's see how to analyze this in the case of the [indiscernible] convex case and you will see it's simple and nice. So the first thing that I want to do is tell you about the continuous process which is underlying this thing. So let's introduce a Brown and Moshen ability. So this is a Brown and Moshen. So you remember W T plus S minus W T, this is a Gaussian with variance S. And these guys are independent of each other, independent. And W 0 0. So this is a definition of Brown and Moshen. Now, the point is that this type of things, you can write it using the Brown and Moshen. So now we are in the case where K is equal to R N. There is no projection. So this case, this equation I can rewrite it as follows. So X filled K plus 1. So let's use X bar. So X bar T is going to be equal to X T over Eta. Or let's say X bar K Eta is X T K. And now what I want to say is that what is the increment of X bar T in time D T. So I have two things. So I move by this much. So from X T K plus 1 to X T K, how much time does pass? I think of this as time. This is T now. So Eta times step pass. Okay, this is the time. So in Eta I move by Eta. So in D T I move by D T. So what I get is this is minus 1 half the gradient of F at the previous integer point, which is X bar Eta, T over Eta D T. That's this part. Now I get the increment by the Brown and Morshen, which is just plus D W T. So this equation really gives me this process. If I look at integer point, from one integer point to the next one, by how much do I increase? Well, I get in Eta times step, I get Eta times this gradient as the previous one, and I get this D W T during Eta times step, which is exactly the Gaussian with variance Eta. Which is exactly what I want. But you see, as soon as you see this, what do you want to write? What you want to write is D X T is minus one-half grade F at X T D T plus D W T. Okay, I don't want to do this [indiscernible]. I want to really do that at continuously I take the gradient and I move continuously in the direction of the gradient. I don't want to update this at this time. So this is a division process. This is a division process. And it's stationary measure is Mu. Stationary measure is Mu. Okay, how do you see that? I mean at least for me when I look at it, it's not clear at all that this was a Mu as stationary measure. But the thing is I will leave this as an exercise, but it's an easy one. You can write the focal plank equation. What does it mean? The focal plank equation just means it's a time evolution of the density of X. X is a random variable. It keeps changing. And it's density, that's because it's Rho T, keeps changing. So let's say Rho T is a density of X T. Then what you can write is that the derivative of the density with respect to time is equal, so this is a calculation to one-half of the divergence, so the [indiscernible], let's not worry about that, we don't get too much about it, of Rho times grade F minus grade Rho. And this is a plus. So this, and let's put the T. So the derivative with respect to time, how does the density evolve? It evolves according to this formula. So now let's look at this formula for a minute. If I plug-in Rho T equals Mu, let's see what happens. What is the gradient of Mu? So the gradient of Mu, you know. What is Mu in Mu is exponential minus F. So what I get is exactly minus the gradient of F times exponential minus F over Z. So I get minus Rho grade F. Minus Mu grade F. So you see that Mu verifies that this term is zero. The gradient of Mu is exactly minus Mu grade F. So those two things cancel and we see that if you start at Mu you stay at Mu. Okay, the density doesn't change. So there are many ways to derive this. The simplest one I prefer is just [indiscernible] calculus and integration by parts. The fancy way is you say you write what is a generator of this equation. So this has a certain generator, this is and this is the joint of the generic. That's how you move from describing the Markov chain through it's generator and describe be the focal plank equation. But anyways, this is all very standard and easy. So now we know that Mu is the stationary measure. But we want to know how fast Rho T goes to Mu. And this is also very standard. In this case, when you have some convexity. So when you have some convexity is the keyword is the [indiscernible] curvature condition, which just says the following. So when now I say is the following. If I look at the total variation distance between X T and Mu. So again right now I'm only talking about the continuous time process. There are two steps. First understanding how fast the continuous time process goes to the stationary measure. And then understanding the discrepancy between the discreet time process and the continuous time 1. So this, it's easy to see that this is less than exponential minus alpha T times the Chi squared distance between X 0 and Mu. So what is a Chi squared distance between two, let's say density G and H. This is just the integral of G, G over H minus 11 squared. Okay, you remember the relative entropy between G and H is just the integral of G log G over H. So for the Chi squared I just replaced log X by X minus one squared. So we get this. Why do we get this? This is just, so here is how you do it. So there is this theory of Bakremie curvature, which has to do, which is a certain condition on the generator. So you verify that the generator verifies this. And then you get the plank R inequality, and the plank R inequality directly tells you that you get exponential convergence in L 2. And then you want T V. So what you do is you do a crash response, and what comes out is N 2 on one side and the Chi squared on the other side. So this is all standard. And now the point is that this is a further exponential N for X 0 from the Gaussian. So you have a starting point to reach Gaussian, so it is a distribution of the starting point is exponential N and then you decrease exponentially fast. So what you need is a time T, which is a folder. So what is it? It's N over alpha log 1 over Upsilon. You're mixed. Okay, you're a distance Upsilon. Because a continuous time process. Now how do you compare the discreet time and the continuous time? There is an exact equality, not for the total variation, but for the relative entropy between the two. This is called Gasonov and it is very simple. It goes like this. So the T V between X bar T and X T is less than one-half, so square root one-half is a relative entropy of X B T with respect to X T. Which is equal, okay, this is an equality, so it can only be so deep. And it gives you this, 1/8 of the integral between 0 and T of the integral of what is the expectation of the known of the gradient of F X R T minus the gradient of F at X Eta T over Eta squared D T. And you get the square root. So this is inequality. And this is Gasonov, which is just the fact that our two processes, our two Brown and Moshen with different drifts, so they are absolutely continuous with respect to each other and you can write down exactly what is, you can make sense of what is D X bar T over D X T. Oh, which is what you need for this type of integral. And you get this formula. Hopefully I will have time to tell you what we did. Because none of this is [indiscernible]. But it's important background. So now the thing is, now we use Beta smoothness. So this is less than Beta squared times the expectation of the normal of X T minus X Eta T over Eta squared. And this is all X. There is no bar. The continuous time thing. But now how far is those two guides. So there is the part that comes from the drift. So the drift during times of Eta, it drifts by Eta. So you get an Eta squared. And now what about the Brown and Moshen part? Well, the Brown and Moshen part during Eta timestamp, it has a variance Eta. But it has a variance Eta in every dimension and you have N dimensions which are subbed for the Euclidian known squared. So this is upper bound by let's say 2 Eta N. So now you see how smooth you should take Eta. Right, because what you get in the end is this is upper bounded by square root of, let's see, so we get Beta squared over 4, Eta N T. So now what you see, T we said we take it to the N over alpha. So let's forget about Alpha, Beta, and there's a constant and let's just focus on Eta and N. So T is like N. So we get square root Eta in squared. So what you want, so you want to take Eta to be Upsilon squared over N squared. If you take Eta to be Upsilon squared over N squared, you get that this T B is less than Upsilon. Yes? >>: This little T, so it says T V or X bar little T. Sebastian Bubeck: Yes. >>: And S T. What is this little T, this is for -Sebastian Bubeck: Any fixed little T this is true. >>: What's the relationship to N or big T? Sebastian Bubeck: So I'm going to apply this to big T. Oh, yeah, I see what you mean. Okay. Good. Thanks. >>: So as time goes by they drift farther away. Sebastian Bubeck: No, the point is exactly, because of Gasonov you don't do, you know, you don't add, like they don't escape from each other. They will stay close. So those two things, that's what I was saying they are absolutely continuous with respect to each other. So the Brown and Moshen part bring them closer again. And it makes sense, you know. This is just -- yeah? >>: [Indiscernible]. Sebastian Bubeck: I mean there is some magic here. I mean they are drifting away. Because you see this T, when T gets large, okay, so that's why you need S T is bigger, you need a smaller Eta. If you want to be good at the very last time you need a small Eta. But still this is pretty sharp. I guess that's what I was saying. But you're right and you're right. At the fixed Eta, if you look at two large signs then they will drift away. So okay, I didn't say, but of course you have [indiscernible] triangular inequality. Good, you tune those thing and you get that what you need is, you know, so K, so what is N. So what you want is N Eta to be T. Which means that N is like N cubed. This need N cubed and that is what improved. So now what is the issue for us with the projection? >>: [Indiscernible]. Sebastian Bubeck: Capital N is the number of, so you have the prior [indiscernible] in terms of time, continuous time, and the 1 in terms of this [indiscernible]. So capital N corresponds to capital T. Sorry. This is not necessarily a notation. So now what are the issues for us? Okay, everything breaks. Nothing works. First of all, it's not clear what is the division process to write. Gasonov is not going to apply because you know when the things get, I mean the two processes are not going to be absolutely continuous with respect to each other, so we need something completely different. Thankfully mathematicians did all the work for us. So we don't have much to do except put the things together. There is this problem called [indiscernible] problem. Which goes like this. So you are given W, which is, let's say, from 0 T to R N. Piece-wise continuous. It's a piece-wise continuous function. And here is the question. Does there exist X and Phi from 0 T to R N such that one you want that X of T is always in K in your convex buddy. So thing is I'm going to define for you what it means a refracted version of a continuation Pi of piecewise continuation. So you wanted that X stays in K. You want that X of T can be returned as W of T plus Phi of T. So far I'm not doing much. But this last one is a critical one. The last one is that Phi of T basically only increases when X is in the boundary. So Phi of T can be written as minus the integral between 0 and T of new S L D S. Where Nu S is the [indiscernible] at X S. So Phi of K, what is auto number of this thing because that's new S if X S is here. And L is a measure, is a measure, supported on the set of times, on the set of times, such that X of T is in the boundary of K. So this then only increases when I on the boundary. I'm just pushing X T. I'm just allowed to push W when X T becomes on the boundary. >>: So without the [indiscernible] because you can just wait until you get to the boundary and then stick it there? Sebastian Bubeck: Yes. Yes. Yes. Absolutely. So this is the theorem from [indiscernible] 79 is yes. So does there exist X and 5, yes. And if W is continuous, so is X and Phi. >>: So can you just describe [indiscernible]. Sebastian Bubeck: Oh, the motivation is very clear. [indiscernible]. So you have K, you have a set of constraints. And you have some class. I don't know. Like this. This is a function in R N. This is a class. Does that make sense? And now what I want it to make sense of what it means to reflect W so that it stays in K, but it mimics as much as possible W. So this is what I'm going to draw now is X. So X goes like this, it follows, and now it says oh, shit, he's asking me to go out. So it just sticks, you know. And whenever, okay, maybe this was not perfect. But whenever, it gives me the opportunity, I follow it again. >>: Using that Phi. Sebastian Bubeck: Here I follow it again. And now I can basically mimic it and here I'm going to stick and ->>: Reflection [indiscernible]. Sebastian Bubeck: So the thing is this is not, this is definitely not a Brown and Moshen pass. For Brown and Moshen it would look like a reflection. But not in general. >>: Oh, it would look like a reflection very much, but it wouldn't look like a reflection of the original. The reflection applies that they actually reflect in reverse if you apply the transformation. Sebastian Bubeck: But it is true that it's equivalent. >>: It's a reflection [indiscernible]. It's not a reflection of the original path. Sebastian Bubeck: Yes, that is true, yes. That is true. But in distribution it is equal. But otherwise you're totally right. Okay, so that's what we're going to use. Maybe, okay what the construction, the construction is very simple. It's a construction is doing projection. So if, okay, let's not do the construction. But the thing is I can also prove a certain continuity, or rather a certain Lipschitz nest of this process. The map from W to X is Lipschitz. So W and W bar are two paths which are close by, then the reflected version will be close by. Okay, so now let's focus on F equal 0. So now I just want to sample from the uniform measure on my thing. So what I claim is that X tilde T or X bar T is the [indiscernible] pass from Brown and Moshen at discreet times of W T, of W Eta T over Eta. So what I do is I have this Brown and Moshen that I only look at this multiple of Eta times. And when I look at it what I want is to project. So when it stays inside I stay inside, but when it goes outside I project it back. That's exactly what my process is doing. When it can do the Gaussian jumps it does it, and if the Gaussian jump takes me out, then I project back. So X bar T is exactly the kind I can pass off W T. And what is the natural equivalent is X T. The continuous one is X T, then I can pass of W T. It makes a lot of sense. By the way, just so you see, this in a different equation, this I can also rewrite it. What is this? This means that D X T is equal to D W T minus Eta T and L D T. So this is a key difference with respect to before. This term, the term with the local time L. So now we understand that our algorithm, our discreet time algorithm is a direct pass of discreet Brown and Moshen. And what we want to do, we want to do two things. We want to analyze the mixing time of this guy, with respect to the uniform measure, which is not so easy. And we want to understand the discrepancy between the two. So the discrepancy, let me tell you quickly. >>: So Seth? Sebastian Bubeck: Yes. >>: I have a question. When you have tried to just, rather than go to the project in the first place, could you have tried to put a concurrent value there and just do the whole thing unconstrained? Sebastian Bubeck: Yes. But this is not so easy. But we're thinking about it and other people are thinking about it. And probably this will go much beyond what I present. So the whole piece to get missing time of N or even square root of N or even dimension 3 in the case of gradients. But this goes way beyond what I present. Okay, so let me just tell you [indiscernible] is the following thing. If you have W and W bar then X of T minus X bar of T squared is less than W of T minus W bar of T. So again I have to pass W and W bar. I'm solving the historical problem, I get X and X bar and Phi and Phi bar and I get plus integral between 0 and T and the product between W of T minus W bar of T minus W of S minus W bar of S and inner product between Phi of D S minus Phi bar of D S. So this is direct result. So it's giving me some notion of smoothness of the map from W to X. So this is easy to see that it implies that the expectation of X T minus X bar T is something small. I'm going to be less precise now because I have five minutes. So this is a bound in [indiscernible] distance. This is a W bound W 1 bound on X T to X T for X bar T. So [indiscernible] distance between two measure Mu and Mu is the infimum is L P, is the infimum over X and Y, which is a coupling of Mu and Mu, of Nu and Nu. So X as the distribution of Mu and Y as the distribution of Mu of the expected L P distance between X and Y. So this is exactly a W 1 down. But what I care about is the total version down. And of course this, I mean just intuitively it's not in general, but intuitively this is much more than the total variation distance because it's scaled for the distances. So if the distance are small, this would be small, but the T V would still be very large. So there is an issue of how do you go from W 1 to T B. And I will give you that in a minute. So from W 1 to T B, how do you do it? It's not clear in general. And we'll do it now. But before that, what about the mixing? The mixing of X bar T of X T. The mixing of X T is actually very simple. It's I think something very well known. So what you want, okay, you want the T V between X T and, sorry, I'm, oh, yeah, okay. So to evaluate the mixing what you need is to evaluate how much time it takes to couple from two different starting points. So if I start at X and X prime, I want to know for the worse X and X prime how much time it's going to take before the two meets. So I get two pass, X T and let's say X prime T, and I want to know how much time it takes before the two can be coupled. What I want is the T V between X T and X prime T. And this is actually a onedimensional program. Because you see what I can do is I can couple the two Brown and Moshen so that whenever these guys want to go to the right, this guy goes to the left and visa-versa. And this is really just about the distance between the two. So what I can put is that I look at those two points. I look at the hyper-plains that separates those two things and what I do is a reflection, a mirror, okay, with respect to this hyperplane. So when this guy goes like this, this guy goes like that. So now I just have a one dimensional Brown and Moshen for the distance between the two. And the question that you need to ask is I have a Brown and Moshen that starts at the distance between X minus X prime. What is the probability that it doesn't hit, that it doesn't hit 0 before time T. This is exactly the question that I need to ask, to ask what is the probabilities that those two did not meet before time T. Because once they meet I can just keep them together forever. Okay, this, this is a simple bound. This givers you X minus X prime over square root 2 [indiscernible] T. So you just need, if this is a constant, then you just need constant time and we know from general principles that you told me that after that once it has mixed to one and it's exponential after that. So here you get log 1 over Upsilon. So this is for the mixing time. So now we know that the technique is fast and we have this argument. So let's go back two from W 1 to T V, which is the last part, and it's just getting back to the question that you asked in the beginning. So the key point is the following. So here is a simple Lemma. It's really three 9s. So the measure of the set of points which are distance from the boundary, at least Upsilon, okay. So the measure of the point is this is at least 1 minus N times Upsilon. And where R is the radius. So again, it's a uniform measure case. In the general case, in the general potential, you can do this, it's also very easy. So most of the mass, if you scale 1 over N or 1 over N squared, most of the mass is the wing distance from the boundary. This is just a trivial bound. Now here is what we do. So now we know that the two process X T and bar T are close in W 1. And we want to say that they are close in T B, meaning we want to really couple them so that they actually meet. Because, okay, let me just say one, okay, let's not say that. So now we know, oh, so let's say this distance is Upsilon. And we know that we have X T and X bar T. And those guys are close in W 1. So we know we can wait until this distance is, let's say, Upsilon prime which is much smaller than Upsilon. So those two are close. And we know we are the time, we saw last time where X T is already mixed. So X T is live from the uniform measure. So we know that it has to be away from the boundary. So this picture is really the correct picture. X T is away from the boundary, a distance of at least Upsilon. And X bar T is close to it. But now observe that. Unless there is a projection of X T and X bar T, they follow the exact same process. It's only when there is a projection that the processes are different. So what we can do is we can apply the previous argument. As long as they meet before they touch the boundary, we're done. And we know exactly how much time it takes for them to meet. This is this argument. And we know how much time it takes for them to meet the boundary, because it's a distance of at least Upsilon. So we can show, you know, you set Upsilon and Upsilon prime so that they meet before they touch the boundary. And now we have a bounded T V instead of having a bounded W 1. And now how do you generalize what history is the case where F is not equal to 0, it's just Gasonov everywhere. And that's it. Okay. Thank you. [Applause] >>: So just where does the giant [indiscernible]. Can you just kind of think about it on the way? Sebastian Bubeck: Well, I'm saying Upsilon prime has to be much smaller than Upsilon. So here every time you lose things in terms of N. >>: Oh, I see. So the gaps that you require always -Sebastian Bubeck: Exactly, you lose, in this part of the argument you actually lose quite a bit. And there should be a nicer way to do it, but we don't know how to do it. And just comment, so we did experiment. And these are currently only small scale because those things do not work in really high dimension. So it's dimension up to 100 and we're trying to estimate the volume of some convex buddies. So the [indiscernible] and the students, they have an implementation where they use a [indiscernible]. So I went into the code and I replaced every line of hit-and-run by just a stochastic grade and descent, and what you get is something that does the same accuracy and that is faster in most cases. It's not by an order of magnitude. It's just a little bit faster. Now, the thing is we plugged it, we plugged the stochastic grade in the middle of this big loop that they have to compute volumes. What I really would like is to estimate how much better is a stochastic grade in the center with respect to hit-and-run to mix with how do you evaluate if you have mix or not. It's not such an easy program either. That being said, things can be done, and I think it's an interesting problem to work on. >>: What can be done? Sebastian Bubeck: Well, you can, okay, so, well, one thing you could do is you could try to estimate online the spectral gap. If you have already mixed then you should have a good estimate of this. >>: But the argument doesn't go the other way. Sebastian Bubeck: Yes. Okay, so definitely what you can do is you can, out of a battery of statistics that you want to test. Whether they cover everything that you want, of course is not clear at all. But maybe you can make some assumption. I mean maybe under some assumptions, the Markov chain, there is a set of statistics that you can test to check if you have mix or not. >>: This is a big problem. Sebastian Bubeck: Yeah, of course. >>: Can you at least in special cases where you can actually maybe sample? So if you put a sample on this solution, then you could try to evaluate, right? >>: If he does not like the test. It's just a standard test; right? Sebastian Bubeck: Yes. Yeah, what kind of test? >>: Oh, I mean like to sample. Like you could then exact sampling [indiscernible]. Sebastian Bubeck: Yes. You can definitely do that. Yes. And then you can run, I mean those things depends if you run it for a hypercube they're doing nontrivial things. And you know sampling from a hypercube is easy. >>: You could also sample between hit-and-run in this this actually. Sebastian Bubeck: That's a good point also, yes. You can do anything. >>: [Indiscernible]. >>: [Indiscernible] Sebastian Bubeck: Yes. Yes. So you can always, because it's not concave, it has light there. So after some point there is no more mass. So you can always [indiscernible]. You see what I mean? So which level are you asking the question, for the algorithm or for the -- >>: [Indiscernible]. Sebastian Bubeck: I mean, it's always the case that far away far enough -- what? >>: [Indiscernible]. Sebastian Bubeck: Yes, for the minimizer, there won't be much mass left. Because it's not concave. So that's always true. Now, the bounds that you get on how far away is far away is going to be again polynomial in N and so it will again enter, you know, you can get some discrepancy. So you're asking if we can apply this and this to the case where K's involved, that's the question? >>: The question was [indiscernible]. Sebastian Bubeck: Thank you. [Applause]

John Langford: Okay. So we have two papers... paper and the other one is this paper. This...

Related documents

Products

Support

John Langford: Okay. So we have two papers... paper and the other one is this paper. This...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib