John Langford: Okay. So we have two papers... paper and the other one is this paper. This...

advertisement
John Langford: Okay. So we have two papers this year in IPS. One of them was the actual learning
paper and the other one is this paper. This one has been kicked around a little bit. But might be some
nice things in it. So for me to go through it carefully. So this is logarithmic time online prediction. I
worked on this with Anna Choromonska when she was visiting as an intern, and Alec was also involved
in the project, my coauthor.
So we're doing multi-class prediction. Doesn't get simpler than that. It's just, it features, you have the
label, and we're thinking about this in kind of an online setting. We're thinking we see features, predict,
and we see features. And our goal is to just find the classifier, which minimizes the error rate. Pretty
straight forward. And this is the twist. We want this to be fast.
So why do you want this? Well, if you take a look at the ways classification is often applied, they're
often kind of confusing. I mean often people think about ranking in returning a search result. But
really what's going on is you have some sort of index into billions of web pages that pulls up a future
plausibly relevant and then you have some sort of rerigging process to that which is over and over
again, until finally you spit out the final results.
But in some sense, logically, the problem is really this problem right here. It means you have to return
things fast. You just have eight billion possibilities, possible labels. Or I don't know how many it is
these days. But you have many billions of possible labels, which are all of the individually unique web
pages.
Or another example is here, it's like the who is that question, right? There's almost ten billion people.
So it's a big multi-class prediction problem. This one I happen to be able to solve, because I know
that's my brother and his family. But otherwise you wouldn't necessarily know.
>>: [Indiscernible].
John Langford: What's that? No, he did not authorize me. I'm probably doing something bad.
Okay. So there's a lot of approaches people use to avoid this high and multi-class prediction problem.
And I wanted to kind of go through those and discuss for the obvious things that we're not doing,
because I guess we've discovered that a lot of people are used to the obvious ways we're not doing it.
So trick No. 1 is when K is small. Often K is small and, you know, that's easy. Trick No. 2 is if you
have a preexisting hierarchy, you just reduce to the K of small case by building a predictor at each node
in the hierarchy. And that can work fine if you have a good preexisting hierarchy.
Trick No. 3 is a shared representation. This is very common in ImageNet applications or
[indiscernible] applications. You have a bunch of shared processing. And then you have a final output
layer. And because everything is shared here, it's amortized across to the final output. And so you save
a lot computationally. That makes a lot of sense. It's very useful. But it's still the case that you still
have that final output, out layer, and if you have ten billion classes, you know, things are going to blow
up. There's going to be too many parameters.
Okay. So another trick is structured prediction. So idea here is that you want to make a bunch of
individual predictions. And maybe you have some order that you impose over the set of predictions
that you want to make. Or you have some structural predictions. So you choose that structure in
advance, and then you sort of start thinking about exactly how to do that. So this is very similar to the
hierarchy case in some sense, although some of the techniques vary a little bit. So that's a good trick.
Sometimes the structure is not very clear. So this is exactly like the hierarchy case. If you have a good
hierarchy then you can use that, if you don't, then not so clear.
Trick No. 5 is the GPU. People are hitting things with GPUs a lot quite hard these days. And GPUs
are great, but as far as we can tell, they're only kind of a constant factor great. So the physics of
computation are such that maybe GPUs give you a little bit like a constant factor, more efficient
computation, but it still only cost effective more efficient computation than general purpose systems.
So these are the tricks people like to use.
Now, let's think about this problem a little bit more fundamentally. How fast can we hope to go? And
the answer is log K. And you can't really hope to go faster than log K. The cannon example is very
simple, you say, well, suppose my label is uniform on 1 through K. Then, you know, information
theory tells me I really have to put out log base 2 of K bits in order to be able to even specify a label.
And so there you go, log K time. Right? It's kind of the end of the story. So you can't really hope to
go faster than log K.
And then if you look at the way things work, right now typically order K is what you see. The amount
of computation that goes on to predict one of K things is order K. So, okay, so K divided by log K,
what does this mean? So when K is kind of small, it doesn't really matter very much. When K is 100,
you might be like, okay, a factor of 10, maybe a factor of 10 was important, but not too important. If
you get up to a thousand, then, you know, you get a factor of a hundred or so, maybe it starts to become
important.
But you know, GPUs are really good at doing a bunch of parallel computations. So maybe even a
factor of a thousand is not, a factor of a hundred is not so important, because [indiscernible] this was
the GPU. But when you get to a million here, you're looking at a factor of a hundred thousand or so,
well, even if you have a GPU, you're going to start to care.
So the claim is there is some K for which you care to be sublinear in K. And in particular it would be
nice to be logarithmic in K. So now the question is how do we actually get there.
So there's a lot of approaches people have taken including myself. I worked on this. So for sparse
error correcting output codes, what happens is you say I'm going to create some binary vectors with a
length of a number of labels. So as much as 01 entries. And then I'm going to train some binary
classifiers to minimize the error rate. So I just want to predict that bit, essentially. Right? So I have a
bunch of, I have log K different, or order log K different binary classifiers, each of which predicts the
bit in my binary vector. And then I want to predict by finding the Y with minimal error.
So if the label, if the value of biy is 1, then you really want the H of X to equal 1 or zero or visa-versa.
You guys can figure it out.
>>: [Indiscernible].
John Langford: So what I mean is there's really a matrix where one dimension has order log K and the
other dimension has a size order K. And biy is iy entry of that matrix. But each predictor is only trying
to minimize the error rate on the one dimension of that matrix. So this is only an hix.
>>: That should be BY maybe? You create log K binary vectors, each one is B sub Y, BIY is -John Langford: This is the entry. This is the entry.
>>: Yeah, you're confusing the entries with the vector.
>>: B sub Y is the value.
>>: Yeah.
John Langford: Oh, man. So here's my fundamental flaw. I took general relativity and I do intend
summation notation. So I always notate things. I talk about the entry and define its indices. So this is
a scalar, which is the I Y entry in the matrix. And then the whole transpose thing people do, that really
confuses me.
>>: [Indiscernible]
John Langford: I'm trying but I don't really understand ->>: Because you don't have log K, [indiscernible].
>>: [Indiscernible].
John Langford: I don't understand why it's unclear, I guess.
>>: It's clear.
John Langford: Maybe we should discuss it later. Seems like a distraction.
So the question is why is this not logarithmic time.
>>: [Indiscernible].
John Langford: Say again?
>>: [Indiscernible].
John Langford: Yeah, so the length of the K vector is actually, I think that's not really a problem,
because we kind of imagine that we have at least one example of each label. So we can kind of
amortize the length of the vector across individual labels.
>>: [Indiscernible].
John Langford: Yeah, we want to do logarithmic time prediction per example.
>>: [Indiscernible]. So if you mention the label as just a number, so the vector could be a binary
extension of this number.
John Langford: Could be. But I will still say this is not logarithmic time.
Yeah, step 3, when you find the Y with minimal error that's typically an order K operation.
>>: You've had zero [indiscernible].
John Langford: Your lookup table, that could be true. This is order log K. There's a constant which is
larger than 1 here. So the lookup table is not size K in that case, it's going to be size, kind of large.
>>: It would depend on [indiscernible]. Matrix, any prediction produced can be found in constant time
[indiscernible].
John Langford: Say it again?
>>: So if I say it's really that 1 is very dense.
John Langford: So B is dense.
>>: Yeah and then whatever you produce is step 2, you just find the binary representation of the
[indiscernible].
John Langford: Why is it constant time?
>>: Well, to make sure that, well, there's no, the log is produced in sub 2 gives you an error of zero
instead of 3.
>>: If you have [indiscernible].
John Langford: If you have no error. We don't want to worry about errors.
>>: [Indiscernible] let's say you have [indiscernible].
John Langford: Maybe. This particular paper doesn't do that. But I'm not saying there isn't an
approach along these lines, but the paper certainly didn't do that. The paper was just saying, look,
we're going to check all the individual labels.
>>: So he said kind of like a [indiscernible]. So why should that be linear instance, would be really a
horrible way to solve these papers.
>>: [Indiscernible].
>>: [Indiscernible] like to me it seems to be the more issue of the problems that you create in the
extension that were difficult. You have the usual problem like that's the thing that I think is the worse
problem here. Because 3 you can just solve with the try beta structure, I think.
John Langford: I will get into that. And I think these two problems are related. So when you have a
particular problem and a representation that you're worrying about, you really want to be tuning the bits
here to match the problem representation that you care about. Which means that it's not going to be the
problem representation which happens to give you logarithmic time. And I mean --
>>: I think this is the problem. The problem is not, if you didn't have this problem that some
combination, some [indiscernible] and it's not linear, right. You can solve it. But the other way is that
the problem that you generate, the B factor, is such that you're just going to get horrible results.
>>: [Indiscernible].
John Langford: So prediction is, I think, okay, let me rephrase this in a way that is fallible. The
prediction is omega K if you're tuning the Bs to match your representations. Because then you don't
[indiscernible]
>>: [Indiscernible].
John Langford: So another approach people use, step one you build a confusion matrix of errors. And
then step 2, you recursively partition to create a hierarchy, maybe you use some sort of spectrum
method to partition things. Part 3 is you have a hierarchy now and you just use it. So you use your
hierarchy solutions. And that would say this is not what we want either. So why is that?
>>: You have cumulative errors.
John Langford: You have cumulative errors coming down, that's true. So step 1, I think, which is the
really killer one. So if we go back and we think about this graph, you can afford to spend more time on
training than on testing, maybe up to 10,000 labels or something, but beyond that you really need to
have, I mean something which is logarithmic time and training also matters.
>>: By this you mean logarithmic overhead compared with the size of datasets. Right?
John Langford: I'm thinking about logarithmic time per sample, per example.
>>: Per example. Yeah.
John Langford: So this doesn't seem like something interesting. And then there's another trick here,
which is pretty cool. It says unnormalized learning. So this is a simple one-against-all approach. You
have a predictor for each class label. And you can just build a predictor for this class label or not. So
[indiscernible]. There's a trick you can actually make the training very fast. You can say oh, we're
going to train, this is the example, we're going to train the Y regressor with X, 1, and they're going to
pick a random other regressor and train it with minus 1.
And now the training has become potentially, I mean just for every example you just do two updates.
So it's essentially a constant or logarithmic or something like that. So that's fine. But the prediction is
still going to be omega K.
So this approach is actually kind of effective. It's good for training but not for prediction.
Okay, so now we want to think about how do we actually do this. And we're going to think about
learning a tree, which is the problem with accumulating errors, but we have to start somewhere. And
when we think about how do we actually want to do things. So let's imagine we're doing digital
recognition. And we have 1, 7, and 3, and 8. We can do a split at the root between 1 and 7 and 3 and 8,
and then we can have a hard split further down between 1 and 7 and between 3 and 8. Or we can have
a hard split at the root between 1 and 8 and numbers 3 and 7 and have an easy split further down. So
the question is do you prefer having the hard split at the root or do you rather have a hard split further
down?
Any opinions?
>>: [Indiscernible].
John Langford: You think you prefer this one? See we preferred this one.
Okay, so the claim is that it's better to have confusing problems near the leaves, difficult problems near
the leaves. And the problem is that, so you have some representation which is being used at the root,
and that representation is getting hit by all the samples. So it becomes a very constrained in the sense
that it needs to get a lot of things right. Well, further down near the leaves, you get fewer samples, you
get fewer things right.
>>: [Indiscernible] mass samples, you want the easy to classify. You have more samples, you hope
you can deal with your distribution there.
>>: That is like making a status symbol from where's your main [indiscernible], what kind of
expressivity points.
John Langford: So, yeah, making a point about representation and capacity. So because I think often
representation capacity ends up constraining your performance. Representing of capacity, rather I think
the way is works is representation of capacity can strain your performance at the root. The same
complexity constrains your performance towards the leaves.
>>: Who is B W G?
John Langford: Bengeo Weston Granger.
>>: So there's another point, which is for a lot of these, a lot of these algorithms, if you make a mistake
at the root then you have nowhere to come but back up. So just mistakes at the root are more sparsely,
so you might want to put these problems there.
>>: [Indiscernible]
John Langford: We will actually have some ability to recover from mistakes because we'll have the
same label maybe in multiple leaves.
Okay. So do we need to think about what is a good partition of, into some right and left branch. And
this is easy when you only have two classes. But it's essentially fairly subtle when you have many
classes. Because kind of very fundamentally non-convex. So we have a bunch of examples and we
wanted to look at the quality of some partitioner. We want some way to judge how good different
partitioners are. And the claim here is that a decent way to do this is to, like I say, if the probability of
A turns into the probability of B is equal to the probability of AB, then it would be as if these were
independent. As if A and B were independent.
But instead what's happened here is that we, we're trying to maximize that difference. So trying to
maximize the dependence between the left/right decision and the label. So it should at least intuitively
be something along the lines of what you want.
>>: This is like a purity criteria of sorts?
John Langford: It's not just a purity criterion. Because you also want the number of examples to be
going, similar number of examples to be going left as right. So we're weighting this by the fraction of
examples and not just by some uniform [indiscernible] classes.
>>: So it's going to need something like increasing correlation coefficiency so you want present
correlation coefficient [indiscernible].
>>: Purity usually weighs, no?
>>: This correlation between H and Y; right?
>>: [Indiscernible].
John Langford: Could be. I don't know if there's some correlation coefficient off of [indiscernible].
So in the paper there's a property called purity and a property called balance. And you kind of want
both of them in order to, and this achieves both of them. Or at least it drivers you toward both of them.
>>: I was thinking about the kind of typical purity measure or impurity when you kind of just growing
the trees; right?
John Langford: Yes.
>>: So that one usually weights by the sizes? It's not purely just, you know, where your computation
looks like?
John Langford: So if you look at a typical entropy like thing, what you see often in decision trees, it
looks kind of like this. I wish I could draw better. And then what we're doing looks a little bit
different. It looks kind of like this.
>>: Right. Okay.
John Langford: It's a bit sharper. Since we're trying to maximize things, this is somehow a bit less
non-convex than that, which makes it a bit easier to optimize later.
So a different form of this, which I think is a bit more operational, you can factor out the distribution on
Y. And then you can kind of see, you're looking at the fraction of time that you go right and the
fraction of time that you go right [indiscernible] of the label. So those are kind of the fundamental
qualities that you're trying to work with.
And then you can see why this is sort of non-convex, because when you're doing classification, if you
take all of your labels and you swap them, you get something terrible performance-wise. But if we take
all of our, if we take our partition and we swap it, you end up with the same value here. And now if
you take an average of those two, you get something terrible. Because the average of a good partition
and the complimented partition is going to be, which is what do you do with a tie. But nothing good is
going to happen if heverything is a tie.
So that's kind of rough. I mean we're used to convex optimization. Convex optimization is much
easier. But this seems kind of hard.
So now we need to think about how do we, so we don't just need to optimize this criteria in an
individual node, we need to optimize it in many nodes across the entire structure. And one question is
can we build things in a bottom-up fashion. Kind of a natural approach you might imagine you take
two random labels, you build a predictor between the two, you combine with other sub trees to make a
super tree and you build up some sort of tree-like structure.
And the claim is this does not work very well. And this canary example is here. So you have three
labels on the line, single feature, which is the line. You're doing linear prediction, you pair 1 and 3, you
build a great predictor between 1 and 3. And then you pair that with 2, and then there's no linear
predictor which can distinguish between 2 and 1 or 3.
So some sort of randomized matching and then build up to get the big structure approach seems to not
be viable in general. At least we haven't been able to figure out a way to do it.
So now the question is can we do things in a top-down fashion. Right? And it turns out you can take,
so there's this motion of decision tree learning as boosting, which people worked on previously. And
you can take these, this proof and you can redo it for a multiclass case. And what you end up with is,
so we imagine that we make some progress on our criteria, gamma, just like a boosting assumption, and
then you can prove that the number of nodes in your structure is something like this. And the gamma
also causes you to get a log K [indiscernible].
So this is something nice that says if we optimize this criteria effectively at every individual node, then
what we can individually hope to get is small error rate.
>>: So this is like the basic error rate you reduce entropy at each step so at bottom you must have like
pure leads?
John Langford: Yeah, so inside of the proof it's doing that. This is going to be a little bit distressing. It
says that if gamma is small, that's in the experiment, that's bad. So we can't really afford to have a
small gamma. You need to be doing strong learning, not weak learning in the individual nodes. So we
really want to strong learn the predictors in the nodes of the tree. Otherwise it's just going to, the tree's
going to be enormously large.
So now the question is how do we actually get a good gamma here, a big gamma at nodes of the
individual tree? And there's a couple common tricks that we use. The first one is you can relax the
optimization criteria. This is kind of a heuristic. We relax the optimization criteria to use some sort of
score margin like thing rather than using the hard decision left/right. So if you go back here, here we're
looking at just did we go left, did we go right. Here we're looking at the expected score on like a minus
1 to plus 1 range and then everything is the same.
But even this is kind of hard to optimize because every time you change the parameters in your Y hat,
you have to recalculate this to get the exact value over all of your examples. And that doesn't work
very well, so we approximate this within a running average. Until you have some running average of
what this value is, some running average of what these values are, one for each Y. And then as your
learning rate converges toward zero, the learning average should converge to the expectation.
So the algorithm at the node then looks something like this. you start with your running averages are
zero for each of those. You compare EY to E. So this is the expectation. A condition otherwise is that
Y independent expectation. You create a label which is plus 1 or minus 1, and then you update the
weight just using that binary label. So we're using an online learning algorithm, online classifier
underneath. And then you [indiscernible] just to keep everything up to date.
So this is kind of the core update rule that you apply to an individual node. And then of course you're
going to go left or go right depending upon what your update says. And then you're just going to
recurse, doing this at all the nodes in your structure.
Yeah?
>>: So why is this W transposed?
John Langford: Yeah, so we're using linear predictors in our experiment. But anything that, it should
work with any online classifier outputting a score that you can use up there.
So now the question is sort of what happens experimentally. There's many different experiments that
you can do. One experiment is let's say we have a fixed amount of time and then we just reduced
datasets so that the amount of time used by individual algorithms is the same. So we can compare this
logarithmic online multi-class tree to one-against-all, looking at the accuracy. So higher is better for a
bunch of different datasets that we found. And what you see, okay, for this one, I think the one-againstall is better, and then it just gets worse and worse and worse and worse as you get to more and more
classes.
So on a per computational time basis, the long tree is getting an accuracy much better than an order K
approach.
>>: There's one correct label?
John Langford: Yeah.
>>: Oh, there's always just one correct label?
John Langford: Yeah.
>>: And everything else is bad.
John Langford: Yeah. So this is the 22K version of ImageNet. So it's like the big version of
ImageNet. This is .NET hierarchy that we got from Paul Bennett.
>>: One could argue that if you have a hierarchy you could measure your error proportional to the
graph distance in the hierarchy or something.
John Langford: Could argue that. We're trying to see how good this is at just flat out, straight up,
doing high multi-class learning.
>>: Is there a way if you have a hierarchy and you want to use the method, is there some way you
could check hierarchy using this method?
John Langford: There's no interface for that right now. I could certainly imagine the training process
to obey hierarchy to some extent. So I mean right now, this is really just an online procedure which is
going on inside of, so the order of examples really matters. You had some preexisting hierarchy so
maybe you could make it so the order of example doesn't matter so much.
>>: Is there a way to use this as some kind of stochastic sub-gradient center or something like that?
Just trying to understand. Like you said any high-level entry should use for this five steps.
John Langford: It's really just these two steps. [indiscernible].
>>: [Indiscernible] so. See just really. I see. So then B minus 1 feels like a sub-variant of 2 or a subgradient on the absolute loss. Okay.
John Langford: We have this loss structure, so it's ->>: So I mean in that case is there a way then to incorporate hierarchy so essentially I guess
[indiscernible] so rather than having 01 error you might have some kind of like a version of the nearest
common ancestor, have error write, and there might be a way how to write it in like some
[indiscernible] slides to be able to create -John Langford: I think it's possible. We haven't done it. But I think it's possible.
>>: So it sounds like your measure of purity, right, would be not treating all the classes equally
[indiscernible] multiplier; right?
John Langford: Yeah.
>>: So but instead you would be incorporating it into the measure of purity.
John Langford: Another thing to think about a moment is so you can see where this optimization kind
of fairly directly implies plus 1, minus 1, something just classification. If you have this kind of thing
that you care about, it's a bit less obvious particularly how you want to do it. Since we started this, we
have found some other purpose for optimizing linear predictors or in kind of criteria, but they involve
things like you kind of create a sigmoid so you pretend that there's some fraction of the times that you
go left, some fractions of the time that you go right. And it kind of continues on with the optimization.
And then you have to worry about regularization maybe a little bit harder, and so forth.
So actually for fixed rate time and then we can look at, this is a test error, so lower is better now. We
can look at the test error for a bunch of other logarithmic kind of approaches. You can have a random
tree, you can use [indiscernible] tree. This is an early paper that I worked on. It's also in logarithmic
time. It's not logging the structures, but it's just something nice there. So we see that generally the test
error is smaller, this is much smaller, depending upon the degree of compatibility between the
representation and the problems.
>>: So hierarchy is used for [indiscernible] tree?
John Langford: It's just a bit representation of a label.
>>: [Indiscernible] and random is what?
John Langford: Just a randomly created hierarchy.
>>: [Indiscernible]
John Langford: Just training in the way that you normally would. So the filter tree is a technique
which is, if you train top down, is this subset or that subset is the problem. Because this takes an
existing training algorithm. This tree fixes that and I think it trains bottom up in order to find a fixed
hierarchy.
So fixed tree is better than random tree. And then this long tree seems to be as good as or better,
sometimes much better, than the other trees.
So now we can also run one-against-all, at least in some of the smaller datasets. We don't care about
computation, we're just going to try to optimize to get the best prediction performance we can. And
what we see is there's still some gap, right. But it seems this is where we are right now. We've closed
the gap statistically to be much closer to one-against-all than we were previously with logarithmic time
prediction. But I don't think it's the case, it's not the case yet that you just want to use this no meter
what. You need to care about the computation if you would prefer to use this.
And the last question you might ask is do we actually get logarithmic time in prediction? And the
answer is yes, we do. So the reason why this question is nontrivial is because when you are, when you
have this branching structures, they're not nearly as cache local and highly optimized as [indiscernible]
operations or whatever. So you expect to pay some penalty associated with the fact that you're
branching. And there is a penalty, if you look at this. But indeed it's still generally logarithmic time.
One of these number may be, so let me go back to, so this ImageNet here is Liam Batu trained a
[indiscernible] network, scraped off the top layer, and then used that to create feature vectors for all of
the examples in the 22K version of ImageNet. And now the prediction here takes half a millisecond.
So that's decent.
>>: John?
John Langford: Yeah.
>>: Do you have a sense of where [indiscernible] on that graph?
John Langford: Yeah, so I think, so Monarch's approach is going to be slower. So we haven't done
side-by-side comparisons. I've tried to understand the performance characteristics of the effects of the
[indiscernible]. I think it's roughly a factor of 10 more state and a factor of 10 slower. And this
approach on the datasets that he was applying things to. But the performance I think is better than what
we're getting.
>>: Should be around the same as performance, right [indiscernible]?
John Langford: I'm not entirely, I expect it will perform better. I don't know if it completely closes the
gap.
>>: [Indiscernible].
THE WITNESS: What's that?
>>: So say it's going to have the better accuracy but slower?
John Langford: It's better accuracy but slower and more state.
>>: So what is this actually you have? What is this complexity?
John Langford: Fast XML?
>>: Yeah.
John Langford: It's not analyzed, but Menic is not using online learning, he's using some sort of
iterative optimization. And the number of iterations is not very controlled. But in fact it's relatively
small.
>>: At prediction time for example.
John Langford: Well, it's going to be, it's logarithmic time for each individual tree but then there's a
bunch of trees and he's doing [indiscernible] things or random forest like things. So it's dependent
upon how many individual trees he uses.
>>: I see, so then there is no, is there any iteration for like how the number of trees you need will scale
the number of classes or is this something that's just one of those things that's more on the problem than
on the number of classes?
John Langford: [Indiscernible].
Okay, so can a predicting time logarithmic in K.
>>: There's a reason that that [indiscernible] talk.
>>: [Indiscernible].
John Langford: So right now the long tree does not incorporate benefits of the filter tree. So when you
have a high noise rate it's good to have a training rule which is consistent. And that's what you
encounter with multi-class predictions. So maybe there's a way to fix that. When I get a moment I'm
going to try some things. I think we do know how to do that. And there's also this issue that your
representation complexity constrained at the root, your sample complexity is constrained at leaves.
And it's kind of, it's something that's fundamentally uncomfortable about a decision tree. So maybe
there's some other approach that's less decision tree like that also gives you logarithmic time.
Other questions? Yeah?
>>: So when the problem [indiscernible]?
John Langford: No. So this is an online approach. And it's actually a bit more to make things online
that he didn't describe. So you can have nodes that become obsolete because all the examples go away
from them. And there's some process of recycling nodes in an online fashion, which is described inside
the paper. So when the distribution is changing and you want logarithmic time, I think there is no good
alternative to this. I'm aware of nothing else that plausibly works.
Okay. Thanks a lot.
[applause]
MR. EUFPLT: So now we're going to be Bayesian, and instead of minimizing the convex loss, what
we want to is to center from a distribution which is arrived from this loss. So I will tell you exactly
what I mean. And what we will see is that the algorithms that we know and love to minimize the
convex function can also be used to sample from the measure derived from this loss.
So as always for the past year my story starts with convex buddy. So K is a convex buddy. So it's a
convex set, compact with non-[indiscernible] and what we find is a function F which is going to be
from K to R, which is convex. So we have a convex function defined on our convex buddy, and I'm
going to assume that this function is smooth in the sense that it is L [indiscernible]. This was a
gradient, the number of the gradient is redundant, and it's better smooth. So the gradient map itself is
Lipschitz. So the gradient of F of X minus the gradient of F of Y is in Euclidean knowledge is in
Euclidean [indiscernible] there's a better known of X minus Y.
And the object we're going to be interested in is a lot concave measure, which is derived from F. So
mu is going to be a probability on K. And I'm going to write for you as a random [indiscernible]
derivative of mu which is a [indiscernible] measure or the density sorry, of mu. So the density of mu
with respect to the limbic measure of D nu over DX is going to be equal to 1 over Z exponential minus
F of X. In the indicator that X is in K.
Okay. And Z is just a normalization constant to make this a property measure. So Z is equal to the
integral of exponential minus F of X, D X over X and K.
And the program we're going to be interested in, the goal, is to sample from mu. Sample from mu
given some [indiscernible] access to mu and K. So that's going to be our problem. So for instance this
morning, [indiscernible] talked about decision trees and how to optimize the weight on the decision
trees. So what he was interested in is minimizing F over those weights.
But now we're taking this Bayesian approach where every weight has a certain probability. And then
you might be interested in doing different kinds of inference using the distribution itself. So this is a
basic task. And the constraint K, for instance, in our first talk, it was about, you know, you had this
proper invalid weighting of the tree. So this gives you a [indiscernible] from K. So this directly
appears in real life example.
So this program has a long history. I'm going to tell you about this in a minute. But first let me tell you
about the results and the theorem that we proved. So this is, by the way, the papers that I have at NIPS
with [indiscernible]. So the algorithm is the simplest you can think of in some sense. So what
[indiscernible] side 1, side 2, et cetera. An I A D sequence of Gaussian and 1 identity. And we'll have
Eta positive step size. And what we're going to do is the following. So we're going to start our initial
point at 0, it's going to be 0, and I'm going to define my sequence to be X still. There will be another
sequence, which is more deserving of the name X. So X still is this. So still this one is going to be
following. So I'm at X T K.
What do I do? So, okay, so the point where F is small, they are a higher probability. So I want to go, I
want to drift towards the minimum of F. So to do this, I know how to do it, I just do a gradient step
using the gradient F X, so I do minus Eta, gradient of F [indiscernible]. If I do that I will go to the
minimum and I want to keep some viability. So what I'm going to do is I'm just going to add one of
those Gaussian, I'm going add some noise. So plus square root Eta, [indiscernible]. So it's very much
related to stochastic gradient descent. So this in expectation is just equal to the gradient. The catch is
that it's a variant of my noise is exactly equal to the step size. So the usual theorem standing as a
stochastic gradient descent from the minimum do not apply. Exactly because I'm in the [indiscernible]
is matched with the step size. And I should put Eta over 2.
And the theorem is very simple. So it's [indiscernible]. So let's do, so we need some kind of
normalization somewhere. Because K was just an arbitrary convex buddy, so what I'm going to assume
is that K contains a unit ball. Let's say the diameter of K divided by R. And to simplify we're going to
assume that L and Beta are some constant, numeric constant. I mean the theorem, it has all the
parameters in the paper.
Now the theorem goes as follows. So total variation distance, you remember the T V between Nu and
Mu is just a supreme of all events of the difference between the priorities that Mu assigns to A minus
the priorities that Mu assigns to A. So if I control this, I control uniformly over all events of different
probabilities. It's a very strong notion of convergence. So we are going to say that the total variation
distance between X to N, after doing capital N step and Mu is less than Upsilon. If I set Eta to be
some, no, let's say a constant times R squared over N, R squared over N. So that's what we prescribe
for the step size, R squared over N. And the most important thing is what is N, capital N. Capital N
should be at least, it should be at least, so square root R times N over Upsilon. This looks pretty good.
Except that it's 12. That looks less good.
But the point is that it's polynomial in N. So what is the difficulty of this problem?
Yes? Of course.
>>: Mu is a measure.
Sebastian Bubeck: Mu is a measure. Yes.
>>: And X [indiscernible] N.
Sebastian Bubeck: Yes, this is a point. Yes.
>>: So what's the total variation between [indiscernible].
Sebastian Bubeck: First of all. I cover two things. So one of them is, okay, so here, you know, I
identify the random variable with its measure. So this is a random variable. This is the underlying low
and identifies the 2. But more importantly I forgot the most important thing, which is just like in
stochastic gradient descent, when you do a step of gradient descent, you might step outside of your
constraint. So you need project back. So you project back. And this will be actually the whole
difficulty. So where P K is a Euclidean projection. Euclidean projection on K. So if I step outside, I
project back on K.
So then I get indeed a point in K and I say that the total variation distance is less than Upsilon provided
I get this. And if N 0, meaning that I'm really interested in the uniform measure on this, then we can
improve it a little bit, then 12 becomes 7.
So let me tell you some of the history of this work. Not much, but a little bit.
>>: So what did N do?
Sebastian Bubeck: I'm sorry? N as 0 just means F is constant. So I want to sample uniform point in K.
So the first one to do this, to prove that there exists a polytime algorithm to sample from this type of
measure was Dyer, Freese, and Cannon. Until 91, I think. And what they did is essentially this thing,
exactly this. So they were interested in the case where F was the constant of this low gradient. But the
point is instead of the projection what they were doing is if you step outside, then you just repeat. You
don't move and you repeat. So the thing is that this type of Markov chain is called a board walk. This
will not mix from any point. If you start in the corner then you can get stuck and it will take
exponential time to make a move. You see, if you are in a corner like this and I am here, then I have
my small ball. And most of the time I'm going to step outside. It's going to take a very, very long time
for me to stop moving.
And what they proved is that from a good starting point, you will mix in polynomial time. Whereas
what we do mixes from anywhere. I didn't say anything about where you start. And why do you mix
from anywhere? Because even if I start here, you see the picture I'm going to draw will make sense in
twenty minutes, is that I will move like this. I will escape like this. I will keep bouncing on the
boundary and this will make me escape very fast.
So that one, the first one was Dyer, Freese, and Cannon. I think they got, now, I don't remember, but
something like N to the 21. And the current we called is by Lovas. Sorry, Lovas, and then Barna. And
they get an amazing N to the 4, using what is called the hit-and-run. So hit-and-run is a following.
It's very simple. You are at the point X K and now X K plus 1 is going to be sampled from a certain
distribution, which is the following. So I first take L at random from the sphere. So uniform in the
sphere. So I take a uniform direction and then I sample X K plus 1 from Mu, restrict it to the line in
just by L, starting at the point X K. So I was here. That's my X K. I take a random line, okay, and I
take a point uniformly at, not uniformly, at random. But at random from the execution of Mu from this
line. So sample from Mu restricted to the line, X K, plus R L, or L R. So this is called hit-and-run.
This is one-dimensional problem. And you can solve it in many ways.
The point is that these two algorithms are very different. So they get a better dependency than this
definitely. This is a zero sort of method. What it means is to be able to query F at two points. You get
two queries for the value of F. What we have is a first order method. You can query a grid in F. So
what I expect or what I hope is that this type of message should be better in practice. It's usually much
more information about the function. So it should mix faster. It's not yet shown in the theorem, but
that's what I hope to obtain. I will tell you towards the end about real experiment. So that's one thing.
This hit-and-run also mixes from any starting point.
Yes?
>>: [Indiscernible] it had to go through a succession of convex files. You don't just start with an
arbitrary -Sebastian Bubeck: So that's if you want to compute a volume.
>>: Oh, so for the something that he had [indiscernible].
Sebastian Bubeck: So for here sampling this works directly. And they prove, so I mean, you know,
there are many papers. But one of the most recent one, which is still I think, I don't know, seven years
old, shows that it mixes from any stopping point. So you don't need a [indiscernible] stop or anything
like that. Which is just like what I do.
The one thing which is tricky is, yes, we're fighting with them and they worked a lot on this. And this
is very optimized. But still I expect, I mean, okay, our point was to prove polynomial. And let me say
something else is this is N to the 4 times and log 1 of Upsilon. Whereas we get polynomial. And this
is the key difference.
Yes?
>>: So let's kind of not waste [indiscernible].
Sebastian Bubeck: Yes.
>>: So suppose I can solve optimization problem. So do you expect I can do a better algorithm if I just
do a series of full minimization problems [indiscernible] functions?
Sebastian Bubeck: I don't know. I don't think so. But I don't know. So it's a good question. But here
you will see, I mean what is underlying this is a certain division process, which has Mu as its stationary
measure. So it's very natural to just discategorize the division process. And it clings to stochastic to be
in a sense. But it's in that way that you do it. You start with the continuous thing, you discategorize
and you get this. And then you see that there is a connection with optimization. But maybe there are
other ways to do it.
>>: Probably in that process you're doing sort of a [indiscernible] approximation somewhere, which
[indiscernible]?
Sebastian Bubeck: So you will see. It's a bit subtle because you're working with those random
variables and you need couple them. So you will see. We're going to do that.
>>: Just to make sure, in your main selling point is that you think this one does better.
Sebastian Bubeck: My main selling point is that this is a natural algorithm that you may want to use
and that is actually used a lot. So this is called the [indiscernible]. This is [indiscernible]. Going back
a long time. And it has been analyzed without the projection. So that was going to be my next point.
and statistics people really look at this, and it's known how to analyze it when you don't project. The
projection induces a lot of real difficulties and we have a way to deal with them and which can give
you [indiscernible] context new insight about how to deal with those projections. Because this is
something you really deal with quite often.
>>: If you don't project them, what happens if you step out of K? You just ignore that step?
Sebastian Bubeck: If you don't do the projection?
>>: Yeah, you just ignore it.
Sebastian Bubeck: Yeah, that's what Dyer, Freese, and Cannon did. Yeah. You just say okay I don't
move. I try it again. And you count the number of failure. So this will not be polytime from any
starting point.
So [indiscernible] in 14 [indiscernible] of L M C. For K equals R N. And F [indiscernible] from
convex. So our starting point was this very nice paper of Armand Delano who showed how to analyze
this in the case with the potential [indiscernible] curvature and there is no constraint. And I'm going to
walk you through this. It's very nice, very simple, and then you will see what those difficulties and
how we deal with this.
Give me just one last thing that I want to say about this. Here you see you're playing with the boundary
at every step. So every step you have to go and see how far can I go in this direction. Whereas what
will happen with this guy is that most of the time you don't have to play with the boundary. It's only
from time to time that you step outside, you project and then you wander again inside and then you
project at some point. So there is some notion of average complexity which should be lower for this
algorithms than for the hit-and-run. But this we didn't do. But in practice you see it. I mean this, you
just have to do it every time.
Yes?
>>: You just said it has to do with, you're assuming that the minimum of the function is somewhere in
the deep in the interior of the convex. If it's on the boundary then you are -Sebastian Bubeck: That's a good point. So I mean it's, yes, it's almost a good point. [laughter] No, no.
I will make this precise. We will exactly need this. But the point is that for any F, you know, which is
smooth enough, there is not too much mass near the boundary actually. So near the boundary, there is
never too much mass near the boundary. So only way to really make a lot of mass near the boundary is
you need the minimum to be at boundary and then you need to explode very fast. But this is prevented
by those conditions.
>>: So these are the units for, so F is just constant.
Sebastian Bubeck: Yes.
>>: And [indiscernible] most of the volume is [indiscernible].
Sebastian Bubeck: So it depends on what do you mean. It always depends on when you say that what
do you mean by closed? So the thing is if I have my convex buddy here, if I do a little thing here of
size like Polly 1 over in, then this has exponentially small mass. The thing is there is not, you know,
it's in between. Like there is no mass here and all the mass is in between. There is no mass too close to
the boundary and no mass too close to the center. So it's really in this same shell. Yes, so that's the
picture, I think. At least to me.
So let's see how to analyze this in the case of the [indiscernible] convex case and you will see it's
simple and nice. So the first thing that I want to do is tell you about the continuous process which is
underlying this thing. So let's introduce a Brown and Moshen ability. So this is a Brown and Moshen.
So you remember W T plus S minus W T, this is a Gaussian with variance S. And these guys are
independent of each other, independent. And W 0 0. So this is a definition of Brown and Moshen.
Now, the point is that this type of things, you can write it using the Brown and Moshen. So now we are
in the case where K is equal to R N. There is no projection. So this case, this equation I can rewrite it
as follows. So X filled K plus 1. So let's use X bar. So X bar T is going to be equal to X T over Eta.
Or let's say X bar K Eta is X T K. And now what I want to say is that what is the increment of X bar T
in time D T. So I have two things. So I move by this much. So from X T K plus 1 to X T K, how
much time does pass? I think of this as time. This is T now. So Eta times step pass. Okay, this is the
time. So in Eta I move by Eta. So in D T I move by D T. So what I get is this is minus 1 half the
gradient of F at the previous integer point, which is X bar Eta, T over Eta D T. That's this part. Now I
get the increment by the Brown and Morshen, which is just plus D W T.
So this equation really gives me this process. If I look at integer point, from one integer point to the
next one, by how much do I increase? Well, I get in Eta times step, I get Eta times this gradient as the
previous one, and I get this D W T during Eta times step, which is exactly the Gaussian with variance
Eta. Which is exactly what I want.
But you see, as soon as you see this, what do you want to write? What you want to write is D X T is
minus one-half grade F at X T D T plus D W T. Okay, I don't want to do this [indiscernible]. I want to
really do that at continuously I take the gradient and I move continuously in the direction of the
gradient. I don't want to update this at this time.
So this is a division process. This is a division process. And it's stationary measure is Mu. Stationary
measure is Mu. Okay, how do you see that? I mean at least for me when I look at it, it's not clear at all
that this was a Mu as stationary measure. But the thing is I will leave this as an exercise, but it's an
easy one. You can write the focal plank equation. What does it mean? The focal plank equation just
means it's a time evolution of the density of X. X is a random variable. It keeps changing. And it's
density, that's because it's Rho T, keeps changing. So let's say Rho T is a density of X T.
Then what you can write is that the derivative of the density with respect to time is equal, so this is a
calculation to one-half of the divergence, so the [indiscernible], let's not worry about that, we don't get
too much about it, of Rho times grade F minus grade Rho. And this is a plus.
So this, and let's put the T. So the derivative with respect to time, how does the density evolve? It
evolves according to this formula. So now let's look at this formula for a minute. If I plug-in Rho T
equals Mu, let's see what happens. What is the gradient of Mu? So the gradient of Mu, you know.
What is Mu in Mu is exponential minus F. So what I get is exactly minus the gradient of F times
exponential minus F over Z. So I get minus Rho grade F. Minus Mu grade F. So you see that Mu
verifies that this term is zero. The gradient of Mu is exactly minus Mu grade F. So those two things
cancel and we see that if you start at Mu you stay at Mu. Okay, the density doesn't change.
So there are many ways to derive this. The simplest one I prefer is just [indiscernible] calculus and
integration by parts. The fancy way is you say you write what is a generator of this equation. So this
has a certain generator, this is and this is the joint of the generic. That's how you move from describing
the Markov chain through it's generator and describe be the focal plank equation. But anyways, this is
all very standard and easy.
So now we know that Mu is the stationary measure. But we want to know how fast Rho T goes to Mu.
And this is also very standard. In this case, when you have some convexity. So when you have some
convexity is the keyword is the [indiscernible] curvature condition, which just says the following. So
when now I say is the following. If I look at the total variation distance between X T and Mu. So
again right now I'm only talking about the continuous time process. There are two steps. First
understanding how fast the continuous time process goes to the stationary measure. And then
understanding the discrepancy between the discreet time process and the continuous time 1.
So this, it's easy to see that this is less than exponential minus alpha T times the Chi squared distance
between X 0 and Mu. So what is a Chi squared distance between two, let's say density G and H. This
is just the integral of G, G over H minus 11 squared.
Okay, you remember the relative entropy between G and H is just the integral of G log G over H. So
for the Chi squared I just replaced log X by X minus one squared. So we get this. Why do we get this?
This is just, so here is how you do it. So there is this theory of Bakremie curvature, which has to do,
which is a certain condition on the generator. So you verify that the generator verifies this. And then
you get the plank R inequality, and the plank R inequality directly tells you that you get exponential
convergence in L 2. And then you want T V. So what you do is you do a crash response, and what
comes out is N 2 on one side and the Chi squared on the other side. So this is all standard. And now
the point is that this is a further exponential N for X 0 from the Gaussian.
So you have a starting point to reach Gaussian, so it is a distribution of the starting point is exponential
N and then you decrease exponentially fast. So what you need is a time T, which is a folder. So what
is it? It's N over alpha log 1 over Upsilon. You're mixed. Okay, you're a distance Upsilon. Because a
continuous time process. Now how do you compare the discreet time and the continuous time? There
is an exact equality, not for the total variation, but for the relative entropy between the two. This is
called Gasonov and it is very simple. It goes like this.
So the T V between X bar T and X T is less than one-half, so square root one-half is a relative entropy
of X B T with respect to X T. Which is equal, okay, this is an equality, so it can only be so deep. And it
gives you this, 1/8 of the integral between 0 and T of the integral of what is the expectation of the
known of the gradient of F X R T minus the gradient of F at X Eta T over Eta squared D T. And you
get the square root.
So this is inequality. And this is Gasonov, which is just the fact that our two processes, our two Brown
and Moshen with different drifts, so they are absolutely continuous with respect to each other and you
can write down exactly what is, you can make sense of what is D X bar T over D X T. Oh, which is
what you need for this type of integral. And you get this formula. Hopefully I will have time to tell
you what we did. Because none of this is [indiscernible]. But it's important background.
So now the thing is, now we use Beta smoothness. So this is less than Beta squared times the
expectation of the normal of X T minus X Eta T over Eta squared. And this is all X. There is no bar.
The continuous time thing. But now how far is those two guides. So there is the part that comes from
the drift. So the drift during times of Eta, it drifts by Eta. So you get an Eta squared. And now what
about the Brown and Moshen part? Well, the Brown and Moshen part during Eta timestamp, it has a
variance Eta. But it has a variance Eta in every dimension and you have N dimensions which are
subbed for the Euclidian known squared. So this is upper bound by let's say 2 Eta N.
So now you see how smooth you should take Eta. Right, because what you get in the end is this is
upper bounded by square root of, let's see, so we get Beta squared over 4, Eta N T. So now what you
see, T we said we take it to the N over alpha. So let's forget about Alpha, Beta, and there's a constant
and let's just focus on Eta and N. So T is like N. So we get square root Eta in squared. So what you
want, so you want to take Eta to be Upsilon squared over N squared. If you take Eta to be Upsilon
squared over N squared, you get that this T B is less than Upsilon.
Yes?
>>: This little T, so it says T V or X bar little T.
Sebastian Bubeck: Yes.
>>: And S T. What is this little T, this is for -Sebastian Bubeck: Any fixed little T this is true.
>>: What's the relationship to N or big T?
Sebastian Bubeck: So I'm going to apply this to big T. Oh, yeah, I see what you mean. Okay. Good.
Thanks.
>>: So as time goes by they drift farther away.
Sebastian Bubeck: No, the point is exactly, because of Gasonov you don't do, you know, you don't add,
like they don't escape from each other. They will stay close. So those two things, that's what I was
saying they are absolutely continuous with respect to each other. So the Brown and Moshen part bring
them closer again. And it makes sense, you know. This is just -- yeah?
>>: [Indiscernible].
Sebastian Bubeck: I mean there is some magic here. I mean they are drifting away. Because you see
this T, when T gets large, okay, so that's why you need S T is bigger, you need a smaller Eta. If you
want to be good at the very last time you need a small Eta. But still this is pretty sharp. I guess that's
what I was saying. But you're right and you're right. At the fixed Eta, if you look at two large signs
then they will drift away.
So okay, I didn't say, but of course you have [indiscernible] triangular inequality. Good, you tune those
thing and you get that what you need is, you know, so K, so what is N. So what you want is N Eta to
be T. Which means that N is like N cubed.
This need N cubed and that is what improved.
So now what is the issue for us with the projection?
>>: [Indiscernible].
Sebastian Bubeck: Capital N is the number of, so you have the prior [indiscernible] in terms of time,
continuous time, and the 1 in terms of this [indiscernible]. So capital N corresponds to capital T.
Sorry. This is not necessarily a notation.
So now what are the issues for us? Okay, everything breaks. Nothing works. First of all, it's not clear
what is the division process to write. Gasonov is not going to apply because you know when the things
get, I mean the two processes are not going to be absolutely continuous with respect to each other, so
we need something completely different. Thankfully mathematicians did all the work for us. So we
don't have much to do except put the things together.
There is this problem called [indiscernible] problem. Which goes like this. So you are given W, which
is, let's say, from 0 T to R N. Piece-wise continuous. It's a piece-wise continuous function. And here
is the question. Does there exist X and Phi from 0 T to R N such that one you want that X of T is
always in K in your convex buddy.
So thing is I'm going to define for you what it means a refracted version of a continuation Pi of piecewise continuation. So you wanted that X stays in K. You want that X of T can be returned as W of T
plus Phi of T. So far I'm not doing much. But this last one is a critical one. The last one is that Phi of
T basically only increases when X is in the boundary. So Phi of T can be written as minus the integral
between 0 and T of new S L D S. Where Nu S is the [indiscernible] at X S. So Phi of K, what is auto
number of this thing because that's new S if X S is here. And L is a measure, is a measure, supported
on the set of times, on the set of times, such that X of T is in the boundary of K.
So this then only increases when I on the boundary. I'm just pushing X T. I'm just allowed to push W
when X T becomes on the boundary.
>>: So without the [indiscernible] because you can just wait until you get to the boundary and then
stick it there?
Sebastian Bubeck: Yes. Yes. Yes. Absolutely. So this is the theorem from [indiscernible] 79 is yes.
So does there exist X and 5, yes. And if W is continuous, so is X and Phi.
>>: So can you just describe [indiscernible].
Sebastian Bubeck: Oh, the motivation is very clear. [indiscernible]. So you have K, you have a set of
constraints. And you have some class. I don't know. Like this. This is a function in R N. This is a
class. Does that make sense? And now what I want it to make sense of what it means to reflect W so
that it stays in K, but it mimics as much as possible W. So this is what I'm going to draw now is X. So
X goes like this, it follows, and now it says oh, shit, he's asking me to go out. So it just sticks, you
know. And whenever, okay, maybe this was not perfect. But whenever, it gives me the opportunity, I
follow it again.
>>: Using that Phi.
Sebastian Bubeck: Here I follow it again. And now I can basically mimic it and here I'm going to stick
and ->>: Reflection [indiscernible].
Sebastian Bubeck: So the thing is this is not, this is definitely not a Brown and Moshen pass. For
Brown and Moshen it would look like a reflection. But not in general.
>>: Oh, it would look like a reflection very much, but it wouldn't look like a reflection of the original.
The reflection applies that they actually reflect in reverse if you apply the transformation.
Sebastian Bubeck: But it is true that it's equivalent.
>>: It's a reflection [indiscernible]. It's not a reflection of the original path.
Sebastian Bubeck: Yes, that is true, yes. That is true. But in distribution it is equal. But otherwise
you're totally right.
Okay, so that's what we're going to use. Maybe, okay what the construction, the construction is very
simple. It's a construction is doing projection. So if, okay, let's not do the construction. But the thing
is I can also prove a certain continuity, or rather a certain Lipschitz nest of this process. The map from
W to X is Lipschitz. So W and W bar are two paths which are close by, then the reflected version will
be close by.
Okay, so now let's focus on F equal 0. So now I just want to sample from the uniform measure on my
thing. So what I claim is that X tilde T or X bar T is the [indiscernible] pass from Brown and Moshen
at discreet times of W T, of W Eta T over Eta. So what I do is I have this Brown and Moshen that I
only look at this multiple of Eta times. And when I look at it what I want is to project. So when it
stays inside I stay inside, but when it goes outside I project it back. That's exactly what my process is
doing. When it can do the Gaussian jumps it does it, and if the Gaussian jump takes me out, then I
project back. So X bar T is exactly the kind I can pass off W T. And what is the natural equivalent is X
T. The continuous one is X T, then I can pass of W T. It makes a lot of sense.
By the way, just so you see, this in a different equation, this I can also rewrite it. What is this? This
means that D X T is equal to D W T minus Eta T and L D T. So this is a key difference with respect to
before. This term, the term with the local time L.
So now we understand that our algorithm, our discreet time algorithm is a direct pass of discreet
Brown and Moshen. And what we want to do, we want to do two things. We want to analyze the
mixing time of this guy, with respect to the uniform measure, which is not so easy. And we want to
understand the discrepancy between the two. So the discrepancy, let me tell you quickly.
>>: So Seth?
Sebastian Bubeck: Yes.
>>: I have a question. When you have tried to just, rather than go to the project in the first place, could
you have tried to put a concurrent value there and just do the whole thing unconstrained?
Sebastian Bubeck: Yes. But this is not so easy. But we're thinking about it and other people are
thinking about it. And probably this will go much beyond what I present. So the whole piece to get
missing time of N or even square root of N or even dimension 3 in the case of gradients. But this goes
way beyond what I present.
Okay, so let me just tell you [indiscernible] is the following thing. If you have W and W bar then X of
T minus X bar of T squared is less than W of T minus W bar of T. So again I have to pass W and W
bar. I'm solving the historical problem, I get X and X bar and Phi and Phi bar and I get plus integral
between 0 and T and the product between W of T minus W bar of T minus W of S minus W bar of S
and inner product between Phi of D S minus Phi bar of D S.
So this is direct result. So it's giving me some notion of smoothness of the map from W to X. So this
is easy to see that it implies that the expectation of X T minus X bar T is something small. I'm going to
be less precise now because I have five minutes. So this is a bound in [indiscernible] distance. This is
a W bound W 1 bound on X T to X T for X bar T. So [indiscernible] distance between two measure Mu
and Mu is the infimum is L P, is the infimum over X and Y, which is a coupling of Mu and Mu, of Nu
and Nu. So X as the distribution of Mu and Y as the distribution of Mu of the expected L P distance
between X and Y.
So this is exactly a W 1 down. But what I care about is the total version down. And of course this, I
mean just intuitively it's not in general, but intuitively this is much more than the total variation
distance because it's scaled for the distances. So if the distance are small, this would be small, but the T
V would still be very large. So there is an issue of how do you go from W 1 to T B. And I will give
you that in a minute. So from W 1 to T B, how do you do it? It's not clear in general. And we'll do it
now.
But before that, what about the mixing? The mixing of X bar T of X T. The mixing of X T is actually
very simple. It's I think something very well known. So what you want, okay, you want the T V
between X T and, sorry, I'm, oh, yeah, okay. So to evaluate the mixing what you need is to evaluate
how much time it takes to couple from two different starting points. So if I start at X and X prime, I
want to know for the worse X and X prime how much time it's going to take before the two meets. So I
get two pass, X T and let's say X prime T, and I want to know how much time it takes before the two
can be coupled. What I want is the T V between X T and X prime T. And this is actually a onedimensional program. Because you see what I can do is I can couple the two Brown and Moshen so
that whenever these guys want to go to the right, this guy goes to the left and visa-versa. And this is
really just about the distance between the two.
So what I can put is that I look at those two points. I look at the hyper-plains that separates those two
things and what I do is a reflection, a mirror, okay, with respect to this hyperplane. So when this guy
goes like this, this guy goes like that. So now I just have a one dimensional Brown and Moshen for the
distance between the two. And the question that you need to ask is I have a Brown and Moshen that
starts at the distance between X minus X prime. What is the probability that it doesn't hit, that it
doesn't hit 0 before time T. This is exactly the question that I need to ask, to ask what is the
probabilities that those two did not meet before time T. Because once they meet I can just keep them
together forever. Okay, this, this is a simple bound. This givers you X minus X prime over square root
2 [indiscernible] T. So you just need, if this is a constant, then you just need constant time and we
know from general principles that you told me that after that once it has mixed to one and it's
exponential after that. So here you get log 1 over Upsilon. So this is for the mixing time.
So now we know that the technique is fast and we have this argument. So let's go back two from W 1
to T V, which is the last part, and it's just getting back to the question that you asked in the beginning.
So the key point is the following. So here is a simple Lemma. It's really three 9s. So the measure of
the set of points which are distance from the boundary, at least Upsilon, okay. So the measure of the
point is this is at least 1 minus N times Upsilon. And where R is the radius.
So again, it's a uniform measure case. In the general case, in the general potential, you can do this, it's
also very easy. So most of the mass, if you scale 1 over N or 1 over N squared, most of the mass is the
wing distance from the boundary. This is just a trivial bound.
Now here is what we do. So now we know that the two process X T and bar T are close in W 1. And
we want to say that they are close in T B, meaning we want to really couple them so that they actually
meet. Because, okay, let me just say one, okay, let's not say that.
So now we know, oh, so let's say this distance is Upsilon. And we know that we have X T and X bar T.
And those guys are close in W 1. So we know we can wait until this distance is, let's say, Upsilon
prime which is much smaller than Upsilon. So those two are close. And we know we are the time, we
saw last time where X T is already mixed. So X T is live from the uniform measure. So we know that
it has to be away from the boundary. So this picture is really the correct picture. X T is away from the
boundary, a distance of at least Upsilon. And X bar T is close to it. But now observe that. Unless there
is a projection of X T and X bar T, they follow the exact same process. It's only when there is a
projection that the processes are different.
So what we can do is we can apply the previous argument. As long as they meet before they touch the
boundary, we're done. And we know exactly how much time it takes for them to meet. This is this
argument. And we know how much time it takes for them to meet the boundary, because it's a distance
of at least Upsilon. So we can show, you know, you set Upsilon and Upsilon prime so that they meet
before they touch the boundary. And now we have a bounded T V instead of having a bounded W 1.
And now how do you generalize what history is the case where F is not equal to 0, it's just Gasonov
everywhere.
And that's it. Okay. Thank you.
[Applause]
>>: So just where does the giant [indiscernible]. Can you just kind of think about it on the way?
Sebastian Bubeck: Well, I'm saying Upsilon prime has to be much smaller than Upsilon. So here every
time you lose things in terms of N.
>>: Oh, I see. So the gaps that you require always -Sebastian Bubeck: Exactly, you lose, in this part of the argument you actually lose quite a bit. And
there should be a nicer way to do it, but we don't know how to do it. And just comment, so we did
experiment. And these are currently only small scale because those things do not work in really high
dimension. So it's dimension up to 100 and we're trying to estimate the volume of some convex
buddies. So the [indiscernible] and the students, they have an implementation where they use a
[indiscernible]. So I went into the code and I replaced every line of hit-and-run by just a stochastic
grade and descent, and what you get is something that does the same accuracy and that is faster in most
cases. It's not by an order of magnitude. It's just a little bit faster.
Now, the thing is we plugged it, we plugged the stochastic grade in the middle of this big loop that they
have to compute volumes. What I really would like is to estimate how much better is a stochastic grade
in the center with respect to hit-and-run to mix with how do you evaluate if you have mix or not. It's
not such an easy program either. That being said, things can be done, and I think it's an interesting
problem to work on.
>>: What can be done?
Sebastian Bubeck: Well, you can, okay, so, well, one thing you could do is you could try to estimate
online the spectral gap. If you have already mixed then you should have a good estimate of this.
>>: But the argument doesn't go the other way.
Sebastian Bubeck: Yes. Okay, so definitely what you can do is you can, out of a battery of statistics
that you want to test. Whether they cover everything that you want, of course is not clear at all. But
maybe you can make some assumption. I mean maybe under some assumptions, the Markov chain,
there is a set of statistics that you can test to check if you have mix or not.
>>: This is a big problem.
Sebastian Bubeck: Yeah, of course.
>>: Can you at least in special cases where you can actually maybe sample? So if you put a sample on
this solution, then you could try to evaluate, right?
>>: If he does not like the test. It's just a standard test; right?
Sebastian Bubeck: Yes. Yeah, what kind of test?
>>: Oh, I mean like to sample. Like you could then exact sampling [indiscernible].
Sebastian Bubeck: Yes. You can definitely do that. Yes. And then you can run, I mean those things
depends if you run it for a hypercube they're doing nontrivial things. And you know sampling from a
hypercube is easy.
>>: You could also sample between hit-and-run in this this actually.
Sebastian Bubeck: That's a good point also, yes. You can do anything.
>>: [Indiscernible].
>>: [Indiscernible]
Sebastian Bubeck: Yes. Yes. So you can always, because it's not concave, it has light there. So after
some point there is no more mass. So you can always [indiscernible]. You see what I mean? So which
level are you asking the question, for the algorithm or for the --
>>: [Indiscernible].
Sebastian Bubeck: I mean, it's always the case that far away far enough -- what?
>>: [Indiscernible].
Sebastian Bubeck: Yes, for the minimizer, there won't be much mass left. Because it's not concave. So
that's always true. Now, the bounds that you get on how far away is far away is going to be again
polynomial in N and so it will again enter, you know, you can get some discrepancy. So you're asking
if we can apply this and this to the case where K's involved, that's the question?
>>: The question was [indiscernible].
Sebastian Bubeck: Thank you.
[Applause]
Download