>> Yuval Peres: Hello, we're delighted to have Sebastien... Princeton tell us about two basic problems in stochastic optimization.

advertisement
>> Yuval Peres: Hello, we're delighted to have Sebastien Bubeck from
Princeton tell us about two basic problems in stochastic optimization.
>> Sebastien Bubeck: Thank you. I think I prepared way too much. But
so maybe it will only be one program instance stochastic optimization.
Let me see. So feel free to interrupt me. So I'm going to start with
and maybe only talk about that but I'm going to start with roughly my
favorite problem in research. So this is this problem. So you have K
unknown probability distribution with just Gaussian. So unknown K.
Unknown probability which are sub Gaussian. With variance followed by
1 something like that. And what you want is to find the distribution
which has the maximum mean. So the goal is to find the one that has
the maximum means. So the aug mass of mu I where mu I is defined as
the expectation of X when X is drawn from mu I. [indiscernible] so
let's go to I star and I'm going to assume it's unique.
So my goal is to find I star. And so I need to tell you how do I
interact with this probability distribution. So what I can do I have a
budget of N samples so I can sequentially query the probability
distribution and I get realization from this probability distribution.
So I want to find I star using N observations. Which are sequentially
chosen.
More precisely a little bit more formally, we have a sort of sequential
game. So time is going to go from 1 to N and that each time step what
do, I choose IT, which is in 1 to K. The index, I choose one
probability distribution and what I receive is so choosy T and I
receive a realization. So I receive YT, which is drawn from mu IT and
this drawing is independent of everything else conditionally on IT.
Once I've done N samples at the end what I want is to output one of the
probability distributions. So I output JN which is K and my hope is JN
will be I star, that one. And how am I evaluated? I'm evaluated by
what is the mean of this guy compared to the true best means. So my
regret, so the regret of time N is the difference between mu I star to
the best means that I could have obtained and I compare it to mu JN,
which is the mean of the selected probability distribution. So I'm
going to call this K probability distribution I'm going to call them R.
That's my -- that's what I use from the terminology and I put an
expectation here. This is what I call the simple regret.
It's just my optimization error. I want to optimize this mu I and
instead of the max I got this guy and I look at the distance, and I
want this thing to be as small as possible. If you know about -- yes?
>>: [indiscernible] when you go out, you inspect this guy one by one?
>> Sebastien Bubeck: Yeah.
>>: Then you select that guy so you have the option to select this guy
or you can choose don't select it and go ->> Sebastien Bubeck: You have total freedom when you choose IT in 12 K
it's going to be independent on all previous observations you made, I
don't know, imagine I made plenty of trials and now I choose to observe
probability distribution No. 3 and at the next time step I choose
probability distribution No. 10.
>>: Can you return?
>> Sebastien Bubeck:
I can return.
I can do anything I want.
>>: Would you have to return to the same things ->> Sebastien Bubeck: Exactly. So I would need to try -- so to
estimate the mean of R1, for instance, I'll need to try it many, many
times to have an estimate, an accurate estimate. Because the first
time I tried imagine this is a Bernoulli distribution with a parameter
1. I try it 1 second time I get 0. I want to learn -- essentially I
want to learn the mu star.
Okay. So that's what I want to do. So I'm not really the first one to
look at this problem, as you can expect. This is as basic as it gets.
I just want to find the max of K finite things. But the issue is that
roughly there has been two approaches which are min and max and
bayesian. Because what I want, I want to find the optimal in some
sense allocation strategy. I want to find the best allocation strategy
so as to minimize this simple regret. But this problem is not well
defined a priori. There is an optimal allocation strategy if I know
the mu I. I do whatever I want but at the end I opt with the best guy.
This is optimal.
Now, in min and max what you say is that you take, you design your
allocation strategy and then I can choose which set of probability
distribution I'm going to throw at you and I look at what is your
regret in that case. That's min and max. And then best is well
defined. It's a mean of a max.
>>: The normalization, there's no normalization.
>> Sebastien Bubeck: There's no normalization. There's no
normalization here. So this -- right. So in the min and max sense,
the answer is that RN, the optimal RN is a folder, square root of K
over N. That's the best you can do. And that's not very difficult to
prove, and I think that's totally uninteresting. So a trigger strategy
will get you K log K over N, K log N. But you have to work a little
bit to remove the log but that's not really the point. We see you can
gain order of magnitude with a new point of view.
>>: [indiscernible].
>> Sebastien Bubeck:
>>: Sub Gaussian.
What does it mean what?
>> Sebastien Bubeck:
Just means the usual thing.
>>: That's the normalization. So does that have the constant? So
which ->> Sebastien Bubeck: Right. Right. So sub Gaussian with constant
one, with known constant.
>>: [indiscernible].
>>: Terms of Gaussian can be used.
>> Sebastien Bubeck: No, that's okay. So I said it quickly. I said
the proxy for the brands is bounded by one. So it's not the main
point. Imagine everything is Gaussian with variance one. And the talk
makes sense. And is nontrivial. So min and max, that gives uses. So
this has been studied since the '50s, et cetera. Bayesian, what you do
is put the possible parameters for the probability distribution and
then you want to find the allocation strategy that minimizes the
expected simple regret with the expectation is with respect to the draw
from the prior. So this is also well defined but then you have to come
up with a prior, et cetera, and it's not clear how to do it. So I want
to go beyond these things, and I will propose a new sort of new
perspective, which allows you to talk about optimal strategies without
making an assumption such that it's min and max or bayesian. We'll
come back to it. So for those of you who know multi bandit it feels
like a multi unbanded program except the performance measure is
different. So in bandit what we look at is this capital RN, which is a
cumulative regret. So we look at the sum of the simple instance in
this regret at every time step. So we look at the sum for T equals 1
to N. So at every time step we could have played the optimal on, we
could have gotten mu I star. But instead we played IT so we got mu IT.
And I'm going to look at this in expectation. So that's called the
cumulative regret. And a trivial, the trivial thing is that the simple
regret is upper bounded by the cumulative regret divided by N. What I
can do is that at the end, when you ask me to output something, I can
just select the time step at random, output the actions that I played
at that time and that gives me this bound. So this is always true.
And this gets you the min and max rate. But now I'm going to show you
can get much better. So okay -- yes.
>>: [indiscernible] you think it's because -- the IT?
>> Sebastien Bubeck: Yes. So it's expectation with respect to
everything, it's with respect to IT. So IT could be randomized. But
even if it's deterministic it depends on the previous observations,
which are themselves random. So I take expectation. It's a
complicated expectation. I mean, it's simple but, to analyze it's not
obvious how to do it.
>>: Expect this thing, how many do you want to select from them just
the maximum of them you want to select.
>> Sebastien Bubeck: Right. That's exactly what I'm going to talk
about, how do you do this. What is the allocation strategy? How do
you choose this guy?
>>: You choose several times.
>> Sebastien Bubeck: You choose one at a time but you will make N
selections.
>>: Make N selection. N is given to you.
>> Sebastien Bubeck: N is given to you. N is given to you.
>>: Visually large, can't find it ->> Sebastien Bubeck: Yes but the question is you want the optimal
rate. You want to make the most out of it. Like with N you will see.
So let me show you one trivial thing so that we're all on the same
page. So what is trivial is just you have a budget of N samples. You
have K options. So you just allocate N over K to each arm. So you
just try each option N over K times. So what does it give you? So one
thing is that to simplify for the talk I'm going to look at this thing
which is so the regret, let's say every other means are in constant.
This is certainly smaller than EN, which is the error rate which is the
probability that JN is not equal to I star. Okay. Let's say that all
the mu Is they live in 01, for instance. I'm going to denote it's
going to be important for the rest of the talk. Delta I is mu I star
minus mu I. So delta I is the sub optimality gap. The distance
between the quality of I and the best quality that you can get.
So this is certainly larger also the delta times EN. Where delta is
the smallest gap. Delta is the mean of the delta I or I not equal to I
star. All right. So in first order approximation it's fine to focus
on EN, the error rate. Okay. So most of the talk I'm going to talk
about the error rate rather than the simple regret. So let's see what
is the error rate of this simple thing just apply general, so deviating
from epsilon is going to be bounded by exponential minus T times
epsilon squared. If I sample T times someone what I mean is everybody
stays in an interval roughly size data I if this is the case there's no
way that at the end I output somebody else and the best guy. If
everybody's in the interval of delta size I then I get the best thing.
So I have mu I. If it stays within let's say one-half of delta I, if
the empirical -- so going to denote by mu hat I. So that's my
empirical estimate using my samples, if the empirical estimate stays
within an interval of size one-half delta I for every I then at the end
I just look at the best one and I will get the true best. So what does
this give me? So with a union bound we just get the sum for I calls 1
to K exponential minus N over K. That's the number of samples that I
get times delta I squared. That's the deviation that I'm looking at
and the constant. Okay. So let me -- so to commence on this, the
first one is that this goes exponentially fast to zero. So the
priority of making a mistake goes exponentially fast to zero. And this
is going to be small roughly when. So this is going to be dominated by
the largest of this term. The largest of this term happens when I look
at I which minimizes the gaps. So this is small. This is small when N
over K times delta is large, when NK delta squared is large. So it's
smaller than delta if this is larger than log 1 over delta. So what we
mean -- it's small when N is at least omega of K over delta squared.
So if the number of samples, the number of trials that I can make, the
number of experiments is at least K over delta squared, just uniform
allocation will find the best action.
>>: Another log K?
>> Sebastien Bubeck: Right I want another log K. I absolutely want
another log K. If this is log K over delta, then refer this one is
delta I find it. So that's what I get.
Okay. So this is good. This is really good. So clearly it doesn't
feel very good because you spend -- I mean you try everybody, but it
could be that very early on you identify that some of these guys
they're not competitive. There is almost no chance that this guy will
be the best one. But still you keep them. What you want maybe is to
focus your attention of the guys which look difficult to distinguish.
Okay. So okay so what I'm going to do now is I'm going to show you
first what are the limits of these problems. So here we said that if N
is that large, we can find it. Now, how large -- I mean, if N is
smaller than something you also can't find it. So this was the main
theorem that we did back in 2010 with J. Audibert, myself and Rene
Munos. And this is what gives us the new point of view which is
min/max by approximation. It goes as follows. So for any mu. So mu
is the product of the mu I. And I'm going to look at the mu, which is
just a Gaussian distribution. So it's going to be a lower bound. So
if I stick to Gaussian distributions, that's fine. So it's a Gaussian
distribution with mu, and with covariance the identity. And dimension
K. So it's just a product of N mu I 1. For any algorithm, in
particular for any means that this algorithm can depend on mu, maybe we
know mu, I'm going to say that there exists sigma, which is a
permutation of the indices, such that if I look at the probability of
error of this algorithm on new sigma, the new sigma, which means that I
have just permuted the distribution. So now instead of having mean mu
on II have mu sigma I, on I. This is larger than some constant times
exponential minus some other constant times N over what we called H and
you have log K where H is a sum of the inverse gap squared. The sum of
1 over delta I squared, equal to I star. So what this means is that to
be small, to have this thing smaller, so this is equivalent to saying
that EN smaller than delta must imply that N is larger than H some
constant over log K. So if you find, if the probability of error is
smaller than delta, sorry you have a log 1 over delta, if you have a
small probability of error, then it means that the number of samples
that you have must be larger than the complexity we call H as the
hardness of the problem, H divided by log K. And H is the sum of 1
over delta I squared. So now if you compare this to this downstairs,
you see that here, instead of H, I have K over delta squared. But K
over delta squared can be much bigger than H, right? So potentially
for some mus, H could be much smaller than K over delta squared.
>>: [indiscernible].
>> Sebastien Bubeck: In the worst case, they are the same. And that's
why the min and max analysis is not interesting because in the min/max
analysis, the worst case is basically when H is equal to K over delta
squared. And in that case uniform is almost optimal, uniform
allocation. So this is a thing that goes beyond that. It's
distribution dependent, it depends on the distribution and it tells you
it's the right measure of hardness -- not yet, but yes. So far.
>>: But at the end of the day EN is just a proxy to the simple regret.
>> Sebastien Bubeck: Absolutely.
>>: Because they're the same, this would be terrible but SQL regret
could be easy.
>> Sebastien Bubeck: Absolutely. Yes. That's definitely correct. So
this is only a first step. It's step-by-step. So here this tells us
something very precise about EN. But it's not very precise for RN. I
mean for little rn, for the simple regret. It's an approximation only
for the simple regret. This is not the end of the story. This is the
end of the story I think for the error rate but not for the simple
regret, I agree. I don't know what is the full answer for the simple
regret.
>>: [indiscernible] the results so.
>> Sebastien Bubeck: Exactly it's along the lines of the Lyon's
Roberts results.
>>: [indiscernible].
>> Sebastien Bubeck: Exactly in some sense the way to understand it
for those who know what is a Lyon's Roberts results, it's actually the
version of Lyon's Roberts for the simple regret, rather than the
commutative regret.
Now proof techniques are completely different. Now I know how to prove
this for the -- let me spend two minutes on this. I know how to prove
this theorem for the community regret. But Lyon's Roberts is not that.
Lyon's Roberts they have to such something for the algorithm. They
prove lower bound for the commutative regret but the algorithm has to
be consistent for any possible distribution, no matter what's the
distribution you have to be consistent and go at a certain rate. Here
I need to assume nothing about the algorithm. Could be terrible for
some distributions. And actually Lyon's Roberts in that sense is not
true. You can get constant regret if you know the distribution up to a
permutation. So it's similar but it's not the same.
>>: [indiscernible].
>> Sebastien Bubeck: Exactly. So now it's not the end of the story.
Now I need to tell you can you get this H? So you can. Almost. So
you can almost and so this is the strategy which is basically just
writing down mathematically what the intuition was which is you try a
little bit everybody and there's one guy that looks like clearly is not
going to be the best, then just stop stamping it and then focus on the
rest and so on and so forth. This is the [indiscernible] reject. The
[indiscernible] reject works like this. So it's going to go by faces.
So you have face, K equals 1 to K minus 1 and it will have a set of
active arms. So at the beginning A is everything. And during a phase
you sample NK minus NK minus 1 times everyone in A. So I need to tell
you, I will tell you the formulas for this thing. So all the arms that
are still active you sample them a certain number of times and then
remove from A the worst guy. The worst guy. So you do that and at the
end of K minus 1 you're left with only one winner. You say okay this
is the guy I believe to be the best. And this theorem is that with NK,
which is some formula, so NK is proportional to N over capital K minus
1 plus 1 minus little K, you get that the priority of error is bounded
by a constant. So actually K times responsible minus N over H times
log squared K. So this strategy, so this means that N needs to be of
order H times log cube K. It's the number of arms, number of actions.
Little k is an index in the algorithm. I go by phases. And phase
little K that's how many times I sample. It's a formula that comes out
of it. So if you compare to the lower bound over there, the lower
bound was that you need at least H over log K. And here you need -right. You need -- so if you have more than this, you are smaller than
delta. So it's tied up to the logs. So this was in 2010. Now in just
recently there's been a very, very nice work from a group of people I
think at Yahoo! and what they did -- so this is [indiscernible] and
Sommer in [indiscernible] 2013, they can get down to -- they get -they modify [indiscernible] it's still the overall same idea and they
get H times log K times log log K. But it's roughly still the same
idea but the analysis is a little bit different. I'm not going to
explain in details. So we still have a gap. But I can also improve
the lower bound. I can remove this log K. So except that it's not so
easy to remove it. So this proof is difficult -- this proof is the
only thing which is difficult which I've written on the board so far.
And you cannot really go into the proof and try to trick some things.
I mean, everything is really -- it's tightly together. But I can
modify a little bit the assumption and get a much simpler proof. So
maybe this is of some value because the statement is in a sense is less
powerful but the proof technique is much simpler it could be applied to
other settings. That's what I want to talk about now. This is CRM to
be written with [indiscernible] Kaufman. So let's put 14. So CRM goes
as follows. So for any mu, any mu which is again an N mu IK and for
any algorithm. So the beginning is the same -- now the invariance
isn't going to be over permutations because that's what makes this
proof difficult. There's not much you can work with. You have to
permute this guy and this guy is not nice. You can be in trouble. Now
what I'm going to do instead of looking at permutations I'm going to
take one of these suboptimal arm and maybe this guy I will put it up.
So I'm going to define let's say FI-I of mu. That's a vector in RD
such that F I of mu let's say is a J's component is equal to mu J if I
is not -- if J is not equal to I. And it's equal to mu J plus 2 delta.
So mu I let's say, there's 2 delta I if J equals I. So I have my
vector mu and I have this operator FI-I it takes the Ith coordinate and
puts it up to be the best one. So the first thing to observe is that
the hardness measure, so note that H on mu for the hardness of mu is
always larger than the hardness of FI-I of mu. I can only make the
problem simpler when doing that. Because when I pull this guy up then
everybody is further away from the best arm than it was previously.
The distance. So in mu you have a certain distance to the best guy,
but in fi of mu the best guy becomes the Ith one which is above the
best before so the gap increases. So hardness measure is decreasing by
this operation, and what I can show is that there exists -- so for any
mu and for any algorithm there exists I such that the regret -- so the
simple regret on F I of mu is lower bounded by exponential minus a
constant times N over H. So this gets rid of this log K over there.
And this proof is five lines. Let's say maybe even four lines compared
to four pages here. Now, it's weaker, right, because I mean it's in a
sense it's weaker because the algorithm -- it's hard to say if it's
weaker or stronger, because here the class over which we take the
maximum is a class of size factorial K. Here it's only a class of size
K. Sorry. I put [indiscernible] to RK. But anyway, so now we know
the result up to a log K. Lower bound is H. Exactly H. And the upper
bound is H log K, log, log K. I think the truth is H. Perhaps up to a
log-log for the reasons. And this is really a fundamental problem.
And we only know basically one algorithm, which is this. Everything
else is a variant.
>>: [indiscernible].
>> Sebastien Bubeck: This one? It's a variant. It's not this
algorithm. So instead of having [indiscernible] for instance instead
of having K phases they have log K phases but it's still the same
structure. And I think you need to get rid of the logs you need
something much smoother. You don't need to be tied to a schedule that
you had beforehand. Like the schedule should be adaptive. But we
don't know how to analyze this.
>>: [indiscernible] a little more than ->> Sebastien Bubeck: Absolutely. With the log case, we remove half
every time. Yes, so that's the algorithm. Now I said it.
>>: [indiscernible] detection was to wait ->> Sebastien Bubeck: No, because -- no. Because I think very far from
it. But it's not clear. So multiplicative weights are typically
designed for the min/max rate. They're not adaptive in this sense.
That's the great weakness is that they work with almost no assumption
but if it turns out that the word is much nicer then they don't adapt.
And here it's really about trying to adapt as much as possible.
even want to adapt as much as if we knew exactly the mu. So --
We
>>: [indiscernible].
>> Sebastien Bubeck: So UCB, there is 20 minute story. So let me try
to do it in two minutes. So UCB, you can provably show that it cannot
go at exponential rate. So the probability of error will be
polynomial. Now, you can do a modification of UCB, which actually goes
at this rate. But the modification -- so UCB it looks like it plays at
time step T the actions that maximizes the current mean plus square
root of 2 log T over TIT. So this you can show that this is not going
to work. What you can do is a modification which is called UCBE, which
we did also in 2010, which goes like this a constant replace the 2 log
T by N over H over TIT. You need to know H. If you know H and do this
algorithm then you get exactly at this rate. Okay. But not knowing H
you could try to adapt to it online but we don't know to prove anything
about it. But let me say something from a practical point of view. In
practice, this problem is going back to one of your earlier questions,
this problem is only interesting is in the range where N is the further
H. If N is much, much larger than H, anything will find the best
action. Just do uniform and you will find it. What is interesting is
that when you're in the problem where it's hard, like it's just, where
this is close to a constant, but if this is close to a constant, 2 log
T is a constant. So you can view the analysis of this as an analysis
of the true UCB for the cases where N is order of H. This is showing
the basic UCB should work in something. So now let me just quickly
show you some pictures. So can I get the thing down? Thanks. Yes.
>>: My question, what if you have two or three at the very top so ->> Sebastien Bubeck: Right. So I'm going to show you an experiment
like this.
>>: [indiscernible] [laughter].
>> Sebastien Bubeck: So I'm just going to show you two experiments.
So the first one we have 15 arms. So in this experiment we have
Bernoulli distributions everywhere and the mean of the best arm -- so
the best arm is arm one and its mean is .5. So in the first
experiment, which I call Experiment 5, the mean goes down in an
arithmetic progression. So you have .5 minus and then .025 times I.
For I equals 2 to 15. And what I plus here, so the bars are the
probability of error for different algorithms. And so the first one is
just uniform sampling. So you see uniform sampling as the worst
probability of error, as expected it's the most naive algorithm. So in
this problem for instance so N is 4,000 uniform sampling will get you
the right arm, with priority at this minus .45 or something like that,
then 2 to 4, algorithm having races appeared in the literature
previously. So they perform a little bit better than uniform but not
too much. Then five is the six or reject which I just told you about.
Bar 6 to 9 are the UCB, the finely tuned UCB. And the rest are the
UCBE, where I try to estimate online H using a procedure which is
nontrivial.
So you see successive reject does it better than all the other -- than
the previous versions. It almost divides like here you see in the
second experiment the probability of error of uniform is close to .6,
and the priority of successive reject is close to .2. So you can get
real improvement. So this second experiment, by the way, is what you
just talked about. So we have one good guy, which has a mean .5. Then
we have five guys which have a mean .45. So close by. And another
group at .43 and the third group. So this is three groups at the
bottom. And here you see with three groups of bottoms like this, it's
really worth trying to adapt. So these algorithms they will quickly
remove all the very bad ones and focus on the good ones. Okay. So
these are the numerical experiments. Now let me move on to something
else, which is what if instead of finding the best option you want to
find the N best. So now maybe you want to find the five best arms
instead of just the best one. So here I had the two experiments the
name changed from 7 to 6 but the same one. On the Y-Axis I have the
probability of error and X-Axis I have how many arms I want the
algorithm to output. So the first point here that corresponds to the
previous slide. I just want to find the best one. Blue corresponds to
uniform allocation and at the end I return the N best. Yes?
>>: [indiscernible].
>> Sebastien Bubeck: Yes, this one is more. So you are good only if
you get the M correct. So that's a difficult task.
>>: [indiscernible].
>> Sebastien Bubeck:
Sorry?
>>: Ordering as well.
>> Sebastien Bubeck: No, no, no, no. That's the key point. You don't
care about the ordering. You just want to find the M best. Within the
M best you don't care about the ordering. Right. So uniform is in
blue and C successive reject is in red. In the previous slide, red is
much below blue. And what happens when you move M. It was successive
reject optimal for M equals 1 compared bad to even uniform.
What this is saying this is expected at some point but successive
reject because it was designed to find the best one, it has a very
rough idea of the ranking below the best one, below the first two
bests. The other guys it doesn't really know what's going on. So if
you ask him to output the five best arms, it does a very bad job. So
what this says is you fundamentally need to modify the algorithm when
you want to find the N best. You can just not do successive reject.
This is a variant of successive reject where you don't have K -- we
rightly tune the number of phases and the samples per phases. So you
need to modify it. So the modification that we did with two students
is this green line which is called successive accept and reject. So
the key and all new idea of this paper is that when you want to find
the M best, you not only reject bad guys, but you also need to stop
sampling guys that look good, because they look good so now just stop
sampling them and say, okay, this one is going to be in the batch that
I accept and that's it. So this is a successive accept and reject.
Now, the difficulty is at the end of a phase, how do you decide if in
this phase you should accept someone or you should reject someone? So
what you do is you compare how confident in some sense you are about
the acceptance and the rejection and you do whatever is the best for
you. And so the analysis is harder than for M equals 1 but you see in
practice it really works. That's the green line, and here for the
other experiment it really works. It's really better. And gap E is a
variant. All right. Is there any question on this? Can you get it
up? Can you get the screen up? Thanks. Okay. So is there any
question on this? Yes?
>>: Why is the successive -- [indiscernible].
>> Sebastien Bubeck: So different from what?
>>: The other one ESR.
>> Sebastien Bubeck: Yes.
>>: When you try to accept the best guys, can you just treat it as the
invert problem, the [indiscernible] problem of rechecking now you have
two versions. Just invert ->> Sebastien Bubeck: Absolutely. The question is where do you put the
baseline? Where do you invert? Like where do you put -- you see what
I mean? Like you say you want to take the negative, but when do you
stop taking the negative. So that's exactly the key point. So
basically the key is that if when you want to find the best one let's
assume that mu 1 is larger than mu 2, et cetera, up to mu K. Then the
new gaps that I define are so if you are in the best one, if you are
one of the best, then you look at the distance between mu I and mu M
plus 1. So you look at the distance between mu and the first guy with
nothing but M best. And conversely, for those not in the M best you
look a letter your distance to the last guy with one of the M best. So
you have those gaps, and what the algorithm does is that it estimates
these gaps and then it decides to accept or reject based on those gaps.
Those estimated gaps. It's along the lines of what you just said. And
the complexity, what you can show in this case the complexity is a sum
of these guys 1 over these guys squared. So maybe just open forum and
maybe I'll spend ten minutes on the other topic. What's important is
here we want to find the N best arms but we put no structural
assumption on the N best arm. What would be very interesting is the
following more combinatorial problem where assume that you have a graph
and the arms -- so you have a graph G like this. G on K edges. All
right. So you have K edges. And on each edge you have a priority
distribution. And now what you want to find is maybe the best spanning
tree. So find the best spanning tree. So let's say that the spanning
trees are a certain size M in this example. So now we want to find a
subset of size M of the K edges, but this subset also has to satisfy
some structural properties. So more generally we are given a subset C
of 2 to the K, right? So it's a set of subset of 1 to K. And what you
want is to find the outmatch over let's say S and C of the sum over I
and S of mu I. So you want to find a subset of the 1 with
combinatorial structure subset C which maximizes the values within
this.
I think this is completely open and I don't think there's a general
theory that will -- I don't think there's a general algorithm that they
see as an input and then find the best structure at the optimal rate.
I think you need to have the algorithm to really think hard for every
single problem. The first one is the best spanning tree. I don't know
how to do it. You could try to think about finding the best matching.
Let's say you have a bipartite, complete bipartite graph. And what you
want is to find the best perfect matching. I have no idea how to do
that. I mean you can go through the list of all combinatorial
optimization problem and try to redo things in this stochastic
framework, with this point of view. The point of view of optimal
rates, optimal distribution dependent rates.
>>: Is this going through that, the spanning trees?
>> Sebastien Bubeck: No. So the issue with the min/max is that more
or less you gain only log factor. Here you get order of magnitude.
You move from K over delta squared to H. But no we don't know for min
and max either.
>>: But it's something from previous, something on the spanning tree.
>> Sebastien Bubeck: No, it can be. It's going to be much smaller
than this. I mean something trivial would be to view each spanning
tree as a norm and then find the best -- but that's exactly not what
you want to do. But what you're saying is a trivial upper bound. And
the key is that it's not going to be like this. The one thing that
could be just how influential is this edge. So let's say you look at
what is the best spanning tree that goes, contains this edge, the best
spanning tree that does not contain this edge. There's a gap between
these true value is small, maybe it's a sum of one over this gap
squared. I don't think so. I think they're all nontrivial
correlations between the edges. So I don't know what is the answer.
Okay. Now I want to quickly talk about something else. So one nice
thing about this theory is that I talked with quite a few people about
practical problems. And of discrete optimization, and often they can
be casted within this framework. But sometimes they can't. And I will
give you one example. Well, you have to think to get something. So I
have a set X. X is a countable set that is known to me and A is a
subset of X. So think of X as a set of integers and A is a set of
prime numbers. Okay. So A is a set of interesting elements. In some
sense. But I don't know A. Okay. I don't know A. And what I want is
to discover A. I want to discover as many elements of A as possible.
Now I need to tell you how do I access these sets. So I access them
through experts. So I have experts, which operate a distribution mu 1
up to mu K. Probability. Supported on X. And now I can play the same
game as before. Sequentially I can make requests to this expert, to
this probability distribution, when I make a request I get a random
variable drawn from the underlying probability distribution and I
observe that I get an interesting element or not. So the game goes
like this. So choose ITNK, receive YT which is drawn from mu IT. And
I also observe whether or not YT is an A. So I observe the indicator
that is IT and A. And what I want is after N samples I look at F of N,
which is a number of interesting items that I found. So how many
interesting items did I found. Well, I found this item. Y 1 up to YN.
That's the set of items that I found. Which ones were interesting,
it's the intersection with A. How many did I find? It's a cardinality
of this. So now I want to maximize F of N.
>>: Don't accept when ->>: Except when I receive ->>: Told you ->>: If it's interesting or not. So that makes sense. So imagine I
have distribution on the integers. I don't know what are prime
numbers. And I sample from one of the distribution, I get an integer,
and then somebody tells me if it's a prime or not.
>>: [indiscernible] scenario a little more, why.
>> Sebastien Bubeck: Yes. Now you are on one. No, absolutely. So I
actually think that there are many, but I don't know yet many. So I
know one. Imagine that you have a big graph. It's really enormous.
Let's say the electrical grid in the United States. Okay. And you
have nodes. So the nodes are going to be your element in X. And now
there are some nodes in this network which are faulty. So there is
something going on with this node. You actually need to physically go
there and fix the node. What you can do is that given a node you can
run a security analysis and test whether or not the node is faulty and
if it's faulty you can go there and fix it. But, of course, you cannot
run the security analysis for every node in the network. But then what
you do is that you hire some engineers and they think hard about the
problem and what they could come up with maybe are some kind of random
walks on this network which are biased towards 40 node. Maybe they're
very smart, they were able to do that. So you have K of these
engineers and they each came up with their own heuristic. So you have
K random walks, rather probability distribution on the network, and
what you can do is you are going to -- you have only one computer that
can render security analysis. So every day you need to choose one of
the K engineers, run his or her heuristic and then run the security
analysis and then move on to the next day. Okay. Is this a good
example? You could also have, I don't know, you want to -- you have
code for computer code and you want to find all the bugs in your code.
And you have different heuristics to find elements of code that could
be wrong. And you want to combine these heuristics in the best way.
>>: [indiscernible] problems, how do you apply different heuristics to
show ->> Sebastien Bubeck: Exactly. I mean three about [indiscernible].
>>: [indiscernible] problem with gathering ->> Sebastien Bubeck: Exactly. Equal to the previous problem. So I
don't think so, because the reason is that this is much more dynamic.
So look at what happens if mu 1 is a direct on an interesting item. So
this guy, gives me an interesting item. But it gives to me only once.
I mean, when I come back to him he always gives me the same interesting
item. So it's not interesting to me anymore. So this guy could be
super good for one time step and then bad forever. So there's this
dynamic aspect that was not in the previous framework.
>>: [indiscernible] the set, multi set.
>> Sebastien Bubeck: No, it's a set. It's just a regular set. So
meaning exactly that if you see twice the same interesting item, it
doesn't count twice. It counts only once. That's where the -- if it
wasn't like that it would be exactly like the previous one. Because it
counts only once.
>>: [indiscernible].
>> Sebastien Bubeck: Yes, exactly. Discovery. Want to discover A.
That's the -- so we called it optimal discovery with expert advice. So
now what can you do? Well, so what would be the optimal -- so imagine
you use a distribution mu I, what would you do? What would be a simple
thing to do. If you do distribution mu I you could for each
distribution estimate what is the probability that you will get an
interesting item that you have not seen yet. So you could define this
MI quantities which is the missing mass, which is the probability mass
that mu I puts on the set A where you remove everything that you
observed so far. Let's call this MI of T. So if I knew what was A and
mu I, I could compute those things. And what I would do is just pull
the awg max, go to the guy that maximizes this thing. But I don't know
those MI of T. But what I can do I can estimate them. This is a
famous problem. It's this thing is a good thing estimator. So it's
something very famous to estimate the missing mass. So you can have an
unbiased estimator of this guy and even have concentration in
equalities. And so what you do is the algorithm we did is that you
sample this guy, plus a confident [indiscernible] which is given by the
deviation. So I'm going to show you now some experiments that we did
with this algorithm. So can I get the screen again. So this is the
algorithm sort of clear. So for each expert we estimate the missing
mass, and we add the confidence term which is given by deviations which
you can derive.
And so now this is an example.
So --
>>: The UI here in order to ->> Sebastien Bubeck: No, so the MI of T was if I knew the
distribution, but then I can do the Guttering estimator which doesn't
not need to know the mu I and T. What you do is you just estimate for
each expert you estimate how many interesting items did I see exactly
once in my sequence, and the normalized, constantly normalizes the
estimator, which is a good touring estimator. I look at the program
where I have seven experts. The Q is Q1 equal 50 percent and Q2 equals
25 percent, et cetera, but the proportion of each distribution for each
distribution. So expert one has priority mass of 51 percent on the
interesting item. The distributions are disjoint, and N is the overall
size of the problem. So each distribution is uniform on the set of
size 128 in this plot, 500 in this plot, 1,000 in this one and 10,000
in this one. So I have distribution which are uniform on set of
different size and they are different proportion of the sets are
interesting. And you see something is going on, right? You can see
this convergence. So what I put is the number of interesting items
that you found. This is time and this is the number of interesting
items that you found. So the top is this Oracle strategy that plays
the arms that maximizes the missing mass. This one is our algorithm,
the second one, and this one is just uniform. Just allocate things
uniformly. And you can see that we have a uniform convergence of the
number of interesting items that we found for our strategy as a size of
the problem grows. That's what we gave it the name of microscopic
limits. So as the size of the problem grows, our strategy is uniformly
optimal in time. So this is completely different from the multi
unbanded problem. The multi unbanded or the one before I was looking
at as N goes to infinity. Here for any fixed N as the problem size
goes to infinity, I'm being optimal. So it's a very different kind of
limit. So this was for disjoint set of distributions. This one is the
one with integers and prime numbers and it gives the same -- we obtain
the same result so we have a CRM, et cetera, and I don't have time to
explain.
And so these are just some references of the 2010 paper with the lower
bound that this optimal discovery and that's the book which Nick
[indiscernible] unbanded programs if you want to know more about this.
And that's it.
[applause]
>> Yuval Peres:
Thank you.
Questions.
>>: So just the first part you described, so you had explained to us
the 8 over log K lower bound and then you said you would remove the log
K that was changing the problem. But for the problem with the
permutations ->> Sebastien Bubeck: No idea how to do it.
>>: So the rest [indiscernible].
>> Sebastien Bubeck: Yes.
>>: And the other bounded, bolted from the ->> Sebastien Bubeck: Yes.
>>: So that bound was for which problem, only shipped one of them?
>> Sebastien Bubeck: Right. So the upper bounds are assuming nothing.
So the upper bounds are when you know nothing about the problem and
still you can adapt at the right rate. So the upper bound holds for
both settings, if you want.
>>: And if you know there's accepting more doesn't lead to improvement
there?
>> Sebastien Bubeck: Well, what this theorem says at most it gives you
improvement in log factor but it could be. We don't know. It's an
excellent question. Maybe with the log it's -- I don't think so. But
it would be nice to have a better proof for the case where you just
have permutation. But key trick was to change the problem so as to
simplify the proof. That was the point. Yes?
>>: What about lower bounds and here are the problems of setting
dependent lower bound. Can you say anything about the problem?
>> Sebastien Bubeck: Lower bounds are basically trivial. It's
immediate to get the optimal one over square root rate. There's
nothing really interesting to do there. And that's why despite the
simplicity of the problem, it has not been really looked at in the
past. It's because if you look at it from the simple point of view of
min and max it's not interesting. You have to do something else than
min and max to make it interesting, basically. But people in practice
have been looking at it because in practice it's an interesting problem
that people face.
>>: Does not say much about R and R. The story over there.
>> Sebastien Bubeck: No. So originally I can just tell you something,
just one thing, which is this lower bound from actually before from
2009, if you look at the simple regret, so this was this thing that
could go exponentially, this thing I can lower bound it for any
strategy by exponential minus a constant times a cumulative regret. So
if -- so optimal cumulative regret order of log N. So if you're
optimal for the computive regret then you have a polynomial decay for
the simple regret. So this lower bound, this was the start of the work
because it says that it's a completely different problem minimizing the
cumulative regret and simple regret. If you try to optimize for the
cumulative regret you don't get optimal strategies for the simple
regret. Yep.
>>: [indiscernible].
>> Sebastien Bubeck: Oh, yeah, yeah, sorry, sorry, yes, yes. So for
the capital RN, we know strategies which do capital RN is smaller than
some constant times log N and.
>>: For small rn.
>> Sebastien Bubeck: For small rn and not EN, is that your point? No,
we don't know anything beyond applying those strategies. But I believe
that those strategies are good for little rn. I think those strategies
are good for the simple regret. I just don't know how to do a better
analysis than the trivial one.
>>: Won't give you anything initially.
>> Sebastien Bubeck: No.
>>: [indiscernible].
>> Sebastien Bubeck: I think -- yes. But I don't know how to prove something beyond
triviality. You can say trivialities but anything beyond I don't know. Thank you.
Download