>>: It's a pleasure to introduce Professor csaba szepesvari who's a... Alberta.

advertisement
>>: It's a pleasure to introduce Professor csaba szepesvari who's a professor at university
Alberta. He has worked on the -- he has worked a lot on reinforcement learning, modeling bandit
and statistical learning theory. And today he's going to talk about the sparse contextual
sparse -- bandit problems.
>> csaba szepesvari: contexture or not, sparsity is going to be there, sparse context, very
sparse.
so it's a pleasure to be here, and I'm going to be here for four months.
so any time, you can
knock on my door, send me an e-mail. I will be very happy to talk to any of you.
so this joint work is with my former student Yasin Yadkori who is a postdoctor fellow with
[indiscernible] at the moment. And what I'm going to talk about is online-to-confidence-set
conversion and application to sparse stochastic bandits.
And so here's the contents.
so first we're going to talk about linear bandits. How many of you are familiar with linear
bandits?
okay. we will have some motivation, but like maybe we can cut it a little bit shorter.
so why
should we care, how come we should care about sparsity, how does sparsity come into the picture.
And then I'm going to talk about a general topic of algorithm and how it actually just reduces
the problem of designing bandit algorithm for the stochastic study to designing tight confidence
sets for certain prediction problems and the [indiscernible] regret.
so that's generality aside. And then we're going to talk about for this algorithm prediction
problems how to construct this confidence sets or confidence regions.
And here if you want to
get more as opposed to many [indiscernible] we need what's called honest confident sets which
means that if you declare that the true parameter is in your confidence set, because a certain
probability it has to be there. Is that probability it's not in a specific design.
so you got
there finite results that you need. so first define what this problem is, and then I'm going to
argue that this confidence actually do matter a lot from the perspective of practical -obtaining practical good results. And then I'm going to jump into the main part of the talk
which is a new construction for confidence sets. we call it online-to-confidence-set conversion.
It takes an online algorithm for a converse setting, and it converts it into a
confidence set. It's kind of a neat result. so for that, we're going to talk about the
framework of lining up the prediction problems, and then we just look at a conversion.
And then
I'm going to explain the slide why it works. It's actually pretty simple. And then we talk
about application to linear bandits in part to sparsity bandits which is the main topic, look at
the results.
okay. so linear bandits. so I don't have to explain this to you. This is actually from a
paper coauthored by [indiscernible]. You want to recommend news articles to users coming to
your website, and the news article could be something -- oops.
The laser pointer doesn't work.
something like this. or it could be about politics, or it could be about, you know, computer
games, or it could be about the olympics, or it could be about science and so on and so forth.
so there's zillions of articles the user could be interested in, and you want to put the article
on the front that matters the most to the user, and you're hoping to collect clicks.
okay. so
here the goal is to maximize what's called the clicks through that.
okay.
so how does this work?
so you have a number of rounds and users coming to your website.
And then in round T here,
you're given a set of articles D of T. so given the set of articles, you have to choose one
article that are going to be denoted by xD to put on the front of the web page.
And you're
going to receive a reward of one if the user clicks on the article.
otherwise, you don't
receive a reward. And your goal is to maximize the total rewards.
so how are we going to deal with large action sets?
so we have this article, for example. so there's just one of the many possible articles.
You
could have zillions of articles in your database. And so how to generalize to articles that you
have never shown to the users. And one of the possibilities is to work out some features for
the articles as we all know. so we can ask whether the article is about sports. In this case,
it's yes. whether it's about politics, and in this case, it's no.
whether it's about olympics.
In this case, it's yes. And whether it's about games. In this case, it's no.
whether it's
about science. In this case, it's no. whether this is a trending or not article. Let's say
it's trending article and so on and so forth. so you can click off your features. In this
case, these features gives the boundary, and you can collect the answers and code them with
numbers like zero one numbers, and then you can collect them into vector that will be the
feature vector on the line your article. And for every article, you can do this, this way
turning the problem into a problem that is defined over some vector space of dimension D.
so
this is the number of features. And then the probability of the click on x can be modeled by
let's say a simple linear model. so you just linearly combine the features of the article, and
that gives the probability of the click. And I'm going to use this notation for denoting the
linear product of this T vector and the vector x. And so the click is going to be bounded on
the variable and your assumption that this P of x for some magical reason lies between zero and
one. And the probability of the click is just P of x. okay.
so you can alternatively write this in the following way.
If you define it as a difference
between Y and P of x, then you can write that the click is equal to the inner product plus this
Page 1
eta.
Eta is just a noise.
If you take the expectation of eta, it's a zero mean right away.
Page 2
okay.
Are we good?
so far so good, right?
so marginally and more abstractly, linear bandit problems are defined over for some vector space
of let's say finite dimension D. And in this case, let's assume that you click on the vector
space, and we have an unknown vector space that I'm going to denote by T plus star.
And the
game is played in these rounds.
In round T, you're receiving a convex set of Tx plus Y.
can I assume that you receive a convex
[indiscernible].
You can randomize them. Randomization you can always achieve any point that
is in the convex heart of the original set.
okay. In terms of the expected return, expected reward, nothing changes if you introduce this
randomization.
so you are going to receive a convex DTc, the set of articles.
And then you
have to choose an action in this convex set, and you're going to receive this reward which is a
variable which a linear product of the action that you have chosen and the unknown parameter of
the vector and some noise is confounding this inner product.
And we're going to assume that
this noise is a zero noise by which we mean if you take the expectation of the noise conditioned
on everything that you have seen up to that point like all the previous actions and all the
previous clicks of rewards then the expectation is just zero.
And we're going to assume, of
course, that the noise has control tasks as well. And the goal is to maximize total expected
reward. okay.
so that's just the standard framework that you see for this [indiscernible].
so what are linear bandits?
It's like -- you'll see it's a beautiful model, and it's also very general.
It allows you to
model the dependence between the rewards of arms and the density of like action sets.
You can
model mark in bandits even monitor arms with it. You can model a lot of other interesting games
like computer games, interactive learning, and it has a bunch of applications, user interface
optimization.
That's what we start with on-line product recommendation given the matrix
prediction.
Problems can be put into this form. It's kind of cool. And network routing and so
forth. so the list continues. There are tons of applications.
okay.
so how do we [indiscernible] a learner?
The learner is a way to intercept the regret. since we allow the action set to change every
timestamp, we have to a little bit be careful about how we define the regret.
so if start
anyway with round T, we would choose this action, right?
so that maximizes the expected reward. No question about it. okay. Everyone would do that.
And so a regret is how much we lose whether by not doing this and expectation if you wish.
so
some people call this futile regret, because you're not comparing the actual reward, but you're
comparing the convex expectation to the [indiscernible] given the choices any way.
so here is
the total reward that that you're going to receive, and here is what you have achieved if you
knew where you would start.
okay. so that defines the regret, and we want regret to go zero or as fast as possible.
If you
divide the regret by the number of rounds, we want this quantity, the average figure, to go to
zero as fast as possible, because this would mean that, well, the average reward that you could
have obtained is close to the average reward that you actually obtained.
Yes?
>>:
[inaudible].
Just 01 submissions.
>> csaba szepesvari: Right. Right. That's what it is, yeah. so, I mean, like if you are
interested in expected, there is no difference. If you're interested in variation bounds like
high probability bounds there is a slight difference between the two.
>>:
[inaudible].
>> csaba szepesvari:
>>:
[inaudible].
I think the study is just -- like the problem definition you mean?
In the past.
>> csaba szepesvari: It's like you have this unknown vector P star, and the reward is the
linear product of your choice with T to start, which is unknown vector plus so much.
so if you
take the expected reward -- this is the expected reward, right, because we said that the noise
has a zero mean.
>>:
Do you see that the construction of that D may also be generated by the --
>> csaba szepesvari:
>>:
Yeah.
[inaudible].
Page 3
>> csaba szepesvari: No. First -- the first, yeah. so in the stochastic linear setting, this
is the assumption.
so if the model is not correct, then whatever I say here you can throw it out of the window.
You are going to have a unknown regret in that case anyways. But still you could study that
case, right?
so it's outside the scope of this.
okay.
Good.
so we want the average regret to go to zero as fast as possible, but we don't want the regret to
grow faster than we want the [indiscernible]. so when we say that we have a sparse problem,
it's obvious when the star is sparse. so the hope here is sparsity is going to move on to add
many features without slowing down learning, right?
so in this news article recommendation example, for example, you're just clicking out these
features, and you are hoping to capture everything that could possibly influence the reward.
You start to add too many features, it inevitably slows down learning except that, maybe, if
many of the features are linear to your predictions.
That way the coefficients could be taken
to be zero. The other features already capture those things.
You don't have to be that careful
about how you're constructing your features. It's a good thing if you can exploit sparsity.
okay. so that there is difference here between a sparse parameter vector versus sparse
features. You could use sparse features to like design different algorithms, adapt your
algorithms.
These problems are such as like what to connect to each other.
And, of course, if
you have both of them, then you can look into the intersection, and that's interesting, but I'm
not going to talk about that. I'm just going to talk about the sparse parameter vector case
today.
okay. so how do we play in stochastic linear bandits when one of the standard algorithms uses
what's called optimism in the face of uncertainty principle.
And the way it works is that you
maintain a high probability confidence set which is about the unknown parameter vector.
okay.
so this confidence set is a region in RD and with the high probability of confidence the unknown
parameter vector to start. And the algorithm is just one line, and that's the beauty of this
thing is you take the joint maximizer of the linear product, but at first -- first component is
your action x, and the second one is the parameter vector.
And the first component belongs to
the action set, and the second component belongs to the confidence set that you have at that
moment in time.
okay. so this way you're looking at a world -- at the best possible world.
You are hoping for
the best -- to see the best possible world, and you're playing the best possible end in the best
possible way. so you're optimistic. okay.
so this way T is going to be optimistic. we're going to call it an optimistic estimate of the
unknown parameter vector [indiscernible]. And this principle goes back to at least Lai and
Robbins; maybe even before that. And there are a lot of algorithms like ucB1 which was proposed
by [indiscernible] and. That's a special occasion. That's not the right difference. But
anyway. And it's a widely applied principle, and there it's very active in the search.
so you might worry about the implementation of this, but if you have a finite action set, then
you can just like enumerate all of the actions and then for all the of the actions, you can
compute its optimistic reward. That's -- if your confidence set has some nice shape, then it's
going to be a convex organizational problem. You can do that. It's not a big deal. I say, if
the number of actions is becoming very large, then you have to be a little bit more organized
about this computation, but I'm going to sweep this under the rug.
Just move on. okay.
But it's an interesting problem on its own how to actually compute these things.
so this is
just an illustration of what's going on.
so you have this confidence set c of D times T. And so P plus star must be inside somewhere,
and usually P [indiscernible] which of D which is an estimate of the unknown parameter of the
center of this. And then usually this ellipsoidal shape. And [indiscernible] is going to be on
the boundary, because you're maximizing the linear object with this confidence set.
okay.
so that's kind of the picture that should have in mind.
And so why does it work?
so how [indiscernible] look like, and you can actually analyze the immediate regret of this
algorithim, and it goes as follows. so in round T, you're choosing this action. And you should
have chosen that action. okay?
And so this is your pseudo reward for that round, and this is the best possible pseudo reward.
so you have to use P plus star to the unknown parameter vector when realizing the regret, and
this can obviously be by this quantity Y, because we're granting that P star belongs to
confidence sets. And, of course, this belongs to the decision set. And this is maximizing this
inner product with the cross-product of the decisions of the confidence set.
so this is at
least as large as the other quantity. so this is follows because of the optimistic choice.
And
if you look at this whereby by linearity, and you can introduce some norm and some duo norm.
You can just use [indiscernible], and you get this inequity.
so what you see there is you have
the freedom to chose any norm. And the regret times T is going to depend on the duo norm of the
parameter vector that you choose. And so if, you know, the DFT set has some nice shape, it's
Page 4
nice bounded, then it's going to be bounded quantity.
right?
The other quantities is more interesting,
so then the difference between T to start and your optimistic estimate of T to start.
that difference cannot be big, then you will be in good shape; okay?
so if
so you want to choose this norm in such a way that you can show that this differential change.
okay. so that's the whole deal idea. so in some sense, this difference is going to match the
confidence -- is going to -- is going to measure the size of the confidence set.
And you can
which I ever think about it as a confidence split. And we can use different norm in every
timestamp and more the type and you would do that.
okay. so from this you can derive a bound. so I'm going to show some regrets bounds, but like
this is just a basic idea of how this goes.
And so why optimism? so could you get away without optimism? well, actually you can get away
without optimism. so this was the previous algorithim. But you can choose any c of T that is
an element of the confidence of c of T. okay. And then you choose the maximizing factor with
respect to your chosen parameter vector, and then with [indiscernible] vector, you can get this
other inequity. This other inequity looks very much like the previous inequity except for the
appearance of these terms. so what you see is that what the optimistic choice buys you is that
you don't have to analyze this term, and this term might actually be large or -- or small.
Anyways the optimistic choice kind of like shapes likes this.
so that's what you gain. But if
you really are worried about how expensive this optimistic choice is how expensive to conquer
it. Maybe you go with some other choice and just analyze that for finite bounds you can
actually analyze it to that. okay. so far so good.
so we see that it's -Yeah.
>>: so could you go back? so if you take an [indiscernible] from c that inequity holds, but it
doesn't necessarily have to in terms of [indiscernible].
>> csaba szepesvari: That's right. If you're not -- if you cannot show that you're controlling
this term. This is a new term. It doesn't help you, right?
so you can control this term as before, and this term is as before.
This is the only
difference.
That's the new term. And if you think about finite imbalance, and finite imbalance
you can actually control that term.
For the other imbalance, I don't know.
It's a meaningful term, actually.
It's an interesting question whether you can control it.
>>: Just to make sure I understand. so x to P star is a constant. It doesn't change across
the rounds. But the norm -- the document norm is something that could change.
>> csaba szepesvari:
>>:
Yeah.
Yeah.
It should be in round two.
so if you have the same --
[indiscernible].
>> csaba szepesvari: It's because I have the set to change by time. so in my model every
timestamp you receive a different set P of T which is a convex set in the space.
so set of
articles every round is different.
>>:
okay.
>> csaba szepesvari: It's coming from there.
This way you actually can have the model [indiscernible].
so but if you don't have the changing
decision set, this index would go away. You would have index there, and this would actually
measure how confident you are of the reward of the optimal arm if you show that shrink faster,
then you could be done.
okay. so where do the confidence sets come from?
we see that it's -- it's all about the design of the confidence sets.
Like once you have the
good confidence sets, then you have good regret. so we hope that the [indiscernible] is not
going to have any regrets. It's high probability. It's better to go with honest confidence
sets in this case. Maybe even be a little bit sensitive I would suggest, okay.
But anyways, so
let's be honest. so how do we design this honest confidence sets?
so if you look at the problem in a general way, you have this linear prediction problem.
so
there is this convex x1, xN, Y1. Those are the responses you happen to choose in convex x in
the previous case. But marginally they could come from any sequential features.
Then you
assume that the response is a linear function of the [indiscernible] is an unknown vector.
Then
you have some noise, and then here I'm being more specific about what do I assume about the
noise. You could assume that the noise is -- is subgaussian with some constant.
subgaussian
means that the chaos the noise [indiscernible] off as fast as the gaussian.
It's a reasonable
assumption under many circumstances.
Page 5
And so here as I said xT is often chosen based on the [indiscernible].
so the estimation
procedures -- so there are different problems as to stated with the study.
one is to just
estimating the start, and the other is to contact the confidence set confidence set with a high
probability.
we're going to be interesting the second problem. so that's the definition of
subgaussian.
And like I'm just arguing that subgaussianty is something like you're familiar
with even if you are not. okay. so what do we mean by this honest confidence sets?
we mean a random set in RD such that you're given some data that's your confidence parameter,
and the unknown parameter vector lies in the confidence set with probability at least one minus
data. Yes?
>>:
I thought you assumed the noise is bounded, because it have [indiscernible].
>> csaba szepesvari:
>>:
It could be bounded, yeah.
That's a special case of subgaussianty.
You find it to be a possibility?
>> csaba szepesvari: That was modification. And then marginalize. so I'm moving to something
marginal, slightly marginal. If you have a bandit like this and the certain like for bandit
cases it's [indiscernible]. certainly.
okay. so we want this confidence set.
>>:
The ability here what is exactly the measurable space?
>> csaba szepesvari: You want to measure the space, and then you have all this unknown
variables that are supported on this measurable space.
so I'm sorry. Like I didn't say x1 and
Y1 N YN are the random variables. They're supported on this measure with this space.
>>:
It's not a product space, right?
>> csaba szepesvari: No, it's not.
are totally correlated.
>>:
These
Right.
>> csaba szepesvari:
you could contract.
>>:
It's like these are -- these are totally correlated.
It's like the -- you could contract the space and, all these conditions
But unlike regression where you assume this is it's not a --
>> csaba szepesvari: This is not a random space. It's not like random design, fixed designs
like [indiscernible] convex x. so it's a little bit harder if you want.
But, actually, it happens to be not much harder to analyze it.
It's a little bit harder. But
you actually get the same results. okay. so this is what we want.
And so this is just like picture about showing what we want.
And so one approach is to design
confidence sets based on the [indiscernible]. so you have your data, and you start going into
these matrices, these x by Y matrices and then xT to one going to be responses to the vector and
then region with this positive parameter [indiscernible] would produce this estimate.
And this
is going to be the center of the confidence set, and you proceed as usual. You define the
[indiscernible] matrix which is slightly adjusted to take into account the ridge parameter.
If
you do that then with some work, you can prove that the following set is a confidence set.
And
here what's important to remark is that the result is pretty cool, because it holds for every
timestamp.
so it's like up to infinite. so it's not just for a finite. You avoid the union
bond basically using a marginal argument. It's learned from [indiscernible]. It's a stopping
argument, yes?
>>:
so in some ways, the dimension of the matrix don't seem to match.
>> csaba szepesvari: well, it's -- yeah.
guess. obviously that thing is divided.
>>:
It should be -- that should be [indiscernible], I
Divided.
>> csaba szepesvari:
>>:
Are you --
okay.
Yeah?
shared space.
>> csaba szepesvari:
Thank you.
Yeah? okay. so that's the coolest thing about this, and the proof is based on the method of
mixtures that goes back to [indiscernible] in the ?0's, and it's kind of like using a
[indiscernible] technique to actually prove this thing.
It's kind of a cool tactic. Anyways.
so this is a confidence set. okay. so what you see here is that this is basically the distance
that you're going to use later on say the bounded bonds.
so it's the [indiscernible] matrix.
so it creates like the -- so it is going to have a much smaller radii in the directions where
you see a lot of convex in the alternate directions.
It's going to be very large. It's
potentially really
Page 6
large, and this is kind of the radius.
okay.
so here this norm is this matrix norm.
okay.
so this axis [indiscernible] to space if you care about R cages or gaussian processes.
so
that's cool. Is this a good bound. so, of course, there's been tons of work in the literature
about producing similar bounds. For example, [indiscernible] the paper proved this bound or
proves this -- showed that this is a confidence set, and then what you see, the difference here,
is that here there is an x for a different time. so if you want to get uniform in time
behavior, then it's pretty common that you take a union bond, and here you see that that's
called a union bond. That's probably covering argument. And here we are avoiding it, because
we use this method of mixture that kind of integrates.
And then you use the stopping time
argument to get rid of the union bond. so that's kind of neat. so the determinative of B of D
could be as large as low -- the determinant of B. so that would be as large as this, but you
are basically shaving off one of those terms.
And [indiscernible] proved a similar bond. You can see that qualitatively these two bonds are
similar or bond looks different. It kind of like adapts. This also adapts to the
[indiscernible] matrix which you hope it's a good thing.
so whether it's a good thing or not,
it's not so easy to see it. so what you can do is that you -- well, first, you can derive a
regret bond for the underlying optimistic algorithim.
Yes?
>>:
so you say that your bond is done by using the regression.
>> csaba szepesvari:
>>:
Yeah.
This is a regression-based bond.
They're the same.
>> csaba szepesvari: That's basically the same.
they're the same, yeah.
>>:
They're the same.
The algorithim is different.
Yeah.
I think
The assumption's different, assumption of xT.
>> csaba szepesvari: No. No. No. No. These guys are like we are going to do like. okay.
so here I was lazy to check how to extend our bond when these norms are different, but like we
have a general result. Like you know s is a bond and true norm of the known parameter. I was
just lazy enough to calculate s would come somehow into this bond, too.
And the same holds
here. so here I -- here they have it, or I had it, but -- okay.
That's the only difference.
But you could take the common case when -- when everything is smaller than one.
so then s
equals one, and then probably that's the only thing I guess on the bond. Yeah.
okay. so before -- like you could compare this by trying some experiment, right. so before we
go in there, if you just plug in these bonds into the previous regret bonds for these specific
algorithms, you get these regret bonds for stochastic linear bandits and you see that the
difference on D shows up, and it's going to be linear -- linearly on D.
The worst case you can
show the D has to be there. so some unregular results are these new confidence sets, any type
of the previous ones, or I would just like going in rounds.
so I'm going to show some
experimental results. so here on the picture what you see is that is a regret.
And this is
time, and this is a bond based on [indiscernible] paper.
And what you can see is that the
regret is not starting to curve. so it's kind of like almost linearly growing. If you take
your confidence set, the regret is more gentle. It's much better behavior, kind of like curved
like a route T which is what it should do. on the picture, I'm also showing you can modify
these algorithms not to squeeze that often and the arms would say you've a lot of competition.
And the regret is not going to change by much if you do that. so you can save a lot of -exponentially many computation steps, because it's enough to recompute everything
[indiscernible] okay. There are ways to construct confidence sets, right?
These confidence sets you see that they are really important.
They're crucial to achieve good
results. In part can get tighter results. And their sparsity is our main subject. so if you
only care about the prediction error in the linear regression setting, then what you know is
that if there is no sparsity then the prediction error is going to be depending on the
dimension, and if you have sparsity within the D, then the prediction's error -- the prediction
error you can retrieve in D by P log D, which is pretty cool.
so you can play with very large D
and smallish P. so but we also note from the literature that the least estimator won't cut it.
so you have to look for something, you know.
so how do we design confidence sets for this sparsity?
so the idea is to do a conversion. It's a reduction from one problem to another. If you can
solve all linear predictions with crowd loss with the small regret under the sparse case, then
we claim that you're going to have very, very good confidence sets for the sparse case, and that
should be enough. so the idea is to create confidence sets based on how well you do in linear
prediction, and, so it's pretty cool, because whenever you improve something on the prediction
side in this online setting, then you're going to improve your confidence sets ultimately.
okay. And hopefully it will give you good bounds for the sparse case.
Page 6
>>: I'm confused. I thought the goal was linear prediction.
[indiscernible] to the prediction. Now you're --
The convex sets are just
>> csaba szepesvari: Yeah. I'm going back and forth. It seems that I'm going in circles,
yeah, but not exactly. so now the linear prediction problem is going to be made tougher,
because it's going to be converse theorem. And so why converse theorem?
The reason why it's converse theorem is if you're independent which is pulling these arms and
then kind of like you don't really control. You don't really -- like this algorithim converse
theorem, you don't really control like how much the arms are spread out.
They happen the way
they happen to be. so you have this prediction problem where the arms could be very correlated.
so know you're just going to add, okay, let's say it's sparse like the sparse predictor which
has a low error to regret. Then can you turn that back into a confidence set.
so it's a big
thing. You will see. It's going to work out.
okay. so what is online linear prediction?
anything that we talked about before.
what do I call online linear prediction?
It's not
okay. so this is the sequential worst case framework times T.
You receive a xT. You need to
produce a prediction which is just a real number, and you receive the correct label or the
correct response Y of T. You suffer this particular loss. okay. That's the whole thing.
okay. we didn't say anything about whether there is any statistical relationship between the
xTs, the YTs, so forth. There's no statistical assumption here. okay. This is a worst case
framework. And you want to compete with the best predictor in hindsight, because there is no
other thing you could do, right? Because there are statistical assumptions here.
so there is algorithms for this problem, starting with all kind of gradient descent algorithms
would be another example, online least squares, exponentiated gradient online loss.
There is an
algorithim that probably not many people know about, but it's actually a simple adaptation of
continuous exponentiate weight algorithms for this value.
It's just continuous exponentiated
weight. The [indiscernible] for sequential. That's for the sparsity. That's why the sE.
That's sparse exponential weight. so, basically, you use exponential weight algorithim with
some which prefers sparsity. okay. so online linear prediction. so what's the regret?
so the regret -- I'm going to use a different letter to denote it as this quantity.
so the
regret begins at parameter vector T. [indiscernible] is the total loss that you suffer where
the total loss you could have suffered is for this parameter vector in every timestamp to make
up your prediction, but there is no prediction that you should stick to some parameter vector T
in your prediction. Like you don't have to do that. Like you can produce the predictions in
many ways. okay. so all of these prediction algorithms that we talked about come with some
regret bounds. so what's a regret bound? The grant is that under -- no matter how the data is
selected, which is the x piece and the Y piece, the algorithim is granted to have a regret that
is below some quantity of N, which you can either compute from data or from prior information
that could be like how big the vectors could be, how big the Y of T could be, so on and so forth.
so they're different kind of bounds in the literature.
There are bounds that people are
trained to study. How fast you can learn that only depends on like the magnitudes of things
that are coming in. And there are bounds that depend on the data. so data depends on bounds.
For us different descent bounds are going to be a little bit better, because they're going to
use these bounds to actually design their our confidence sets.
so you will have a tighter
regret bound. The only thing that it would require is it that like you come up with some
algorithim.
You have a regret bound. The regret bound should be contemplated based on known
quantities.
That's it.
And then like we turn the wheels, and then we all spit out the confidence set.
okay. so
typical result in the literature show that the regret is either a proven or of size log of N.
If you exploit the curvature of the loss, then you can work a little bit harder and get this
type of regret bound. If you kind of like throw away that information, you typically end up
getting root in regret.
okay. so before we do all these actions, let's look into a reduction that was done previously
by other people. so the question here whether small regret implies small risk.
so might be
with this result it's kind of cool. Like this reduction makes sense. so let's say your event
is a statistical problem where you have i.i.d. data. okay.
And you want to do linear
prediction.
so you have xT and YT and i.i.d., and you assume that YT is just linearated to DxT,
and you choose an online learning algorithim A that produces sequence of estimates and to make
this prediction. so this algorithim is special, because the predictions are based on
experimental vectors. Not all algorithms do that. And you have a regret bound. what can we say
about how well can we predict like how small a risk can be achieved.
okay. so what's the risk of a vector T [inaudible]. so the risk of vector T [inaudible] is the
expected squared error with the expectation is about the joint distribution of xT and YT.
so
these guys prove that if you take the average of the vectors produced by this online learning
algorithim then for any data for this which was in minus data, the risk of this average vector
was that high probability is going to be bounded by.
The main term here is PN divided by N. so
Page ?
if you had an algorithim that has a regret of log N.
Then you have a log N divided by end risk.
so it's like turning the online algorithim into a algorithim that has a small risk.
so online
learning is powerful. so that's kind of where we started.
so how do we get to use an online algorithim to produce a confidence set?
And that's done here. so we have this data where we don't assume -- don't make this i.i.d.
assumption, but we can have xP sequentially generated, but YT is related to xP as before in
linear manner. And we have an algorithim A, an online algorithim, that we're going to fill with
this data xT and YT. And it produces this prediction Y out of T. okay. And it comes with a
regret bound, okay, against this unknown parameter vector T plus star.
The regret bound is PN.
Then you can show that the following set. This set's a high probability confidence set, and
that's a uniform high probability confidence set for all times.
so what's in the set?
so it's kind of like an ellipsoidal shape set again.
You have this quality quantity, and you
say that the quality quantity cannot be larger than this.
so you take all the parameter vectors
for this quantity quality below this other set. so you can see that if you expanded the grand
meal would come in and have the exact same flavor as the previous confidence sets.
okay. so the shape is going to be like -- it's not surprising.
It's exactly the same shape.
Apparently what you see -- what's the difference here is that in radius type of quantity depends
on the regret of the algorithim.
so if you design a algorithim which is a smaller regret your B of N goes down.
okay. Then you
have a much smaller confidence set. so what you're summing in terms. so if you worry about
this one like that's too big, it's actually not too big, because here you have end terms that
your confidence is a constant. so if B of N is low end, it means your confidence low end which
means the confidence set is actually shrinking. okay.
>>:
why are you using the YT as the prediction in defining confidence set?
>> csaba szepesvari:
>>:
Because that's how we can do it.
But what you actually observe is the YT's.
>> csaba szepesvari: well, you have this algorithim A that produces those predictions.
actually do observe those, too, right?
so you
so you take this algorithim for this online setting.
It produces these predictions. And then
you build your -- if -- so the idea is that if that algorithim is so good in predicting things
like we can use what it predicts. so it's kind of like this is the filtered version, if you
want, of Y of T. so it's kind of like you're reducing the noise by using the predictions of the
algorithms.
It looks a little bit dangerous, because it's not the actual data, but you have to
trust the algorithim. The algorithm actually uses Y of T and you're indirectly using Y of T.
okay. so that's how it works.
And the proof. It's just one slide. so if you take the definition of the regret, it compared
your loss with the loss that you could have achieved if you used [indiscernible] in the
timestamp and that was bounded by B of N. You have this quantity expression. You do the
expansion.
You do the algorithim, and then from that regret bound just by the algorithm you
derive this.
And you see that this quantity is what appears in the confidence set.
you have this other guy, and that guy is just a martingale; okay?
You have the B of N.
And
Because here you have the noise that multiplies these other things which I mention in the past.
so you have to analyze the martingale, and as long as you're done with that, you're done.
so
that's it. okay.
so you use standard techniques to analyze the martingale, and then you get your confidence set.
It's a little correlation. so it's five pages.
okay.
so if we combine it, how can we get good confidence sets for the sparse case?
well, we need an algorithm that achieves small regret in the sparse case, and we should get -we need an algorithim that achieves a low-end regret, because if you have a low-end regret, then
your confidence set would be just too large. so you have to have an algorithim that achieves
low-end regret and which adapts to or exploits the sparsity of the unknown parameter vector.
And this algorithim by Gerchinovitz that was published in 2011, which was based on earlier ideas
of [indiscernible], actually does it, and it's a -- my simple to describe algorithim.
so the
idea is that you have the parameter space, and that's the [indiscernible].
You put the prior
data which is going to be kind of like a distribution that, you know, prefers sparsity And then
you basically take the convex exponentiated gradient target down.
That means you can apply
[indiscernible].
That data is just as gaussian, and you subtract from the posterior, or you
Page 8
compute the expected prediction based on the posterior.
That's kind of the same thing. And
they showed that with not difficult analyzing that the algorithim is actually scaling with P log
and D. That P is the sparsity of the vector T, and N is the number of rounds [indiscernible].
And so as a corollary you get this confidence set. so it's all just plugging in the previous
inequities.
so you will see that T appears there and D appears there, [indiscernible] because D
doesn't appear.
so if you apply this to bandits, what do we get?
There is a very general -- like because of this reduction a generality is like you have any
online prediction algorithim with a regret of B of N.
Then you get this regret for the
stochastic bandit problem, the stochastic bandit problem here.
so you can see that root D and
DN appears, and then BN appears in the variable terms.
so we don't worry about this. This
might be a little bit worrisome, but we are going to come back to that.
so if you plug in the regret results are gaussian, and you will see that your regret is going to
scale with the root BVN. so previously it was curing this B times root N. so it appears that B
by root BP. It's probably not what everyone was hoping for, which is that you can get totally
rid of the dependence on D, and the regret is going to depends on B only through the polylog
terms. so that didn't happen.
so can we do better?
Actually, the answer unfortunately is that under these conditions you can't do any better.
so
there is a lower bound which is based on other paper and where they studied bandits in the
algorithim setting. And the lower bounds goes as follows: You take this decision set which is
basically all the unit vectors and entities of one as the first component, and you pick some
epsilon which happens to be square root of D over N.
And you -- the unknown parameter vector is
going to be any of the following D parameter vectors.
The first component is fixed to 0.5, and
you have a [indiscernible] on the [indiscernible] component.
so basically here the sparsity is
two. And if you want to play in this game, you have to guess like where is the little epsilon.
so you have to find the epsilon is positive. so you want reward.
And so the standard result goes that if you take the stochastic bandit problem where the Y is
going to be boundary with linear product with parameter of linear product of the vector chosen
and in [indiscernible] then the reresult of any bandit vector on this problem is at least root
DN. okay.
so what you see is the root D is never going to get away.
It's never going to go away. okay.
so no matter what sparsity you're talking about, this is a very sparse problem.
okay. so there is a result which constant with this alternate noise model, and there the
parameter is part in every timestamp. And there is a strong assumption about that this has to
be a i.i.d. noise meaning the components are uncorrelated or independent of each other.
They
are zero mean.
And if you do this. And let's say your decision set is [indiscernible] boss, it has like large
vectors in it or kind of directions. Then there is an algorithim that achieves a better regret
end.
so this result is due to constancy and [indiscernible].
This is concurrent to the result. so
if you're willing to make better assumptions, then you can improve the situation, but whether
this set of assumptions or some other set of assumptions are going to be the ones that you care
about is another question.
okay. so back to our results. Do we actually improve empirically. so well we generated some
artificial problem where you have 200 dimensions and the sparsity is ten.
so only ten vectors
and known zero. And that shows the actual set. And you have some noise, and you get c is that
you don't apply this reduction result just apply square.
And then the regret is not going to
change at any time soon. And if you apply this deduction result, then you get a much gentler
behavior on this problem. so you gain something.
so in summary, we take sparse stochastic bandits and the main tool, what is
online-to-set-conversion tool. And we think that this is the first confidence set for sparse
linear predictions under general conditions. we got good predictable results.
There are other results. I didn't talk about this Yahoo news article recommendation
contribution that Lee Holmberg devised awhile ago and in terms of future work -- so currently
I'm looking at designs for other problems like matrix prediction where you can just
[indiscernible] the framework, and it seems like every step goes through.
so there doesn't seem
to be any major difficulty there.
And the -- one of the challenging questions is whether you can adapt to unknown sparsity.
so in
-- we don't know the answer to this, but if you'd ask me, I would say probably no,
unfortunately.
so there are the interesting questions like when the action set has a few
extreme points then [indiscernible] and can you design algorithms that ultimately it takes into
account things like that and achieves the best of both results in all cases.
And this algorithim that I talked about, this algorithim that is based on context exponential
weights, it's pretty expensive. The authors say they can run it for tens of thousands of
Page 9
dimensions.
It requires something of approximate computations.
so the question is whether you
can get away with cheaper algorithms in the experiments where we use this algorithm and we use
exponential gradients which is a cheaper algorithim which comes with a worse regret bound, but
we were careful to use a better data dependent regret bound on, and that's where we were
winning.
so the question whether other algorithms, you know, exploiting the stochastic bandits were cut.
cut this? I don't know.
And, lastly, the question is whether there is a tradeoff between computation and statistical
adaptations to the data which you are learning.
so these days, there's a lot of results that appear in this direction, and, of course, you can
ask the same question here. You can relate these problems to other problems that we start to
have some knowledge about, and then you can hope to be able to study this problem in the form of
fashion. so far what we have observed is that while you can have like cheaper algorithms like
[indiscernible] algorithms which come with scrap regret bounds and so your regret seems to go up
if you use an algorithm which is cheaper to run, and if you have a more expensive algorithm
which maybe doesn't have even have -- doesn't even emit any problem parameter complexity bounds
that you can get much better results. so the question is whether this tradeoff is real.
All right.
>>:
so thank you.
The confidence set for this.
>> csaba szepesvari:
>>:
okay.
Is this on?
so how did you do this when you choose the ->> csaba szepesvari:
>>:
[indiscernible]?
Yes.
>> csaba szepesvari: so like I said, the define each of the actions. Then for each of the
actions, you need to find the T vector that maximizes the action. okay.
so the T vector that
lies inside this confidence. so that's a quality [indiscernible] problem and not an
optimization problem. You can [indiscernible] it in close form. It entails in working a
matrix. You can do an sVD, once you do an sVD, then you're done.
okay.
Any questions?
[applause]
Page 10
Download