1

advertisement
1
>> Ofer Dekel: Okay. So let's get started. It's our pleasure today to have
Karthik Sridharan visiting us. He's now from U Penn and he was in TTI before
and he was our intern a couple years ago in the summer so we know him very well
and like him very much. And this is going to be relax and randomize, a recipe
for online learning algorithm. So thank you, Karthik.
>> Karthik Sridharan: Thank you. Thanks for the introduction. All right. So
I'm going to basically talk about online learning. What does that mean? We
have basically time going from 1 to T. Learner picks a mixed distribution or a
set of actions curved F. Adversary picks an action, XT, belonging to some
subset script X. Learner draws his action from the distribution that he
picked, and he suffers the loss L of FT, XT.
So and what we are interested in is minimizing this notion of regret. So what
does that mean? It means that we are interested in [indiscernible], because we
are playing mixed strategies. So the idea is you have your cumulative loss and
you want it to be as good as the best single action in hindsight.
So if I had known Z1 through Z2 -- sorry. If I had known Z1 through ZT in
advance, then I could have picked the best action, F, and I don't want to have
too much regret with respect to that. And so this is the basic framework that
we're going to be looking at.
So we are interested in how well we can do in this scenario, and can we come up
with some kind of generic tool for dealing with online learning.
So the outline of the talk is this. So first, we're going to look at some
nonconstructive -- by that I mean I'm not going to give you algorithms, but
we're going to look at how well we can do in these online learning games. So
it's coming from Minimax Analysis. I'll describe what that means in a bit.
So basically, we're going to look at the notion of the Minimax value of the
online learning game. That is, what's the best I can do versus the worst or
the [indiscernible] trying to hurt me the most.
And from this, what we're going to do is we're going to look at some notions of
sequential complexities for online learning that I'll be defining. So this
part is, well, it's non-constructive, basically kind of mirrors what happened
in empirical process theory and statistical machine learning.
2
So a lot of empirical process theory is used for giving complexity tools for
statistical learning theory. And we're going to kind of develop analogous
tools for online learning. Of course, as I mentioned, the first part is going
to be nonconstructive.
So we're also interested in how can we get algorithms for these. So the next
part we're going to look at relaxations and algorithms. I'll define what
relaxation is more later. But the basic idea is we basically want to take
these nonconstructive proofs that we had out here for getting upper bounds on
the value and we want to convert them into algorithms. We want to develop
algorithms from these. And hopefully, efficient algorithms for at least
certain cases.
And I'll also talk about random play out, what this means and how we can be
used for analysis of things like follow the [indiscernible] leader, how
basically you can look at things like follow the [indiscernible] leader as
coming from kind of you can look at as them coming from a Minimax type
analysis, all the way when you're relaxed, you can get this.
And I'll look at some -- I probably won't touch upon many of the examples.
I'll mention them, but I won't have time to go into details. So all of this is
kind of comes from Minimax analysis, and all of this algorithms that we develop
again are going to show you that irrespective of what the adversary does, I can
get this kind of bound.
But it doesn't tell you that if the adversary gave you something nicer, you can
do something better. So okay. So in the worst case, for instance, let's say I
can get a square root B, square root type of bound. But I want to also have
the assurance that, okay, maybe I'm okay in sort of square root B. I'm okay
with it four Times Square root B. But then [indiscernible] if the adversary
played something nicer which I was expecting, then I get much better results.
And we look basically or we just touch upon how we can do this using some of
the techniques from here, inculcating them, and we look at localizing and
adapting. I'll again define what these mean. And we look at online learning
with predictable sequences. So when we have some type of model -- so the basic
idea of online learning with predictable sequences is that in online learning,
you deal with an adversary who is worst case. Sometimes he just puts adversary
[indiscernible] at you.
3
Online learning with predictable sequence is you have a predictable sequence.
You have like some kind of model what you think the world should be. But then
there is an adversary who tries to corrupt your model. And the lesser he
corrupts your model, you want to be able to do better. So we want to capture
that.
Okay. So now let's move to the first part. That's sequential complexities for
online learning. So before we go to that, let's see, okay, so we have the
statistical learning where we have this nice story where basically, nature
picks some distribution D and the set of instance is Z and you're given
instances drawn already from this distribution DZ1 through ZT, and the goal of
the learner is to pick some F hat from F based on the sample.
And our goal is to have a low expected loss. So the expected loss of what the
learner picks should be close to the best expected loss that could have been
done. And if this can go to zero, we say that it's statistically learnable.
In online learning again, we have kind of analogous scenario. Scenario here,
of course, is all sequential for time T equal to 1 to T. The learner picks
some action. Adversary simultaneously picks some ZT and Z, and we play the
game. And again, as I mentioned before, the goal is to minimize regret. And
if you can take this regret, the average regret to go to zero, then we say that
the game is online learnable.
Now, let's look at how do we give certificate for learnability and learning
rates in general in statistical learning theory. Statistical learning theory,
we have this slew of tools from empirical process theory, and we can
essentially use these to kind of give complexity measures on F that will give
us nice bounds on the learning rate.
And examples of these are the VT dimension, the Rademacher complexity covering
[indiscernible]. There are a lot of these. And the algorithm here is mostly
generic. You just do empirical minimization and for all of these, you can get
these as your upper bound. So that gives you an upper bound on the learning
rate.
However, for the online learning case, at least until fairly recently, the
certificate for learnability has been kind of case by case. So you have a
particular problem, you come up with an algorithm for this problem and you
4
[indiscernible] bound and the way you kind of show online learnability and you
show the rate.
Of course, for when the problem is convex, you have like tools that are
slightly better in the sense that you can at least for convex scenario, you
have generic tool box to some extent. But in general, our question is can we
come up with complexity measures for on the actions at F to basically say that
a problem is online learnable. Or can we come up with like a generic tool box.
And to look at this, we look at what's called the value of the game. That is,
so before I get into the definition, I'm going to make, at least for the first
part, I'm going to make a slight change in that instead of Z belonging to Z,
I'm going to take X belonging to X and in instead of using L of F, Z, I just
use the mapping F of X. It's just to kind of mirror what happens in empirical
process theory in terms of complexity.
If I don't want to hold on to the L in terms of notations, just to simplify
notation, that's it.
Okay. So what is the value of the game? Well, it's the natural quantity that
you would want to, I mean that you would like to bound, which is basically
what's the best regret I can get in terms of the worst adversary. So if I play
optimally and the adversary plays optimally, then what's the regret that I
suffer? So this is the greet, so some T equals 1 through T. My cumulative
loss minus the best loss.
And we are basically -- so we go a times step one, we pick the best mixed
strategy. Adversary goes next. He picks the worst, or in his view the best
action that he could have picked. Of course, this is a randomized game, we
draw F1 from Q1 and so on. So this is defined as the value of the game.
Now, we can do slight manipulations to this to get it
doesn't have to explicitly deal with [indiscernible].
that? Before I show you what I mean by that, just to
I have a sequence of this form of some operator, some
I'm going to use this notation. So T equal to 1 to T
that you just write this out in this format.
Is everyone clear about this?
slide.
to a form that kind of
And what do I mean by
make things simpler, when
operator, some operator,
of this basically means
It's just because I don't want long things in my
5
Okay. So this is just the definition of this operator. Now, [indiscernible]
we can make this mix and then we can put the expectation here because we have a
[indiscernible] at each term and it's the same thing because it's going to be
at the corner. And once we have this, we can do a Minimax swap. Well, you
need some mild assumptions on the set script F and set script X. But by those
assumptions, you can basically swap these two terms and once you do that, you
can push this expectation in here and pull it out. Which basically gives you
this term here.
So it's [indiscernible] more distribution so at each time set we become a
conditional distribution and then we look at N4F belonging to script F, FT
belonging to script F of the conditional expected loss.
Okay. So from this, we can move to what I'm going to term Sequential
Rademacher Complexity. So we have this from the previous slide. The first
thing that we can do is instead of N4, FT in single one here, we can just take
the N comes out of the minus, it's a sup, and we can just go to our upper bound
of this one.
And how this already should start looking familiar to guys who have seen in
empirical process theory. So the idea, I mean, you have this maximal
deviation, right. So you have supremum or a function class towards the maximal
deviation. And empirical process theory generally you take [indiscernible]
data, so all of this is just like a single expectation. There's no piece of
[indiscernible] each time. And you just asking supremum over F, what's the
speed at which my average converges to the expectation. And this is the kind
of the [indiscernible] version of modeling [indiscernible]. Because these
things are modeling in different sequences here.
Okay.
>>:
We can do a little more.
XT, so the first expectation and XT for the second expectation.
>> Karthik Sridharan:
>>:
Pardon?
The XT on the very right.
>> Karthik Sridharan:
This one?
6
>>:
No, more right.
>> Karthik Sridharan:
>>:
That one.
This one?
Here?
That's the one that's in the expectation.
>> Karthik Sridharan:
>>:
The last XT.
Yeah, sorry.
[indiscernible].
>> Karthik Sridharan: Yeah, I should have used -- yeah, yeah. So the one in
here is the one from here. And the one here is the one from [indiscernible].
Yeah. Sorry.
Okay. So now of the next thing you can do is
equality, pull these expectations out and you
piece of T and then you get this difference.
include a Rademacher random variable by which
takes one or minus one of equal property. It
just multiply this because both of these are,
distributions are the same.
use the [indiscernible] in
can have pairs drawn from this
But now you can essentially
I mean random coin flip that
can put that in here for free and
conditionally, their
And once you do this in sort of supremum distributions, you can always upper
bound it by supremum or two point X prime T and XT. And you get this and, of
course, you can divide it into two terms and write it as two times this. So
this is the conditional Rademacher complexity. The sequential Rademacher
complexity. The idea is that you pick -- so, okay, let me describe what this
term is doing here.
So I pick a particular X1, which is kind of the worst X1. Then we flip a coin,
then based on what I see in the coin, I pick the worst X2 and so on. And where
the final term that I want to maximize is this term in here. Now, notice that
basically all of this is generated by coin flips, and we are looking amount the
coin flips. Like what are the total number of possibilities that are basically
forms of T, right? Because at the first step, I flip one.
So when
have -minus.
a free,
I choose X1, I just have one choice. Now when I choose X2, I can
I have two choices. Either my first flip could have been a plus or a
And depending on that, X2 would vary and so on. So basically, it forms
and you can rewrite this in a kind of -- or at least depends on your
7
viewpoint, but in a nicer form.
Yes?
>>: In the last bound in your operation, so the very last, if F is a large
offset, the [indiscernible] instead of being 01 becomes 1,000, 1,001.
>> Karthik Sridharan: True. So it could blow up. So here, there is no
centering, and so here there is an automatic centering and then when you go
here, there is no centering. So it turns out that, for instance, like if you
look at, say, a classification or things like regression with, like, standard
losses, in the worst case, the centering doesn't matter. But so we'll actually
come back kind of in the third part again and we use of the centering in a way
that actually can give us more meat in terms of if you knew some predictable
sequence, you could do some centering through of the predictable sequence.
That's a good point.
of -- so I'm here, in
and 1. But if your F
the centering here so
There is no centering. So if my F values were in sort
a sense I'm implicitly assuming that F is between minus 1
was between thousand and a thousand one, you're losing
you're adding like a huge constant. Yes, that's true.
You had a question? Okay. So as I mentioned, basically, the sup over X, you
can basically look at it is as sup over 3. And this is another form of
defining the Sequential Rademacher Complexity. So here we take a supremum, or
here when I write the bold X, it's a three. It's a binary three, by which I
mean for each X sub T is a mapping from plus minus one party minus one to
script X.
And basically, notice that this would have been essentially the same as the
classical Rademacher Complexity. This wasn't here and we had this sup over X
through XT. But here at the part actually tells us which XT to go to next. So
pictorially, let me tell you what this is instead of going through the map. So
you have this tree here. This X is like this tree here and you have these
nodes out here. And as an example, let's draw the part epsilon equal two plus
one minus one minus one. So that would correspond to this part out here. So
how do we calculate this term in here for a given F? It is some equal to one
to T epsilon TF of bold XT of epsilon, right.
But what is X1? X1 there is no choice. It's just a root. So you have an X1,
but what's the sign that goes with X1, it's epsilon 1. So epsilon 1 comes in
there. I don't remember if I had any -- sorry.
8
Okay. So the next -- so now we went to the plus one, which means we go to the
left or right. And so that's an X3. So the next one is X3 and its
corresponding sign is minus so we go to the left, that gives us X6 and so on.
So that's the way the inner term is taken. And we take supremum over
[indiscernible] expectation over the part and supremum over the tree. And this
is the Rademacher Complexity. And as we saw in the previous slide, essentially
the result is that the value of the game or of the online learning game is
bounded by two times Rademacher.
As we see, many scenarios it's actually tight so what we get here is, in a
sense, the best we can do. And I'd like to mention that properties like
Lipschitz contraction and other properties of Rademacher Complexity also hold
true for the sequential Rademacher Complexity, which kind of allows to go
through to get rid of the laws or do other nice operations.
Okay. So the next kind of thing that comes to mind when you're talking -- when
we think of empirical process theory is covering numbers. So you can also get
analogous covering number for online learning. So instead of going through
this exact definition, let me show it to you pictorially.
So the idea is that we say that set V of real value trees is [indiscernible] is
an alpha cover. If for all F and all parts there exists a tree such that on
that part it's close. What do we mean by that? Let's evaluate F on all these
X1 through X2, and these are the values that you'd get. So I'm just taking the
binding case, for example. So you get these trees.
Now, the idea is that for all F. So let's pick some F, say F2 or F3,
pick a part. So we pick this part here, which is, I guess, minus and
it's X1, X2 and X5. So what we want is we want to cover this, right?
have two real value trees. In this case, it's enough for binary, and
notice that basically, this part out here is covered by this part. I
[indiscernible] in this case.
and let's
plus. So
So we
now
mean,
So essentially what we see is of the definition of the covering of what it
means to have a cover in the tree sense. And the idea of covering numbers is
the smallest cover on a particular tree. So I give you a particular tree,
what's the smallest number of such real value tree such that you can cover the
script F.
And the covering number, without given script X is basically supremum over X.
9
All right. So why is this useful? Well, it's useful because we can
essentially -- okay. So we have this Dudley integral bound in statistical
learning theory, and you can essentially get exactly an analog [indiscernible]
in an online learning world.
This is, if you've seen Dudley integral bound before, this is essentially
exactly the same thing. It's just that now instead of the usual covering
number, we have the tree-based covering number. And then nice thing is
analogous result holds for statistical learning world is that whenever F is in
minus 1 -- I mean, whenever F maps to minus 1 and 1, we basically have that the
Rademacher Complexity is bounded by this Dudley integral complexity. Yes?
>>:
So what are the [indiscernible].
>> Karthik Sridharan:
your actions.
>>:
These are the functions in your function class.
So it's
So that might be, the way you did all was --
>> Karthik Sridharan: Yeah, yeah, absolutely. If it's all possible functions
you can possibly learn, right. So yeah, definitely. So it's some subset of
functions F. So, for instance, if you could take like the class of all linear
functions and then you'll get a covering number for the class of all linear
functions, by which we mean that I play a vector W and adversary plays a vector
X and I suffer W transpose X additional. So this is like a lot of fun in
convex optimization is basically this.
But here we want to capture in general more, like not just convex scenario, but
arbitrary function classes like [indiscernible] and unit networks and all kinds
of non-convex function classes. And you can do that using this.
>>: So I might be missing something. If I'm restricting myself to zero, one
function, didn't the example you just showed show that using two trees, you can
cover a whole binary function?
>> Karthik Sridharan:
>>:
Oh, no, these are not all binary functions.
But [inaudible] therefore these trees --
>> Karthik Sridharan:
There's a value on every and there's a zero, zero.
10
>>:
Oh, it's not just a --
>> Karthik Sridharan:
>>:
No.
Oh, you have to cover.
>> Karthik Sridharan:
Yes.
>>: That means if you take a function of zero, one, and you take
[indiscernible], the total number of [indiscernible] not the number of steps,
but it's also the width of the tree.
>> Karthik Sridharan:
>>:
Right, in the worst case, it could be really bad.
[inaudible] gross function and the numeric --
>> Karthik Sridharan: Right, right. It could be really bad. So for instance,
if you take the set of all threshold functions, like just on even one
dimension, if you take the threshold function, we see dimension as one or two,
right. But it's a little [indiscernible], I'll come to it a little bit. But
basically, the growth function is as bad as it gets. It's basically not
learnable.
So because you basically can give examples that force you like -- because you
don't have a resolution, you can make it kind of forced like a search that just
kind of goes on smaller and smaller intervals. But you never reach that.
>>:
[indiscernible].
>> Karthik Sridharan: Right. Okay. So now we have an analog of covering
numbers. Next is, well, next is the famous one that kind of started all of it,
which is we see dimension and we want to get [indiscernible] which has already
been done by Littlestone in '88 and then the agnostic case was by Ben-David,
Pal and Shalev-Schwartz.
So the idea that is you're given, again, an X valued tree X and we see that an
X valued tree X of depth D is shattered by a function class if for every part
we can realize it. What do we mean by that? So this is a tree, and we
basically, for every part that I choose, so like minus, plus, minus, I should
be able to find a function. For instance, F2, such that F2 of X1 is minus. F2
11
of X2 is plus, F2 of XY is minus. And so for every part, I should have a
corresponding function that can actually attain those values.
So if that happens, then we say that it shatters this tree. It Shatters this
tree, and I'd like to point out that this tree is really not the same as a
covering tree. These are two different objects.
Okay. So you can also -- of course, Littlestone, I mentioned, is the largest D
such that F shatters some script X value to your depth D.
So we can also get similar kind of analogs, combinatorial parameter for the
real valued case. Again, we have this X tree. Now we have to have some notion
of margin, right. So the notion of margin is given by this weakness tree. So
we have a real value tree S, which kind of tells you what point you want the
margin across. And the idea is that if I pick, for instance, this part, plus,
minus, and plus, then I should be able to find a function, say, for instance,
F6 here, such that F6 of X1 minus our 2 is larger than 0.2. So when I was
going to the right, it has to be larger than our 2 to the right. And when I
was going to the left, it has to be smaller. And so on. So basically, we
should be able to not just get these values, but basically be able to move away
from them by [indiscernible].
So this is basically, it's very similar to how you get fat shattering dimension
in the usual case, except it's on a tree.
And how do we use these parameters? I just told you, defined the parameters.
But how to use them. So you can basically show an analog of the VC lemma, also
known as the Sauer Shelah lemma. The idea is that if you have a function
class, say a multivalued function class whose fat shattering value is D, then
you can get a bound of this sort.
Then there is also a real valued version of it. So the fat shattering, the
covering number at scale alpha is basically upper bounded by something like T
over alpha [indiscernible] dimension at scale alpha. So now you can use these
combinatorial parameters in there to get your bound. So you plug them in, you
get the bound. Covering them, put this back in your Dudley integral bound and
you get the bounds you want in the value of the game.
So let's just take -- so up to here, of course, I was trying to give you some
complexity measures for online learning. Let's just take a slight detour for a
12
few seconds. So I just want to mention that, I mean, so, of course, one of
the, like, being a machine learning person, I kind of like my main view of
things like Rademacher Complexity and VC dimension are in terms of learning.
Like basically, they help you give bounds for machine learning algorithms. But
in empirical process community, it's also studied because it gives you -- I
mean, it basically tells you when you have uniform [indiscernible] convergence
or when there's uniform [indiscernible] function class, averages converges to
the expectation.
And basically, you can kind of give an analogous result in the -- I'm sorry.
In the world where you have different sequences. So basically, let's define
class F to satisfy uniform universal convergence if for all alpha. I'm sorry.
I didn't complete the definition. So basically what you want is supremum over
F irrespective of what distribution you pick. Supremum over F of average of F
of XT minus the conditional expectation goes to zero.
So you want this to go to zero, almost surely. That's basically what I wanted
to write. And the supremum here is the distributions over the infinite
sequence, X0, X1.
If you saw basically how we built it from the sequential Rademacher and so on,
one direction is kind of obvious, that if you have finite sequential
complexities, like if you have finite fat shattering in the tree, then this
almost sure convergence holds. That is an easy direction. But it's not too
hard to basically show that this -- that at all scales, fat shattering
dimension of alpha being finite is also a necessary condition for this uniform
universal convergence. So that's all I wanted to say about that. We can get
back to learning.
Okay. So let's look at online supervised learning. The idea is that you have,
I guess the first example is the binary classification example. You're given X
and you're given Y and you basically want to predict Y and you have a function
class F such that it takes F of X and gives you binary labels. And in the
statistical learning world, we have a complete story in that things are
statistically learnable if and only if we have fight night VC dimension, and
this is a [indiscernible]. And, well, in online learning world again, it was
proved by Ben-David, Pal and Shalev-Schwartz that online binary classification
is learnable only if you have finite fat shattering dimension [indiscernible]
dimension. The tree-based one that I described before.
13
And you can basically go to a general supervised learning problem, where you F
of X is like a real value thing and you're looking at, say, regression with
absolute loss. But you can also extend it to other losses. But for now, let's
just think of absolute loss.
And for this, again, there is this result by Alon, et al., and Bardett, et al.
Which basically says that the problem is statistically learnable only if at all
scales the fat shattering dimension is finite. And, well, the analogous result
for the online supervised problem and it turns out that it is true.
So if you have a set of functions that are bounded by one, then the online
supervised problem is learnable if and only if we have a finite fat shattering
dimension in [indiscernible] case. And moreover what you can show is the value
of the game, the Sequential Rademacher Complexity, the Dudley integral
complexity, all of it are all within [indiscernible] factor of each other.
And, in fact, you can show that the Sequential Rademacher Complexity, both
upper and low bounds by a value of at most two.
So you cannot really do better than what -- than the bound that the Sequential
Rademacher Complexity gives.
Okay. So in terms of where you can apply the results, by that I mean can get
bounds for these dimensions, as I mentioned, it's nonconstructive. So you can
give nonconstructive bounds -- of course, you can give bounds for online convex
optimization and learning with linear functions. But this we already knew.
But what you can kind of deal with are things that are non-convex like neural
networks, decision trees. You can give generic margin bounds where we have an
arbitrary function class, and lots of other examples. And the main thing here
is that you can deal with non-convex scenarios, but then it's all
nonconstructive.
Well, we started off with question, can we come up with generic toolbox to kind
of tell us how well we can do in online learning, and we said that things are
case by case. So we kind of said okay, we don't want to contract an algorithm,
get a bound, because that becomes case by case. We want a generic tool box.
And we have a generic tool box that tells us how well we do, but -- yes?
>>: You're now comparing the statistical regression in that one. In the
[indiscernible] examples come from distribution. In the case where you lose
the ID part, you're [indiscernible] and you're looking at the worst sequence of
14
[indiscernible], whatever they are.
>> Karthik Sridharan:
>>:
So you must pay for that.
Right.
Where do you pay?
>> Karthik Sridharan: Where do you pay? So the complexities are different,
right. Now it's the tree complexities. So it could be much higher. But if
you look at all these examples, like neural networks, decision trees, you don't
pay much, because the statistical learning and online learning are actually
equally bad. But then, for instance, just the threshold function, I mean,
threshold is like the simplest example, and we know that it's like very easily
online learnable with an efficient algorithm. But in online learning world,
it's not at all learnable. So you pay because the tree complexity and the ->>:
[inaudible].
>> Karthik Sridharan:
>>:
No, I said it's not at all learnable.
Oh, okay.
>> Karthik Sridharan: Sorry, yeah. I just said not at all online learnable.
So basically, the tree complexity. You can also say -- the tree complexity is
always lower bounded by the classical complexity. Because you can just take a
tree, and at each level, just make all the nodes equal, then you exactly get
back the classical complexities.
So it basically means that the tree complexities are huger. In some cases it
can be much huger, but in a lot of applications it's actually more or less the
same.
>>: [inaudible] online learning as statistical learning is equal to uniform
convergence.
>> Karthik Sridharan:
>>:
It's only for supervised learning.
Only for the supervised.
>> Karthik Sridharan:
Seen here.
15
>>:
And here is uniform convergence with [indiscernible].
>> Karthik Sridharan: Right, exactly. Yeah. Then the [indiscernible] part,
we got it only very recently, took us quite some time to show that it's an also
necessary condition that finite fat shattering is necessary for this uniform
modeling convergence. Sufficiency is easy to show, but the necessary part
takes -- I mean, I guess it's again easy. It's just that we didn't get it for
a while.
>>:
[indiscernible].
>> Karthik Sridharan: So there have been, like, I think like I've seen like a
couple of ones which are -- there's one in limited scenarios, not like a
generic function class. I have seen like one which is about like Lipschitz
function. But the basic thing, like if you look at a lot of empirical process
theory for, like, non-independent -- for dependent data, most of it kind of
assumes things like [indiscernible] and then the idea is to use VC dimension
with things like blocking technique to say that you get [indiscernible] it, but
then essentially the same tools kind of apply.
Yeah, so as far as like we know, this is the only one. All right. So well we
have all this tool, but all of it is nonconstructive valid algorithms. We want
the algorithms to get these bounds. Otherwise, why is it useful? And that's
kind of what we're going to look at next, which is how we get from all these
analysis to actual algorithms.
So let's get back to the root, the Minimax algorithm, where we started, the
value of the game. So what's the definition of the value of the game? If you
remember, it's basically I do best, adversary does best and we kind of do this
for many rounds and we look at what regret is. And this was defined as the
value of the game.
And let's just rewrite this in a slightly different equivalent fashion. You
can basically take a few of these [indiscernible] inside by writing the sum out
in the sense that you can do this. So the competitor is always going to ->>:
[inaudible].
>> Karthik Sridharan: Oh, yeah. So yeah, I mean, you can basically -- the
competitor always [indiscernible] but these losses kind of come out a little
16
bit. And let's just kind of rewrite this definition recursively by looking at
this form.
So I think so you have this in for absolute loss. So basically, you can
rewrite the value in a recursive form which is info over Q, sup over Z and then
you have expectation drawn from Q of the expected loss. Plus this term, which
goes all the way up to capital -- I mean, ZT, Z. And the starting condition
is, of course, when you're given everything, it has to be the competitor, that
is this term.
And once you have this, it's easy enough to see that the value of the game is
basically this thing given nothing. So all the way when you're given nothing,
it's at value of the game. When you're given everything, it's just minus the
competitor.
Okay. So in the Minimax strategy again is simple. You play for T rounds. You
get F1 through FT, you get Z1 through ZT and now for the next step what you do
is you take your Z1 through ZT, plug it into this, and then you take the admin
or the mixed strategies of sup over Z in this thing. This is just simple
rewriting of stuff.
And so the exact Minimax algorithm for particular cases have been done. One by
Nicolai and Gabor for absolute loss, which nicely kind of takes the value of
the game and shows that you can go to the classical Rademacher Complexity and
so on. And there's also, for another case by Abernathy, Warmouth and I don't
remember the third person. But in general, the problem is that this Minimax
strategy is not computationally feasible. So we cannot really solve this
exactly.
So, well, the idea is to basically replace this by some other function,
which -- let's call it relaxation. Basically, good upper bounds. And, well,
relaxation, we say, is admissible if this condition is satisfied. So I add my
current loss. I take info over distributions, sup over Z. My expected current
loss plus the relaxation with the Z included should be upper bounded by
relaxation of only Z1 through ZT without the Z.
And the initial condition is that the relaxation is larger than minus the
competitor. Actually, it's not initial. It's final. But we are kind of going
from inside out, so it's initial --.
17
>>: You have kind of [indiscernible] and instead of obeying capable the
recursive formula test the upper bound.
>> Karthik Sridharan:
>>:
Yes.
And.
[inaudible].
>> Karthik Sridharan: It is very similar. And it resembles something else
which I'll quiz you on in just two seconds. Okay. So basically, again, just
like how we did for value, you can basically show that for any admissible
relaxation, the value is upper bounded by relaxation given nothing. And this
is just dynamic programming.
The nice part is that this -- I mean, like if you look at this in here, the Q
kind of comes in only in the first part. It doesn't come in the second part.
This is kind of special to external [indiscernible]. I guess even for
[indiscernible], but it doesn't work all the time. Like if you go to games
that are not regret form, it doesn't work. And if you go to partial
information games, then there is coupling. But for this, it has a nice form.
Okay. And the strategy again is kind of similar to what we did to the value.
You basically find the Q that minimizes this.
And again, the expected regret of the strategy is bounded by relaxation.
Okay. So this is very -- I mean, like this is very similar to something that's
even used in online learning world, and Nicola knows the answer. So it's
basically exactly potential method. And initially, I mean, so we kind of
started with value and we wanted to look at how -- so we didn't actually write
it in terms of relaxation. We wrote it in terms of what we had in terms of the
sequential Rademacher complex. We wrote it down, we said we had that and then
we said let's make this more general. And once we made it more general, we
just realized that this is potential method. Nothing different.
Okay. So the first kind of result is that the conditional sequential
Rademacher relaxation, forget that long name. Basically, what it is is kind of
a conditional version of the Sequential Rademacher Complexity that we had
before. What we had before, though, was no Z1 through ZT given. It's all the
way to the end. And we had the same sup over Z and then this expectation and
all that. Now, the only difference is that when we're given Z1 through ZT, you
18
just subtract the sum of the losses before.
And you get this. And what you can -- I mean, basically what happens is you
can look at the proof of how we show values on a basic Sequential Rademacher
Complexity. And exactly the same proof shows you that conditional sequential
relaxation is permissible and that you can -- and that's basically the
[indiscernible] value is upper bounded by two times Rademacher. But now you
have an actual strategy that comes from what we had in the previous slide. So
the strategy given by this corresponding -- that corresponds to this
relaxation.
Now, the observation that we make is that most algorithms are found in online
learning world actually comes out from relaxations that are up bounds on the
conditional sequential Rademacher relaxation. And so the basic idea is that
you want a general recipe. And notice that here we have the sup over Z, right.
And the Z is indexing over the future. Indexes from T plus 1 through T. And
in sense what you're saying is I want to have like a look at what my current
losses are, and I want to look at my adversity can hurt me and try to minimize
this.
That's not just going to happen for this round. It's going to happen for the
future rounds. And the sup over Z and the sum over T plus 1 through T is
basically kind of discounting for the future. And the idea in all of this is
to get rid of the future trees Z, the Z out here, by passing to an
[indiscernible] upper bound on Rademacher, sequential Rademacher as possible.
And sometimes what kind of happens in all of this is you can go through all the
results that I give in a nonconstructive way, and all of them kind of go in a
way that is inside-out. Which basically means that you can take each of these
and you can convert them into an algorithm. You can convert them into a
relaxation and the relaxation gives you an algorithm.
And the idea is that you keep going to upper bounds, to larger and larger upper
bounds until you have a method that is actually efficient or has some property
that you desire.
Okay. So kind of easy example, like the first try would be to look at when you
have a finite set of F and it's easy enough to see that if you look at, for
instance, proof of how you show [indiscernible] finite lemma, you basically
show an inequality like this. Of course, this is for the conditional version,
19
because it has the sum over this one. Some over previous losses. If you don't
have previous condition, you basically would have one other lambda log size of
F plus lambda times capital T. And that's how you prove, like, massage finite
lemma. Saying that uniform over script F, what's my -- uniform over script F,
what's the -- like how farther do [indiscernible] different sequences converge.
And you basically go through the proof, you write this down and you can
basically put this info lambda in here for each step. It just kind of, you can
derive this relaxation out by just plugging in the first step of massage finite
lemma, and this automatically leads to a parameter-free version of the
multiplicative rates algorithm.
So in multiplicative rates algorithm, you have this time set data. You either
have time set based on knowing capital T or you set it as one over square root
of T at each time step. Here, basically what it says is that for all this info
lambda here. So this is a [indiscernible] problem. Find out the lambda that
minimizes this. This only depends on my previous losses. So this is the time
step you want to calculate, QT plus 1. Look at my sum over previous losses,
take this term and try to find out what the best lambda is and plug that lambda
into your ->>:
How does the [indiscernible] come up?
>> Karthik Sridharan:
>>:
The [indiscernible], where did you get it from?
>> Karthik Sridharan:
>>:
Log what?
This is an upper bound --
Under sequence --
[indiscernible] complexity.
>> Karthik Sridharan: Yeah, basically it's just taken from the -- okay. So
the idea is you take like the, for instance, how you prove bound on the massage
finite lemma and take the step that you use there, and basically the proof is
really proof for showing the relaxation is admissible. And that's about it.
Yeah, sure, I mean,
there has to be one
right. So it's not
not equality, there
there is one like step which you cannot derive. I mean
step which you cannot derive because it's an upper bound,
the exact thing. So unless it's -- I mean, as long as it's
has to be some kind of a, like some kind of an observation
20
that you need to use.
>>:
[inaudible] random variable.
>> Karthik Sridharan: So here, I'm assuming they are bounded.
you have this T minus -- here it's bounded by ->>:
So that's why
[inaudible].
>> Karthik Sridharan: Okay. So you can actually reason out why this is the
[indiscernible]. So if you had -- oh, I guess I didn't. Without any other
structural assumptions, if you didn't know any other kind of -- if you
didn't -- if F were just arbitrary, then you can show that this is a
[indiscernible] relaxation, because at least symptotically because of I don't
remember. It's Kramer something theorem. Chernov, maybe. Kramer Chernov,
something like that. Which kind of shows that symptotically the log partition
function is like the right way to get your -- I don't remember the exact
statement. But basically, you can show like your -- I mean, it's the symptotic
version of [indiscernible] quality. And it shows that the log partition is the
right function.
And so without any other extra assumptions, you basically, this relaxation kind
of comes out. You use that because that's the [indiscernible]. But I am using
some -- I mean, I'm using like massage finite lemma as the first step to
getting to this.
So the basic idea is, I mean, I guess I can't pull it out of the thin air, but
basic idea is that if I have a proof in the empirical process theory world and
I take that proof and I can [indiscernible] algorithm. And if I have an
algorithm in the online learning world, which kind of fits in this framework,
then I take this and I want to prove something in the empirical process theory
world. And kind of get this thing that I can use results from here into there.
And so, for instance, you can -- so for all of this, we have like the
algorithm, but for things like neural networks and for -- okay, for generic
binary class, online binary classification, there is the Ben-David, Pal and
Shalev-Schwartz algorithm. But you can show that if you can kind of like only
evaluate little [indiscernible] damage of some function, like if you can get an
upper bound on that, then you kind of get algorithms -- I mean, you can get a
much more efficient version of it.
21
So for all the bounds that we got where there were no bounds before, you get an
algorithm out of this. Okay. Other examples are you can kind of derive -- you
can recover great descent and middle descent and X variant. And in terms of
[indiscernible] there is like this universality of middle descent, which kind
of comes from this result in [indiscernible] based theory by [indiscernible].
And a nice thing is that you can kind of, in a sense you can see that the
regularization term that you get there is, in a sense, almost just derived from
this. Because you can see that it is just like if you want to [indiscernible]
rate, then like the only option you're left with is like to go -- you go to the
tightest bound in Rademacher in terms of getting rid of this future treaty and
that kind of gives you the exact relaxation that corresponds to that
regularizer. So in a sense, you derive it.
Okay. And you can also get relaxations based on sequential complexity measures
that I mentioned before and get algorithms out of it. So you get constructive
way of getting these upper bounds. Of course, even when function F is
[indiscernible], even though you get these constructive methods, they may not
be efficient at all, or they may not be [indiscernible] even in the sense that
computationally visible.
Okay. To get -- okay. So if you want to get more efficient algorithms, you
basically find like nicer relaxations that are easier to compute.
Okay. So now let's look the idea of random payout. So let's go back to the
Sequential Rademacher Complexity. So at around T, basically, the conditional
Sequential Rademacher Complexity relaxation was this guy here, and we had like
some horrible terms in this square bracket. But the basic idea is that, okay,
so what we want to do is imagine that I had an article that could give me this
Z, that could compute this sup over Z if I gave it all the terms that are
required in there.
Then the idea is to basically do a random work. And do online learning based
on that. So given any Z1 through ZT minus 1, let's say that I can compute this
arg max. So I can find the particular treaty that maximizes this term in here.
Now, what we can do is, okay, so the basic idea in all of this is going to be
that notice that sup basically can come out of all the way to here. And now
expectation is linear, right. I mean, linearity of expectation we can use.
So the basic idea is that the Q is a distribution that we can pick here.
So
22
we're going to try to mirror Q to reflect what's happening here. That's the
idea in both this slide and the next one. So we pick Q to mirror what happens
here. And then essentially, we get, of course, we can calculate the sup over Z
for each term, and then we can put that out and you basically get a single
expectation due to linear expectation. You pull that out and effectively what
you get is that you can get a randomized strategy.
The idea is that at round P, you draw epsilon T plus one through epsilon T.
That is the coin flips for the future, and then you basically calculate this -I mean, we are assuming that we can calculate this arg max. So if you can
calculate this worst case tree, you basically find the Q that minimizes this.
So the idea here is that you can solve this efficiently. And, of course, I'm
assuming they have an article for this, which is not realistic assumption, but
in the next slide, we'll see that in a lot of cases you can actually get rid of
it. And this gives you a randomized algorithm that essentially suffers the
same regret, expected regret bound as the Sequential Rademacher Complexity.
Okay. So the second area is again, you start with the same thing, but instead,
imagine that -- so this kind of stems from the observation that lots of times,
basically, the worst case adversary scenario is actually the low bounds are got
by actually forming the worst case [indiscernible] distribution. And this
happens a lot of times in online learning.
So the worst case statistical learning is not too far away from the worst case
online learning. And you can actually come up with these distributions. And
so you want to kind of use this knowledge in some sense.
And so imagine that I could actually do this, then I would basically replace
the sup over Z through this distribution D. And now, again, now instead of F
from Q, I can replace -- I mean in some sense what I do with Q is I kind of
replicate this process, and then by linearity of expectation, it comes out -and due to convex, it comes outside the sup and stuff. And basically what you
have is something similar to the last slide, but to kind of make it simpler,
let's look at the linear case. Now, there is a key difference here.
So the previous slide, we kind of, the Z, the trees that we found on each round
were different. Here, I'm putting a single D for all rounds. So this D is not
super scripted or sub scripted by small T, which means that I want to find a
similar distribution that's really bad.
23
So for this to happen, I should be kind of I told you that it stems from the
observation that in lots of cases, the sequential and the online and the
statistical complexity are the same, and the statistical and Sequential
Rademacher Complexity are the same. So basically, that means that there is
some kind of underlying thing that makes this happen.
So one -- yes?
>>: So [inaudible] loss, which is the same. But this here, you are not
going -- this is not an upper bound. You're restricting the adversary by
restricting [indiscernible].
>> Karthik Sridharan: Right, but you cannot do it for free, right. I mean, as
long as I can find a distribution that's equal. But in general, I'm going to
pay a constant. You'll see that I'll pay this constant C here. It will be
that I cannot find a distribution that's as bad in terms of without the
constant. But if I pay like an extra -- so, for instance, the worst case
adversarial might be a like a factor three words in the worst case. But that's
fine for me. I'm okay with constant factors.
But I need like, I mean, it's not just that these two are off by a constant.
You need a step-wise version, which is like the assumption that I'll be
introducing here. In the linear case, I'll explain what the assumption is.
For instance, the online [indiscernible] are basically static experts. You can
again show that basically, it falls under the same framework.
So what you can do is you can get like a randomized version whenever you have
randomized algorithm, whenever you have -- you get an efficient randomized
algorithm whenever you have an online transitive learning problem with convex
losses. When it is non-convex, you get a mixed strategy. But there's no
guarantee that it will be computationally, like, efficient.
But when it's convex, you can get computationally, like something that just
solves a convex problem, what it does is it flips coins. It gets some -- I
mean, in a sense, it will start resembling follow the [indiscernible] leader.
And for linear case, you can't actually get the follow the [indiscernible]
leader. But the basic idea is you flip coins and instead of using this -- the
idea that these relaxations are -- so notice that basically this was like a
relaxation, right? You just took Rademacher as an example, but this proof kind
24
of goes through whenever you have a relaxation of the form expect the some
function of X1 through XT. Now this expectation may not be easy to compute.
But the idea that is you mirror this expectation by kind of if you knew what
the draws were or what the distribution should be, you mirror what this
distribution is, you draw the same random variables. And once you're given
everything, the expected may not be easy to optimize. But when you're given
everything, lots of -- I mean oftentimes it becomes easy enough to optimize.
And then you use that fact to kind of get a randomized algorithm.
>>: So just like [indiscernible] technical amateur who is going do technical
stuff.
>>:
You're right.
>> Karthik Sridharan: Okay. So for linear loss, you can basically get like a
simple condition that basically says that sup over Z, this perturbation over Z
can be replaced by a draw from some distribution D that we know. And so the
idea would be that you draw -- so in terms of algorithm, what it's doing is it
draws the future from this distribution D and also the Rademacher variable.
And then it solves the simple problem where there's no expectation here. It's
just this exact problem.
And it turns out that -- and, of course, the bound that it enjoys is the
classical Rademacher Complexity up to constant factor C and from this you can
easily derive new versions of follow the perturb leader for L1 L infinity. If
you've seen the follow the perturb leader proof, for instance, for L1 or
infinity in this complex case, it's kind of, I mean, in some sense it's
magical, because this [indiscernible] distribution is used in a very, like in a
crucial way to actually get this proof to move through.
So here, first of all, we get this to work with Gaussian distribution, and the
proof is kind of more like, it kind of follows this general framework. And it
gives you a more -- yeah?
>>:
[inaudible].
>> Karthik Sridharan: Yeah, L1 infinity is Gaussian. For L1 L2, we use
uniform sampling or the hyper unit. And the nice thing about L2 L2 is at least
we don't know if version, at least up to now where you get dimension-free
25
result for L2 L2. And so this gives you like a dimension-free follow the
perturb leader for L2 L2.
The idea, okay, so to kind of give you a view of what this follow the perturb
leader is, is you minimize the low, like sum up to time T minus 1, plus an
extra -- so it's like linear term with some of the X1 through XT minus one,
plus some extra perturbation. And this perturbation, in a sense, you can view
it as regularization. You can view it as inculcating. There are lots of ways
of doing what it does.
>>:
[indiscernible] F is constrained to the one --
>> Karthik Sridharan: Yeah, the first is F and the second is X. And we
suspect that basically, I mean, we haven't tried too hard, but we suspect that
basically, you can either follow the perturb leader with the optimal aids for
any LP class, for like any small LP. But, I mean, we haven't worked that out
yet.
>>:
[inaudible].
>> Karthik Sridharan:
>>:
Yeah, LP, LQ.
LQ is the holder conjugate.
[inaudible].
>> Karthik Sridharan: Yeah, the non-dual I don't know. But for dual, I think
it can go through. Okay. So this is a random play-out. So it's not really
something I'm going to, like, kind of give you a list of other stuff that I
didn't have much time to cover.
So we kind of get parameter-free variants for most of the commonly used online
learning algorithms. Basically just went through [indiscernible] book, try to
look for all the online learning algorithms, see if it comes out of the
framework. And most of them did.
And we get some new online learning algorithms for the binary classification,
online binary classification problem, and in some sense, at least -- okay. So
if you have a particular article, you can show that given that article, our
algorithm is much more efficient compared to what was known before.
And we derive algorithms for the nonconstructive bounds that I showed earlier.
26
But I'm not claiming that they're efficient. But we do get new efficient
algorithm for online transductive learning with convex losses, and I guess the
more interesting one is for online matrix completion problem with trace norm,
we get a new algorithm -- I mean, we actually get two variants, both of which
are efficient. And one of the variants only needs to calculate like spectral
norms at each time step.
So in terms of worst case complexity, it's as bad as the other variant or as
good as the other variant. But in reality, lots of time spectral tomorrow it's
very easy -- like parametric gives you really fast conversion. So at least on
small scale experiments, we got some pretty nice results.
>>:
[inaudible].
>> Karthik Sridharan: Yeah. At least that's what he told me. Okay. So the
other thing is that this bound, we don't need some of the assumptions that the
other works need. Like there's also the Shy and Hazan and [indiscernible] so
they have this optimal algorithm for this based on a particular decomposition.
Theirs also like gives an efficient version with optimal bounds. But then they
also need like these bounedness assumption and so on. We can basically get the
result that gives you the same guarantee as the Rademacher Complexity with no
other assumptions.
So the final part which I'm going to kind of like blaze through is adaptive
online learning algorithms and benign adversary. So the first thing I mean,
all of this, even the slides I made, I made at a high level so I'm not going to
give too much detail.
So all of what we did in the previous part were I have this worst case
adversary and I want to get the optimal guarantee or some nice guarantee
against this worst case adversary. But what if, I mean, what if you had like
some notion in mind saying that I think the adversary behaves like this, but
maybe he doesn't. But if he does behave like this, then I want to get a nice
bound.
And you want to kind of capture this notion. You kind of want to adapt to when
the adversary is nicer, you want to actually enjoy better regret bounds. But
you don't want to lose too much when he's not.
So we want to basically make use of, take advantage of the sub-optimality of
27
the observed sequence so that the idea is that -- so let me define what this
set is. You have like some Z1 through ZT given before.
So the idea is you want to kind of look at where the competitor is going to be.
So if I'm given Z1 through ZT for any like any kind of -- any further sequence
into the future, I look at the heuristic class of only items for the future,
because that's where my competitor comes through. And you can get this kind of
decomposition of regret bound.
The basic idea is you kind of choose a partition on the flight using this thing
called blocking technique, and then you use any relaxation algorithm that we
introduced before as a meta algorithm, but you run it only on a localized set.
And, well, that's all that I have.
Now I'll tell you what you get out of this. The first thing is something that
you're looking for a long time, just local Sequential Rademacher Complexity,
you can also derive all of these local complexity measures for online learning,
which gives fast rates.
And you can get adaptive algorithms that kind of automatically adapt to the
data norm. And so, for instance, one example would be you can adapt
automatically if the adversity strong convexity function, you get 1 over T.
But if he plays linear, you get 1 over square root T and you automatically
adapt to it. And the thing is you can also adapt automatically if the
adversity played X concave functions and so on.
>>:
[inaudible].
>> Karthik Sridharan: Yeah, this was known already. But the main thing is so,
for instance, you can also adapt in the same algorithm to concave. It doesn't
need to know anything beforehand. I mean, of course, there is a trade-off
between what you -- how computationally, like, efficient you want it to be
versus, like, how much you want it to adapt.
But if [indiscernible] computational efficiency, you can kind of adapt to
almost anything that happens. So you can kind of get the best thing that you
would expect.
28
And you also get an adaptive expert that takes benefit of, like when the
adversity does or when some experts are really good and so on.
Okay. Another view of all of this is that you can have, like, a constraint
adversity, and the adversity at each time step picks a mixed distribution, but
from a constrained set. So piece of T is constrained set that depends on what
happened previously, Z1 through ZT minus 1.
So before, we just had like from the set of all distributions here, the sup was
on set of all distributions. Now we are saying that given the observed
sequence before, he's [indiscernible] to act in a particular way.
And you can again use like all the non-constructive analysis that we did
before, like Rademacher. You can kind of get some results, but I don't want to
like put them on the slide, because they are complex in terms of rotationally
complex.
Okay. But let me give you the linear example. So let's say that you have some
function, MT of XT minus X1. It's an arbitrary function. So the idea is that
I want to do linear learning, but I have a model of what XT I'm going to
receive. So in some sense I have a predictability of what I expect the
adversity to do.
So in a sense, I have an expectation that the adversity is most likely going to
play this vector in the next round. Of course, he may not play that, but the
idea is if he did play that or he doesn't play too far away from that, he plays
only like [indiscernible] far away from that, he's described to do that, then
you can show that the value is bounded by this.
And in order that this is any arbitrary function. It need not be a particular
form. And examples of this are, of course, one is MT equals XT minus 1. That
is, he's constrained to not play too far away from the previous round. Another
is he's constrained to not play too far away from the average of the previous
rounds. And the average of the previous rounds is basically what variants
bound is all about.
And so we had this result which said that, okay, you can do it with an
arbitrary MT, and basically it's going to point to this motion of predictable
sequences. The idea that online learning with predictable sequences, you have
this predictable sequence. I'm talking about linear loss. So you have an
29
prediction of what XT should be and adversity adds some adversity and noise.
If the noise is too huge, then it's as bad as online learning worst case.
you want to do something better if things are better.
But
So there was this nice paper recently in this current [indiscernible] which
basically proposed an algorithm for the case when MT is equals to XT minus 1.
So what we had was a nonconstructive method and we were kind of hoping that we
could actually take the idea from this paper and try to see if we can do it
with any predictable sequence. And indeed, you can do that. And you can also
get like some -- you can kind of reduce like some of the extra steps that they
make. And you can basically get a variant of middle descent. So this is
usually middle descent where you just kind of follow the gradient.
And if you had Euclidian spaces, it would just be the gradient, usual gradient
descent. And now the extra step is that you kind of, you kind of have a
trade-off between how far you want to be from GT plus 1, which is usual online
learning, versus how good you want to optimize what you think should be the
next point in the sequence.
And by choosing [indiscernible], I'm not going to mention how, but you can
basically choose it such that you get a good bound that looks like this. So if
your predictable sequence is good, you get a good bound. If the predictable
sequence wasn't good, if it was completely arbitrary, then, of course, you pay
like a factor of four. So you can pay like 16 times square root D, but you
still get a square root [indiscernible]. So you pay this extra factor. But
when things are good, you do better.
And without giving much more information, you can basically do all of this in
the bandit setting, by which I mean that you don't actually get to see the XTs
themselves. You only get to see, like, the dark product with the F that you
chose.
So here the way it's written is in terms of self-concordance functions, but you
can also do it with entropy function which gives you [indiscernible] on bandit.
And basically what you get is this term here, this is like an extract term,
because you don't know Ms of T. You're to evaluate M bar. And the result is
given for any arbitrary evaluation of M bar. If I can get some kind of
estimation of M bar, I can plug that in and it depends on how bad this
30
estimation is in this term.
And, for instance, Hazan, et al., in '09, had this bandit against benign
adversities, which basically gives variance bound and you can essentially see
that basically what they're doing is trying to get this term to be small. And
they use this trick of [indiscernible] sampling. This term need not always be
small. So if MT was equal to XT minus 1, then it's not going to be small. If
it's average or some notion of average, you can make it small.
Okay. So the next thing -- I'll be very fast, is what if you didn't know the
predictable sequence, you want to learn how the predictable sequence. So can
we learn a good model for M1 through MT? So for each pi and pi? So what do we
mean by this? We have a set of models for capital pi.
For each pi and pi, we get a predictable sequence. So these M sub Ts are
actually functions of the past. So basically gives you a prediction. Each
model gives you a prediction. So, for instance if you're thinking of, like,
say, investing in the stock market. We invest in the stock market and we
lose -- we basically gain based on what the XT price is on each day and so on.
The usual experts.
Now, imagine that you also have these hedge funds that also tell you what a
prediction about each of the different, like, companies share values and I want
to somehow use that and I want to use the fact that if one of these guys were
doing good, then I should be able to get stuff for much smaller loss.
So can we do as best as knowing the pi and pi in hindsight? So there is some
model that is good, but we don't know that beforehand, but we [indiscernible]
as good as that. It turns out you can actually doing that by basically what
it's happening is it's running on experts with this as the loss, which is MT
minus XT squared. And so what you can show is that you get a bound like this.
Again, here can be chosen using doubling trick to get you the optimal bound.
But now you have an [indiscernible] pi of like with respect to the best model.
The only extracting that you pay is this log cardinality of the number of
strategies.
And extensions to this, which, of course, I absolutely have no time to do is,
you can learn by predictable sequences with partial information by which so
there are a quite a few variants of this. One variant is when you have bandit
31
feedback on the loss.
So you only see the loss on the arm that you suffer.
Another variant of this is when you only see the model -- so imagine that you
had to pay for a model. You're paying these hedge fund companies to give you
predictions so you only get to pay -- you don't have like an infinite budget.
You only have money enough to pay for one of these guys. So each day, I choose
one of these guys and I pay him. I only see what his predictions are. I don't
get to see what the other companies' predictions are. So this is the other
environment.
Of course, the third variant is you kind
the more interesting variant is where, I
setting, then the idea is that you don't
to give you all predictions, but you pay
separately.
of put both of these together. And
mean if you were in the multiarm
pay for, like, the hedge fund company
for each individual prediction
So in that case, you want to basically pick an arm, and you want to say what do
you predict about this particular arm. I don't get to see what he predicted
about that arm, but I pay him to tell me what he predicted on this arm and I
pick a particular arm, and so that's the fourth variant.
Okay. So in summary, you basically started with complexity measures for online
learning using Minimax analysis. We started from value, went to some
complexity measures that kind of due inspiration from the process theory.
And then we basically showed how to get like a principled way to move from
these nonconstructive Minimax analysis to online learning algorithms. And
finally, I gave you some hints on how to build adaptive online learning
algorithms and learning with predictable sequences.
So in a further directions, some of the things is can we extend relaxation
mechanism to notions of performance beyond regret. So things like
[indiscernible] approachability, calibration, et cetera. So there are some key
differences between what we did here and the kind of, the structure of the
losses or the performance measures in these other contexts.
And so I'd like to mention
non-constructive, in terms
have nonconstructive tools
question is, just like how
the reason why we're asking this is because in
of like the things like Rademacher Complexity, we
that can actually get you these bounds. But the
we do nonconstructive tools for the usual regret and
32
converted them to algorithm, is that a way to convert these nonconstructive
tools into algorithms. Another thing is learning with partial information. As
I was mentions before, when you have partial information, there are two terms
in your -- in terms of their dependence on Q actually starts getting coupled
and so that creates extra problems. But when we have some partial results but
we still don't have what we would like to have.
And also, can we extend all these analysis to online learning with states and
basically into competitive ratio type scenario rather than like [indiscernible]
type scenario.
So that's it.
>> Ofer Dekel:
Thank you.
So questions?
>>: [indiscernible] what I've found is that in this work, you have a lot of
connections between the statistical learning and online learning. And I think
a very [indiscernible]. I find that interesting.
>> Karthik Sridharan: Oh, yeah. Yeah. I guess I was making this slide
sequentially so by the time I came here I have a limited memory. I have a
faded memory of what I had. Yeah, definitely.
So basically, it's like you have the whole empirical process for statistical
learning. You have almost the same story for online learning.
>>: [inaudible] you have the relaxation is based on the [indiscernible].
you have the other thing where you do the opposite in parts [inaudible].
>> Karthik Sridharan:
And
Oh, the predictable sequence.
>>: The predictable sequence. You look what you have, and just like
[indiscernible] what you have to restrict and then you consider the worst of
the features, like you had before. There is also a kind of [indiscernible] and
then testing on the rest.
>>: So if you didn't understand, let's release the audience. He will be here
for the rest of the week, everybody is invited to keep badgering him to
understand all this stuff.
Download