Ofer Dekel: Let`s start

advertisement
>> Ofer Dekel: Let's start. So in our problem we're going to have a
player, and player is synonymous with maybe an agent or learner or
decision-maker. And this player plays in the world and sometimes we
call the world nature or the environment. And I'd like you to imagine
maybe an application of an algorithm that's making some automated
investment decisions for us in handling our portfolio and the world is
the stock market or maybe a spam filtering algorithm that's observing a
stream of spamules [phonetic] coming in and it has to determine which
ones are spam, which ones are not.
That's the player world. In the machine learning world we make
assumptions that the world is simple. We assume that the world behaves
like some stochastic process, very often IED stochastic process, and
online learning we like to say that the world is oblivious, means the
world doesn't react to us and we have some formulations of worlds that
react but often we assume how they react.
And I want to make the point this is often an unrealistic assumption.
When we work in the world, when we make choices, when we make decisions
the world reacts and it's often in a way that we really can't model in
any good way. For example, if you think of the spam filtering example
that I gave, as I managed to detect spam better and better, the
spammers are going to notice that and they're going to change their
strategy.
This is a little bit of an arm's race going on. As I improve, make
better decisions they get better as well. If you think of the online
investment or portfolio management problem also what's the world? The
world is the thousands of other investors that are trying to make a
profit at my expense.
It's kind of a zero-sum game in some sense, or I mean they want me to
lose so they can win, and they're strategic. So the world is not IAD
they're trying to hurt me.
So the assumptions are unrealistic, and the problem is we don't know
how to model them very well. How does the world react to my actions.
In this talk what I'd like to ask, the question I'd like to ask is how
far can I go if I assume that the world reacts to me in the worst
possible way. So in fact the world isn't a world it's an adaptive
adversary, bad guy. He's malicious. He knows my algorithm. He knows
the pseudo code of the algorithm I am going to run and has infinite
computational power. And assume that's the world. If I can deal with
that guy, then I can certainly deal with the real world. That's why
it's an interesting question to ask. That's going to be the topic of
the talk.
So let me explain how the player interacts with the adaptive adversary.
A little protocol that goes on. A repeated game. Assume that the
player has the power of randomization. This is critical. The player
has access to random bits and the player has to perform one of a finite
number of actions. So here they're depicted by the slot machines, by
these arms. So the player has to choose which arm she pulls.
And this is a repeated game. So it goes on for a while. Let me
describe one round of the game. So round two of the game. T minus one
rounds have already finished, concluded, and they've played those
rounds so they've learned a little bit about each other. The player
has learned about the adversary and the adversary has learned a little
bit about the player.
Here's round T. It starts when the adversary chooses a loss for each
one of these possible arms and immediately can field them. So the
player doesn't see what the losses are.
And the player defines a probability distribution over the arms, draws
from random bits, and according to the distribution chooses the arm she
wants to pull. She pulls the blue arm and suffers a loss of 0.8. She
gets to see what this loss is and she doesn't get to see what the other
losses are. This is why we use slot machines. This is the idea when
you pull one of the arms you don't know what you would have gotten had
you pulled any one of the other arms. This is called banded feedback
and the name is for historic reasons, it's because these slot machines
are nicknamed one-armed bandits. This is called bandit feedback. It's
when I only see the value that I actually suffered, I don't see the
loss I would have gotten if I pulled something else.
This is the game they played. There's another game, similar game,
which is called the full information feedback game, which is the same
thing but there's one additional step at the end of the game. And
that's a step where the other loss functions are in fact revealed.
At the end of each round the player sees what would have happened had
she done something else. That's called phone information. If she
doesn't that's called bandit feedback. We'll talk about these settings
in parallel.
So let's define things a little bit more formally. So this is the K
armed online learning problem. So it's a T capital T round repeated
game. So let's assume that capital T the number of rounds that's going
to be played is known in advance and agreed upon between the adversary
and the player.
The player's randomized. The adversary is a deterministic adaptive
adversary. So he's adaptive. Sees what the player plays and adapts
his strategy based on what the player plays does and he does so
deterministically. Here's a more formal way to state the game. So,
first of all, the notation.
actions.
The player has an action set.
One of K
And here's something that makes things simpler. We can notice that the
adversary, because we assume that he's deterministic, we can actually
have him do all his decision making before the game begins.
So, right, he has infinite computational power. He knows the code, the
pseudo code of the algorithm that the player is playing. So he can run
all these simulations beforehand and just specify to us before the game
begins how he will react to any sequence of actions played by the
player. This is exploiting the fact that the adversary is
deterministic.
So actually the adversary doesn't participate actively in the game
itself. Before the game begins, he defines a sequence of T loss
functions where each loss function is a function of the entire history.
So this is how we have him specify how he's going to react to the
player's actions.
So he defines T functions. FT takes T actions. The history of T
things that happen until time T and maps it to a loss. Now the game
begins, only the player plays in the game actively. So for T rounds,
what happens is the player chooses a distribution theta T over the
action space. Draws the action. And then suffers the loss if sub T
defined by the adversary evaluated on the concrete sequence that the
player plays. Of course if player doesn't get to see the loss function
that he plays, they're kept secret from him but well defined for the
purpose of formal setup. Full information feedback game is one
additional step that happens here.
This is at the end of the game, the player gets to observe for every
possible X what would have happened had I played X now. I don't get to
see the counterfactual, what would have happened if I changed my
strategy from the beginning of time, but today I get to see what would
have happened had I done something else.
Here are two examples. Just to motivate. So bandit feedback. Here's
one little toy example: Online content optimization. So I assume that
I have a news website, and my editors or my reporters give me a bunch
of different articles. And I can only advertise three of them on the
top of the page. Which three do I show? And I want people to click on
these ads.
So I choose three to show. Some user comes to my website. I show
these three little ads, either get a click, which is great. That's a
loss of 0. Or if the guy doesn't click that's a 1. That's bad. But I
don't know what would have happened if I showed some other ads bandit
example. And phonebook information feedback is very real, sample of
investing in one stock.
day.
Let's say I just want to own one stock every
I choose one stock. I buy it. I own it. But at the end of the day I
know how good the stock market, I know how much money I would have made
had I invested in some other stock so I get full information in this
case. This is just a little bit of a motivation.
Okay. So now let's say I have a player, how do I evaluate the quality
of this player? So first let's define the expected cumulative loss of
the player. So what we do is we just sum up over these T rounds the
loss suffered by the player. We take expectation because the player
plays randomly. We define L of T to be the loss accumulated by the
player over the rounds of the game. But notice that this really isn't
meaningful. We didn't restrict the adversary in any way. The
adversary can assign a maximal loss to all actions. So he can inflate
my loss as much as he likes. That's not interesting. So we need to
compare to something. We need a reasonable basis of comparison. We
need to compare to some alternative which makes sense and I'm going to
use the very, very simple comparison which is to compare against
constant actions.
So I'm going to say how much do I regret not just sticking to maybe the
blue arm and just pulling that for the duration of the game. So I
choose one action. And I say my regret for not having chosen and
played this action throughout the entire game is just the amount of
loss in expectation that I suffered minus the cumulative loss I would
have suffered had I played that action again and again.
Now, my regret against the best action is, again, my cumulative loss.
The difference between my cumulative loss and the loss of the best
action in hindsight. So after the game is over, assume that all the
loss functions are now exposed. I can say if I had only played this
one best action all the time, this is what I would have gotten.
This is my notion of regret. Okay. So that kind of clears up the
problem of the adversary can artificially inflate everything, if he
inflates everything he'll hurt me and the competitor equally.
This is the notion of regret. What we'll try to do is bound regret.
Prove these bounds on regret. Notice as the game goes along I
accumulate the regret accumulates, the loss accumulates and the regret
can grow and the question is how fast does it grow?
So if the regret is guaranteed, if I can prove to you that it grows
only sublinearly, so maybe it grows like big O of T to the power of Q
for some Q less than 1, it means that as time is going by, the gap
between me and the best arm in hindsight is shrinking I'm doing better
I'm improving. On the other hand, if I can prove to you that the
regret is lower bounded by a linear function T, it's omega of T, I'm
not learning anything. It means I'm all round, there's still a big gap
between me and the best guy that persists for the entire game, so I'm
not learning. These are the two kinds of things we want to talk about.
And then Q, the exponent here, is the learning rate. Smaller is
better. If we have a smaller Q we are learning faster.
So we've defined regret. We've defined the game. Can we prove some
bounds on regret? Yes. Here's the first result. It's kind of
depressing. No, we can't learn. So there always exists an adaptive
sequence of loss functions such that my regret grows linearly. It's
impossible to learn against adaptive adversaries. How do I prove this?
I can tell you what the adversary does. I can define to you an
adversarial strategy and the algorithm for the adaptive adversary that
will cause me to suffer linear regret. Here it is. He knows my pseudo
code. Before I've seen anything, he knows the distribution that I'm
going to start out. The player on each round defines some distribution
of the arm and theta one is just this distribution in the first round
before having seen anything. So he just chooses some action with a
positive probability. Now he defines his adaptive sequence of loss
functions as follows. He ignores everything but the first action.
Doesn't matter when T equals two or three or four. He only looks at
the first action. If that first action was X hat, which there's a
probability of that being, then the loss is one.
Otherwise, the loss is zero. He doesn't care if I prove -- he doesn't
care if I learn, he just looks at my one first play. If that was X
hat, I will suffer a unit loss forever. After T rounds I'll suffer a
loss of T. I'll accumulate T units of loss.
There's some positive probability of that occurring. So my expected
loss is T times that positive thing. What's the alternative? The best
action in hindsight could be anything which isn't X hat. If I just
stuck to that I would have suffered zero throughout. So the regret
indeed is T times this constant. Very, very easy. That's the
adversary. And I can't learn.
So why did we bother and set up all these things? So we'll do what we
always do in computer science or machine learning. When a problem is
too hard, we just have to make it a little bit easier. We have to
restrict the adversary just a little bit. So this is the class of all
adaptive adversaries. We have to restrict it. We showed a point in
here which causes me to suffer a linear regret. I can't learn. So
here's another little subclass in there called oblivious adversaries.
Oblivious adversaries are a very special case of adaptive adversaries.
They in fact do not adapt. Adapt in a trivial way, which in fact they
don't. So let's define that. Oblivious adversaries are adversaries
that don't rely on the past. We'll still use the same notation, the
function still takes the entire history, but it ignores the first T
minus 1 actions in that history, just looks at my current action.
So formally for any sequence of T actions and any other prefix of the
sequence of T minus 1 actions, if I take these first T minus one
actions replace them by the other guys, the last one is the same, then
the loss is the same. This is an oblivious adversary. And this is in
fact the standard assumption in 99 percent of papers in online
learning. We like to assume that the world is oblivious.
We can use the convenient shorthand, since we only depend on the last
guy, we can write F, give it that one argument. And then our
definition of regret simplifies to this simpler thing. You'll see this
in most papers. Most papers will use this kind of abbreviated notion
of regret; but just keep in mind that if we don't assume that the
adversary is oblivious, then this doesn't make sense. We have to refer
to the more complicated one that uses the full expressive form, and
you'll see actually quite a few papers that talk about adaptive
adversaries and use this notion of regret. You should be very
suspicious of those results. This only makes sense in the oblivious
case.
To again give an example, when is it reasonable to assume the world is
oblivious, let's say I'm investing a thousand dollars in a stock, no
one is going to care. The world isn't going to change by my little
investment. If I'm investing $100 million in a stock, the stock market
is going to react; it's going to change. Big institutional investor
cannot assume that the world is oblivious to his actions, can't assume
that he's some negligent little speck of dust, doesn't affect the
world. His actions will cause the world to change and affect his
losses in the future. I don't know if this is true -- I didn't know
whether to say a billion -- I've never invested $100 million, so I
don't know.
So we have oblivious adversaries. This is online learning. This is
impossible. Let's define some intermediate steps in between. The
first one I want to define is the switching cost adversary. What's
that? It's more general than the oblivious, less general than the
adaptive. Oblivious adversary with switching cost is like an oblivious
adversary but he charges me an additional cost of C, some constant
switching cost whenever I change my action. So whenever my action on
time T is different than my action on T minus 1, I pay an additional
loss of C. Or, in other words, he defines this oblivious sequence
which depends only on the current action and then the real loss is that
oblivious loss plus C times the indicator function of did I change, did
I swap arms. And again the example is that maybe I'm investing a
little bit of money in the stock market so the stock market is
oblivious but maybe I pay some transaction cost. Go on the Web pay $7
for each commission trade, you don't want to hop around every day for
every switch from one stock to another, costs a little bit of a
penalty. Notice this loss relies on the current action and the
previous action. So we say it has a memory of one; whereas in that
counterexample, remember that, he remembered all the way, every
iteration. He penalized me for the very first thing I did.
So switching costs. They have a memory of one. So there are special
case of adaptive adversaries that only look at my current action and my
previous action, which are special case of adversaries that have a
finite memory that can look M steps back in time but they can't look
all the way infinitely back in time. This is the little hierarchy I
want to talk about.
Just to mention -- I mean, most of machine learning is here. Most
people like to assume this IID thing online learning of this and I want
to talk about that.
Now we have a table. When theoreticians see empty tables, they get
excited, sparkle in their eye. We have to fill in the values of this
table, here are different types of adaptive adversaries. We want upper
bounds and lower bounds on regret, bandit feedback in full information
case.
I have to make some technical assumptions, just to fit everything in a
table, because I'm going to refer to previous work in other papers and
sometimes the assumptions are a little bit different. So I won't go
through the details of this but I'm just going to make two very general
assumptions. One is bounded range which means that whether I choose
this action or that action at time T, the difference isn't going to be
unbounded. There's some bound on that. And bounded drift means if I
play action X today or tomorrow, again there's not going to be a huge
gap between them. So these are necessary. You can go without them but
let's skip that.
Upper bound of T is trivial. I said everything is bounded. If I pay
any action, the regret I can accumulate in each round is bounded by one
or some constant. So if I play T rounds I'm going to have a regret of
T.
We proved a few slides ago that in the fully adaptive case I can do
better than that. Here I have a perfect characterization. When these
two things are equal I understand it. So I understand this so far.
Now I can take all the previous work done in this field. So a bunch of
papers. They prove a lot of things. Again online learning is what we
understand or classic online learning is what we understand the best.
Here we have best characterization. We have square root T sublinear
growth of the regret. We do have a nice quick learning upper and lower
bounds match both in the bandit full information case. So this was the
state of the art when I started working on this problem.
Based on ten years of past work on online learning. And then the nice
thing about this table, when you have one problem, a special case of
another problem, lower bound of course applies. Lower bounds propagate
this way and upper bounds propagate that way. If I could get t
two-thirds on any guy with bounded certainly, I could get T two-thirds
unit memory whenever we have a result here. It propagates this way.
So this is the state of the art.
I want to talk about two results. One is a result I had last summer
with Roman and Ambuge [phonetic]. And there we used a blocking
technique to give an algorithm that achieved T to the two-thirds
against bounded memory adversaries in the banded case, bam, bam, it
propagates to the left. Blocking just means we take a standard
off-the-shelf online learning algorithm and run it in blocks, chooses
some action and sticks to it for the duration of the predefined block
and only available if the block changes to a different action.
So, I mean, intuitively that would reduce the number of switches, but
we can also prove that it actually works better against, that also
works against memory bounded and adaptive adversaries. This is the
first result. And now very, very recently, so hot off the presses,
this is just this summer, with Nicholich [inaudible] who came as our
visitor to Microsoft Research, and Hajameer [phonetic] from New
England -- sorry, he's now in our group. Used to be in the MSR New
England. Now he's moving to our group. So this is yet unpublished
work. And we can -- we prove this lower bound. So it's the lower
bound on the number of switches. So again we're looking at the banded
feedback case. And we're asking how much does an oblivious adversary
penalize me for the number of switches, how much does he pay, and we
showed that in order to get something better than T to the two-thirds
regret on the oblivious part of the loss, we have to make it at least T
to the two-thirds switches.
So this is how we proved that. And, again, we have this magic
propagation to the right. And there's also this nontrivial propagation
of why this applies that. It's very simple, but I can't go into it
right now. So how do you prove these lower bounds. I want to say one
word about that. I have to prove the existence of an adversary.
Remember when we showed this lower bound, I just told you what that
adversary is. I don't know what this adversary is, but I have to prove
that he exists. So we used a very convenient technique called the
probabilistic method. There, if you want to prove something exists,
the way you do it, you define some probability distribution. Here's
it's a distribution over adversaries or over sequence of loss functions
and you show there's a positive probability when you sample from that
distribution of getting a sequence of loss functions that will inflict
this much damage. If there's a positive probability of finding that
guy, then he must exist. If he doesn't exist, then there's probability
0 of finding him when I just want to sample randomly.
All we have to do after using this technique is define what is the
probability of distribution and the one that we just used is a Gaussian
random walk. Define two arms, hop around, due to Gaussian walk and do
some theory and prove that's enough to show with positive probability
we'll get a guy that will hurt us that much.
So this is now the current state of the art. You can see that it's all
tied up here. So we understand, we understand what happens with
switching, we have an upper and lower bounds on regret that match up
for one memory, for N memory. We have one little glitch here, which is
in the full information feedback case. We're not exactly sure what the
situation is when you have one memory. But when you have a memory of
two or more we already understand it perfectly. So we almost have a
perfect characterization of what's going on. The super interesting
thing that just pops out here is this. So you can see that there's a
fundamental difference between learning with banded feedback and
learning with full information. And this isn't trivial, because maybe
intuitively it's trivial, but what we did before they always looked the
same. Somehow banded was few years delay until the paper came out.
But you often saw that the regret behaved the same in banded
information settings. Here we have a very clear indication that the
banded case and full information case are not the same.
Let me summarize. The world reacts to our actions. We ask the
question how far can we go under the paranoid assumption that the world
is out to get us. It reacts in the worst possible way. We model this
problem as online learning against an adaptive adversary and then we
derive an almost complete characterization of policy regret in the
setting and we got this very, very interesting result which shows that
banded feedback and full information feedback have a qualitative
difference to them.
And that's it.
[applause]
Thanks.
Are there any questions?
Pedro.
>>: So what you said brings up [inaudible] spam in the [inaudible] of
spam your move is actually not classified particular spam but a whole
filter in the case of link spamming your whole function your link is
not -- there's a very large number of possible moves right there. And
your learning is choosing that function.
>> Ofer Dekel: Yes, I made things very easy for myself. This thing
completely ignores the dependence on the number of arms and certainly
none of this works if the number of arms is infinite. This is -- I
have research on this but it wasn't presented in this talk.
>>: That's my question, what happens in that case?
>> Ofer Dekel: So I mean each result is different. They all have a
polynomial dependence on the number of arms. Is it logarithmic or
square root or linear. We have to go into each one of the results and
kind of refresh our memory about it. Here I really just cared about
how these things developed with time. But definitely the dependence on
number of arms is an active, interesting field of research that I'm
engaged in. Another one is well I just compared against constant
actions, that kind of sucks, can we find more interesting comparison
classes and certainly we can do that. We can define structured
comparators when we defined deterministic Markov decision processes we
can compete against constant actions that occasionally are allowed to
switch and so on. So we have all those results but I just didn't have
time to go into them.
We certainly don't have this beautiful kind of complete picture there.
>>: Two questions which are about the same. So the first one is what's
the dependency on M because to lose linearly on them ->> Ofer Dekel:
>>:
Yeah, do you?
>> Ofer Dekel:
>>:
Lose linearly on M.
Yes, you lose -- yes.
So when --
Lower bound?
>> Ofer Dekel: Lower bound, I don't know. The lower bound is two
arms. We just need to show a adversary that gets this dependence on T.
One with two arms. I don't think we really thought through the thing
but this upper bound depends on linearly, linearly is good because when
you take MDP, kind of reinforcement learning type of algorithms, they
will sometimes have an exponential dependence on the history, because
the number of states is two to -- number of actions to the power of
number of history length.
So you're linear in the number of states which is exponential ->>: Much more complex there.
comparison in the sense.
>> Ofer Dekel:
fixed actions.
Policies explanation large, not fair
You're right. You're competing against policies, not
That's why it's a higher bar.
>>: So the other question is do you have like a linear setting bounds
for this memory, when you have memory, so for the featurized case, I
guess [inaudible] this question like linearly bounded [inaudible].
>> Ofer Dekel: These results -- so, okay, so you're asking me -- so
here we had a discrete choice type of setting. We had one of a finite
number of actions what happens -- many of the upper bound -- these
upper bounds work for pretty much any online learning setting. The
lower bound is very, very specific to two arms, which is good enough
for this. And it's also very recent work. I don't think we've thought
about how this affects linear case but the upper bounds always work.
Yeah?
>>: I have a question about is the assumption about [inaudible]
situation [inaudible], for example, I'm analyzing click through rate
and I wanted to use choose optimal advertisement. And user calculated
also gives me one [inaudible] and let's say he finds [inaudible] in the
third position in the second position. You can count the strategy
affect the strategy to second position.
So he's not IAD, but he is going to get maximum click through rate.
through this area to improve IAD can calculate it with you?
>> Ofer Dekel: I don't know. I'd have
it off line and chat about it. I'm not
who. In this world it's me against the
and I'm running away. Let's definitely
So
to think about it. Let's take
fully sure who cooperates with
world, everyone's against me
take it off line.
So now I'm putting on my organizer hat. So we have lunch. And here's
how it's going to work. Is there something I need to know?
>>: [inaudible] the option today we have those actually set aside out
front, up here, so if you want people to go this way [inaudible] and
then know what the options are.
>> Ofer Dekel: So here's how it's going to work. So we have three stations that are
giving out the food. One is right behind this wall. It's out these doors and to the left.
Another one -- the other two are in the two rooms that are just opposite here. So if
we go around this way or that way you'll see there's two other smaller classrooms set up
with tables and there's food there. What I'd like to ask if you're sitting on this side of
the room, if you could please go out these doors and then either take a left or a right
and get to that room, if you're sitting on that side of the room if you could exit
through those doors and go around. You could go to the first or second room. The
atrium of the building is open to us. So if you -- once you take your food you can
either sit in any of these rooms anywhere you see a chair you can sit down you can walk
through the glass doors there's an atrium, there's security guards that will prevent you
from going where you shouldn't be going. If you find yourself in a headlock or handcuffs,
you'll know you've gone too far. Please go wherever you like. There are soft drinks at
the end of this corridor. And bon apetite.
Download