>> Yuval Peres: All right. Good afternoon, everyone. ... day. We have I'm sure a bunch of great...

advertisement
>> Yuval Peres: All right. Good afternoon, everyone. Welcome to theory
day. We have I'm sure a bunch of great talks coming. In fact, when I -- so
we have a combination of some [inaudible] veterans and some young researchers
and postdocs. And when I told our postdocs that they should not feel too
pressured that if the talks don't go well, then in a few years they can
recover from it. Somehow this reassurance didn't work so well because I saw
them at 11 p.m. yesterday practicing their talks.
So, anyway, I'm sure they'll all be fantastic. And we're starting with a
talk on looking for the Bobby Kleinberg incentivizing exploration.
>> Bobby Kleinberg: Thanks so much, Yuval, for organizing this workshop and
inviting me to present this paper. I'm going to be telling you about some
work I did on incentivizing exploration with Peter Frazier, David Kempe, and
Jon Kleinberg.
This picture you might be wondering about. This is Christopher Columbus
receiving his gifts from Ferdinand and Isabella for discovering North
America. So this is a depiction of how incentivizing exploration used to
work at some time in the past. We had all-powerful monarchs who would
command their loyal subjects to go out and explore. The subjects would
obediently do it, and then, if successful, they would earn their rewards.
You know, in the online world, also there are many powerful entities that
depend on their subjects now called users to explore a world of possibilities
for them, you know, so Amazon, for example, is a store where you can buy
almost anything, you can read reviews of almost anything, but it's not as if
Amazon employs full-time reviewers to figure out which products are good and
which ones are bad. They just depend on the autonomous activities of users
going about their business.
You see the same pattern repeated in many, many other contexts, both in the
commercial world and elsewhere, you know, social news readers need to
recommend articles to you, but in order to do it, they need you to recommend
articles to them. There are these collective citizen science efforts where
amateurs at home with their telescopes are mapping the sky but there's no
global governing organization that can demand them which part of the sky to
aim their telescope at. And it's a bit of a stretch, but you could even say
that the same types of issues play out when we talk about things like
national funding of scientific research. We have an organization like the
NSF that may have priorities for what research projects it wants people to
undertake. But in the end the research is being done by individual
scientists who will pursue their own aims that may align but are not obedient
to the dictates of the funding agency.
In all these situations, what we have really is a misalignment of incentives
where we have a principal whose goal is to explore a broad space of
alternatives and collect information about all of them, and we have
individual users whose goal is to select the best alternative for them.
To state it in a more pithy way, we have a principal whose goal is to explore
and individual users whose goal is to exploit. Stated in this way, it's very
tempting to model this dilemma of misaligned incentives as some type of
multi-armed bandit problem. So in the next slide I'm going to introduce to
or recap for you what is the multi-armed bandit problem.
And the version of the problem summarized on this slide is what generally
gets called the Bayesian multi-armed bandit problem. So this is a problem
where there are K alternatives, conventionally called arms. The strange name
stems from the metaphor of K different slot machines, each of which has an
arm that you can pull and get some random payout.
And we'll assume that the payout distribution of one of these slot machines
is determined by some unobservable parameter called its type. And so the K
arms have independent random types but the type itself is unobservable. All
you can do to find out the type of an arm is to pull it, observe one random
sample from its distribution, and as you accumulate more and more samples,
you accumulate more and more certainty about what its type is.
Okay. In the model of decentralized crowd sourced exploration of a set of
alternatives, I'm going to assume that you have a sequence of users -- in
this talk it will be modeled as an infinite sequence -- indexed by time. So
the user that comes in at time T will participate in one and only one round
of decision-making in which it chooses one of K arms. I'll call it I sub-T.
Think of this as a user coming to amazon.com to buy an 8-megapixel camera and
there's some finite set of 8-megapixel cameras that are sold at Amazon, it
selects one of them. Afterwards it observes how good that alternative was,
maybe on a scale of one to five stars, and reports what reward it got from
pulling that arm.
The sequence then repeats in the
observing -- we're assuming that
user that comes in at time T can
provided in all the time steps 1
next time step. And in this paper, we're
the history is fully observable. So the
see all of the star ratings that were
through T minus 1.
And then to reiterate with a bit more formalism, what I said on the previous
slide, the principal's goal is to maximize the long run average of people's
payoffs. I'm going to do the standard thing in this field and assume that
there's some geometric discount factor gamma less than one, and payoffs that
people get T time steps in the future contribute with weight gamma to the T,
into this weighted average that the principal is trying to optimize.
The user at T time, on the other hand, has a goal of just maximizing their
individual constituent of their weighted average. And so the policies that
would maximize this quantity and this one respectively I call the optimal
policy and the myopic policy. So the optimal policy chooses a sequence of
ITs that maximizes the expectation of this average reward. The myopic policy
is a completely greedy algorithm that in each step T chooses the IT that
maximizes expected RT given the history encoded in the public log file of the
past T minus 1 observations.
What's known about these two alternatives? So the optimal policy is
principle could be very, very complex. It's a function that maps every
possible history of T minus 1 steps into a decision at time T. So there are
exponentially many possible histories. So encoding the truth table of this
policy could potentially take double the exponential space.
And, in fact, for a long time this problem of finding the optimal policy for
the multi-armed bandit was thought to be so hard that statisticians who were
working on it on the Allied side during World War II were joking that if we
wanted to defeat the Germans, what we should do is we should drop leaflets
containing descriptions of the multi-armed bandit problem over Germany and
their scientists would become so preoccupied with this unsolvable problem
that they would get distracted from the war effort and we would vanquish
them.
So then it came as quite a surprise in the 1970s when this man, John Gittins,
came along and solved the multi-armed bandit problem with an amazingly simple
and succinct solution. So he defined a number called the Gittins index that
you can compute for each arm that depends only on the past history that you
observed for that arm. I'm not going to tell you how this number is computed
because it won't be relevant for my talk, but the optimal policy is simply at
every time compute the Gittins index for every arm, pull the one with the
maximum index. Question.
>>: What's the model of -- how are these RT distributed? Do you need to -I mean, if you know this or you don't know this, this could change your
approach.
>> Bobby Kleinberg: So you need to have a prior belief about how the reward
distribution of an arm depends upon its type parameter. The Gittins index
policy is generic in the sense that no matter what your prior is it explains
a procedure for computing a score such that the optimal thing to do is to
pull an arm with a highest score. The specific function what that procedure
computes will depend on the form of your prior. So this might be easy or
hard to compute depending how complex your prior is. For standard priors,
like a beta prior -- a common one that's used in reality is you believe that
the arm has some unknown parameter theta between 0 and 1.
>>:
Independently drawn from this.
>> Bobby Kleinberg: Independently drawn from their beta distribution. When
you pull it, you see a binary reward, which is either 0 or 1, and you do a
Bayesian update that gives you a beta distribution with difficult parameters.
So like that would be a typical prior that people would use in practice.
>>:
0, 1 [inaudible]?
>> Bobby Kleinberg: In the specific instantiation of the model that I just
talked about with beta distributions, outcomes are 0 or 1. For Gittins
theorem, all he needs is that there exists a bounded subinterval at the real
line, and outcomes are distributed on that interval. So the result is really
quite general.
Okay. For myopic users, the policy is even simpler to describe. I described
it on the last slide. It's a purely greedy policy that always computes the
posterior expected reward of each arm and pulls the one with the highest
expectation.
A lemma that we prove in our paper that I'll skip the proof in the interest
of time is that the value of the myopic policy is always at least 1 minus
gamma times the value of the optimum. And this multiplicative factor 1 minus
gamma can't be improved in the worst case.
>>:
Gamma is the discounting [inaudible]?
>> Bobby Kleinberg: Gamma is the discounting factor. That's right. So if
the principal is very patient, gamma is close to 1, it might be .999, and
then this is a 1 over 1,000 factor, which is not very desirable.
If the principal is very impatient -- so let's consider gamma equals one
half. This is a principal who is so impatient that a reward of one in the
present day is as good as getting 0 in the present day and one every day from
tomorrow until eternity.
And so in -- if you're discounting so steeply that the present period is
worth as much as the future combined, this says that being myopic is half as
good as being optimal, which is actually not so surprising. Right? Myopic
is doing exactly the best thing you could do in the present, and if the
present accounts for half of all the value you could get over time. Okay.
So the proof of this lemma is not quite as brief as I just made it out to be
because, you know, neither of these policies is getting a stationary reward
sequence whose expected value in each time period is the same. The expected
values are probably increasing over time.
So you need to do a little bit more work to prove this approximation. But
trust me that it's valid. So the trouble is that when the principal is very
patient, this approximate nation factor is very close to zero and we want to
know if we can do something better. To do something better, we're going to
introduce a new feature into our model. We're going to allow the principal
to pay the agents a bonus for exploring alternatives that are not myopically
optimal for them. So, you know, in the simplest implementation of these
bonus payments, you might just post a sign that says, you know, here -- for
arms 1 through K here's the bonus that you would get if you were to pull each
one of them right now. That changes the user's decision problem so that
they're maximizing posterior expected reward plus bonus.
And in these applications that I talked about, you presumably would not put
up a sign that literally says I'm paying you to do suboptimal explorations
and here's how much I'm paying you. But you might instead, for example, if
you were Amazon, just silently offer a discount on the eight-megapixel camera
that does not yet have as many reviews as its competitors. And in that way
you might hope to accumulate a more diverse set of training data without the
users knowing that you were making them do your experimentation for you.
And some of these other environments like social news readers where they're
not actually exchanging money with their users, then you can think of these
payments as being in some kind of artificial virtual currency like your
reputation points of the user of the system. And indeed you often find on
these systems that they give their users reputation points, and when you
reach certain milestones, you get a badge and other people can see your
badge. And there's research literature that I have not contributed to but
people like my brother are quite interested in having to do with how to
design these virtual reward systems to encourage the maximum amount of
participation.
Okay. Now in my model where we have this publicly observable log file of
every transaction that's ever taken place, if the principal and the users are
correctly doing Bayesian updates on the evidence of that log file, they will
always have the same posterior beliefs about the arms at all points in time.
Which means that I'm -- if I'm a principal that's trying get somebody to pull
arm I instead of arm J, I know exactly how much bonus I need to pay them to
bridge the difference in expected rewards between those two alternatives.
So an equivalent description of what the principal is trying to do is it can
adopt any policy it wants for selecting which arm to play at time T based on
past history. But if it makes a selection other than the myopically optimal
arm at any time, it needs to pay this much cost to bridge the gap in the
user's utility between doing what's myopic and doing what the principal is
asking.
Okay. This is a good -- so that specifies the model of the problem that
we're investigating. This is a good time for me to take a break and tell you
about a bunch of very interesting related work, much of it coming out of MSR.
That has to do with models that share the same motivation of incentivizing
users to explore a space of alternatives for you but don't make use of this
assumption that you can pay people to pull arms that are not myopically
optimal for them.
So in the absence of that assumption, what other mechanisms do you have for
manipulating their behavior? Well, you could withhold certain information
from them. So in my model I'm assuming you can see the full history of every
transaction that's ever happened in the past.
But in many recommendation systems they don't show you every review that's
ever been produced, they just sort of say here's one or a couple of
alternatives that we're recommending to you and, you know, maybe here's some
small amount of evidence why we think this is a good recommendation. Think
of Netflix, for example. Netflix doesn't make public the set of all star
ratings that all of its users have ever given to movies.
Okay. So in those models, an exemplary one being this 2013 paper of Kremer,
Mansour, and Perry, which with apologies to Yishay who's in the audience, I
have the conference citation, but I think it's now in some journal ->>:
JPE.
>> Bobby Kleinberg: JPE, yes. So not just some journal but a top five one.
They have a multi-armed bandit problem where the rewards are privately
observed and the principal controls the information channel by which
information about this past history is funneled back to the users.
And their paper for the most part focuses on the case of two arms, one of
which is a priori better than the other and which are collapsing in the sense
that after you pull the arm once and observe a reward, you have no remaining
uncertainty about its type. So your prior collapses to a point mass
distribution the very first time you pull the arm.
And in nonstrategic settings, it's trivial to design an optimal policy for
these problems, but in the strategic setting where you have to give people
advice and they have to be willing to take your advice, it turns out to be
quite challenging to solve for the optimal policy even in this setting.
Their paper is primarily devoted to doing that. They have some follow-up
results on -- dealing with a greater number of arms. But then there's this
paper from two years later where Yishay, Alex, and Syrgkanis did an extension
from two to many arms, they allow for much more general prior distributions
over the arms.
And they give a policy which, while not being optimal, achieves a regret that
has the optimal scaling in terms of the number of time steps up to constant
factors. I haven't defined regret in this talk, but for non-Bayesian
analysis of the multi-armed bandit problem, this is sort of the gold standard
for how you evaluate the quality of algorithms.
And in this discussion of related work, I skipped over another important
reference, which is this working paper of Yeon-Koo Che and Johannes Horner,
two economists who are looking at a continuous time model that's very similar
to the Kremer, Mansour, Perry discrete time paper and obtaining qualitatively
similar conclusions. Okay.
As I said, all of these papers are on mechanisms without money that try to
achieve the same effect as the mechanism in our paper. But now let me return
to our model in which rather than withholding information about the past we
are paying people side payments to get them to do things that are not greedy
in the present time step.
So to state our main results, I need to define a term called the achievable
region. And the way to think about it is that if you're the principal in
this problem, there are sort of two measures of costs that you want to
simultaneously minimize. One is the opportunity cost relative to the first
best policy, the Gittins index policy that you would use to explore this
space of alternatives in a perfect world where everybody's incentives were
perfectly aligned.
And I can normalize this to be a number between 0 and 1. So if we scale all
of the rewards so that the expected, gee, metric discounted reward of the
optimal policy is exactly one and the principal instead adopts some
suboptimal policy that gets 1 minus A, then I'm going to call that number A
the opportunity cost and plot it on the X axis. Okay. And, you know, the
other thing the principal would like to minimize is the amount of bonuses
that it has to pay out to the user. So I'm also going to express that as a
multiplicative factor B times the value of the optimal policy. We now have
two parameters A and B that we'd like to drive both of them down to zero, but
in general you can't do that.
Okay. And so just to make sure the meaning of the two axes is transparent to
everybody, let's take that result that I presented a few slides back that
said the value of the myopic policy is at least 1 minus gamma times the value
of the optimal one.
Okay. So the myopic policy never has to pay anybody any bonuses. So it's at
0 on the Y axis. And the theorem says that it's getting at least 1 minus
gamma times the optimum. So that says that on the X axis it lies somewhere
between 0 and gamma.
Okay. So that's one example of plotting a point in the achievable region.
So, oh, right, I forgot to define achievability. So if a policy satisfies
these two inequalities, we say achieves the loss pair A, B. And then we say
that a loss pair is achievable, maybe I should say universally achievable, if
for every instance of the multi-armed bandit problem, no matter what priors
you have on the arms, there always exists a policy that achieves that loss
pair.
So previous result with the 1 minus gamma in it says that the .0, gamma is
universally achievable, and the objective of the paper is to map out the
entire achievable region and not just its X intercept.
>>:
Is this an expectation [inaudible]?
>> Bobby Kleinberg: This is -- yes, yes, yes. Very important question.
Both of these should be interpreted as respected values. In a lot of these
problems it would be interesting to be able to solve, for example, for the
maximum expected reward under a hard constraint on the payment. Right? Like
it -- you know, I give you a budget of $100,000 of bonus payments that you
can pay out to your users, and I don't want you to satisfy an expectation.
You should never exceed the budget. We don't know how to solve that problem,
but I think it's a really interesting one.
Okay. You know, once we had formulated the problem this way, one thing that
struck me as really appealing about it is that this is a model that in some
concrete sense lets you plot or depict the exploration versus exploitation
tradeoff inherent in multi-armed bandit problems. So I've written a lot of
papers in my life about multi-armed bandits. I can't tell you how many times
I've stood in front of a projector screen saying you should think of the
multi-armed bandit problem as a theoretical construct that abstracts the
exploration-exploitation tradeoff that decision-makers often face.
Okay. But here's a model where if you want to explore, you have to pay
people to do the exploration. So the cost of exploration is very nicely
encoded on the Y axis. And if you let people exploit, they do something
that's in general suboptimal, and so the cost of allowing people to exploit
is nicely encoded on the X axis. And so the shape of this curve is the shape
of the exploration-exploitation tradeoff curve for the multi-armed bandit
problem.
Okay. So once I like had conceptualized our problem in that way, I became
very curious to know what the shape of this curve is. And our main theorem
tells you exactly what the curve is. So the achievable region is the set of
all pairs A and B that satisfy this inequality. The square root of A plus
the square root of B is greater than or equal to the square root of gamma.
Let's pause and reflect on what this result tells you about the incentive
dilemma. Okay. So the first qualitative takeaway from this main theorem is
that the achievable region is convex, closed, and upward monotone; meaning
that if I have any A, B and I increase one or the other coordinate, I remain
in the achievable set.
Okay. Except for being a closed set, these other two properties
obvious in hindsight. If I have a policy that achieves A, B and
that achieves A prime, B prime, for example, I can achieve their
tossing a coin at time 0 and deciding whether to use policy 1 or
are actually
another one
midpoint by
policy 2.
The achievable region is set-wise decreasing in gamma. As I increase gamma
towards 1, the set strengths. This is also consistent with our intuition
that has a principal becomes more and more patient, its incentives are less
and less aligned with those of the myopic users, and so it should become
harder and harder to achieve the points that are close to zero.
A more interesting qualitative takeaway is that there are certain points,
even ones that are not very far away from zero, that belong to the
universally achievable region no matter how patient the principal is. So any
time root A plus root B is greater or equal to 1, even as the principal
approaches infinite patience, it still remains possible to achieve that loss
pair.
So you could state this more interpretably as saying this .1, .5 corresponds
to the statement that the principal can always run their system at 90 percent
of the optimal learning policy, while giving back only 50 percent of the
surplus to the users in the form of these bonus payments. And that holds
even in the infinite patience limit.
A final thing that I should say, although I forgot to put it on the slide, is
that I told you about the result that the X intercept of this region is at
gamma comma 0. I said that that's a pretty easy lemma, although I skipped
the proof. The result says that the Y intercept is also at zero comma gamma.
As far as I know, that's a hard result. And even proving that the Y
intercept is finite, as far as I know, is a hard result.
>>:
[inaudible]?
>> Bobby Kleinberg: No. No. So the Y intercept means -- so the opportunity
cost is pinned at zero. So you have to run the Gittins index policy and pay
them whatever it takes to get them to keep doing what the Gittins index
policy wants them to do.
There's no reason -- so you keep paying them the difference between the
myopically best and the best according to the Gittins index arm. There's no
reason that those payments have to be bounded above by the value of the arms
that Gittins is pulling.
I guess maybe if I could depict it using a whiteboard marker that works. You
have the payoff sequence of the optimal policy which presumably looks like
this. And you have the payoff sequence of the myopically best arm, which is
also increasing. It must eventually increase more slowly. If it was better
than this one out to time infinity, then this one wouldn't be optimal.
So these two eventually cross each other. But initially this one might be
way more than twice as high as that one. Might be a thousand times as high.
So the gap -- so the -- right. The policy that achieves the Y intercept is
paying for these gaps and the optimal policy is only gaining this amount.
And so if you chop at any finite initial time, there's actually no bounded
approximation time factor between what the optimal policy is getting and how
much you have to pay people to play it. It's only if you let them go out to
infinity and take advantage of this geometric averaging that the
approximation kicks in.
Okay. I want to devote the remainder of my talk to giving you some sketch of
how we prove this result. It's an if and only if. So there's an
unachievability part and an achievability part. And the one that's easier to
talk about is the unachievability. So let me do that first. There's a
particular type of stereotypical hard instance of incentivized exploration
that we call Diamonds in the Rough. And this is searching through a bunch of
risky alternatives that are probability worthless but have a huge upside if
they pay off. When there's an outside option that's a safe bet that
everybody would rather do instead.
Okay. So it's a -- link this back to the citizen science example where, you
know, like you're trying to get birdwatchers to go out and record
observations of what birds they see in order to collect ecological data.
This is like everybody in Ithaca wants to go to Sapsucker Woods and look at
the birds there because it's the most beautiful place to go birding. But
maybe if someone would go and sit next to the airport runway for a while and
watch the birds there, we would actually learn some very ecologically
relevant information about that ways that air traffic interferes with bird
migratory patterns.
So you want to somehow get people do these risky but potentially very
valuable bets, but none of them want to do it on their own.
Okay. So how does that work quantitatively? We're going to have infinitely
many collapsing arms, each of which is -- you know, think of it as like a
sealed envelope that the first time you pull the arm you see a payoff which
completely reveals its type. Either it's an arm that always gives you some
high payoff, capital M, or it's an arm that always gives you 0.
The probability of giving the high payoff is 1 over capital M times 1 minus
gamma squared. This magic number is chosen so that if you spend your whole
life opening these sealed envelopes until you finally find the one that
yields the big payoff and then you always play that one, the expected value
of that policy is normalized to be equal to 1.
And there's an outside option whose payoff is all and this is normalized so
that if you always play that one, then your geometric discount reward is
going to phi. Okay. So there are two obvious policies to pursue here. I
just told you what they are. One of them searches through the blues until it
finds a winner; the other one always picks the yellow.
>>:
Your phi is less than one?
>> Bobby Kleinberg: Phi is less that one, otherwise the problem is not
interesting. Yeah. So the optimal policy gets a reward of essentially one.
What cost does it pay? It pays the difference between this and the a priori
expected value of one of those. And it does that from time zero to infinity,
which gets rid of the 1 minus gamma factor.
Okay. So there's an opportunity cost of zero and there's an incentive cost
of phi minus 1 plus gamma, if you worked it out for what the optimal policy
does. The myopic policy has no incentive costs and it only gets phi, which
is less than 1, so there's an opportunity cost of 1 minus phi. Okay.
And it's not obvious but it's true that the achievable region for this
instance of the multi-armed bandit problem in the limit as the number of
these blue arms goes to infinity is just an unbounded polygon with two
corners at these two points and everything in the first quadrant that sits
above them.
And so as I vary this parameter phi, I get a bunch of unachievable points.
All the white points lying below this line are unachievable. And as a vary
phi, I get a bunch of different lines and a bunch of unachievable points that
lie below them.
So the union of all those white regions gives me a bunch of unachievable
points. The theorem statement that I had on the previous slide exactly says
that those are all the unachievable points.
I want to take a second to point out to you something cute about the form of
this curve. So if you look at all these tangent lines that I derived by
varying the value of phi, you'll notice that the sum of the X intercept and Y
intercept doesn't depend on phi. It's always equal to gamma. Okay. So the
envelope that you trace out is the one that you get by starting with a point
at the origin and one on the Y axis at zero gamma and letting them move at
equal speed until the Y axis point drops down to the origin and the X one is
at gamma zero.
That's actually an art project that lots of people do when they're
elementary -- when they're in elementary school. Okay. I found on the Web
somebody who had done it with string. There's the curve. I remember doing
this when I was in elementary school. My brother did too. And it's
mysterious. I loved the beautiful shape of the envelope that came out of
that. And at the time it was unimaginable to me that I would someday write a
research paper where the answer to the question was that curve.
There are other people besides me who are captivated by this curve. Here's
somebody who constructed it between a fallen tree and a standing one. Okay.
On to the achievability result. So now I want to -- I need to take a point
that lies inside the purple region and I need to show that there is a policy
that achieves it.
And so it's going to be a proof by contradiction. Suppose that, rather than
the purple region, the achievable region was some other subset denoted here
in yellow. I explained to you that trivial reasoning justifies that the
achievable region is convex. Okay. So if there's a purple point that
doesn't belong to the achievable reason, then there's a separating
hyperplane, a line that passes through that point that separates it from
everything that's achievable.
And that line has some slope lambda and so to say that this line separates
the point from the unachievable -- so say that the line separates this point
from the achievable region is to say that there's some instance of the
multi-armed bandit problem where no policy can achieve reward minus lambda
times cost as large as the value that that objective function attains at this
putatively unachievable point A, B.
So reward minus lambda times cost, I'm going to call that the Lagrangean
objective, or I'll abbreviate it as the lambda objective. It's an objective
function parameterized by this parameter lambda that measures kind of the
tradeoff between learning and earning, the tradeoff between these incentive
payments that you have to make to agents and the value of the rewards that
they reap.
Okay. And in our proof by -- okay. So our theorem that the achievable
region is this purple one can be reinterpreted as saying that for every
lambda you can always guarantee that there's a policy whose lambda objective
value is at least some approximation factor times the opt. To extract what
that approximation factor is, you would merely have to calculate for the
curve root A plus root B equals root gamma, where's the -- if I draw a line
of slope lambda, I guess negative lambda, where's the point of tangency,
what's the value of 1 minus A minus lambda B at that tangency point.
Okay. I've done the calculation for you. The value is 1 minus lambda gamma
over 1 plus lambda. And our theorem is simply equivalent to the assertion
that not only for Diamonds in the Rough but for every multi-armed bandit
instance you can find a -- for any specified value of lambda, you can find a
policy whose lambda objective is at least this big. That's what I need to
prove for you.
And, okay, so let me check in with Yuval. It's 1:20.
1:20, which is when I was scheduled to stop talking.
>> Yuval Peres:
You have ten more minutes.
>> Bobby Kleinberg:
>> Yuval Peres:
In a minute it will be
I have ten more minutes.
Excellent.
[inaudible].
>> Bobby Kleinberg: Okay. I'm planning to spend less than ten more minutes.
Okay. So proving that the answer to this question is yes is the bulk of the
technical work at our paper. And I just want to present maybe two slides
that give you a sketch of why this is true.
So rather than computing the policy that optimizes the lambda objective,
which we believe is probably a very, very hard, maybe PSPACE-hard, problem,
we design a simple but suboptimal policy that we're able to analyze. The
policy incorporates two features that are crucial to its analysis, time
extension and random censorship. So I'll call it the TERC policy for time
expansion with random censorship.
And how does this policy work? At initialization time, before it pulls any
arms, it tosses a biased coin for each arm. And with this probability marks
the arm as censored. Censoring an arm basically means I'm never going to try
learning that that arm is the best. Okay. After initialization at every
time step it tosses an independent coin with bias 1 over lambda plus 1.
You'll see why this 1 over lambda plus 1 in a second. If it gets heads, it
plays the next step of the optimal policy limited to the set of uncensored
arms. If it gets -- Yishay, yeah.
>>:
[inaudible].
>> Bobby Kleinberg: It's a finite number K and -- I mean, the parameter K is
not represented in the pseudocode because ->>:
[inaudible].
>> Bobby Kleinberg: It's assumed to be finite. Yeah, that's right. So when
I was informally analyzing the Diamonds in the Rough instance, I said that
there are an infinite number of those blue arms, but it's really a limit of a
sequence of finite instances, and so the proof of unachievability for the
white points requires actually doing it for every finite number of arms and
taking a limit.
Okay. If the coin toss comes up tails, you just don't offer people any
bonuses and they pull an arm myopically. I didn't say it on the slide, but
you also ignore whatever observation came out of that arm. So your posterior
in this policy is actually not conditioning on the entire history, but you
only do Bayesian updates on the histories that you saw when your coin toss
came out heads.
Okay. So okay. So this trick where we interleave the optimum and the myopic
policy, that's what we call time expansion. Why does it work and why this
one over lambda plus one in the bias? So here's a simple calculation.
In any step, if you look at the expected value of lambda times the bonus that
you pay out, that's lambda times 1 over lambda plus 1, because you only pay a
bonus in the event that the coin lands on heads, times the amount of the
bonus which is the payoff of the myopic arm minus that of the one that you
told the person to play instead of the myopic arm.
Now, in the remaining time steps where you tossed tails, you actually got
some surplus reward above what the optimum policy on the uncensored arms
thought it was going to get. That is to say, had you been playing this
restricted optimal policy, you would have gotten this payoff, but instead you
told people to just play myopically, and they got, in expectation, a higher
payoff. Okay.
So in terms of the Lagrangean objective, that surplus, when you toss tails,
exactly cancels this deficit that you get when you toss heads. Because tails
comes up with probability lambda over lambda plus 1. And when it comes up,
the surplus that you get in your Lagrangean objective is this thing in
parentheses.
Okay. So the time expansion trick brings about this cancellation where from
now on we can ignore the bonus payoffs that we're paying to the agents
because they're, in expectation, getting exactly canceled out by these
windfalls that happen when the coin toss comes up tails.
Okay. So, in other words, what I've said is that the Lagrangean objective of
the TERC policy at time step T is equal to the expected payoff that the
restricted optimal policy gets in the same time step. Okay. Now, this
analysis has a bit of a something for nothing sound to it. We're giving
people these bonus payments and yet I just told you that on the expectation
our reward minus the bonus payment is equal to the reward of the optimum
policy.
So why are we censoring arms at all? We should just leave every arm
uncensored and always get as much payoff as the [inaudible] in every time
step, and then we would have -- you know, we'd be achieving the magical point
00 on the achievability plot.
Okay. So the reason that this something for nothing reasoning doesn't work
is that the TERC policy is learning more slowly than the [inaudible]. The
opt policy gets to pull an arm and see a feedback in every time step. The
TERC policy in step T is getting the same payoff that the opt -- restricted
opt policy was expected to get in step T based on the knowledge state that
the TERC was actually in.
But this policy always gets a new data point and advances its knowledge
state. This one only gets new data with this probability. And the rest of
the time its knowledge state remains paused. So it's basically trying to
simulate this policy but it's playing more slowly.
Okay. So, in other words, the Lagrangean objective for the TERC policy is
the same as the expected payoff for a stuttering opt that every time it pulls
an arm it stays on it for some random geometrically distributed number of
steps and then advances.
Okay. Stuttering slows down the rate of learning and reduces payoff. The
censorship is designed to compensate for this slowdown. So, in other words,
conditional on arm one remaining uncensored, the slowdown in learning delays
the time when I get to pull arm one for the hundredth time. But censoring
the other arms accelerates the time when I get to pull arm one for the
hundredth time because it didn't have to wait behind all those pulls of all
those other arms that got censored.
And the censorship probability is designed to exactly trade off this slowdown
with this speedup, and that's about as much as I can say about the analysis
of the censorship.
So I'm moving on to my concluding slide now. So in summary, this paper
presented a model of crowdsourced information discovery that models phenomena
like reviewing products online or articles or crowdsourced scientific
exploration. We saw that situations like this involve a misalignment of
incentives that has to do with explore exploit tradeoffs and the analysis
that I provided allows a surprisingly precise quantification of that tradeoff
as a function of the principal's patience.
It allows you to make statements like the principal can always achieve at
least 75 percent of the social surplus while paying back only 25 percent to
the users. The structure of the TERC policy lets you say that simple
policies that randomize between just letting people do what they want or
providing incentives for them to do something optimal are -- simple policies
like this are sufficient to achieve the approximation guarantees in this
bullet. And the analysis of the lower bound gave us some insight of what the
hardest looks like.
And so this was going to be the last slide of
waiting in the Overlake transit center for my
Mahatma Gandhi that I had never seen before.
were to die tomorrow, learn as if you were to
my talk, but yesterday I was
bus, and I saw a quote from
So he said: Live as if you
live forever.
I hope you can see the relevance of that quote to this talk. But maybe you
can also see that uncharacteristically for Gandhi, his wisdom was sort of
incomplete here. He was overlooking a fundamental tradeoff in life. We
learn by living. We learn by living. So you can't adopt a different policy
when learning from the policy that you adopt when living.
>>:
[inaudible] randomization.
>> Bobby Kleinberg: Oh, yeah. So that's right. So replacing Gandhi's maxim
with the much more wise Yuval Peres maxim, what he should have said was live
and learn as if you were to die tomorrow with probability 1 over lambda plus
1 and live forever with probability lambda over [inaudible] plus 1.
[laughter]
>> Bobby Kleinberg:
And with that, I conclude.
[applause]
>>: [inaudible] make sense in practice [inaudible] in principle [inaudible]
I have to see the whole history of all the [inaudible].
>>:
Thousands and thousands of reviews.
>>:
[inaudible] because otherwise.
>> Bobby Kleinberg: You wouldn't have to see that because if we all have a
common prior and you have a correct belief about what discount policy I'm
using, then you could compute for yourself what discounts were applied.
>>: But I'm just wondering [inaudible] I'm wondering whether there's any
middle ground where, you know, you could publicly announce [inaudible] X
which is not the optimal thing but it's very easy for us to understand.
>>: Why do you care about the discount? You don't have [inaudible] you just
care about you get to see the results. That's all the information there is
about the arms. You don't actually care why people chose [inaudible].
>>:
[inaudible] how can I find a policy [inaudible].
>> Bobby Kleinberg: Sorry, are you portraying that as a modification of
Ander's question, or just ->>: It's purely [inaudible] I just want to say is it clearly to see them
only [inaudible].
>> Bobby Kleinberg: Yeah. So no. Given A and B it's not easy to define a
policy [inaudible]. What's easy to do is, given a lambda, I showed you how
to get a simple policy whose Lagrangean objective achieves 1 minus lambda
gamma over lambda plus 1 approximation to the optimum.
But the business of achieving a specified A and B depends on knowing more
about the parameters of your multi-armed bandit instance. You can't simply
be -- it can't be described as a simple transform move, restricted optimal
policy.
Ander, I think your question is a good one and a perceptive one. I don't
really have -- I haven't thought about the question enough to say anything of
substance. So I think the one thing I'll say in response to your question is
that I object that I was here for an entire morning of the Algorithms for
Technology Transfer workshop, and I did not once hear a question that was
prefaced by, you know, your theory is all fine and good, but bringing it a
step closer to practice ->>:
[inaudible].
>> Yuval Peres:
[applause]
Okay.
I think we have to move on, so let's thank.
Download