>> Ran Gilad-Bachrach: Umar is not new to Microsoft... in Mountain View working with Nina Mishar and others on...

advertisement
>> Ran Gilad-Bachrach: Umar is not new to Microsoft Research. He's spent the summer
in Mountain View working with Nina Mishar and others on search related problems. He
also spent the summer in AT&T working on voice activated interactive applications.
Umar won several prestigious awards including the best student paper in UAI 2007. The
computer science department award in Princeton in 2004. He also won the Google
student award in the machine learning symposium twice in 2007 and 2009. Therefore
without further ado, let's here from Umar Syed about robust semi-supervised learning.
>> Umar Syed: Okay. Thank you, Ran, for that very kind introduction. So as Ran said
today, I'll be talking about robust semi-supervised learning. This is joint work with Ben
Taskar at the University of Pennsylvania.
So just to set the scene a little bit let me review supervised learning, which is the oldest,
most traditional learning model in machine learning. So in supervised learning a -- we
are given a training set of examples and a labeler carefully examines this set of examples
and reveals all the labels of the examples to a learning algorithm and the learning
algorithms job is to try to fit a function that predicts the labels of these examples as
accurately as possible.
But it has been commonly observed that labeling examples is an expensive and time
consuming process and many alternative approaches to supervised learning have been
proposed to deal with this. And one of the most popular is semi-supervised learning.
So semi-supervised learning is based on the observation that while labeled data is
expensive, unlabeled data is typically quite cheap and easy to obtain and so our learning
algorithm to try to exploit this abundance of unlabeled data.
So in semi-supervised learning we're given a training set of unlabeled examples. A
labeler examines the examples and reveals the labels of some of the examples to the
learning algorithm.
Now a question that sort of immediately comes to mind is how are the examples selected
to be labeled. And in most analyses of semi-supervised learning and the way most
algorithms are derived it is assumed that the examples to be labeled are chosen
randomly and usually uniformly at random; that is the unlabeled data and the labeled
data are drawn independently from the same distribution.
But in most naturally occurring partially labeled datasets, the examples that are labeled
are not selected at random. So consider a website like Flickr. Users on Flickr can tag
pictures with tags that indicate the content of the picture. Something similar happens on
Facebook. You can tag pictures. You can also indicate whether you like the picture or
not. And the process by which users are selecting pictures to label is unknown to us but
it's almost certainly not a random process.
In fact, most websites on the Internet that are soliciting user feedback are generating in
one way or another a partially labeled dataset and the examples that are labeled in that
dataset are selected according to a process that is difficult to model and difficult even to
note.
Now, let me just point out that in all these examples I'm showing you, the user selects an
example to label not in a sort of sterile isolated environment but rather in the context of a
webpage typically in the context of other examples. So this selection bias -- the selection
bias by which the chooser chooses examples to label is highly correlated across
examples and seems difficult to model.
And so today I'm going to talk about an approach to semi-supervised learning that
attempts to totally lift this modeling burden. So we're going to assume that the examples
in the training set that have been selected to be labeled are chosen arbitrarily even
adversarially. And so we don't want to make any probabilistic assumptions about which
examples are labeled.
Within this framework, I'm going to drive some generalization bounds, tight upper and
lower generalization bounds on learning in this framework. And using those bounds, I'll
derive an efficient and scalable algorithm. They can learn a predictor using -- using
partially labeled datasets. And then towards the end, I'll describe some experiments that
illustrate the benefits of our approach. And these experiments will illustrate that our
approach is more robust than traditional semi-supervised learning algorithms when the
labels are -- when the labels are provided by a particularly unhelpful or confusing labeler.
Okay. So here's a roadmap of the talk. After I describe robust semi-supervised learning
in detail at the end I will briefly discuss some will other products I've worked on related to
bandit algorithms. I've worked on these projects during both my PhD and my post-doc.
And I think some people here might be interested in perhaps talking about these projects
later. So I'll just kind of briefly summarize them at the end, and maybe we can discuss
them later.
Okay. So let's talk about the learning model. So what do we want from this learning
model? Remember that we want to lift the burden of having to model how users select
examples to label. So we don't want to make any probabilistic assumptions about how
those examples are chosen. Additionally and somewhat orthogonally we don't just want
the labeler to be able to give us discrete label information about each example, but also
to be able to tell us some global information about the labeling of the entire dataset. And
this kind of soft global information is going to be particularly useful for our experiments.
And I'll describe it more in detail a little bit.
So here's our framework in some more precise detail. So we're given a training set of
examples, X and Y. Here X and Y are vectors, so X is a vector of X and Y is a vector of
labels corresponding to those examples. The labeler examines the training set. And
then he reveals to the algorithm a label regularizer function.
Now, this function is chosen arbitrarily, perhaps even adversarially from some known
function class. And the function is going to encode all of the information that the labeler
wishes to reveal to the learning algorithm. So the semantics of this -- what is this label
regularizer? The semantics of this regularizer are that of an asymmetric penalty function
on labels of the training set.
So, in other words, if R assigns a large value to sort of full labeling of the training set, this
means that that labeling is not likely to be the correct label of the data. If R assigns a
small value to some complete label of the training set, then R may -- then that labeling
may or may not be the right value -- the right labeling of the dataset. So in other words,
roughly speaking the minima or near minima of the function R, which is defined on label
space -- labeling space are possible labelings of the dataset.
>>: This doesn't talk about choice when you get to that next slide, this just seems to talk
about [inaudible].
>>: [inaudible] function.
>>: Okay.
>> Umar Syed: All right. So we're just given this function. That's it.
>>: Okay.
>> Umar Syed: So just ->>: One kind of clarifying question. How is the expert supposed to give you this
function? It is given to you by something, so what [inaudible] give you the sense of how
the expert is going to come up with this function?
>> Umar Syed: So you're going to need to be able to follow the gradient of this function.
>>: How are we supposed to expect an expert to give you this function?
>> Umar Syed: I'll give you some examples of specific functions and maybe that will
clear it up. Yeah. So that will be the next slide. But just to recap, this is the learning
model. Again so a training set X and Y is drawn independently from some distribution.
The labeler looks at that training set, replaces the labels with this regularizer function and
that is the only information that the learning algorithm sees.
Okay. So here are some examples. So suppose the -- so this framework is quite
general, and it can capture several existing approaches to semi-supervised learning. So
here's an example. Suppose that the labeler is going to tell you for each example a set
of possible labels that this example could have. So a regularizer function R that would
encode that information is a function that assigns zero to any full labeling of the dataset
that is consistent with these -- that is consistent with these sets and infinity otherwise. It
looks like in the transfer some symbols got lost. So this should be infinity and this should
be element. Sorry about that.
Oh, yeah. Oh, boy. I hope this doesn't happen too much. Okay. So a -- so here's
another kind of information that a regularizer function R could encode. So suppose the
labeler wants to tell that you to give you a similarity measure on examples and he wants
to assert that two examples that have similar -- that are similar have similar labelings. So
this is basically what's called laplacian regularization. And it can be encoded using this
form here.
And notice here that I've replaced a labeling of the dataset Y with Q which is a
distribution on labelings of the dataset. This is sort of basically a soft version of a labeling
of the dataset. And for the rest of the talk when I mean a labeling, I really mean this
softer version of a labeling, distributions on labelings.
Another thing the labeler could tell you is he could tell you the expected values of some
features under the true posterior distribution. And this kind of information is called
posterior regularization. And this is yet another approach to semi-supervised learning. I
hope that answers your question.
These regularizers can also be combined. For example if he wants to tell you -- if the
labeler wants to tell you a set of labels per example and also similarity information, then
that can be encoded using this simple sum.
>>: So if I wanted to just [inaudible] let's say I wanted to be really adversarial and I want
to choose the worst possible examples in each class to show you, how would I
[inaudible]. I want to just go and I want to mess you up. Let's say there's two
overlapping Gaussians and I want to take the labels from the wrong ->> Umar Syed: Right, right, right. So I'm going to do an experiment with exactly that kind
of noise later.
>>: So how do you express R -- how do you express that R?
>> Umar Syed: Oh, so you would say that -- so he wants to tell you information say
about two examples. So labelings that are consistent -- so any labeling, the function will
be independent of what labels are assigned to the other examples. And for those two
examples, you know, it will have a low value for those labelings that you think that you
say are allowed, the labeler says that are allowed and high value for those that
[inaudible].
>>: So how do you -- okay. Well, keep going. Keep going.
>> Umar Syed: So one question you might want to ask is -- so I've just explained how
several existing approaches to semi-supervised learning can be put in our framework.
So how is this any different from order semi-supervised learning?
So all existing -- at least the ones that we're familiar with, approaches to semi-supervised
learning are optimistic, all right, in the sense that they tend to find a predictor that has
good performance on the best possible complete, complete labeling of the partially
labeled dataset. And our algorithm, as I'll demonstrate to you, really finds the best
predictor that has the best performance for the worst possible complete labeling of the
dataset. And so in this sense existing approaches are optimistic while our approach is
more pessimistic.
And this is particularly good for scenarios where the examples to be labeled -- that have
been revealed are not representative or confusing or misleading in some way. And our
experiments will illustrate this.
And moreover as kind of a bonus, this pessimistic versus optimistic approach has the
effect of convexifying some existing approaches to semi-supervised learning that are
non-convex. For example, posterior regularization, one of the examples I showed you, is
a non-convex formulates of the problem and our pessimistic approach convexifies that
approach. Okay. So I hope the learning model is clear. So now let me give you a few
bounds about this model.
So before I kind of jump into stating what the theorems are, let me step back and ask
what is the be point of these theorems. So I'm going to give you and upper and lower
bound on generalization error in this learning model. And the algorithm I present later in
the talk is going to minimize the upper bound. And because this upper and lower bound
are going to be almost tight, this will show that our learning algorithm is near optimal. In
other words, the right thing to do.
So here's the upper bound. So let's -- you know, again, X and Y is a training set drawn
from some distribution. And L is the loss of D, the parameter that you're trying to learn. It
could be log loss or hinge loss or something like that. And R is the label regularizer
function that encodes the information.
Then with probability one minus delta we have the following somewhat difficult to parse
upper bound. But luckily this -- the bound on the right-hand side is the sum of three
turns. And each of these three turns I think has a nice intuitive interpretation. So let me
walk you through it to help you understand what this bound is really saying.
So the first term is this -- is a maximization. And this maximization is balancing between
two objectives. So this -- so this first part of the first term this is the expected loss of the
training set under labeling Q, and this is the penalty essentially that the regularizer
assigns to Q. So this maximization is trying to maximize the empirical loss that we
observe subject to this penalty.
So when R has -- and so this term, this first term will tend to be large whenever R has a
lot of minima because that gives it -- or near minima, because that gives this
maximization more opportunities to be large, more places where it can be maximized.
So in other words, this term will tend to be large whenever R is ambiguous, you know,
many, many labelings are possible or allowed.
The second term is the penalty that is assigned by the regularizer function to the true
labeling of the dataset. And so this term will tend to be large when R is misleading, when
the labeler is lying to you about the what the true labeling is.
And the last term is the term that goes to zero as the number of samples guess to infinity
and this term comes from standard arguments from uniform conversions. And so taken
together what this bound is saying as the number of examples goes to infinity the loss of
any parameter is bounded by the ambiguity and misleadingness of the label regularizer
function that the labeler gave you. Which is a sort of an expected kind of result. Okay.
So this is the upper bound.
Now, it turns out that without further assumptions we're not going to be able to prove a
matching lower bound. So what I'm going to describe to you is a set of what I think, I
hope you will agree, are natural assumptions about the behavior of a labeler. And under
these assumptions our learning we'll be able to prove a tight lower bound. Okay. So
there are basically three assumptions.
So the first assumption is basically a truth telling assumption. So here's the function R
that the labeler gives you. And let this be the true labeling of the data. And the
assumption is that this -- the true labeling is within epsilon of the minimum of the function.
In other words, you know, it's not -- it's not -- the labeler is not highly misleading about the
truth. I should perhaps also pound out that this function R typically will have many
minimum, so it's not enough just to identify the minimum of the function to be able to
recover the labeling.
>>: I have a question [inaudible] the simple -- so the labeler is handing you both the
regularizer R and subset of labels.
>> Umar Syed: Just the R.
>>: The just the R.
>> Umar Syed: The only information [inaudible].
>>: But isn't this -- how is this -- earlier you said you were taking a pessimistic view of
[inaudible] and then here it seems maybe I misunderstand you but it seems that you kind
of assume that [inaudible] misleading you [inaudible].
>> Umar Syed: Oh, right. So the pessimistic view -- so that's a very good question. So I
would say that the pessimistic view is not so much a consequence of the model but
rather a consequence of the bounds that I'm describing to you. So this -- so this upper
bound is pessimistic because we have this max over in essentially all possible labelings
that are essentially consistent with what he told you, right? And I'm going to describe to
you as -- a scenario where this bound is tight.
>>: So you're saying that worst case in terms of, you know, given the setup with that
particular --
>> Umar Syed: That's right. That's right.
>>: The R itself is not [inaudible].
>> Umar Syed: Right. So I should make a distinction between being misleading and
being ambiguous. Right. So this is saying that the labeler is not misleading. But he can
be as ambiguous as he wants.
>>: I have confusion also on the previous slide. The -- all right. The loss -- this is the
loss of the model assuming Q is the correct labeling.
>> Umar Syed: On the right-hand side, that's correct.
>>: On the left, the first term on the right-hand side.
>> Umar Syed: The first term on the right-hand side.
>>: Okay. Okay.
>>: You write Q as a state. Yeah. Okay. So it's [inaudible].
>> Umar Syed: So draw an example from the training set and then draw it's label
according to Q.
Okay. So again, the first assumption is that the labeler is not too misleading. The
second assumption is an assumption about the stability of the labeler. So if I have two
training sets and they are very similar to each other, and I'll explain what I mean by
similar in a second, if they're very similar, then the labeler will give me the same
information for both training sets. And by similar, I mean think of those two training sets
as empirical distributions in put space. If those two distributions are close by this amount
that we're going to prioritize by lambda, then the two -- then that's what we mean by
close. So that's stability.
And the last assumption is kind of an unusual one, which is -- but it's necessary. And
that is that -- what we call the no coding assumption. And that is the labeler can't hide
the true labeling in some clever way in the function R. We want to exclude this possibility
because if he hides the true labeling in a clever way, then a very clever algorithm might
be able to recover it, and we have no hope of showing that our algorithm's optimal.
So let me just give you an example of a clever hiding of the information. So consider a
dataset where the examples have binary labels and the labeler is going to erase some of
the labels and not erase others. So there are -- there are M examples. There are two to
the M possible labelings and there's also 2 to the M possible ways to erase labelings. So
he -- there could, in fact, be a one-to-one mapping between labelings and erasure
patterns. And a very clever algorithm could invert that mapping and remove the truth.
Now, obviously this is a highly artificial unrealistic setting and this is exactly the kind of
artificial thing we want to rule out. Because if we don't rule it out, then I can't show that
the algorithm is optimal.
>>: What would be the [inaudible].
>> Umar Syed: So the proper definition is that -- so R -- so R assigns some labeling's
finite value and other labeling's infinite value, right? So let's call things that are finite
allowed and things that are infinite not allowed, right? The penalty means it's not
allowed. So the assumption is that no algorithm can conclude that a labeling that is
allowed is not allowed -- is not possible. So if the regularizer assigns a finite value to a
labeling then it's possible. That's the assumption.
>>: So on [inaudible].
>>: Yeah.
>>: [inaudible] where people tend to rate things that are [inaudible].
>> Umar Syed: Yeah. Yeah.
>>: And so that would be kind of the situation that seems to be analogous to a mapping
between -- not a deterministic mapping but nevertheless a mapping between labelings
and [inaudible] mappings, right, because what's happening essentially erasing the three
star gradient.
>> Umar Syed: That's true.
>>: And [inaudible].
>> Umar Syed: That's a good point. That's true. That's a good point. I wouldn't argue
with that.
>>: Your lower bound cannot [inaudible].
>> Umar Syed: No, no. It's fine with that. Because -- so what would the R look like for
that setting?
>>: [inaudible].
>>: Well, no, it's more than that, right, because you're saying that the cases that are -the cases that are left unlabeled essentially telling me something about ->>: The actual label.
>>: The actual label.
>>: But it's ->> Umar Syed: Right. So if it ->>: [inaudible].
>> Umar Syed: No, no. [laughter]. So if it were the case that anybody unlabeled is
definitely not a one or definitely not a five ->>: [inaudible] right?
>> Umar Syed: Well, I think ->>: What type of ->>: So is one soft it's okay.
>> Umar Syed: Yeah, I think so.
>>: Oh, okay.
>> Umar Syed: So the assumption is -- does not involve probability. The no coding
assumption.
>>: So there can't be a deterministic way to determine [inaudible].
>>: Conversely a [inaudible] R of truth [inaudible] I mean you can still do full labeling,
right?
>> Umar Syed: Absolutely, yes.
>>: So R truth equals zero [inaudible].
>> Umar Syed: That's the supervised setting.
>>: That still works?
>> Umar Syed: That still works. Okay. So under these three assumptions which I hope
I've convinced you are reasonable we have the following lower bound. So if the number
of examples is at least this many, here lambda is from the stability assumption, then with
at least constant probability no algorithm is guaranteed to have generalization error that's
more than this much better -- excuse me, so in other words that's this much better than
the upper bound. In other words, there's no algorithm that can guarantee with high
probability a generalization error bound that's more than this much better than the upper
bound that I showed you. And again, here lambda is from the stability assumption, and
epsilon is from the truth telling assumption.
>>: That square root, is it real square root or [inaudible] I'm just curious if it actually
cancels ->> Umar Syed: It's probably O, you're right.
>>: But does it actually cancel the term? Does it cancel the term in your upper bound?
Is this ->> Umar Syed: It's the same term.
>>: Yes. So this is saying that you can't be more than max over Q of L theta, you know,
Q.
>> Umar Syed: Oh, no, no.
>>: Q minus epsilon.
>> Umar Syed: No, no, no, no. It's the -- so the gap will be twice. So in other words, the
upper bound has a slack of this much.
>>: Yes.
>> Umar Syed: And the lower bound has sort of a complementary slack below it.
>>: I see.
>> Umar Syed: So the gap is twice this term, yeah.
>>: But it goes.
>> Umar Syed: Yes, it goes to -- asymptotically goes to zero. So asymptotically the gap
is epsilon. Okay. So now that we have our matching upper and lower bounds, we're
ready to drive an algorithm. And the algorithm is really simple. And now that we've
shown that the upper bound is tight, at least under some reasonable assumptions, the
algorithm is just going to be minimized -- find the theta then minimizes the upper bound.
And hopefully we've argued that that's the right thing to do. And we're going to minimize
the upper bound while controlling the norm of theta to help with generalization error.
And so here I've just plugged in the upper bound from the previous slides discarding
terms that are independent of theta. Oh, sorry. So how could I minimize this objective?
So one thing I could do that perhaps some of you are familiar with, is I could try a
subgradient descent. I want to minimize this function. This function is not differentiable
but I can find its subgradient at various points. And I can follow the subgradient until I
find the minimum. But something that we found seems to work better is the algorithm I'm
about to describe, which is essentially a -- we're going to describe a two step algorithm
for finding theta star and the theta that minimizes this objective. And this algorithm
assumes that both the loss function and the regularizer function are convex. And all of
the examples of regularizer functions that I've given you so far are, in fact, convex.
And the algorithm is called game because these game theoretic theorem in the proof of
its conversions. So here's the algorithm. Essentially what we're going to do is -- excuse
me. So we have this mini max objective. So we're basically going to solve the problem
inside out. So by solving it inside out I mean I'm going to swap the min and the max in
the objective. Okay? And now I'm going to find not the best theta for the worst possible
labeling but rather the worst labeling for the best theta. In other words, I'm sort of
changing roles. I'm no longer the learning algorithm. I'm like nature trying to mess up my
learning algorithm. And that worst labeling is called Q star.
And the next step is to find the best response. So assuming that this worst labeling is
instead the true labeling I just find the best theta for this Q star in this worst labeling.
Okay? So I'm kind of solving the problem inside out.
>>: So you do kinds of like [inaudible] when you -- between the steps?
>> Umar Syed: No, not exactly. I'll describe what I'm going to do. But think a high level
argument.
So there's a couple of questions. First of all, the first question is, you know, is this buying
-- what is this buying us? Why does turning the problem inside out helping in any way?
An even more basic question is that why does this give us the right answer? In other
words, why is the theta star here the same as the theta star for the original objective?
Right. And so let me address that problem first.
So first of all, is it clear what I mean by solving the problem inside out? Okay. And the
reason why this works is that Q star that worst possible labeling of the data, in fact,
uniquely determines theta star. The parameter that registered ->>: As long [inaudible].
>> Umar Syed: Yes. That's right. Well, a bit more on that. Strongly convex.
>>: Strongly convex.
>> Umar Syed: Yeah. So here's exactly what you're talking about. So here's the
objective -- here's the objective for this second step of the algorithm. So I found my Q
star. The worst labeling. And indeed theta star is the unique minimum of this objective.
And this -- this function is strongly convex because of the regularizer alpha that I've
added to it. If I had not added that, it might be just convex. It could have flat lesions. But
due to strong convexity it's the unique minimum. But we have kind of a problem, which is
that no algorithm is going to find the exact minimum but rather an approximate minimum.
And at least in this picture, the true minimum and the approximate minimum are quite far
apart because when the function is just almost flat, you have this problem.
So what we do, and this is sort of done in more than one place, is that if you just increase
alpha you make the function more strongly convex and then you can be sure that any
approximate minimum is going to be close to the true minimum. And so this is the effect
there.
And so this kind of illustration I hope explains the conversions guarantee we have for the
algorithm, which is that if both steps of the algorithm, this inside-out approach find epsilon
approximate solutions to the problem, then the theta that's output by the algorithm is
going to be this close to the true -- to the parameter that you're after. And the position of
your answer is governed by how approximate the answers were and also this alpha. So
you need that alpha to make the thing more curved so that the approximate answer was
close to the true answer, and then you have to make epsilon at least that big to drive
everything close. Go ahead.
>>: You said you promised to do better than the subgradient minimization. Do you mean
it was just faster or do you mean it found a better answer?
>> Umar Syed: It just seemed to be faster. And I can expand a little bit on that. So if
you were to just kind of do just regular subgradient minimization, what you would find
essentially is that -- so this approach does, you know, sort of one step all at once and
then finds the Q all at once and then finds the theta. And the subgradient approach
would kind of alternate between those two things. It would make a step of progress in the
Q direction, a step of progress in the theta direction, and it would just alternate between
those two. And it seems, at least on the examples that we tried, is that doing things all at
once got you to the answer faster than doing a step of one first and then a step of the
other.
>>: You were saying that R often has many local minima of this [inaudible] R is convex.
>> Umar Syed: It has many minima but not testimony local minima.
>>: Oh, you're saying that they're contiguous or something?
>> Umar Syed: Yeah, yeah. So all these [inaudible] that I give you are convex.
>>: I mean so like your previous slide you're showing you like you know I mean as you
change alpha you're going to get closer to the optimum. But only in a sense that is
meaningful in terms of that regularizer. Right? Like I mean that theta, even though your
[inaudible] data is kind of arbitrary, right, has nothing to do with the really decision of the
problem and [inaudible] does that really mean anything? I guess it means something in
terms of the particular expression.
>> Umar Syed: Right, right. Yeah, I mean, so I would say a couple of things. One is that
I would appeal to -- there's a number of sort of online learning algorithms that play the
same trick, which is that their convergence guarantee is in terms of how strongly convex
the objective is. So we have the same issue I guess. So that is one thing I would say.
And the other thing I would say is that if you -- you know, if you have enough time, right?
So there's two terms here. There's alpha and there's also epsilon which is how accurate
each step is. So if you have enough time to get a very -- an order alpha approximate
solution, then this distortion that's imposed by increasing alpha is not going to hurt you.
Do you know what I mean?
>>: So I can think of [inaudible] like algorithms that give you non-convex [inaudible].
>> Umar Syed: Yeah.
>>: Yeah. So this is -- but you could still do epsilon subgradient in those cases, but I
guess there's no fear that says how -- how accurate you will end up [inaudible].
>> Umar Syed: Right. Right. I mean definitely. And I think that would -- yeah. And that
would still apply here. So I haven't even -- so this theorem applies to the meta algorithm,
how I'm solving each of these steps is not ->>: [inaudible].
>> Umar Syed: Yeah. Go ahead.
>>: So is the alpha and the epsilon, are they coupled in some way so if you make it
steeper, the epsilon ->> Umar Syed: No, they're ->>: [inaudible] epsilon is an input in terms of added to the theorem.
>> Umar Syed: Yes, yes.
>>: But in practice is -- if you were to change the steepness of the R, then the effective
epsilon you had before you changed the alpha would be different?
>> Umar Syed: Yes. Yes.
>>: And so these two things are kind of coupled in a kind of an interesting way?
>> Umar Syed: Definitely. So they're independent parameters, but this theorem is telling
you that they had better be coupled like they had better be the same order if you wanted
the performance.
>>: So I guess the question is when you could get them to be useful [inaudible].
>> Umar Syed: Right. Right. So ->>: One or two layers.
>> Umar Syed: Right. So this epsilon governing how approximate each step of the
algorithm is. And typically that's just a matter of time. So the longer you run it -- I mean,
I'll explain this in a moment, but each one is just an optimization. So the longer you run
the optimization the more I hear it is going to be. Okay.
Okay. So there are these two steps, right? And I said if you solve each step pretty well
you get a good -- you get a good answer. But how do we solve each step? And it's not
clear that we've won anything yet because I started with a mini max problem and step
one is a maxi min problem so how is that any better?
Well, what's interesting is for the loss functions that we're interested in, specifically log
loss and hinge loss, we can turn this into something that's much more friendly. And what
we do is we take the -- we dualize basically. We take the dual of the inner minimization.
And so this minimization I'm going to take its dual. And after I take its dual, for log loss I
get this maximization. And so this is the entropy of -- well, let me just say it.
So this is basically very similar to maximizing log likelihood. So here -- and you know the
dual ever that is maximizing entropy. And so here I'm maximizing entropy subject to
these expectation constraint in the features. But now I'm not matching the empirical
distribution, I'm matching this worst case distribution Q. And this -- this objective has a
nice form and sort of a well studied form, and we can use existing algorithms like for
example exponentiated gradient descent to solve this problem sort of just right out of the
box. So that's step one. We dualize the inner minimization and we just kind of use a
standard approach to solving the problem.
And then once I found the best -- the worst case labeling Q star, I just plug it into here
and now this is just normal maximum likelihood. And I can use whatever I want, for
example stochastic gradient descent to find the best parameter for this worst case
labeling of the data.
Okay. So now I'd just like to spend a little time talking about some experiments that
illustrate the features of our approach.
Okay. So we tried our algorithm on some image labeling tasks. These are both -- and
we compared our algorithm, excuse me, on both binary and multi-class classification
against some standard traditional semi-supervised learning approaches, and we did it
using labeling noise that is both simulated and also some label noise that we got by
simulating the data to labelers on Amazon Mechanical Turk.
So for the first hundred experiments we tried it on the binary classification task. We
compared our algorithm to some semi-supervised learning variance of support vector
machines. The first is laplacian SVM and the second is transductive SVM.
For our game algorithm, we used the regularizer that I had talked about earlier. For each
example we give a set of possible labels that it might have plus a laplacian regularizer.
And all of the experiments that I'm going to report, the algorithm uses log loss for its
objective but in the results I'll show you accuracy. Yes?
>>: [inaudible].
>> Umar Syed: That's right. Just one of the features.
So the first kind of label noise that we wanted to test against was an unhelpful labeler.
And an unhelpful labeler what we're trying to simulate is a person who erroneously but
perhaps sincerely thinks that his best effort is labeling those examples that are
exceptions to the general rule. So this is the example that you wanted, right?
So as -- when labels are requested, they are labeled in the order of outliers first. More
precisely in decreasing order of the number of neighbors who are of a different class. So
this is one of the noise labels, outliers first. It's a very confusing kind of noise.
We tried this on a couple of different image -- datasets. The first is the Columbia Object
Image Library. This is dataset of household images. And the second is a dataset of EEG
brain scans. These are both datasets part of a standard semi-supervised learning
benchmark.
And to give you an idea of what an outlier in these datasets is, here is the most outlierish
example in the Object Image Library dataset. So these two classes are medicine cartons
and toy cars. And on the right circled in red, although it doesn't look like it, this is actually
a picture of a toy car. It just looks like a medicine carton because of the strange angle of
the image. So this is the first outlier [inaudible].
>>: [inaudible] heavy encode [inaudible] in a car. [inaudible] algorithm ceases the
[inaudible] function, right?
>> Umar Syed: Oh, yeah. So I'm going to label say 10 -- I'm going to reveal the label
have 10 examples, and that will be a trial. And then I'll reveal the labels of, you know, 15
examples, and that will be a trial. And so on. And so I'll show you how the algorithms
perform.
>>: Algorithm just -- just with different offers?
>> Umar Syed: Different offers, right. And I'll show you how they're performed as more
and more information is revealed. And what I'm telling you now is how is that information
revealed. So -- in the trial that reveals the least amount of information I just show you the
outliers, and the trial that reveals somewhat more information I show you those plus a
few more and plus a few more.
>>: So you reveal a subset how [inaudible] then epsilon of R your best Q? So let's say
in this example you just [inaudible] toy car. Right? So how is R defined. So for that
being toy car.
>> Umar Syed: Yes.
>>: And that [inaudible] half of all possible R, that's the domain, and that's zero and then
the other half is infinite, oh, because you're not lying.
>> Umar Syed: I'm not lying, yeah.
>>: The truth is in that ->> Umar Syed: Yeah. I always reveal the correct label.
>>: I see.
>>: [inaudible].
>>: Okay.
>> Umar Syed: Right. So this is kind of what the data was like.
And so here -- okay. So here we are on the -- this is actually the main computer in the
trained dataset. So on the X axis I have how much of the training set I have revealed or
labeled. And again, this is in order of hardness. So the hardest first and then so on. And
I can only point out this is a very challenging kinds of noise. So this is why the accuracy
levels are fairly low. But generally speaking the game algorithm does better than the
other methods. Although as the amount of training set that is revealed increases their
performance tends to converge.
>>: [inaudible].
>> Umar Syed: I didn't -- not necessarily. I guess I didn't -- I didn't check explicitly but
we didn't enforce that in any way.
>>: [inaudible] do you know what the [inaudible] is on this with the -- what is the fraction
->> Umar Syed: It's 50/50.
>>: It is 50/50?
>> Umar Syed: Yes.
>>: Okay.
>>: [inaudible] trials what changes? So like you have verifiers, right.
>> Umar Syed: Yes. Oh, I mean so for example -- oh, so we are -- we are subsampling
fraction of the data, yeah. So that's where the noise comes from.
>>: So you subsample then rank based on outlier [inaudible] ->> Umar Syed: That's right. And we have something similar on the image library
dataset. The gap is not as big here. But again I would point out that at the very low
levels of the training set being labeled our algorithm does do somewhat better.
>>: [inaudible].
>> Umar Syed: Excuse me?
>>: [inaudible].
>> Umar Syed: Oh, yeah. You mean the examples that are labeled are just chosen
uniformly.
>>: Yes.
>> Umar Syed: Not in this experiment but in a later one I do that.
>>: [inaudible].
>> Umar Syed: Yeah. Yeah. For sure.
>>: [inaudible] accuracy from a [inaudible] at least 90 percent.
>> Umar Syed: Yes. So [inaudible].
>>: It's twisted. This is [inaudible].
[brief talking over].
>>: [inaudible].
>>: Yeah, that's right. This is almost but not quite lying [inaudible].
>>: If only choose the worst part of that because [inaudible] a much better [inaudible].
>>: It's not the [inaudible] boundary, right?
[brief talking over].
>>: [inaudible] the other side of the boundary.
>>: Yeah. It's errors. He's labeling almost like errors.
>> Umar Syed: So okay. So I do show some results per uniform sampling on a different
dataset so maybe [inaudible].
>>: In this presentation you mean?
>> Umar Syed: In this presentation, yes, in just a little bit.
So in the second set of experiments we tried it on a multi-class dataset. And here we
compare to some semi-supervised algorithms that use -- try to infer the missing labels
using app kind of maximize likelihood approach.
And so here the kind of noise we wanted to test against was a labeler who is reluctant to
label examples that seem to be ambiguous. So here the labeler is going to labeler
examples on the border between two classes last. So as the number of examples whose
labels are requested increases, the labeler examples -- labels examples in increasing
order of the distance from the centroid of the class cluster. That's the precise definition
that we're using here. But generally speaking this labeler is avoiding the border regions
which are the harder examples. So this is a guy who is reluctant to label hard examples.
And we tried this on dataset called Faces in the Wild. This is a dataset of facial
photographs of public figures. The features are the eigenfaces.
And so just to give you an example of what a border region between two classes look
like, these are the two classes in our dataset that had the largest border region by our
definition. And Arnold Schwarzenegger and Gerhard -- this is the former German prime
minister, Gerhard Schroder. And they kind of look alike, right? I think they do.
Okay. So here are the results. And again this is a 10-class example. And again we're
reporting accuracy. And I would say over the range of the fraction of the dataset that's
labeled we're showing small but consistent improvement over the existing algorithms.
>>: [inaudible] is more surprising to me. Like I mean it [inaudible] it will be interesting to
[inaudible] results in that one too. Because there you know kind of like you're essentially
giving it the good sort of labels, things that are squarely in the midst of the classes, right?
And take the [inaudible] you hope would kind of grow out in the right sort of way.
>> Umar Syed: Right.
>>: What's your intuition [inaudible] so it's not really such an adversarial solution, right?
>> Umar Syed: Right. Right. Right. Yeah, I think so -- I think what Dana is basically
doing is hedging, right? If you think of like EM, right? So what the EM approach
basically does is it's trying to infer the missing label. An EM can get caught in a local
minimum. And what EM does is it doesn't hedge at all. It guesses sort of wildly -- I
mean, I'm [inaudible] but it -- but, you know, by getting caught in the local minimum it's
sort of being very optimistic. It's guessing some labels for the examples that are missing
and then finding that, in fact, I can fit those quite well, whereas our algorithm hedges. If
there's uncertainty about ->>: So you're using the EM against over confident [inaudible] treats the problem
[inaudible].
>> Umar Syed: Exactly. And the way that I labeled it A is that the examples that are
missing are exactly the ones that you should be hedging.
>>: [inaudible].
>> Umar Syed: Yeah, well, you know, so a bunch of them are like synthetic, right? So
like I don't know, at least three or four of them are like synthetic. And then we showed
you two.
>>: [inaudible].
>> Umar Syed: Yeah.
>>: [inaudible].
>> Umar Syed: Which ones [inaudible].
>>: [inaudible].
>> Umar Syed: Okay.
>>: That's very hard [inaudible].
>>: Is that the one that has a lot of classes?
>>: Pardon me?
>>: Does it have a lot of classes?
>>: [inaudible].
>>: But a lot of features.
>>: The [inaudible].
>> Umar Syed: Yeah. Right. So we saw the two datasets out of those eight real ones.
We had some real practical considerations like something like too many features. I don't
remember there being -- I don't know. I can't say that -- I don't remember the exact
reasons why we selected the ones that we did.
Okay. And finally we wanted to try it on some sort of label noise that comes from real
people and not from simulation. So we selected from our Faces in the Wild dataset, we
selected a set of photographs from -- for two celebrities who have a lot of photos in the
dataset, Laura Bush and Jennifer Aniston. And rather than showing these pictures to the
labelers one by one as is typically done in a labeling scenario, we wanted to simulate
something more realistic, something more like the motivating examples I gave you in the
beginning of the talk. I'm sorry, did you have a question? Something more like the
motivating examples I gave at the beginning of the talk.
So we actually showed them all the pictures simultaneously in a sort of tile format, which
is here. Like this. And we said label any five images you like, exactly five images. And
so here what we're trying to test is, you know, we want the users to select pictures in the
context of seeing other pictures, and we want them to use their own biases and methods
for selecting pictures to label whatever they like.
>>: Can you tell them what the purpose of it was by giving -- I'm just curious because I
mean you admit that earlier you had set this thing about like well, here's a user who, you
know, thinks they're being helpful while they're getting these really hard ones or
something, so I'm wondering if you tried to motivate them by saying hey you're helping a
learning algorithm so do the right thing.
>> Umar Syed: Right. Right. No, no, no. So this is the entire task.
>>: I see so [inaudible].
>> Umar Syed: All the guidelines are there right.
And>>: I'm really curious to see your analysis of how many people just pick the first five.
>> Umar Syed: Oh, yeah. I'm going to show you that in a second.
>>: All right. [laughter].
>> Umar Syed: Not quite that, but something like that.
>>: Okay.
>> Umar Syed: So this is the entire task. I think that you have to give it a title and I just
said image labeling easy. That was the title. Easy meaning easy money. I was trying to
attract people. [laughter].
So this is what it looks like. And so the first thing to note is that there is indeed strong
bias in how people choose example -- pictures to label. So here I've ordered the pictures
from most label to least. And the most labeled picture I think is about three times more
likely to go to be labeled than if they were to be labeled random.
>>: [inaudible].
>> Umar Syed: This is not -- this is not presentation.
>>: No, no, [inaudible].
>> Umar Syed: This -- the presentation order is fixed.
>>: Is it fixed.
>>: But is that picture number the presentation order?
>> Umar Syed: No, it's not.
>>: You've just sorted it.
>> Umar Syed: I just sorted it.
>>: So you know what the [inaudible] is with the ->> Umar Syed: Yeah, I'll show you.
>>: [inaudible] that picture chosen.
>> Umar Syed: This is just arbitrary. [inaudible]. So this is to demonstrate that there are
biases. And let me kind of pull out two kinds of biases. So the first is that Jennifer
[inaudible] [laughter]. There were an equal number of each kind of picture, but Jennifer
Aniston was much more likely to be labeled. And the second kind of bias is I think what
you're getting at which is location on the webpage. So ->>: [inaudible]. Stronger than that.
>> Umar Syed: Well it might be influenced by the fact that the submit button is at the
bottom of the page. So we think that people click on a bunch of pictures on the top.
>>: [inaudible] labeled more than five.
>> Umar Syed: Oh, so if someone -- that didn't usually happen, but if someone labeled
more than five or they made a mistake I just discarded the trial. So I'm not interested in
that.
>>: [inaudible].
>>: I mean essentially 50/50. 50 percent is [inaudible] 50 percent not [inaudible].
>> Umar Syed: It seems like you're saying basically we think that people clicked on
some things, scrolled to the bottom, clicked on some things and submit. Go ahead.
>>: Did people mostly label one person or would they do both? Like would they do just
pick out all the pictures of one person [inaudible].
[brief talking over].
>>: I mean it doesn't matter but I'm just curious.
>> Umar Syed: So here's the performance if you -- if you run the SVM variance and our
algorithm on this dataset. These are box spots over all the trials. So not only is our
algorithm doing somewhat better on accuracy ->>: [inaudible] one labeler here.
>> Umar Syed: One labeler, yes. These are box plots. So this is is giving you -- the
whisker plots are the range -- the whiskers are the range and the top and bottom of the
boxes are the 75th and 25th percentile of the accuracy. So we're getting somewhat
better accuracy and also more tightly -- more tight range.
And maybe to address what you were saying earlier, what if this data was just labeled
uniformly at random, right, rather than by in this biassed fashion? So we tried just
labeling the data uniformly, then the advantage of our algorithm basically goes away.
>>: [inaudible].
>> Umar Syed: Yeah. So we did some cross-validation.
>>: [inaudible].
>>: [inaudible].
>> Umar Syed: I think it was normalized. When you say normalized, you mean -- so we
take the distance and we exponentiate the distance, and then we have this scaling
parameter that controls -- kind of controls the dynamic range. That's the kind of laplacian
[inaudible].
>>: [inaudible] divide by the square root [inaudible].
>>: [inaudible].
>>: Of each node.
[brief talking over].
>>: Do you think that they're reduced variants [inaudible] or do you expect that or ->>: So, I think it's intuitively meaningful, but I have no theoretical results to explain why.
So the reason it's intuitively meaningful is that, you know, it's a -- for example, unlike the
transductive SVM it's a convex problem. And so it's a maximizing globally convex -transductive SVM can get caught in the local minimum. So that's one -- that's a partial
explanation.
>>: [inaudible].
>> Umar Syed: Yeah.
>>: [inaudible].
>> Umar Syed: Excuse me?
>>: [inaudible].
[brief talking over].
>> Umar Syed: Okay. So that concludes the semi-supervised portion of the talk. If I
have time, I don't know if I do, I can briefly talk about some of the work I've done on
bandits.
>>: [inaudible] but people may have to leave.
>> Umar Syed: Okay. All right. So in my PhD and also my post-doc I've worked on
some bandit algorithms for sponsored web search. I'll describe a couple of the projects
that I've worked on just very briefly.
The formulation of the problem is a contextual bandit problem. So what is a contextual
bandit problem? Well, it's a problem -- a decision making problem that proceeds over
several rounds. And each round one is supposed to choose one of several actions. And
depending on the action that you choose and also an exogenous context you receive
some payoff. So you can imagine walking into a casino and you're faced with all these
slot machines. You choose one slot machine to play each round, and then this is where
the analogy falls apart. There's some context that also affects both the context and the
choice of slot machine you made affect your payoff.
So why -- so, you know, where ->>: [inaudible].
>> Umar Syed: The context is full observed. And this learning problem was formulated
and was motivated by a sponsored web search because this is an excellent model for
sponsored web search. So sponsored web search in this model here the actions are the
advertisements that you can choose to display to a user coming to a search engine. The
context is the query that the user input to the search engine as well as anything else on
the search results page. And the payoff, the expected payoff is the click-through rate, the
probability that a you've is going to click on the ad given that that's the query they
searched for and everything else.
Okay. So we -- we've addressed two kind of challenges of these contextual bandit
problems. The first is that in any realistic setting the number of actions or the number of
contexts is going to be very, very large. Because that's the number of ads and the
number of queries. And of course that's enormous. And the second challenge that we
address is that click-through rates are usually not stationary but rather they can change
over time. So I'll describe a solution for each of these problems briefly.
So the click-through rate of any ad is a function of both the features of the ad and the
features of the query. And so here we have, you know, a search, tickets to Seattle, and
here's an ad, you know, flights to network. So we want to estimate what is the
click-through rate when this ad is presented in response to this query.
And the basic problem is that we can't just use counts because the number of distinct ads
and queries is enormous. So we have to do something to generalize.
And so our approach, and I just have one slide on explaining our approach is that it's -the functional relationship between the click-through rate and how that depends on the ad
and the query, decomposes in some nice way over the words that are in the ad and that
are in the query.
And so this graph.
So we have this graph, and we place an edge -- and so we have, you know, a node in
this graph for the words in the query and also for the words in the ad. And we place an
edge between two words if we think the click-through rate depends jointly on those two -on those two things. So for example the click-through rate is going to depend both on the
fact that there was the word tickets in the query and the word flights in the ad and -- I'm
sorry. Go ahead.
>>: So are you saying this edge is an input, you are to determine where the edges are or
the algorithm [inaudible].
>> Umar Syed: This is input [inaudible].
>>: Okay. Oh, I see.
>> Umar Syed: So, yes, this is all prior information about the problem that is being coded
by us, the algorithm designer.
>>: [inaudible] just say the edge exists.
>> Umar Syed: Yes, that's right. You don't know the weight, you just know the edge
exists.
And likewise the fact that Seattle is in the query and New York City is in the ad has some
effect on the click-through rate, probably has the down -- downward effect. And the fact
that the word cheap is in the ad, this graph is indicating that that has some effect on the
ad that's independent of what was in the query.
>>: [inaudible] have multiple edges connecting them and cycles and so on? Do you
assume that the decomposition is always just in terms of edges or -- if you have more
complicated [inaudible].
>> Umar Syed: So -- so here's what I'm really assuming, which is that I assume that the
function decomposes into a bunch of terms. Each term depends on only a few words.
And then if two words appear in a term, I put an edge between them.
>>: So you can have a full factor [inaudible] where each term would be a factor and
[inaudible] a number of words from either ->> Umar Syed: That's right.
>>: [inaudible] so here it might as well be a full two sided graph, right? So -- right? You
mean you would like to just be -- you would like to assume that you don't need -- you
need to look at both terms like, you know, cheap flights.
>> Umar Syed: Yes.
>>: But cheap could be connected to any game term and so it could be all the links
[inaudible].
>> Umar Syed: That's right. That's right. Any edge could be in that graph. And then the
regret and the complexity of the algorithm will depend on how sparse the graph is.
>>: [inaudible] assumption is mostly around the fact that there is two sided?
>> Umar Syed: No. Much so I made it two sided to make it look good, but you could
have edges between two things on the same side. But so you're going to have a graph, it
could look like anything you'd like, and then the regret and complexity of the algorithm will
depend on terms that perhaps you guys are familiar with, things like the tree width of the
graph and the degree of the nodes in the graph or ->>: Can the single thing still be involved as the settlement to the [inaudible] function as
well.
>> Umar Syed: No, they can be.
>>: Oh, they can?
>> Umar Syed: Yes. And so this is exactly the kind of assumption that we're making in
graphical models by encoding independent assumptions about distribution. But here
we're ->>: [inaudible].
>> Umar Syed: That's right. That's right. But you can -- so you can take your factor
graph and then you can make it a graph by just adding -- you know, adding edges
between everything in the same factor. And we like that representation better because
our bounds are in terms of properties of that graph, not the factor graph.
And so here I'm just repeating what I said a moment ago. Of we have an algorithm that
exploits this structure of the payoff function and this algorithm will have -- it is a
contextual bandit problem, an algorithm for selecting an action to take in every round.
And both the running time and the regret of this algorithm are going to be polynomial as
long as this graph structure is sparse. And more precisely it has low tree width and low
degree. Go ahead.
>>: I'm sorry to keep asking all [inaudible] the graph thing. I want to understand if I
understood what they were saying correctly. Can you go back one?
>> Umar Syed: Yeah, you bet.
>>: Can you have a factor like if I think it's important to have cheap flights and New York
City all at the same time or not at all, I can express that in your [inaudible].
>> Umar Syed: You can. You can.
>>: And how would you split that out into ->> Umar Syed: So that would be a [inaudible].
>>: That would be a [inaudible]. Right. So you're allowed to okay.
>>: [inaudible].
>> Umar Syed: Okay. And so that's all I was going to say about that. This is obviously
kind of a quick tour.
>>: [inaudible] result. You proved polynomial on tree width; is that right, empirical result
or something ->> Umar Syed: [inaudible].
>>: Okay.
>> Umar Syed: But to the best of our knowledge, this is the sort of first end-to-end
algorithm for contextual bandit problems or the number of context and the number of
actions can be very large. And by end-to-end I mean that we don't make any
assumptions about access to an oracle that can solve some problem that maybe we can't
solve. This is sort of a full solution and exploits the structure of the graph.
>>: [inaudible] right, you can have arbitrary properties of either and just define the
[inaudible] any way you want?
>> Umar Syed: Sure. You bet.
>>: Okay.
>> Umar Syed: So the second kind of problem that we wanted to deal with in contextual
bandits is that these click-through rates typically are not fixed over time, they change,
right, the rate at which people click on things changes over time. And in the study of
bandit algorithms, when you're talking about the payoffs changing over time, there's
usually two kind of extreme forms of assumptions that people make. One is stationary
that these payoffs or these click-through rates are fixed for the entire duration of the
interaction. And the second is the other extreme which is that these click-through rates
can change every single round, and they're changed even adversarially, right?
And so for us, these assumptions are either too weak or too strong. And in our view it
seemed like a more realistic assumption is something that's kind of in the middle which is
that for many queries click-through rates change -- they can change abruptly but only
rarely. So let me give you kind of an example. So before October 2009 the query
balloon boy, if you were to input it to any search engine would give you a company called
balloon boys which is they make rooftop inflatable advertisements. You've probably seen
these things.
But in October 2009, if you guys were watching the news around that time, there was this
kid in Colorado I think who went up -- who went up into the sky on this sort of hand made
balloon.
>>: [inaudible].
>> Umar Syed: Or no, he didn't. Right. [laughter]. We thought he did.
>>: This is [inaudible] half of the change, right, but it seems like there's a -- like the huge
other factor which is seasonal facts and some things where a ton of queries change
slowly but they're all changing every day.
>> Umar Syed: Yeah, you're right. You're right. So this -- this work is focused on the
kinds of queries that change abruptly and suddenly, not the ones that follow a kind of
rhythm. You're right. This is not well suited to that.
>>: You can apply what you did to that or is it -- it's really targeted towards the
[inaudible].
>> Umar Syed: We did. And we found it didn't work so well for the cyclical.
>>: Okay.
>> Umar Syed: It's really designed for [inaudible]. All right. So what's interesting is that
-- so the meaning of this query changed very abruptly and therefore the associated
click-through rates changed very abruptly. But this change was not like -- the change
was associated with all kinds of other signals that were indicating that a change
happened; for example the volume the query spiked, the occurrence of the query in news
articles spiked. And so there was lots of indications that something has happened to the
query.
And so our approach -- and this is work I did while I was at Search Labs in Silicon Valley
for one summer, which is that we combined an online prediction algorithm with a bandit
algorithm.
And so the prediction algorithm uses these contextual signals like the volume of the
query, its occurrence in news articles and so forth to predict whether or not the meaning
of the query has recently changed and, therefore, whether it's associated click-through
rates might have dramatically changed.
And then that prediction algorithm outputs a prediction. And then our bandit algorithm
takes that prediction as input and decides whether or not it should modify its behavior
accordingly.
>>: [inaudible] the optimal [inaudible] when looking [inaudible] computer program?
>> Umar Syed: Yes, this is the algorithm that knows the option that its highest
click-through rate, and it just offers the best [inaudible].
>>: But it's even sort of super optimal, it's not even assuming that it's a constant
[inaudible].
>> Umar Syed: Right, right. This is not the best single ad in hindsight or anything, this is
every round.
>>: [inaudible].
>> Umar Syed: Yes. Well, so it knows -- so there's all these options, and they each have
some click-through rate. It knows which is the best one.
>>: No, but on what period [inaudible] if you [inaudible].
>> Umar Syed: Oh, I see. So [inaudible] granularity is [inaudible].
>>: Oh.
>> Umar Syed: So we know that today's [inaudible].
>>: Why wouldn't the best one in hindsight be any different than [inaudible].
>> Umar Syed: Because the best one is changing in this experiment.
>>: But if you're only giving an answer for that day's ->> Umar Syed: Yeah, yeah, we're only giving it for that day. But I think John was asking
are we [inaudible].
>>: [inaudible] it's usually the best constant algorithm, so you would average the whole
click-through, the whole dataset but that's not [inaudible].
>> Umar Syed: [inaudible].
>>: Okay.
>> Umar Syed: We [inaudible] any way, we find that using this online classifier
[inaudible] and then modifying the bandit algorithm to use this information and take
advantage of it can result in some dramatic improvements in the regret.
And that's all I had. Thank you very much.
[applause]
Download