>> Chris Burges: So we're delighted to have Yisong... Yisong was my intern in 2007. He was awarded...

advertisement
>> Chris Burges: So we're delighted to have Yisong Yue here today with us.
Yisong was my intern in 2007. He was awarded a Microsoft Research fellowship
after that. He's finishing up his PhD with Thorsten Joachims in Cornell. And
today he's going to talk about a new learning framework for information retrieval.
>> Yisong Yue: Thanks, Chris. And thank you all for having me. Today I'll be
talking about new learning frameworks for information retrieval. And it's joint
work with my advisor Thorsten Joachims at Cornell.
So we are living in an increasingly data driven and data dependent society, and
as a result having the ability to efficiently and effectively manage, organize an
level this growing sea of information is becoming correspondingly more important
over time.
And although when we first think of information systems, many of us first think of
things like Web Search services. In reality, many of the services that we use and
interact with today provide, amongst other things, a way for us to manage,
leverage, and retrieve information of different sort.
And although this is by no means an exhaustive list, I think it's clear to see that
information systems are useful and can be applied to any domain that has a -that needs to manage a growing amount of digital content. And I would argue
that that is every domain that is being important to society today.
So this is a machine learning talk. And I am a machine learning researcher. And
one of the reasons I'm even here giving a talk is due to the rising popularity in
applying machine learning to designing models for information systems.
The benefits are that machine learning can allow us to optimize for models with
an enormous number of parameters, ranging from thousands to millions, to
sometimes even billions of parameters. And this has been pretty effective and
has been -- these techniques have been successfully employed in developing
many of information systems that we use today. However, the standard
techniques and the conventional learning algorithms that people typically use are
limited. And here are two ways that I think are pretty important.
First of all, the convention machine learning techniques, they make restrictive
independence assumptions. So, for example, when training a model for a
ranking function, you assume that the model -- you assume that the ranking
function will compute the relevance of individual results or qualitative
independently of other results.
The second limitation is that when deploying these machine learning algorithms,
it typically -- they typically require fairly expensive pre-processing step of
gathering a labeled dataset and this on requires fairly extensive and expensive
human annotation. And this inherently limits the scope to which these algorithms
can be deployed as scope and scale to which these algorithms can be deployed
across many different ranges of applications.
So in this talk, I'll be discussing my research, which addresses both of these
issues. In the first half of this talk, I will be describing a structured prediction
approach for modeling interdependencies. You could think of this as relational
data.
In the second half of this talk, I'll be describing interactive learning approach
where the idea is you want the system to learn interactively with the environment,
in this case a population of users, using feedback from the environment. So
feedback provided by the users.
And these algorithms and methods were designed with information systems in
mind. And I think that information retrieval is an important application area. But I
think you'll see that these algorithms also motivate new applications and are
themselves fundamentally new machine learning models.
All right. So here is a quick three slide introduction to structured prediction by
way of examples. Perhaps the simplest interesting example of structured
prediction is first order sequence labeling or part-of-speech tagging in this case.
So here the idea is they've given a sequence of words X. We want to predict the
sequence of tags Y, which are the part-of-speech labels.
So, for example, for the sentence the rain wet the cat, we want to predict
determiner noun for determiner noun. And here the dependencies come from
the transitions between tags, the tag. So we care not only about how often say
the word rain maps to noun as opposed to verb, but also how often a determiner
maps the noun in the English sentence.
A more complicated example in natural language parsing -- processing is to give
-- given a sequence of words X, predict the entire parse tree Y. So here for a
sentence, the dog chased the cat, we want to predict this entire parse tree here
where N -- the structural constraints comes from how often the sentence node
decomposes into a noun phrase and verb phrase and that noun phrase
decomposes into determiner noun and so on and so forth.
In information retrieval, given some query X, we want to predict a set of retrieval
results Y typically expressed as a ranking. And here we can think about
dependencies between the results. So for example, we want to -- we might want
to model the redundancy between documents that we're considering to retrieve in
order to reduce the redundancy. And that's what I'll be talking about today,
diversified retrieval.
Now to motivate diversified retrieval, I want to quickly tell you a really short story
about Misha Bilenko, the curious high school student. So, you see, Misha, he's
heard about this magical field of machine learning, but he doesn't know too much
about what it is, and so he's really curious. And so what he's going to do, he's
going to go on his favorite search engine, type in the query words machine
learning, maybe just read the top five documents and just to get an idea of what
machine learning is all about.
So here are the top five I pulled from my favorite search engine, which shall
remain nameless. And, you know, they are relevant, right? I mean, they're all
individually relevant. But they're all kind of the same, kind of redundant, right?
Because, I mean, we have two results for machine learning textbooks, we have a
result for the machine learning journal, which is just an obnoxious precursor to a
textbook really, we have this AI topics machine learning link, which is very similar
in spirit to the Wikipedia article. So it doesn't really provide a very
comprehensive and cohesive -- well, it doesn't provide a very comprehensive
view of all the concepts of machine learning on the Internet.
So what if instead when Misha typed in machine learning he saw these results
instead? Well, cook. We have the Wikipedia article link. We have a link to the
International Machine Learning Society, which is the premier society for machine
learning researchers. We have a link the an industry lab doing machine learning.
We have one link to textbook. And if Misha was interested in looking at video
lectures, we have a link to the video lectures page as well.
So not claiming this is the perfect set of five results -- I don't even know what
perfect exactly means here, but it clearly provides a more comprehensive view of
all the different subtopics or concepts related to machine learning that you can
discover on the Internet.
All right. So there's been a lot of work in -- a lot of interest in diversified retrieval
ranging from the seminal paper in 1998 by Carbonnell and Goldstein, and in all
these previous work, the researchers who conducted these studies noted that it's
-- that the models -- in order to optimize diversity it requires modeling
inter-document dependencies. And this is, again, impossible with standard
independence assumptions of the models that we are thinking about using in
machine learning to train for.
And so the solution I'll be presenting today is a coverage based approach. And
I'll present a structured prediction learning algorithm for training this model.
All right. So here is an abstract depiction of the prediction problem. You could
think of this entire region as the space of information relevant to some query.
These shaded regions are the different documents for this -- documents for this
query. The area covered by the region is the relevance of the document, so
bigger is better -- more relevant. And the overlap between documents are the
inter-document redundancies, the redundant information covered by both
documents.
So in the conventional model, which evaluates the relevance of each document
independently, we would first select D3, because it's the most relevant. Then we
will select D4, even though it's a can be on selecting D3 is almost completely
redundant because it's the next most relevant independently. And then we select
D1.
In the coverage based solution, we would first select D3, then condition on
selecting D3, we would select the next most relevant document, which would be
D1. And conditioned on both D1 and D3, we would then select D5 because it
again covers the most -- the remaining uncovered relevant information. So this is
a coverage based approach.
So the idea here is to view diversified retrieval as a coverage problem. And if we
had the quote/unquote perfect representation of information, then we could just
deploy the greedy algorithm, which I just described in the previous slide, to
computer goods coverage solution.
The challenge here is to discover what a good representation of information is.
And that the learning problem. And but the good news is that once we can learn
this good coverage representation that we can make good predictions and new
test examples.
And this is a supervised learning approach, because we'll be requiring manual
labeled subtopics as training data. So that it doesn't address the second
limitation that I talked to at the beginning. But I'll get into that in the second half
of this talk.
And I'll be presenting a algorithm with theoretical guarantees and also a software
package that you can play with yourself if you're interested.
Okay. So how do we represent information? Well, there are many different
ways. Perhaps the first thing you could think of is just basically all the words in
the candidate document we're considering to retrieve. It's sort of the lowest level
or rawest form of information. We could get a little about it fancier and think
about things like the title words, the anchor text, meta text, so on and so forth.
We can think about representing information at a perhaps somewhat higher level
by thinking about cluster memberships with the documents, how they cluster. So
maybe topic models or dim reduction techniques like LSI.
We could also even use external taxonomy memberships like the ODP hierarchy.
In this talk we'll be just focusing on the words or the lowest level raw
representation of information.
Okay. So intuitively speaking if we selected a set of documents that covered as
many distinct words as possible, then we're covering as much information as
possible. Now, of course not all words should be weighted equally. Some are
obvious more important than others. So we need a good -- if this is our
representation then we need to go to weighting function.
But suppose we had a good weighting function. Then we could form laid the
prediction task as to select K documents which collectively cover as many
distinct weighted words as possible. We could again deploy the greedy
algorithm, and this -- these types of coverage based problems are known as
what's called submodular optimization problems. So the greedy algorithm
actually has good performance guarantees.
Again, the learning goal in this specific formulation is to learn a good weighting
function.
Okay. So how do we weight words? Well, surely not all words are created
equal, right? The word the is probably almost close to meaningless -- yeah?
>>: I have a question. You were saying that one [inaudible] approximation. So
this means that the number of documents you need to cover all the words would
be ->> Yisong Yue: No, no. So the objective function -- the objective function is the
total number of weighted words you've covered. That's the objective function.
>>: With a fixed K?
>> Yisong Yue: With a fixed K. So the 1 minus 1 over E approximation is the
greedy algorithm is competitive where the 1 minus 1 over E bound against the
optimal set of K documents. Yeah?
>>: So your notion of overlap is overlap on words [inaudible].
>> Yisong Yue: That the how we represented the information. So yeah.
>>: So what if I [inaudible] take your example and Misha wanted to know all the
textbooks [inaudible]. There's going to be significant overlap. So you're not
taking into account intent, you're assuming that a diverse result [inaudible] is the
best?
>> Yisong Yue: Right. So it's certainly not the case that you want to diversify in
this fashion on all possible queries. What we're exploring here is suppose you
did, right? Suppose you are interested in designing models that retrieve based
on diversity of the set of relevant documents.
>>: [inaudible].
>> Yisong Yue: I'm sorry?
>>: Okay.
>> Yisong Yue: And then suppose we wanted to build models to that. How do
we think about designing these models? And this is one way to do so.
And Misha had typed in the query machine learning textbooks. You don't want to
return the same text -- kind of textbook, right? Maybe you want to diversify on
textbooks.
So how do we weight the words? Well, not all words are created equal, so for
example the word the is probably not that informative for almost all queries. It's
also conditional on the query, right? So the word computer is normally fairly
informative about a certain subtopic for most queries like maybe education, but
not for the query ACM. Because it probably appears in basically all the
candidate documents for the query ACM.
So we really need a weighting function that the based on the candidate set of
documents. Now, there's been some prior work on this -- in this area. It's called
essential pages. It's actually done here at Microsoft. What they did was they
used a fixed function of computing word benefit, the benefit of covering the word.
And it depends on the word frequency in the candidate set.
So for example, if a word appears in 100 percent of the documents in the
candidate set, then it has zero benefit of being covered because it's not useful in
teasing apart the different subtopics for different concepts.
If a word appears very rarely in the candidate set, then it has low weight,
because it's not very representative. You could think of this if you're familiar with
information retrieval as a local verbal of TF-IDF. And what I'll be presenting is
basically a way to learn a much more complicated and more general weighting
function.
So here are the features that I'll be using. So let boldface X denote the set of
candidate documents and let V be the variable that denotes an individual word.
Then we could describe -- we could use features that describe the frequency of
the word V within different aspects of the set of candidate documents.
So for example, the first feature might be active if the word V appears in at least
10 percent of the documents and X, the second feature might be active if the
word V appears in at least 20 percent of the titles of the documents in X or 15
percent of the anchor text of documents pointing to X or 25 percent of the meta
tags -- meta information in X and so on and so forth. And in practice, in our
experiments we'll be using thousands of such features.
And after learning a model which will represent as a weight factor W, then we
can think of the benefit of covering a word V as the product of our model of actor
W with this feature representation.
Again, I should emphasize that this is local. It's dependent on the candidate set.
>>: [inaudible].
>> Yisong Yue: Yes?
>>: So kind of do it in a two step basis. The first step you actually retrieve say a
couple thousands documents.
>> Yisong Yue: So there's ->>: [inaudible].
>> Yisong Yue: That's a good point. So for this machine learning approach, we
assume that -- I mean, you make the standard machine learning assumptions.
You assume that the candidate sets are provided to you according to some
distribution. In practice when we implement these types of algorithms you
probably want to do a two stage approach where you first select the candidate
set and then you apply an algorithm like this one.
So, yeah. Okay. So here's the structured prediction formulation for maximizing
coverage. Let X denote -- let boldface X denote the set of candidate documents,
let boldface Y denote the prediction, the supervise set of X of size K and let V of
Y denote the union of all the words that appear in our prediction.
And then we can model the quality of our prediction given a learn model using
what's called a discriminate function. And I'll walk you through what that means.
We're basically summing over every word that appears in our prediction, and of
course -- in that sum we basically just sum over the benefits.
Okay. So what are some properties of this?
>>: [inaudible].
>> Yisong Yue: So you could do normalizations a feature if you want. That's
sort of the standard approach to discounting the length of the document. And we
do do that in our experiments. All right. So what are some properties of this
formulation?
Well, first of all it is not rewarded on this, which is the thing that we care about
the most. Because if a word appears in two documents in our prediction, it's only
counted once. This is a sum over a set, not a multi-set. And that is the structure
-- that is the structural assumption of the structure prediction model.
So, for example, if we are representing this coverage based prediction problem
where the space is represented using words, then if two documents share a
word, the benefit is covered only once. Second of all, this is still a submodular
optimization problem, so greedy has the same approximation bound. And we'll
be using a more sophisticated structure in our experiment. So one limitation of
this simple formulation is that if a document contains even one document of this
word it's consider to be covered. And you really want to have graded levels of
coverage. Some document might cover this word better, say, if it was in the title.
And you can sort of expand this feature representation in a stack form. And I'll
be happy to talk about this offline to anyone who' interested. But we do use a
more sophisticated structure of the same flavor in our experiments to account for
the fact that different documents cover words to different degrees. Yeah?
>>: So I missed something. So is X is set of [inaudible] X is the set of all the ->> Yisong Yue: Candidate documents.
>>: Candidate documents.
>> Yisong Yue: Uh-huh.
>>: So can you repeat where -- how you're using this entire set in define this
function? I'm asking because -- I mean, you know, in real life you don't really
have this, they kind of arrive one by one in this streaming fashion ->> Yisong Yue: They arrive one by one in a streaming fashion?
>>: Yeah. I mean when you do retrieval, you never really get a full set and then
rank it, you kind of have a streaming mechanism where you kind of get them
gradually and you have to calculate their values on the fly. So that's what
typically happens in the real system.
>> Yisong Yue: So you might -- you might sort of let a hundred stream in and
then rerank the top a hundred?
>>: Well, the way these things work now is they just come one by one and you
have to kind of support them on the fly. So pretty much. But I mean you could
do something -- I'm just trying to understand your model.
>> Yisong Yue: Yeah. Okay. I understand. So right. So in terms of -- from that
perspective, you do have to wait -- I guess you would have to wait until you've
collected a large enough pool of candidate document in order to apply this
model.
>>: And so how would you use the set?
>> Yisong Yue: It's used in defining the feature representations. Because all
these features are defined relative to the candidate set of documents. Because
again, it goes back to sort of this -- this idea that the importance of a word is
conditional on the candidate set because the word -- for example, the word
computer is not as useful for teasing subtopics for the query ACM.
So this is also a coverage based problem, so it's also submodular optimization
problem. So greedy has the same approximation bound. And we'll be using
more sophisticated structure in our experiments. And the goal here is this to
learn this model vector W with the intention of making these types of predictions.
Right? This is not a binary classification problem, this is a coverage problem,
and so we want to train a W with the intention of doing well in this coverage
problem.
And I'll present a structural SVM approach for learning this W. And it will have
empirical risk and generalization bounds.
>>: So just [inaudible].
>> Yisong Yue: You can model -- you can model it if you like. In our
experiments we choose not to.
>>: So the creative [inaudible] is very implicit in these.
>> Yisong Yue: In this particular model, yes. Although you can think of defining
this feature representation also conditioned on the query explicitly.
Okay so here's the formulation that we use. It's a structural SVM formulation.
Let boldface X denote a structured input. In our case the set of candidate
documents. Let boldface Y denote the structure output, in this case the
prediction, the subset of size K. Our goal is to minimize the standard SVM
objective, which is a tradeoff of model complexity measured as the square of the
two norm. And a measure -- and upper bound measure of loss of our structured
prediction problem.
Now, this is a quadratic minimization problem, so if it were unconstrained it would
have a iteratively solution at zero. But there are constraints. And I'll describe
what the constraints mean. Let Y of I be the computed optimal solution for this -for the Ith training sample. And this is something that we assume that we can
compute given supervised labels. Then for every other label, which is in this
case suboptimal, we want the benefit function. So remember, this is the benefit
of predicting Y given an X and a W. We want the benefit of our optimal
prediction to be at least as large as the benefit of any suboptimal prediction plus
the structure prediction loss. In this case, it's a subtopic loss that I'll get into later
minus the slack.
And so here the lack variable, in order for these constraints to be satisfied exactly
upper bonds, the structure prediction loss. Yeah?
>>: What is taken to the I to be the Ith [inaudible].
>> Yisong Yue: In the training set, which I'll describe in more detail later, each
training example is a collection of documents, candidate documents and their
subtopic labels. So I'll get into that in a few slides.
>>: [inaudible].
>> Yisong Yue: Deltas are all non-negative, that's right.
>>: [inaudible].
>> Yisong Yue: Well, okay, so suppose that -- those that our algorithm makes a
mistake and this score is higher than this score. Then if you subtract this from
both sides, this side is the negative, so this must be negative. Which means this
has to be larger than this. That's why it's a minus. Okay.
So this is still a complex optimization problem. So it should be pretty simple to
optimize. Unfortunately there is typically extreme number -- extremely high
number of constraints of this form. Usually intractably many. So using the naive
approach, it's actually intractable to enumerate this objective, this optimization
problem, let alone optimize for it. So we're using a cutting plan approach to
optimize for this objective.
So let the left hand side detect the original optimization problem where the color
gradient denotes the objective, SVM region. This is good region right here. And
let the linear planes denote the constraints.
And so the global solution is global optimal solution is right here, subject to the
constraints and the cutting plane approach we first start with no constraints. So
we just solved this objective on constraints, so the solution is a zero.
Starting from the solution, we then select the constraint that the most violated. In
this case, it would be this constraint. Because the corresponding site variable is
the highest for this constraint. So we add this constraint and then we resolve the
optimization problem with a single constraint. And so we find this solution. And
we continue to do so, to iteratively keep finding the most violated constraint until
our solution here a epsilon approximation to the solution here. And by epsilon I
mean that no other constraint is violated by more than some small tolerance
epsilon.
And this is guaranteed to terminate in linear time, linear in one over epsilon. So
it's -- and in practice, it's quite fast. And it's guarantee to be a very good
approximation to this in a way that preserves all the same generalization bounds
of a normalized VM.
Okay. So one key problem here is finding the most violated approach, right?
The naive approach would be to enumerate over all the constraints to find the
one that's most violated. And of course we assume that this is intractable.
The good news is that we can formulate finding the most violated constraint in
the same coverage formulation as the prediction problem. We can treat it as a
coverage problem. So we have a one minus one over E approximation to find
the most violated constraint, and all the theoretical guarantees still hold if we
have a constants factor approximation of in finding the optimal cutting plane.
Albeit the guarantees are a little bit weaker. And this will perform pretty well in
practice, as you'll see.
Okay. So here's the training data that we will use in evaluating our model.
>>: Yisong?
>> Yisong Yue: Yes?
>>: That trick, the trick, the coverage problem wasn't it in -- I can't pronounce his
name.
>>: [inaudible].
>>: What he said. Marginal paper. That's your hack on his paper? Because I
don't remember the -- that the original ->> Yisong Yue: That's the -- this is the key technical contribution of the paper.
>>: That's right. But ->> Yisong Yue: No. It's -- it's subdivided. So this is an -- this is generally an
ISMO 08 and this is sort of key, I guess, technical contribution of that paper.
Okay. So here's the training data. We're going to use data from the interactive
track in TREC. And in this dataset, queries have specific labeled subtopics.
So for example, for one query, use of robots in the world today, human judges
label documents as relevant to different subtopics. So for example, nano robots,
space mission robots, one water robots, so on and so forth. You could think of
this is a manual partitioning of total information regarding this query. And our
goal is to discover a good representation that help us partition the information of
this query.
So you have the results. We're comparing against random, which -- I'm sorry.
We're treating five document here. We're comparing as random, which basically
select five documents from the candidate set at random.
>>: How big is the [inaudible]?
>> Yisong Yue: It varies from query to query. On average it's about 40. And I -yeah. And so this is missing subtopic error rights, so lower is better. We're
comparing as Okapi, which just is a standard retrieval model in information
retrieval that does not consider diversity. We're comparing its unweighted, which
was the baseline for our model, where it just tries to select as many distinct
words as possible unweighted. We're comparing it as essential pages which is
the prior work that I mentioned developed here at Microsoft. And our method
called SVM-div. And only essential pages in our method do we notice better than
random, and SVM-div outperforms the rest with a [inaudible] represent
significance.
>>: So the training set took the subtopics and [inaudible] maximum coverage
[inaudible]. What was the training [inaudible].
>> Yisong Yue: You just -- you took all the document in the -- you took all the -you take all the documents that are labeled into one of the -- at least one of the
subtopics. So there are about 40. And then you -- the training -- the training -the optimal label is the one that is the set of five documents that maximizes -minimizes subtopic loss. So the ones that cover as many subtopic as possible.
And different queries of different -- have different subtopics. They're all query
dependent.
>>: What happens if you do a query independent TF-IDF [inaudible]?
>> Yisong Yue: I'm sorry?
>>: What happens -- so your thing does a query dependent kind of TF-IDF thing.
What if you do just a global TF-IDF which is query independent as a processing
step?
>> Yisong Yue: That's like Okapi as a preprocessing step?
>>: No, no, you apply your algorithm, but you define these weights with respect
to -- I mean, you do a preprocessing step, which is TF-IDF but globally across
everything. Use the query dependent aspect of getting this entire candidate set
ahead of time. Have you tried that?
>> Yisong Yue: No. If you were to do that ->>: I mean, that the what you need to compare to in order to claim that the query
dependent normalization is given to something, right?
>> Yisong Yue: What we're comparing is essential pages.
>>: But that's a complete different algorithm also, right?
>> Yisong Yue: It's -- it's the same -- I mean, the prediction task between
essential pages and an SVM-div are more or less the same. It's the model that's
different. So essential pages ->>: I understand. You're doing better.
>> Yisong Yue: Okay.
>>: Is it because you're using SVM, is it because it's query dependent
normalization? I mean, it could be a number of things. I just was ->> Yisong Yue: I guess -- I guess -- I guess you could try to model the
importance of words globally rather than locally and apply the same coverage
problem. We did not -- so that could be a base -- that could be an interesting
baseline. We do not compare against that baseline.
The two baselines -- so in the -- so these two are not coverage based
approaches. The two baselines we did compare against are essential pages and
totally unweighted. So we did not compare against weighting with global TF-IDF.
Yes?
>>: So that random fraction [inaudible] seems to apply to a lot of subtopics. Do
you remember how many subtopics there were on the average? It seemed like
there would have to be much more than five to get random .47.
>> Yisong Yue: Again it varies because they're all query dependent. But it's
about 15 subtopics per query.
>>: Oh, so you might be at the best you can even do even if you had the oracle.
>> Yisong Yue: The best is about .27. The oracle is about .27.
>>: Oh, okay. Okay. Okay.
>> Yisong Yue: So some documents are good. They cover more than one
subtopic.
>>: Okay.
>> Yisong Yue: So if documents -- so okay. If documents were disjoint in the
subtopics they covere4d then there is no redundancy in the case that we care
about.
>>: That is true.
>>: So one way you can do that without labeling data is that you can [inaudible]
the local idea and you get the largest courage of the [inaudible].
>> Yisong Yue: The local?
>>: Yeah, the local ideas. So if you do that, then what kind of performance ->> Yisong Yue: That is actually what essential pages does.
>>: I see.
>> Yisong Yue: So essential pages, you know, it's -- they make this assumption
that the importance -- that the importance of different terms is, you know,
response to the TF-IDF function. And this is a discriminate active training
algorithm that says we have the same prediction task but we're going to re-we're going to learn the new benefits with respect to our training labels. Right?
So it's linear model to the training models.
Okay. So that's -- so that's the approach. More abstractly what I have described
in the first half of this talk is a way of learning coverage representations, right?
So suppose we had a training set where we had some sort of gold standard
representation of information, in this case the subtopic labels and we want to
predict a good coverage that minimize subtopic loss.
The goal here is to learn an automatic representation of the labels such that it
does not require the gold standard labels, and then we can maximize coverage
on new problem instances.
You could think of this a the inverse of the prediction problem, right? So there's
been a lot of work in optimization research on these different structured
optimization problems, and optimizing for coverage maximize the coverage is
one such instance of this problem. So the prediction problem is given the gold
standard formalization of the optimization problem we can predict a good
solution, in this case a good covering.
The inverse problem is we want to learn and automatic representation that
agrees with the gold standard solution on the prediction. So we want to learn a
gold standard -- we want to learn automatic representation such that our
coverage solution in this representation agrees with the coverage solution in the
gold standard representation. That's the learning problem.
So what are some other kinds of coverage problems? Well, you know, it's pretty
endemic in many kinds of recommender system, such as, you know,
recommending the types of schedules, so maybe helping some secretaries.
Create schedules, different products for different commercial lab services,
scholarly papers if you're interested in sort of browsing, you know, the different
types of research work that's out there on some scholarly dataset and so on and
so forth.
Again, there's sort of an ambiguity in what users want, there's and ambiguity in
what I want, even if I could formulate the queries perfectly. And there's a
ambiguity in how the -- how the service interprets your request. Because there
isn't a perfect formulation.
In a different context, suppose we wanted to place sensors to detect for different
outbreaks or intrusions in a physical environment, or if we wanted to figure out
which blog feeds are to monitor in order to collect the most news stories
information or information cascades as quickly as possible, again there's
ambiguity in where the events occur and what events we care about. Yeah?
>>: [inaudible] coverage problem?
>> Yisong Yue: I'm sorry?
>>: Is recommender system recommendation really a coverage problem? Being
if I'm watching movies I don't really care about watching movies on all possible
topics, I have a strong bias on only the movies that I like. And so it might be a
coverage but ->> Yisong Yue: Suppose I'm looking for Mother's Day gifts. So it's certainly true
that -- it's certainly true that diversity is not essential in all possible applications
within recommender systems. That's certainly true. So if you know what you
want, you know what you want.
And if the system knows what you want, then that's perfect. Suppose I was
searching for Mother's Day gifts, right? I have no idea. And so there it might
behoove the system to hedge why it bets, in which case you could formulate this
as a coverage problem.
>>: I have a question [inaudible]. Do you think that diversity in coverage are
typically a primary objective than most retrieval problem or is it more commonly
irregular for example a constraint where you would have the primary objective
being whatever -- some sort of [inaudible] of accuracy and do you think that
solutions to -- if you state the problem where diversity is a regularizer or
constraint would be very different from where you have this primary objective?
>> Yisong Yue: So when you think about implementing these systems in
practice, I guess it depends on what kind of -- the properties of the problem,
right? I can imagine scenarios where I am -- where I do a pretty generic query
on a research paper dataset where I want to sort of -- I'm kind of looking for new
ideas, maybe looking for related works to cite for a paper I'm writing, not really
sure what I'm looking for. And their relevance is agent about it harder to judge,
first of all.
And second of all, lots of papers talk about the same thing. I might only want to
cite one paper from this pool of papers. And so there it would behoove a system
when it's recommending, making a list of recommendations, to diversify why it
recommendations. So it depends on how much redundancy there is, and it
depends on how clear your notion of relevance is. So these are different
property that you need to examine.
But I think there are certainly applications where you do want diversity as a
coverage problem.
>>: Is the primary -- you think it's actually [inaudible] to have it as a primary
objective?
>> Yisong Yue: So okay. So there are ways of -- there are ways of combining
this coverage based problem with a more -- with a more traditional one
dimensional relevance objective. I haven't done any specific work on it, put there
are ways to sort of balance the two in a way that's -- that -- in a way that's a
combined optimization problem. So it will be interesting to think about.
>>: Sorry. Another follow-up to that question. So if -- so say that we do care
about coverage and we do want one document from each topic but in your
example of, you know, citing papers and I want the most authoritative example
from this group, so how can you -- so I agree that coverage is important, but
given that coverage is important, how do you then kind of ->> Yisong Yue: Trade off -- trade off -- you can formulate -- I mean, I see your -I understand your question. It's -- you can formulate this problem more generally
as a utility function, right? You want to maximize user utility. And utility is -- they
gain utilities higher when you present users with more authoritative -- more
authoritative results but the gained utility suffers from diminishing returns as you
retrieve more a more redundant results.
>>: Okay.
>> Yisong Yue: So it's still a coverage based problem but it's -- well, but it's a
little bit more complicated than the model I presented. Yeah?
>>: Just a comment. I think it -- [inaudible] coverage problem is like a [inaudible]
summarization. So you can [inaudible] you can construct the sentences from the
document, try to [inaudible] --
>>: So especially in this slide, you're tell us that diversity is important [inaudible]
intuitively kind of agree that having diverse thing is true, but I think that you can
make a much stronger statement. I think you said -- the most important thing you
said was hedging the bets, right? So essentially define the problem such that
your prediction task becomes a risk averse prediction path, much like a portfolio
selection or computational finance where there, you know, they want to make the
most money with the portfolio but still you want to diversify just to guarantee that
you don't lose everything. And there the need to diversify just comes as a natural
sequence of your formulation of the model, and maybe that's what we're missing
[inaudible]. [laughter].
>> Yisong Yue: That's a [inaudible] of putting it.
>>: [inaudible] has done that where he increases [inaudible].
[brief talking over].
>> Yisong Yue: All right. So I think we've beaten the first half to death. Let's
move on to the second half on interactive learning. How am I doing on time?
>> Chris Burges: [inaudible] 15 minutes or so until the hour is up, but we have
until 12.
>> Yisong Yue: Okay. Great. All right. So the idea here is that we want to build
systems that can learn interactively from feedback from its environment. So, for
example, suppose you wanted to build search systems for corporate Internet
search and these companies are very sort of -- are very private about their data.
They just want a block box system that the just installed on their network and
they just want it to work. So you want the system to sort of adapt to the
particulars of the network structure of the internal corporate network, the
language models of their documents, you know, the query -- the query
formulation patterns of their users, the click behavior of their users and different
things of that nature.
The standard techniques people typically employ in machine learning are limited
in the sense that they require a fairly labor intensive label pre-processing step in
order to collect sufficient -- a sufficiently representative amount of training data in
order to apply these machine learning algorithms.
So for example, you might be asking a human labeler to ask whether or not this
document is relevant to this query for some document query pair. This is
expensive and inefficient to obtain, and it doesn't completely generate to other
domains, right? There's definitely a difficulty in generalizing. So the dataset
about patents, whether or not these patents are relevant to these types of patent
queries might not generalize that well to designing a search system over a
medical papers dataset, medical research dataset.
So in light of this, a lot of people are starting to look at online learning as a way of
modeling this problem. The idea here is to learn on the fly. And in particular, I'll
be describing an extension of the multi-armed bandit problem.
This is pretty broadly applicable because many systems interact with the
environment on the fly, so they learn and they try to provide good service
simultaneously. One of the key sort of technical challenges in theoretical
questions in this line of research is how do we analyze performance? And what
people usually do is they measure the utility of the strategies chosen by our
algorithm as this learning on the fly versus the utility of the best possible strategy
known only in hindsight. And this is -- this difference is often called regret.
So there's been a lot of work done on multi-armed bandit problems. I just wanted
to quickly illustrate with this set of -- with this simple example sort of some of
difficulties or disconnects in modeling a lot of the real world problems using the
standard language of the multi-armed bandit framework.
So here we have a refreshment service provider. It's at a party. And it's trying to
optimally provide refreshment services to the guests at the party. And it has two
strategies in this simple example to give Pepsi or to give Coke. And its goal,
amongst other things, is to figure out whether or not this population of users on
average prefers Pepsi more or prefers Coke more. And then to exploit that non
to optimally satisfy the population of guests here.
So in the language of the standard multi-armed bandit framework, here's how the
scenario would play out. Our system would go to the first guest, ask that guest to
drink some Coke, then ask that guest to rate Coke from one to five or zero to
one, whatever you prefer. And then update its average Coke rating. They would
go to the next guest, ask that guest to rate Pepsi -- drink Pepsi and then rate that
Pepsi from one to five, update our average Pepsi rating. And they will proceed
along this fashion until it has some idea that maybe Pepsi should have a -probably has a higher average than Coke and they would start to exploit this
knowledge and start serving the guests Pepsi more often in the future.
So what's wrong with this scenario? Well, two things that I can think of. First of
all, it assumes that all the users -- scratch that out, all the guests are calibrated
on the same scale. So if I give a rating of three to Pepsi, it means the same thing
as when Misha gives a rating of three to Coke. Probably not the case.
Second of all, it assumes the users are able and willing to provide explicit
feedback. So here imagine that I'm searching -- I'm searching on a search
service and then my competent Internet and every time I do a search the search
service gives me a popup that says please rate this -- the relevance of this
document or this query on a scale of one to five. I'm just not going to do it. It's
not worth my time.
So here's an alternative approach that I've been working on instead of absolute
feedback, we focus on relative feedback. Instead of simplicity feedback we focus
on implicit feedback. So this is how this scenario will play out again if the
language of the framework that I've been working on.
The system goes to the first guest and asks him -- let's say it has some fixed
budget, volume budget K. It has two cups. It pours some amount of Pepsi in
one cup, some amount of Coke in the other cup. It gives both cups to the first
guest, and it simply observes from which cup the guest drinks more out of. And if
the guest -- if the first guest drinks more out of the cup of Coke, then it would
make the inference that this guest prefers Coke to Pepsi. It would repeat this
process for all subsequent guests. And if the second guest were to drink more
from the Pepsi cup than the Coke cup, then it would make -- the system would
make an inference that this user prefers Pepsi to Coke. And it would proceed
along this fashion with all the users until it has an estimate that maybe this
population prefers Coke more on average than Pepsi.
So this addresses both of the issues that I pointed out. First of all, it's making
relative comparisons as opposed to eliciting an absolute reward. So is A better
than B? And this is something that has been found across a wide range of fields
as more reliable feedback to gather.
Second of all, it no longer assumes that users are providing explicit feedback.
We're simply making inferences from natural user behavior as they're interacting
with our system.
>>: I know this is a [inaudible] example, but I'm assuming the match with the real
stuff. [inaudible].
>> Yisong Yue: Right. So you could imagine -- you could [inaudible] right. You
can imagine that the users can't be bothered to be asked what they would prefer
-- what they would prefer. Or that they would -- or they would know once they
see it but they don't know before hand. Yeah?
>>: So and in this setting what's the equivalent of, you know, once it determines
that Coke is the winner and it starts giving Coke to every guest, but in this
scenario does it just give more Coke and less Pepsi?
>> Yisong Yue: There are a couple ways. You could just pour Coke in both
cups. You could just give one cup. I mean, it starts -- there are different ways of
formulating this scenario. But you could just give Coke -- pour Coke in both
cups. Yeah?
>>: What would be the extension to one or two choices? Is that distribution
[inaudible].
>> Yisong Yue: No, you just -- you choose -- in the two cups? You choose -suppose you have K choices. For each user you give them -- you give them two
choices that you choose.
>>: [inaudible].
>> Yisong Yue: I'm sorry?
>>: [inaudible] from some distribution?
>> Yisong Yue: Well, could you monthly from the distribution. The algorithm
decides -- the algorithm could decide by sampling from the distribution. But the
algorithm decides which two -- which two to give to users.
>>: It doesn't have to be random, right, I mean, you may actually choose to
sample with certain ->> Yisong Yue: Yeah. I mean, the algorithm -- you could design an algorithm to
make choices based on the distribution, but you don't have to. With the RK
choices in the center multi-armed bandit setting, you give the users one choice
and ask that user to rate that choice. In this setting, you give the user two
choices of your choosing and you ask the user to compare the two. Well, you
infer which one of users prefer by seeing how they interact with the two choices.
Okay. So the contribution here is a new online learning framework which we call
the dueling bandits problem. And the idea here is that we want the algorithm to
learn via making pairwise comparison. And we'll -- I'll present a new learning
objective or a new notion of regret, and I'll also present an algorithm with
theoretical results.
And to my knowledge, this is the first online learning framework tailored towards
relevant feedback.
So before I present the formulation of the problem, I want to quickly describe real
world comparison oracle for search applications. This milk is called team-game
interleaving. And the idea here is suppose I had two retrieval functions F1 and
F2, and I wanted to know which one is preferred by some user, say Thorsten for
a query at say SVM. How this works out is we would first precompute the
rankings of both retrieval functions, and we would show the user -- we would
show Thorsten an interleaving of the two rankings. So the results are color
coded here, such that it preserves some notion of rank fairness and the user
does not know that one result came from one search engine, the other result
came from another one.
And from this interleaved ranking that the user sees, we can then simply trace
clicks to -- as a vote of preference for one of the two retrieval functions. So if
Thorsten were to click more on the red results, which is the case here, then we
will make the inference that F1 is preferred to F2. Okay.
So here's the dueling bandits problem. Given a space of bandits F, where we
would think of this as a parameter space of retrieval functions or weight factors,
the algorithm proceeds in a series of time steps where at each time step it can
compare two bandits or two retrieval functions. So for example, this can be
implemented using the interleaving task on search applications. Each
comparison is modelled here in our case as being noisy and independent of
other comparisons.
The goal is to minimize this notion of regret. And I'll walk you through what that
means. So at time step T our algorithm choose two bandits, F sub-T and F prime
sub-T. And the regret is the sum of the probability that F star, which is the best
bandit or the best retrieval function known only in hindsight, the probability that F
star beats F sub-T plus the probability that F star beats F prime sub-T and minus
one. Minus one is for normalization purposes.
So if you interpret this as the probability or the fraction of users who at time sub-T
would have preferred the best retrieval function over the ones selected by our
algorithm at time step 2. So it's a dissatisfaction notion of regret.
Okay. So here's a -- here are some examples to illustrate how this problem will
play out in practice. Here I have three examples where the -- we're comparing
two bandits in the space of bandits. And the color grained represents the quality
of the bandit. So lighter is better. So the best bandits are up here.
In the first example we're comparing two pretty poor bandits, pretty poor bandits.
And so for noisy comparison between these two, we incur pretty high regret
because users would have preferred bandits from up here. And again, we don't
know a priori what this regret is. That's by assumption. And the second example
we're comparing two mediocre bandits. And for noisy comparison between these
two, we incur a noticeably lower regret.
And in the last example, we're comparing two pretty good bandits. And for this
comparison we incur almost zero regret.
And so the goal then would be to make this -- for the algorithm to make a
sequence of comparisons such that we -- such that this regret formulation is
minimized over time, up to some time horizon T. And our algorithm does not
know a priori what this -- what this space looks like or what F star is.
So in our analysis of making a few modeling assumptions we assume that each
bandit F has an intrinsic value V of F. This value function is never observed
directly. And for analysis purposes, we'll assume that if V is concave which
implies a unique F star. If V is not concave, you have to deal with the local
optimum.
We assume that comparisons -- our comparisons are modelled based on this
value function. So for example, the probability of F beating F prime could be
modelled as -- could be modelled as some transfer function applied to the
difference of the values of F and F prime. And we make some smooth
assumption on this probability function.
So a common one is the Lipschitz -- I'm sorry, the logistic transfer function. But
in general, any S shaped curve works. Yeah?
>>: You're assuming that the intrinsic value is the same for all the queries?
>> Yisong Yue: Right. So you would have to -- it's like the average value. So
you would have to do some averaging. Okay. So I'll present the algorithm by an
example. Our algorithm is called dueling bandit gradient descent. It has two
parameters. It's pretty simple. It has two parameters. An exploration step size
delta and an exploitation step size gamma.
It maintains a current estimate of what it thinks is the best point. And then at
each time step it chooses another point in this space within some exploration
radius, within the exploration radius, and it compares the two. In this case, this
bandit is worse, so it's likely to lose the comparison. So in this case it did lose
the comparison. So our algorithm maintains a stationary current point. So -- and
in the next time step we're comparing against this point which is again sampled
from our exploration radius. And in this case, the proposed candidate bandit
wins the comparisons so we make an update in this direction. Notice that our
update step is smaller so our exploitation step is smaller than exploration step.
So our algorithm is conservative and this falls out of the analysis. And we'll
continue proceeding in this fashion, making this sequence of comparisons and
updating of our proposed candidate wins the comparison.
Now, this comparison is random, so sometimes a better bandit, a better
candidate bandit will lose the comparison, in which case we stay stationary. And
sometimes a worse candidate bandit will win the comparisons, in which case we
make a step in the quote/unquote wrong direction.
But the idea is that on average we're doing something like gradient descent, and
we'll be converging towards the good region in a way that minimizes regret.
So here's the analysis. A sketch of it on one slide. It's built upon the fact that
convex functions sat this inequality where the difference between -- for convex
function C, the difference between C of X and the best -- the minimal point C of X
star is less than or equal to the gradient at C of X times the vector difference of
two points.
Now, for our formulation, first of all it's not convex because it's applied through a
logistic transfer function. And second of all, we need -- we don't actually know
what the gradient is. We need to estimate it. We're estimating the gradient.
So this introduces both additive and multiplicative errors to this inequality. And in
particular -- and in particular, it depends on our exploration step side delta, how
bad the error is.
The main analytical encryption of this work is a bound on the multiplicative error
which allows us to reason about how aggressively we can explore versus how
bad our error is in our estimate.
The good news is that we can do so and -- the good news is that we can do so
and the good news is that we can do so and by doing so we can actually set the
parameters in a way that has sublinear regret with respect to the time horizon T.
When a algorithm has sublinear regret what that means is that the average regret
shrinks over time so in the limit we do as well as knowing the best bandit in
hindsight. Yeah?
>>: So it's your objective function [inaudible] you're adding [inaudible].
>> Yisong Yue: Is this convex?
>>: Quasi convex. [inaudible].
>> Yisong Yue: Yes, it quasi convex. It's -- so you could read the paper for more
details but it's convex in this region.
>>: I wonder if there's [inaudible] in the performance of the quasi convex ->> Yisong Yue: Maybe. Actually I don't know. So, I mean, if the function were
to be completely convex, then you could simply take a noisy estimate of this
gradient and then you could show that that could do well because it's not
completely convex it introduces error and that the contribution of the paper.
There might be some approaches in stuff that's analyzing quasi convex functions.
I don't know.
>>: I kind of lost you here. You're saying that you're using this property of
convex functions. So what's the convex function you are applying? What is C?
>> Yisong Yue: C is the sum.
>>: Is what ->> Yisong Yue: Is the sum right here. This probability. This probability function
is C.
>>: And it convex?
>> Yisong Yue: No, it's not. But ->>: [inaudible].
>> Yisong Yue: It is -- it's convex for all bandits that are better than X. So the
idea here is that this function, although it's not convex, satisfies this property up
to some error. Up to some error rate. And if you can bound that error, then you
can reason about how this function behaves as a function of delta versus this
inequality.
>>: [inaudible] regions of the -- for example, if that -- if your [inaudible].
>> Yisong Yue: No, because the F function is always guaranteed to be at the
inflection point. Because relative -- relative to -- well, okay. It's at bit
complicated. I'll be happy to talk with you offline. It's just you could describe a
formulation where the F function is always -- you could describe an equipment
formulation where the F function is always guaranteed to be at the inflection
point. Where it becomes non-convex. Between convexity and non-convex.
Okay. So here's some web simulation results. We took a Web Search dataset
provided courtesy of Microsoft, and we did the following simulation. There's
1,000 queries in this dataset. And we simulated a user issuing a query by
sampling from this dataset at random, and then for two different retrieval
functions, two different rankings, the user would probabilistically choose one over
the other based on the NDCG. So the NDCG is the hidden value of the two
rankings. And so the user would probabilistically choose one over the other.
And we tried three versions of our algorithm, one which sampled one query
before making an update versus up to 100, in this case we were simply making
an average -- we have a less noisy estimate of the uptake before making update
decision. And we compare it against -- so this is NDCG on the Y axis. We
compare it against ranking SVM, which is a pretty standard supervised learning
algorithm for this problem.
So we don't do well as the ranking SVM because the metaphor here is that the
ranking SVM has direct access to the labels, which in this case is the labels
inside the users' minds. Whereas we only make inferences based off observed
user behavior.
So all our labels in some sense are provided for free, all our feedback is subsets
is provided for free, and we do reasonably well.
>>: What's the X axis?
>> Yisong Yue: Number of comparisons. So ->>: [inaudible] ranking functions?
>> Yisong Yue: It's parameter space. So there are 367 features, so it's a
continuous parameter space. So in some sense it's an infinite number. But
we're just basically making -- exploring different points in this parameter space.
Now, if we were -- so if we're interested in only optimizing over a discrete set of
different retrieval functions like K, then we have stronger regret bounds. In this
case, it's information theoretically optimal at log T, where the idea here is that we
have K different retrieval functions and you want to find the -- you want to find the
-- you want to make a sequence of comparisons to -- between these K retrieval
functions in a way that minimizes regret.
Now, the only bandit gradient descent the algorithm presented is pretty simple,
and I think that's a strength because [inaudible] has reasonable theoretical
guarantees and it's easy to extend because it simple. And so you could think
about incorporating domain knowledge or leveraging the types of structured
prediction algorithms that I described in the first half of this talk in a way that, you
know, maybe makes sense for the application. Yeah?
>>: I'm just trying to understand your non convex function that you're optimizing
to get. So would it be possible to upper bound this with a convex function that
has the same rating as X? Because if that -- if you could do that, then you could
apply the algorithm [inaudible] without a gradient algorithm, which just [inaudible]
get the same rates.
>> Yisong Yue: We do actually do apply -- we actually do use some of their
results in our results.
>>: You see what I mean? If you could ->> Yisong Yue: I see what you mean. I see what you mean.
>>: Use their algorithm with just one point [inaudible].
>> Yisong Yue: I see your -- I see what you mean. I think the answer is yes. I'm
not sure if you do better than what we have. We do use some of the results from
[inaudible]. We had to -- we had to extend the results because their estimate of
the gradient is different than our estimate of the gradient.
>>: [inaudible] in terms of from the one point, two point [inaudible] and two point,
one point you go back to the rating -- approximate rating.
>> Yisong Yue: So it should say that in the [inaudible] approach where they
assume that they have knowledge of C, so they observe the value of the
probability function, not just the sample of the -- sample of the comparison. We
get the same regret bound as they do. The -- it's just that their estimate of the
gradient has less variance than ours does. Okay.
So I think that -- I think that this line of approach is very interesting and potentially
very useful. And I think it's also important to think about ways in which we could
design ever more practical approaches and maybe thing about ways of
evaluating on real information systems from vary domains of user communities.
And I think this will also shed insight and guide future development.
So I want to quickly wrap up by just briefly talking about some of the related
research that I've been doing in this regime. In structured prediction, I've also
worked on optimizing for multi-variant ranking measures, such as average
precision. Within interactive learning I've worked on not only modeling this
exploration versus exploitation tradeoff but also on ways of interpreting user
feedback.
So here in the exploring exploitation tradeoff work we assume that we have a
comparison oracle that will give us unbiased feedback via team-game
interleaving. But there are ways in which we can think of maybe eliciting even
stronger feedback than what we already have.
All right. So that's the end of my talk. If you're interested in any of technique
details or playing around any of software, they're available from my website. I
invite you to take a look. And thank you for your attention.
[applause].
>> Chris Burges: So any questions? All right.
>>: Just one question about -- in the regret [inaudible] you said that -- you
mentioned a [inaudible] out of your function or [inaudible].
>> Yisong Yue: So right. So okay. Here's our reference bandit, and in this one
dimensional case we either compare against this point or this point. And so in
expectation it gives you a estimate of the gradient at this point, right? But that
has an error.
Also, this point is not -- is that it falls outside of the convex region of this function.
So that gives you -- so the -- so that gives you a different error. So the estimate
of the gradient gives you an additive error. The fact that this point falls outside of
the convex region gives you a multiplicative error. Okay.
>> Chris Burges: Thank you.
>> Yisong Yue: Okay. Thank you.
Download