24076 >> Kuansan Wang: It is my greatest pleasure to... Microsoft. Many of you probably overlapped with Le Zhao...

advertisement
24076
>> Kuansan Wang: It is my greatest pleasure to welcome Le Zhao back to
Microsoft. Many of you probably overlapped with Le Zhao as he was a
formal ISRC intern within our group working with Cha Lu [phonetic].
But we all remember he's actually from CMU in the LTI, Language
Technology Institute, CMU. Very famous institution in the HLT area.
And for those of you who are active in the check community, you
probably already now him, he's a active contributor in the trek
community contributing to the lemma toolkit. And his earlier works
span from the structure document retrieval to the recent one most
recently on this term problem, but in addition to that Le is also very,
he's from LTI, so he's also well known in the HLT community. So he has
done work in pattern retrieval, biomedical document retrieval; and
today he's going to tell us his Ph.D. thesis work. So without further
ado.
>> Le Zhao: Hello, everybody. I'm really glad to be here again to see
my old friends here and to make new friends.
And I'm especially glad to see a reasonable turnout on a Monday
morning. In this talk -- so first if I get speaking with the talk, I
want to comment that I know there are people joining through the video
link and there's no way for you to participate in the
question-and-answer session. So I'll have my e-mail here you can send
your questions to my e-mail and I'll try to get to them at the end of
my talk.
Okay. So this talk is about text retrieval. And let's first see what
is text retrieval. The task is the user, the confused user generates a
query, which contains certain query terms. Retrieval engine, retrieval
engine returns a set of results from a document collection and feedback
to the user. So hopefully the user will be happy with the results.
This task and the retrieval engine is usually evaluated using a
Cranfield-style evaluation where this evaluation abstracts away the
user by retaining only the query that the user issues.
And retaining only the results that users are happy about which are
called relevance judgments. And these relevant documents help us
evaluate together with the queries help us evaluate the retrieval
engine in a relatively objective way, an automatic way. So here comes
the big question: Where are we? And where are we going?
The current retrieval model dates back to the early 1970s. The best
ones are from still from the 1990s. These current models are based on
simple collection statistics, like TFID. TF is term frequency which
measures how frequently a term occurs in a result document.
measures how rare a term is in the collection.
And IDF
So what these retrieval models doesn't do is any kind of deep
understanding of natural language text. So given this current status
what's the ideal land like, what's perfect retrieval like? Given the
query information retrieval, given an answer text search, a perfect
retrieval model should be able to judge that text search implies
information retrieval.
And that is called a textual entitlement task, inferring when one
sentence equals another sentence. And it's known as a difficult
natural language processing task.
And also searchers are frustrated frequently when they are doing
informational searches. So we're still fairly far away from the
perfect land. But what problems have been holding us back all those
years? And this work argues that perhaps the following true central
and long-standing problems in retrieval are the culprit. One is called
the term mismatch problem where query terms fail to appear in relevant
documents. It happens sometimes because people describe the same thing
using different vocabulary. And this general vocabulary mismatch
problem is studied in the 1980s, when Sue was still working for Bell
Labs. That's how long standing this problem is.
However, there's still no clear definition of term mismatch in
retrieval. The second problem is query dependent term importance. If
you are familiar with retrieval models, this is a probability of a term
T occurring given the relevant set of documents. So the probability of
term T occurring in a set of relevant -- in a set of documents relevant
to the query.
Traditionally term weighting is assessed using IDF. Basically how rare
a term is in the collection. Because it has nothing to do with the
query. It has nothing to do with relevance. This probability PTR is
known to be important for retrieval. It's known to improve retrieval a
lot.
It appeared very early on and have been studied by scarsly, by scarce
research. And these research provided very few clues about how to
accurately estimate this probability in a query-dependent way.
This work connects these two problems, shows that they can result in
huge gains and use a predictive approach to try to solve both problems.
And in this talk I will use term mismatch problem as a thread. Because
it's a more general version of true problems as I will show. First,
what is term mismatch and why do we care?
In job search, you might be looking for information retrieval jobs on
the market, a job post could as well say text search. This could
easily cost you potentially 50 percent or even more of the job
opportunities even if you are careful when you are formally making a
query.
In legal discovery, you might be looking for bribery or foul play in
corporate documents but they'll never say that, at most they'll say
grease or payoff, and this will cost you cases. In patent search cases
it could cause businesses. In medical record retrieval successfully
finding the record regardless of the vocabulary mismatch could
potentially save lives.
So in areas where the user cares most, mismatch can hit most. People
know that mismatch is an important problem. And they have tried to
solve mismatch from different angles from the documents and from the
queries end and from both ends.
Our approach is different. So suppose you are given a problem, any
question, any problem. How would you proceed to solve that problem?
What's your first step? The first step is always is clearly define the
problem, what the problem is.
And then to see if it is a real problem. So in this case I will show
you that in theory and in practice this is a real problem. And then we
shall try to understand the underlying mechanism of how this problem is
created and try to solve the problem using principled ways. So I
promise to show you this in this talk and I will come back to this
slide throughout the talk. So that you know I've fulfilled my promise.
So first definition. We define the probability of the term to mismatch
rather than documents to be the probability of a term not occurring in
the relevant set of documents relevant to the query. So suppose we
have collection of documents here, the larger bubble represents the
documents that contain T. The smaller bubble has the documents
relevant to your query. Mismatch, the proportion of the relevant
documents that do not contain the term T.
So if you use the word retrieval as your search term, then these will
be the jobs that you mismatched in all the relevant jobs on the market.
So first I want to comment that the term mismatch probability is
related to the term recall probability, which is just a complement of
mismatch.
It's a recall is a complement of mismatch. Secondly, this probability
can be very directly calculated if you know the relevant set of
documents for query. So if you have relevant judgments from queries
you can accurately estimate this probability.
Basically how many documents contain how many relevant documents
doesn't contain the document T divided by the size of the relevant set.
Now, let's look at some examples. These are some queries and query
terms and these are the term recall probabilities for these query terms
in the query. So for the word spills in oil spills, it occurs in
99 percent of the relevant documents. Has a very high recall, low
mismatch. Why is that the case? Perhaps there are no other ways to
describe oil spill. Yes, question?
>>: How do you know which documents are relevant?
label them?
Do you manually
>> Le Zhao: Yes, use manual label documents and then we collect those.
Thanks for the clarification. So perhaps there are no other ways to
describe spills, except oil spills, because oil leaks means something
else. Term, imitation. Term. The word term appears in 98 percent of
the relevant documents. Perhaps also no other ways to describe the
word term here. However, the same word term in long-term care appears
only in 68 percent of the relevant documents. Why is that the case?
Because long-term care can be described as elderly care or home care,
et cetera. It's not really a necessary term or relevance.
The word "effects" appears, it's an abstract term. And it only appears
in 28 percent of the relevant documents, because the facts could be
described as improvements, decrease, impact, et cetera.
Ailments is not only abstract. But also a rare term. And that makes
the situation much worse. These queries are from track datasets where
the government provides users to generate queries and do the
assessments to generate relevant judgments and we as participants can
freely participate and evaluate our systems.
So this is a very nice deal. So by now there should be a lightning
strike in your head. We have a simple definition which allows us to
estimate the probability of mismatch from relevant documents and to
analyze mismatch.
This probability, the probability term occurring in the relevant set,
occurs in one of the very early retrieval models. If you assume that
term currency are binary and conditional of each other given the
relevant information, then the optimal -- the optimal ranking score is
given by this formula.
If you are familiar with machine learning, this is just a naive base
model. The retrieval model, the retrieval model aggregates relevant
scores for the terms that have appeared both in the query and in the
document.
And we are scoring whether document D is relevant. And in this case
two probabilities, two conditional probabilities determine two sets of
probabilities determine this retrieval score and two probabilities
determine the optimal term weight here.
One is a probability of term occurring in the nonrelevant set. Because
the relevant set is usually very small compared to the collection, this
probability can be accurately approximated by the probability of term
occurring in the collection and results in the traditional inverse
document frequency based term weighting. So term weighting based on
how rare a term is if it is rare term it is more important.
The other part we now know is determined by the term recall
probability. Now, this is a very basic model. However, more advanced
models use this as the sole part, as the only part that specifies term
weighting, how important the term is. And more advanced model, other
advanced models behave similarly. And these models -- these advanced
models have been used as very effective features in Web search.
And to recognize that it is important to recognize that this
probability, the term recall probability, appears as the only part in
the query that is about relevance. This part has nothing to do with
relevance, because it's a collection statistic. It has nothing to do
with the query.
This full formula has been called relevance weight or term relevance,
but term recall is the real talk about relevance. In theory, it's as
important as IDF. And the only part in a retrieval formula that's
about relevance. In practice, because people know this probability is
difficult to estimate, you need to -- you need information from the
full set of relevant documents in order to estimate this probability.
So people typically just ignore this part and only using IDF based term
weighting. And when people do that, it causes the emphasis problem
where the retrieval model try to emphasize, will try to emphasize the
high IDF terms in the query. So, for example, for this query prognosis
of viability of political third-party in the U.S., prognosis and
viability have the high ID, they're rare terms. So they are being
emphasized.
If we look at the term recall probability, political third party should
be emphasized instead. When the retrieval models assign a wrong
emphasis to the query terms, there could be top false positives where
this is a rank result given using advanced retrieval model, language
model, and all these top ten results are false positives. Meaning
they're irrelevant documents that happen to contain the rare terms,
prognosis and viability but they're not about a topic.
And it's important to recognize that this is an emphasis problem
instead of a precision problem, because if you just look at the top
results, right, you see that prognosis and viability are being used
about something else instead of political third party.
So you might think that if you require prognosis and viability to be
about political third party, it might improve the situation. But in
fact increasing the match, strictness of the match, can only increase
mismatch and can make the situation worse.
So this is a mismatch problem, not a precision problem. Recall, not
precision. And even Google and Bing still have top ten false
positives. I should note that Bing is much better than two years ago
when I first tried this query on Bing. And Google actually decreased
performance a little bit on this query. I don't know why.
So false positives throughout the rank list, increasing -- decreasing
precision at all recall levels. But I've shown you that this is an
important problem. But how frequent is it occur? How significant is
the problem?
Turns out in a 2003 real reliable information access workshop it
gathered many groups of experts, research IR systems evaluated language
models BN 25, pseudo feedback, all the standard techniques still being
used now.
And did a failure analysis and discovered that out of the 44 failed
topics 64 percent of them. So we're summarizing their results here.
They did not summarize their results that way.
Out of these failures, 64 percent, because of emphasis, and we now know
that term recall based term weighting can help solve this problem.
Another 27 percent is the mismatch problem where you need kind of a
query expansion to solve the problem, and we now know that if we know
which terms tend to mismatch, we can guide our expansion toward solving
these problem terms. So underlying more than 90 percent of the
failures is our need to predict this term mismatch probability.
So in practice it takes plains common failures over retrieval models;
not only that, but also many other behaviors of the retrieval
techniques, such as when you are combining bigrams with unigrams in
your query. The bigrams tend to have a much lower weight than
unigrams. Why is that the case because the bigrams increase mismatch,
right?
So they should have a much lower recall probability and much lower
weight than the unigrams. And personalization, the disambiguation, and
structure retrieval which enforce structural matching between the query
and the documents. These techniques all increase precision and are
shown to be less stable for improving retrieval. Why is that the case?
Perhaps the problems that the queries are suffering from are the
mismatch problems. Not the precision problem.
Okay. I've shown you that it's a frequent problem. Now let's focus on
the emphasis problem. It's a frequent problem but what about retrieval
performance gain? What kind of gain are we talking about? For basic
models, if you apply the true term [inaudible] into the retrieval
models, it can get 100 percent gain in retrieval.
In more advanced models, it can still get 30 to 80 percent gain. So
for a new query without relevance judgments, we need to predict that
probability, but that prediction perhaps doesn't need to be very
accurate to show a performance gain because there's a huge potential.
So huge potential of gain.
Now on to prediction. How do we predict this probability, this
mysterious probability that people find no clue to predict very few
clues. We look at the data. First it varies from 0 to 1. So we need
prediction. Second, the same term in different queries can have
different term recall probabilities. So we need a query dependent way
to predict this probability.
Third, it's different from IDF. What have -- so these three trends
also occurs in more larger scale analysis. Here I'm listing the term
recall probabilities each point is a term, the term recall probability
of a term. Sorted in descending order.
As you can see here, the term recall probability here varies from 0 to
1 almost uniformly.
>>: All this comes up in one query?
>> Le Zhao: This is from one track dataset which includes 50 queries.
So these are fairly long queries. For reasonably -- for shorter
queries the term recall probability, there's a bias toward higher
recall. And you should be surprised to see this statistic, because on
average a query term mismatch is 30 percent of the relevant documents.
And that's a query term from a short query. So imagine you are as a
user you're typing into your search engine one query term. You're
excluding 30 percent of the relevant documents. And as you type the
next term, you are excluding another 30 percent from the remaining.
Right? So that gives you less than 50 percent of the relevant set to
even begin with.
>>: Confuses the term recall [inaudible] and you're not doing any
stemming?
>> Le Zhao: Good point. I'm doing stemming. Yeah. Call that
stemming. And, globally, the product here, the term recall probability
is the mean and the variance for the same term that occurs in more than
one query. As you can see, it still varies from 0 to 1 and the spikes
are the variance.
So for a lot of terms, there is a variance, but the variance is small.
But it could be large for certain query terms. And plotting here, the
points are the query terms. And X is the DF, IDF, Y is the term recall
probability.
As you can see, there's a slight correlation. But it's still messy.
So you cannot directly predict this using IDF. But what have prior
approaches done? Prior approaches have tuned this as a constant, or
use IDF as the only feature to predict. And success is limited over
very basic models and the missing piece is the knowledge that this
probability measures term recall and is related to term mismatch.
With this knowledge, we can ask ourselves: What causes mismatch? What
might cause mismatch. First, a term being not, a concept being not
central to the topic. The words related or potentially related are not
really central to this topic. Propounded not really central. So these
terms tend to mismatch.
Second, synonyms tend to occur in place of the original query term
causing mismatch. Abstract terms tend to be replaced by more specific
terms in relevant documents and cause mismatch.
And given these factors, we can try to design features to model these
factors and to do prediction. So I've shown you the magnetism of how
mismatch occurs and how mismatch causes problem in retrieval.
In terms of features, what we need to do, we need to identify synonyms
of a query term in a query dependent way, with made specific choices in
our design of the features to be general. To only depend on the query
and the collection. Not to use external features, external resources,
but certainly it's not the best -- it's not the best way.
There should be better ways to design these features. But this is the
first set of features that have been shown to work for this problem.
Let's see how we do that. So first external resources like What Net
[phonetic] Wikipedia query logs, they have a coverage problem; they
tend to be more static not dependent, so they're not easy to use.
Using them research topics themselves. What we did is use a term-term
similarity measure in the concept space to help us identify the
synonyms. So it's called local latent semantic indexing because given
the query we do initial retrieval. Get the top set of documents from
that retrieval and apply latency semantic indexing on the top
documents.
So, for example, we can get top 200 documents, do dimensional
reduction. Keep only 150 dimensions.
And here are some examples. This is the query term. These are the top
similar terms identified by latent semantic indexing. So we use as a
feature the self-similarity. So we use inner product as a similarity
measure. We use a self-similarity of the term as one feature.
That measures how well the term co-relates with the latent concept
space, and the latent concept space is related to query. So well this
term relates to the query. We also use as features the average of the
supporting terms. Average similarity of the supporting terms as a
measure. So basically we're also requiring that not only the term
correlates with the topic but also the supporting terms, basically the
concept is central to the topic.
And also we use as we designed a feature which measures how likely
these synonyms replace the term in collection documents, right? If the
synonyms are frequent terms, then it's very likely that these terms
will appear in documents that the original query term doesn't appear
in.
So these are the features.
And we can measure --
>>: Can you go back a slide.
supporting terms?
I didn't understand number two.
What are
>> Le Zhao: Supporting terms are the top similar terms measured by
latent semantic indexing. So basically we represent each term in a
concept space and we compute inner product similarities.
>>: With just ->> Le Zhao:
Rank.
Rank the terms and these are top similar terms.
>>: Seem to [inaudible].
>> Le Zhao:
Yes.
We pick top five.
Yes.
>>: And then, sorry, didn't understand three, the synonyms are the top
supporting terms.
>> Le Zhao: Yes, synonyms are the top supporting interprets. And we
are measuring how likely these terms occur in collection documents that
this term doesn't occur?
>>: It doesn't occur in.
Okay.
>> Le Zhao: Measure of how likely these terms replace the original
query term in collection documents.
>>: That is a collection of --
>> Le Zhao: Of the entire collection, yes. Not just the top-top
documents. We can measure how well they correlate with a target from
recall. Turns out term centrality has a high correlation because IDF
has a 1.3 correlation. And negative also means helpful. Positive or
negative. As long as this absolute value is large, it's helpful.
Centrality concept, centrality also very helpful, replacability, well,
understandably it's negative correlation, and it's fairly high.
Abstractness. The abstractness feature is based on the observation
that users tend to modify more abstract terms using more concrete
terms. For example, educational modifies the word program in this
query. So program tends to be the more abstract term. And so on.
Effects also tend to be an abstract term. So we can use dependency
passer to pass the query.
And if a query term is being modified by other query terms, then we say
this is more abstract. This is a binary feature, and it has a
co-relation of 0.12 with a target. So this is also very helpful. And
the top documents are retrieved just using IDF.
>> Le Zhao: Yes, yes, using the baseline retrieval model.
baseline I'm comparing to.
Whatever
>>: So you are using a [inaudible] so which dependent ->> Le Zhao:
Stanford.
>>: Stanford.
>> Le Zhao:
Yes.
>>: The Stanford processing is not the sign for query [inaudible].
>> Le Zhao:
No.
>>: So what's the increase [inaudible].
>> Le Zhao: I haven't verified. But for the small set of queries that
are looked at, the pass looks very accurate. So the passers also
behave fairly accurately on short texts. So given these features, we
can model the prediction of term recall as a standard regression
problem. We can use training data with known relevance, training
queries with known relevance to train the model and use another set of
queries without relevancy information as a test set. And here we use
Gaussian kernel support vector regression as the prediction model.
You can also use other advanced prediction models like posted decision
tree, posted regression tree. It works similarly. And the experiments
we are measuring two things. One, how accurately we are predicting
term recall, using how closely the prediction is from the truth.
Second, we are measuring retrieval performance using overall retrieval
success and precision at top ranks.
So what percentage of the top ten are relevant. So here's one example.
For this query, we are getting the correct emphasis here. Although the
absolute value is still very -- is still not very close.
And more globally, this is the method that uses a training set average
to predict for the test set. And it gets an error of 0.3. So suppose
the distribution is uniform from 0 to 1. The term recall is uniform 0
to 1. If we use the training set average, it should get an error of
0.333. So it's not completely uniform.
Using IDF alone increasing error a little bit. Using all features and
tuning the meta parameters of the features we can reduce error by half.
And this shows if we use the recurring term, the terms that occur more
than one query in the training set to predict. So use the previous
occurrence to predict the next occurrence we can get a fairly low
prediction error. But these two values are not directly comparable
because they're not measured on the same set. This is measured on test
set. This is measured on the training set.
So it can be predicted. And I want to just briefly insert that our
method demands a more general view of the retrieval modeling problem.
Traditionally, retrieval modeling is seen as restrictedly as a document
classifier for a given query. To classify whether a document is
relevant to the query or not.
The more general view sees a retrieval model as a meta classifier which
is responsible for many queries. Takes the query as input and outputs
of a document classifier.
So given this view, learning a retrieval model basically is just
transfer learning in machine learning. You are using, basically using
the knowledge from related tasks training queries to classify a
document, learn the new classifier for the test query. Features and
model are just facilitating the transfer.
So this more general view would perhaps lead to more principled
investigations of the problem of how to learn retrieval models and also
allow us to apply more advanced transfer learning techniques into
retrieval. Okay. That's the insert. We're measuring retrieval
performance now. If you are familiar with the retrieval models, this
is how we insert the probability into the retrieval models as the term
weighting, as term weighting, in language model.
If you are not familiar with retrieval models, it's okay. We're just
weighting the query terms. We're not doing any kind of expansion.
This is the performance on six different test sets. Each test set from
track contains 50 queries and we're using one as a training set and one
as a test set. And the crossbar, there's a 10 to 25 percent gain.
Most of them are significant.
So this is measuring overall retrieval performance gain. And in top
precision, as predicted by theory, there's also a 10 to 20 percent
gain, although not always significant, because the measurements are
sparse.
So it can be used in retrieval as term weighting to help solve the
emphasis problem and leads to significant gains. What about the
mismatch problem? If we can successfully solve the mismatch problem,
increasing the term recall probability of every query term, we can at
the same time solve the emphasis problem.
So let's recap.
>>: Let me verify, use the union the two [inaudible].
>> Le Zhao:
Sorry.
>>: The two stage retrieval.
>> Le Zhao: Yes. We need two stage retrieval.
features. One to do the retrieval.
One to generate the
>>: And bringing you to [inaudible] the set of -- the relevant
document.
>> Le Zhao:
Yes.
>>: And general features and bring the model to weight this
[inaudible].
>> Le Zhao:
Yes.
Two stages.
>>: So have you a comparing method [inaudible].
>> Le Zhao: Yes, that's a good point.
back to that after the talk.
I do have.
But I will come
>>: So this slide you're showing on the transfer learning, can you go
back to that. Sorry. Keep going back. So this more general view that
this is sort of like a learning -- are you saying something different?
Most learning models are used in a training and test set, are you
saying something different from that?
>> Le Zhao: Good point. So by now there's a dominant model is
learning to rank. Basically learns global retrieval model out of the
training set.
So just one retrieval model. Basically one classification model. And
applies to model -- the same model on the test queries. But here we're
learning from these training queries and generate a new classifier. So
the classifier is different.
While the learning to rank learns a global model one classifier, is
that answer your question?
>>: No. So the classifier in your case is the product of the POT in
ours or do you actually have an underlying model as well in addition to
those?
>> Le Zhao: The underlying classifier is just BM 25 or language
models, which is just the traditional models with that probability
inserted into the model.
>>: So what you're doing is you're learning an improved probability.
>> Le Zhao: Yes, but at the same time with a different term weighting,
that is a different model. That is a different classification model.
It's the same retrieval model but it's a different classifier, because
you're classifying the documents differently with a different term
weights.
>>: The retrieval is too thick.
>>: I understand that. I can ask you more about this later. It seems
more like different feature values to me than different model. But
I'll ask it.
>> Le Zhao: Yeah, yeah. Okay. I can perhaps come to that later. So
let's recap. Mismatch. Mismatch ranges from 30 to 50 percent on
average. Relevance matching can degrade very quickly for multi word
queries. The solution, one solution is to fix every query term. If we
fix every query term by expanding every query term using its synonyms,
it results in a conjunctive normal form query.
But in this case this keyword query is being expanded until a conducted
normal form query. It's very expressive and very compact. One
conjunction normal form query in this case is equivalent to hundreds of
alternative queries using keyword queries. This is used very
frequently by lawyers and librarians. In this case it's a query, legal
track query which is created by lawyers.
As you can see, they spend lots of time trying to expand every query
term. So it's a very tedious task. And what we propose to do is given
this term mismatch probability, given the prediction, we can help guide
the expansion to focus, to let the user focus on the terms that have
the problems.
So here placement and children are being expanded. And keeping the
other terms untouched. To go is to expand -- say two terms and still
getting 90 percent of the improvement that would be great.
So how do we evaluate that? Ideally have a user, we let the user
propose a keyword query. Keyword query is sent to the diagnosis system
being diagnosed which are the problem terms, and the problem terms are
being fed back to the user.
The user expands the problem term and the query formulation strategy
generates the query. Submit to the retrieval engine and do evaluation.
So in this case we can have different diagnoses methods in the
diagnosis component and we can have different query formulation
strategies in the query formulation component, and we can compare these
different methods.
That's what we intend to do. However, online user studies of such a
complex system need to control many variables. Without millions of
users this kind of online study cannot be carried out. What we end up
doing is a simulation. We have the expert user give us a fully
expanded CNF query beforehand.
We extract the keyword query by taking the first term out of each
conjunct. Similarly, the pipeline goes. So this is one simulation.
And the user expansion is also one simulation where the expanse term is
being extracted out of the CNF query.
So this simulation is fairly realistic. It's used partial -- to use
full expansions to evaluate partial expansions, to simulate partial
expansions.
>>: I'm missing something.
Why is it bad to expand everything?
>> Le Zhao: It's good to expand everything. I'm just saying given our
probability how are we using that? We can use that to guide the
expansion, to save some time for the user.
>>: So I know the user -- the expansion happens automatically, right?
What is the user do?
>> Le Zhao: The expansion doesn't happen automatically. So the CNF
query you're seeing is being expanded by a lawyer. Spending lots of
time on one query.
>>: But you know the synonyms, right. We have dictionaries with
synonyms. I don't understand what's the manual part here?
>> Le Zhao: The synonyms are difficult to get. These are the gold, if
you can get the synonyms you get query retrieval. But we don't. We
don't have automatic ways to get that. Perhaps Bing has that.
>>: All right.
Okay.
>> Le Zhao: But even Google and Bing have this problem.
if we can manually expand more terms, we can do better.
So that means
>>: Manually generate CNF queries give you the upper -- [inaudible].
>> Le Zhao: Yes. So I'll show the experiments. So here we're using
two sets, two datasets. One with CNF queries created by lawyers. One
with queries created by search experts. In terms of retrieval, in
terms of diagnosis methods, I'm plotting here on the X axis, the number
of query terms that are being selected for expansion. On the Y the
retrieval performance gain. The relative performance gain. So the
upper bound is expanding fully or expanding not so full.
But close. So fully expanding sort of give upper bound. You can
observe here these two points are using the PTR-based diagnosis. So
predict PTR and do diagnosis and do a simulation expansion, expansion
simulation and evaluate. Expanding two terms using PTR-based diagnosis
can get about 90 percent of the performance gain while if we are only
using IDF as diagnosis, we need to expand three terms. So we're
effectively saving users' time.
>>: Do you know what the very best you could have done too is.
>> Le Zhao:
What do you mean the very best?
>>: You could have tried all pairs to see [inaudible].
>> Le Zhao: No, I haven't done that. But that's a good point. So
we're using a greedy strategy here. Greedy strategy. Not necessarily
the optimal. That's right.
>>: Sorry. I don't think that's what I meant. I mean -- unless I
misunderstood you. So greedily selecting the two, fine, but you also
have like the four or five expansions, right? So would you be able to
just -- would you be able to try all two and see which one gets the
very highest retrieval scores.
>> Le Zhao: Expanding which query term or do you mean how many query
terms to expand for each query term?
>>: Which query terms to expand.
>> Le Zhao: So no we haven't tried permutations.
greedy approach to expand --
We are only using a
>>: Try all the possible combinations.
>> Le Zhao:
We haven't tried that.
>>: Rank them just by bag of ->> Le Zhao:
Yes.
>>: Because you already have the -- [inaudible].
>> Le Zhao:
Yes.
>>: Oracle.
>> Le Zhao: Yes that's an Oracle experiment. So in terms of
expansion, forms of expansion, we're comparing CNF expansion versus
traditional bag of words expansion, keyword expansion. As you can see
CNF gets better than keyword. CNF gets better than keyword.
>>: So what is the [inaudible] expansion would be?
>> Le Zhao: So we are using the same set of menu high quality
expansion terms from the users. But instead of doing CNF expansion, we
are combining these set of expansion terms as a group. And combining
that set of query terms with the original query, using a weighted
combination. And we have also tried weighting the expansion terms.
But that doesn't help much.
So we have tried several different ways to do the traditional way of
expansion.
>>Le Zhao:
>>:
Traditional way of the expansion.
You also assume that the candidate expansions from CNU expansions?
>> Le Zhao: Yes. The candidate terms are from CNF expansion, yes. I
have also tried using a automatic. So the standard relevance model.
So standard pseudo relevance feedback method to automatically extract
expansion terms but that's worse than manual expansion terms.
So we show that CNF queries, so the problem diagnosis can produce
simple and effective CNF queries. I've also worked on other aspects of
the problem such as improving the efficiency of the prediction. So
here we can use a one pass prediction, one pass retrieval with three to
ten times speed up and close to performance while retaining still 70 to
90 percent of the performance, that's based on the observation that
many of the query terms for many of the query terms the term mismatch
probability doesn't vary much across queries. So we can use that to
speed up our prediction.
>>: Is that mean you don't need pseudo environment documents?
>> Le Zhao:
one stage.
We're not doing pseudorandom feedback in this case.
So
>>: So you need a cache of each community term that had this
probability.
>> Le Zhao:
Yes.
We need a cache.
But luckily the training set --
>>: The model and the only difference is the probability is the
probability of term given the document?
>> Le Zhao:
Sorry.
I don't get your point.
>>: I'll take it off line.
>>: You're predicting from the previous three queries ->>: Query terms.
>>: Judge relevant documents.
>> Le Zhao: Judge relevant documents from the training set.
worked on --
Also
>>: Coverage would be a problem.
>> Le Zhao:
Yes, coverage could have been a problem.
And, yeah.
>>: I think you [inaudible].
>> Le Zhao: Luckily you don't need much coverage. It's about
50 percent term coverage gets about 70 to 90 percent gain.
If you have larger training sets, yes definitely better. We've also
worked on structure retrieval using symmetric role-based structure. So
we can annotate the question which terms are the target. Which terms
are the agents and subjects, objects. We can annotate the answer in
the same way and try to match the structure, not only keywords. But
this problem causes this structure retrieval causes mismatch at the
structure level where because of a switch of the key term, key verb
here the argument zero becomes the argument two here and it's a
different target. We use undirected graphic model to learn, jointly
learn the field level translation from the question to the answer,
using the training set which are basically true question-and-answer
pairs. We can learn the translation and we can predict which are the
likely structures given the query. And use the prediction we can use
[inaudible] lemma search engine to query these alternative structures.
Allows us to query this kind of structure. We can get 20 percent gain
in overall retrieval versus using just a strict question structure
alone.
Conclusions. This talk is about two long standing problems in
retrieval term mismatch and term weight estimation. We have provided
the definition and initial analysis of term mismatch. Future work
would explore new features and new prediction models that would improve
our prediction even further. We have showed the row of term mismatch
probability in basic retrieval theory. And used the principled
approaches to solve term mismatch. But what about more advanced models
like learning to rank, transfer learning, model term mismatch played.
How is it play out in these more advanced models?
We have used automatic ways to predict term mismatch and turned the
initial modeling of the causes of the mismatch, the possible causes of
mismatch. And we have provided an efficient way to predict this
probability. Future work should explore better ways to model these
causes of other causes that haven't been explored here.
In terms of effectiveness in retrieval, we have used term weighting and
diagnosed the expansion. Better techniques are needed. Like
automatically expanding the query into a CNF form. Better formalism
like transfer learning might facilitate, facilitate us to extend this
work into more tasks like relevance feedback, et cetera.
We have diagnosed intervention. Diagnosed intervention can happen at
different levels of the retrieval process, we're applying the diagnosis
at the term level and only diagnosing the term mismatch problem. And
we have shown that this guided expansion can help retrieval. Future
research should explore the diagnosis of specific types of mismatch
problems. Is it because of abstractness or synonyms?
It could also explore different problems, not just mismatch, but also
precision problems, right? So that we can guide lots of different P
techniques personalization to solve the real problem of the query and
to improve.
And even further, we can proactively diagnose the user. We can see
what problem the user is having and suggest searches or results even
before the user types into the search engine. So that's my thesis
work.
I have also worked on lots of other things at CMU building datasets
like Blue Web O9 being used by more than 200 research groups worldwide,
more than seven track retrieval evaluation tasks. I've worked on the
lema toolkit which is an open source IR toolkit which can do lots of
fancy things.
I've worked on large scale computing, and I do a fairly popular Hadoop
tutorial at CMU. In terms of other research I've worked on structure
retrieval. I've worked on legal discovery tasks, pattern retrieval in
biomedical chemical domains and I have worked on information retrieval
for human language technology applications like question answering,
filtering knowledge bases extraction from the Web and information
extraction.
And with that, I end my talk, and I'd like to take feedback.
[applause].
>> Kuansan Wang:
>> Le Zhao:
More time for questions.
I'll just check here, go ahead.
>>: Just a question for the low level [inaudible].
>> Le Zhao:
[inaudible] are there better?
>>: I'm just -- so you are using [inaudible] level. So there may be
some new level [inaudible] to expand that. So have a more specific
kind of test.
>> Le Zhao:
I see.
I see.
I see.
I see.
>>: Okay.
>> Le Zhao:
>>: A better patch [inaudible].
>> Le Zhao:
Right.
>>: Specific [inaudible].
>> Le Zhao: Right. So James point is that better semantic
representation of the text will help solve the mismatch. Yeah.
just check who else might have questions during the talk.
>> Kuansan Wang:
[applause]
If not, let's thank the speaker again.
I'll
Download