>> Kuansan Wang: All right. Let's get started. ... Yanen Li from UIUC to come here and visit us. ...

advertisement
>> Kuansan Wang: All right. Let's get started. It is my greatest pleasure to welcome
Yanen Li from UIUC to come here and visit us. Yanen is a Ph.D. student from the
computer science department under the supervision of Professor ChengXiang Zhai.
We're all good friends.
He has been a great intern, doing internship multiple times with us on query
understanding. And he has contributed also in many areas, and he also participated in the
first Bing and Microsoft Research Speller Challenge, and he won the second place. So
his outcome is so impressive, so we couldn't wait for him to get a degree, and we brought
him back to have interview.
And today he's going to tell us his Ph.D. work. I'll let you take over. Welcome.
>> Yanen Li: Thank you very much for Kuansan's warm welcome. And good afternoon,
everyone. So today I'm very happy to share with you about my Ph.D. studies on the topic
of multilevel query understanding.
Okay. Now, let me get started. Nowadays we are entering the big data age. Per any
second, we are generating a large amount of data on the Web. And the search engines
remain the most effective tools for managing such a huge amount of data.
This magic box is everywhere, from Web document search to product search to people
search.
The reason why the search engines remain the dominant tool for many of these such huge
amount of data is that actually it's not the volume that matters. Actually it is the small
amount of relevant data that greatly influences people's decision-making.
And the most effective way for getting this relevant data is search. In the whole search
process, the query the critical bridge connecting the users in one side and the information
on the Web on the other side.
So understanding a query, a query is a critical task for improving the search accuracy as
well as the user's experience. Here are the query understanding I mean is that not only
what the user says but also what she wants.
For example, for a query how old is Obama, it's much more desirable to directly give the
answer 52 years than just returning a set of Web pages.
However, understanding a Web search query is a long and trivial task. And here we
introduce our framework in which the whole query understanding task is broken into
multiple levels, all multilevel query interpretation and representation.
In this framework, the first level of query understanding is called lexical query
understanding. So the task of this level of query understanding is to try to reduce the gap
between the user's issued query and the ideal query in his mind.
For example, in the query Louis Vuitton store in Las Vegas, so people sometimes has a
more ill-formed query because of the query spelling errors. So this task is to try to
transform the ill-formed query into a well-formed one.
And once we get the well-formed queries, so in the next level of query understanding,
they lead the syntactic level of query understanding. We try to make sense of all of the
query, which means we try to break the query into semantic boundaries, which called
query segments.
So once we can do that, actually the query can be represented by a bag of concepts or a
bag of phrases, instead of bag of words.
So by doing that, actually we can support more complex retrieval models such as concept
weighting and phrase-based translation model also.
>>: Quick question. Do you find that the segmentation tends to result into naturally
what we would consider linguistic [inaudible] Louis Vuitton, or for particular tasks do
they ->> Yanen Li: We're in the particular context of having the relevance meaning, right,
well, particularly in the retrieval, building the better retrieval models, right?
>>: Yeah. And do those tend to correspond to things that we would consider entities and
->>: Yes, sometimes. Yes, yes, sometimes. Yes, [inaudible]. And then in the next level,
the semantic level of query understanding. So once we get the phrase boundaries of the
query, we also try to get the entities and relations of the query.
For example, here, we no that Louis Vuitton and Las Vegas are both entities. Louis
Vuitton is a famous luxury brand and Las Vegas is a famous city in Nevada.
So once we can get such which query -- what representation. So we can support many
interesting applications, such as we can enrich the search result with entity attributes and
the related entities. And we can also support entity search where we return a list of
relevant entities instead of a list of relevant pages.
Also, we can also do direct question answering where we directly give the answers to
users if we know exactly what the users want.
So there a couple of research questions here. First, how can we do it in a general way so
that we can apply our solution to all kinds of queries. The second is how can we do it
with minimal human effort so that we can save money and notation effort. And, third,
can we do it in real time when the users need it.
And so specifically we are interested in the questions as below. In the lexical query
understanding, we are interested in the problem of query spelling correction. And in the
query syntactic understanding, we are more focused on query segmentation. And in the
query semantic parsing, we focus on a particular problem of entity synonym mining,
which is a critical component of the query semantic parsing.
And I notice that this level of query understanding is somewhat static, meaning that we
have to know the query as a whole in advance. However, we also interested in dynamic
query understanding where the user just give a little hint such as a very short prefix. So
we are also interested in the dynamic query understanding, specifically query
auto-completion task.
So during my Ph.D. studies, I have addressed key questions in each level of the query
understanding framework. And, for example, for the lexical understanding, we are just
modeling multiple types of query spelling errors. And on the syntactic level, we adjust a
query segmentation problem by utilizing the click-through data. And on the semantic
level of query understanding, we mine the entity synonyms by a clustering framework.
And, finally, in the dynamic query understanding, we try to model the query
auto-completion by two-dimensional click models.
And in this talk, I will very briefly overview the first three level of query understanding
works I -- we have done. And in the rest of my talk, I will focus on the dynamic query
understanding part.
Okay. So let me first introduce our work on modeling multi type of -- multi types of
query spelling errors.
So, as I mentioned, the query spelling correction problem is to try to reduce the gap
between the ill-formed query and the ideal query in user's mind.
So although there are a lot of effort Bing make in this problem, there are still a couple of
research challenges, so within which a critical challenge is then how to model the
complex types of errors.
For example, here I show several types of error. Some just happen in single word; some
cross multiple words. For example, such errors could happen simultaneously in the same
query. And such kind of complex errors significantly influences user's experience.
And so people have adjust the query spelling correction problem by usual or normal
hidden Markov model. And in this setting, actually we try to get the most likely states
based on the input query.
And, however, in our setting, this is the input. The input is the query and the output is the
most likely states. However, in our work, we add another layer, which is the state's type
layer into the HMM framework so that we can actually correct such complex types of
errors.
So I will walk through some examples to illustrate the idea.
For example, for single word error, so which is easy, for example, the HMM process is
like this. We first [inaudible] our state, right, through the in-word transformation type of
state, and then we [inaudible] spell query word. And then we transit to another state and
[inaudible] another word, and so on and so forth. You can see that there is a 1-to-1
correspondence between the state and the query word.
However, so in a complex case, so we have splitting error. Means that this -- actually the
homepage is splitted into two words. So suppose now the state sequence is at this point.
And the model tries to [inaudible] to a stay with type merging. In this type, it will try to
merge multiple words and then omit the misspelled series of words and then go to
another state. In this way, actually we can correct this type of spelling error.
And in another case, the concatenation error, so actually multiple words are concatenated
into a single word. And in this case, so suppose we are now here. And the model tries to
[inaudible] to a state with this type splitting. In this state, it will try to split this word into
multiple words and omit to the misspelled word.
So you can see that once we can get this type sequence, it will help to generate the correct
omissions. However, this state sequence is latent. Meaning that not only we couldn't
observe that state sequence, and we also need to infer the hidden state type sequence.
And this is a typical type of structural learning with latent variables. And in this world
we adopt Collins 2002 method and do some modification. And so for the detailed
mathematical part, I will skip and instead I will show some examples.
So actually we compare our method to a state-of-the-art model for query spelling
correction, and the result shows that actually in two sets of data, we -- our model
significant outperforms the baseline.
And so for the progress of query spelling correction, actually the industry had to put a lot
of effort.
So back in 2011, MSR and Bing actually launch a Speller Challenge to improve the
Bing's speller. So we took part of the challenge and we got to second place.
Okay. Now let me move to the second part of my talk. So once we get the well-formed
queries, actually in the syntactic level of query understanding, we try to, well, divide the
query into a bag of concepts, a bag of phrases. So for this purpose, we have done a query
segmentation by exploiting the click-through documents. So the task of query
segmentation is clear. It's to just in bracket to the query so that it can represent it by a
bag of concepts.
And so our solution is to try to exploit the click-throughs. So the key idea here is this
observation. So in order to be a valid segment, so it should appear both popular on the
query site as well as on the click document site.
So our solution is to try to build a generative model to leverage this observation. So our
method is actually a generative model. So the step for generating a query given a set of
documents and their parameters is as follows.
We first generate our query length given the length's distribution. And then so we select
a query partition B which is the empty bracket based on the bracket distribution.
And then we try to fill in the actual query words based on their bracket and the document
language model as well as our segment unigram model. And the segment unigram model
is the parameters of what we want to estimate, and by maximizing a set of queries so we
can iteratively update this model and get a local optimal.
>>: Basically what's BNQ here?
>> Yanen Li: B is the click document, Q is the actual query. So we tried to generate a
query that the segment could be consistent with our segment unigram model as well as
the document language model.
>>: [inaudible] click-through data, is a database provided here [inaudible] based on the
title or based on the ->> Yanen Li: The title.
>>: The title. Not ->> Yanen Li: Not the actual description. Yeah.
>>: Does talking to language models [inaudible] phrase boundaries?
>> Yanen Li: Yes, it's a binary -- a bigram language model. Okay. Here let me show
you some quantitative results. So we test our model on two datasets. Actually, we can
find some interesting results.
So we can actually get the event names and movie names and some entity name as well
as the time name. So you can see that some segments expand modulo words.
And we have also conducted the quantitative evaluation, especially compared to the Tan
2008 method, which is a state-of-the-art query segmentation model, which only utilize
the information on the query side.
So clearly our model -- so outperforms the baseline, indicating that actually leverage the
click-through data actually could help the query segmentation.
So ->>: Could you say something a little bit more about what the data is and where the
segmentation comes from as far as [inaudible]?
>> Yanen Li: Okay. The data is -- actually it comes from this paper. So it's a standard
dataset for evaluation query segmentation. It contains 500 queries.
>>: 500 queries over what and who provided the gold standard.
>> Yanen Li: Yeah, the group provided the gold standard paper so they actually
published the queries and the corresponding segmentations in this paper. Okay.
So after we're getting the query segmentation results, actually another interesting
question we want to ask is that whether better segmentation will lead to better search
models.
And in order to do that, we need to handle the segmentation uncertainties. For example,
here for the Bank of America Online, this query, suppose we can get two possible
segmentations with this possibility. And suppose we have a document, and actually we
build a new language model with the query segmentation to score a document based on
the query segmentation.
So this model, the procedure of scoring a document is quite intuitive. The score of a
query and a document, it consists of two parts, summarization of two parts.
The first part is actually the probability of this segmentation. The second part is the
relevant score of this segmentation and the document. And we sum over all the possible
segmentations, then we can get the final score.
And here I skip most of the details of the mathematical modeling, and I will show some -sure.
>>: Can I ask a question? The previous slide, do you segment the documents the same
way you ->> Yanen Li: No, we don't segment document. We try all the possible -- all the possible
ways to score this as one.
>>: What other -- what more possible ways [inaudible]?
>> Yanen Li: No, it's the bigram language model.
>>: Bigram language model treating the query segment as a unigram [inaudible]?
>> Yanen Li: Yes. So treating this as unit and this is another unit.
>>: I see.
>> Yanen Li: So, yeah, we actually did a set of experiments on 12,000 queries, and we
have two observations. One is that actually our model outperforms the BM25 unigram
language model as well as the bigram model. And ->>: And when you do BM25, do you use the segment, or was it no ->> Yanen Li: No.
>>: Just use word.
>> Yanen Li: Just use word, yeah.
>>: So you consider that's real comparison?
>> Yanen Li: I think it is because the BM25 is a standard way of retrieval models.
>>: But BM25 say nothing that the [inaudible].
>> Yanen Li: Yes, of course, you can extend to a more advanced version of BM25. For
example, BM25 considering word proximity. But here we didn't do this comparison,
yeah.
The second observation is that when their query lens go bigger, actually our performance
increasement is actually bigger. So which means that when the query lens goes bigger,
we have more opportunity to leverage the query segmentation so that we can get a better
result. Yeah.
>>: So you do [inaudible] relevance query [inaudible] to the unigram. So how do you
[inaudible]?
>>: So, yeah, actually we can put this question offline because of the time.
>>: [inaudible].
>> Yanen Li: Yeah.
>>: [inaudible].
>> Yanen Li: This one?
>>: Yeah. So here you assume each document and query are generated from the same
prior distribution? Is that correct?
>> Yanen Li: Actually the document is not based on other distribution, is the
independent distribution. The query should be based on the blanket distribution as well
as the document language model distribution.
>>: So basically your training data is a query as well as ->> Yanen Li: Yeah.
>>: So it's already given [inaudible].
>> Yanen Li: Okay, yeah. Let me move to the other part so I can try to answer more of
your questions at the end.
So once we can get this concept, boundaries in the query semantic understanding, so in
this part of talk I will introduce our work on mining the entity synonyms by a clustering
framework, which is a critical task for parsing the semantics of the query.
So in the query semantic annotation, we tried to recognize the entities and relations
[inaudible] if all the entities are written in the standard form, so actually the challenge is
not that big. Because once you can get the phrase boundaries out of the query
segmentation, we can just do a dictionary lookup.
However, a major challenge here is that people usually don't write their standard form of
their entities in the queries. There are a lot of variations in this alternative surface forms.
For example, in the shoes brand as well as the movie title, we observe a lot of such
variant forms of the same entity. And also in the previous work, people do the entity
synonym mining in a way that they process single input in a time, which means that so
they get the synonyms by one input and then process another one.
However, in our work, we actually changed the learning protocol a bit. We tried to do a
joint learning by a set of entities in the same category. By doing that, actually there are a
couple of benefits. The first benefit is that actually we can utilize their mutual attraction
to do a better job.
For example, for the entity the lord of the rings the return of the king, there were two
close candidates. One is lotr 2, the lotr 3. So it's easy to get such incorrect synonym.
However, if we add another input, the lord of the rings the two towers, so it will be easier
to get the correct answer because of the mutual attraction.
And then other benefit is that we could actually discover the categorical patterns. The
categorical patterns means the left and right context of a set of entities. So here I show
the left and right context in the shoe brand domain. And such kind of categorical pattern
is very useful for disambiguation.
Because, for example, for the diesel fuel, so we know that is not likely to be a synonym
of diesel as diesel brand. Diesel is shoe brand. Because they [inaudible] very different
categorical patterns.
>>: But -- sorry, I'm not understanding the disambiguation challenge here. When
somebody types diesel fuel or diesel brand, it's already disambiguated. If they type just
diesel, you have no context to disambiguate. So how do you envision using this for
disambiguation?
>> Yanen Li: Oh, actually I think one candidate is diesel fuel, another is just diesel.
Because our input is a set of entities in the same category. We already know that diesel is
from the shoe brand category. So we don't want the diesel as a synonym of the diesel as
the shoe brand.
And so we try to solve this problem by a clustering framework with weak separation.
This is our objective function. There are two major parts of it. The first part is that there
is the overall in-cluster dispersion. We tried to make all the objects in one cluster and
one closer. And the other part is the wiki redirects. We use it as a regularization.
The idea is that we want the points having wiki, wiki redirect to be closer to each other.
And in this way we'll [inaudible] function is multiple combination of different semantic
matches. The categorical similarity is one of them.
Let me explain a little bit about the clustering procedure. First of all, we initialize the
cluster center by the canonical values of each entity. Because we believe that the
synonyms sender should not be deviated from the canonical value of the entity.
And then we update the candidate cluster assignments by the initial parameters. After
that, we try to add just the sender. So the way to add just a sender is quite interesting.
We try to combine the top K candidates as well as the canonical value to be a new center.
Because sometimes, for example, the lotr 3 is actually more popular than the canonical
value because it's short, so people tend to type this short form more than their standard
form.
So by including the top K candidates, it will increase the robustness of the solution. And
then we update our feature weights by two criterias. One is that we try to make points in
the cluster more close to each other. And the second criteria is that points are having
wiki redirect should be more closer also. And we iterated to this process the steps. And
finally it will converge to the local optimum.
>>: Excuse me. Can you go back to the previous slide? What are the parameters you
can tweak?
>> Yanen Li: The parameter is -- okay. Sure.
>>: So I'm lost obviously.
>> Yanen Li: Yeah. So the DI -- DXIZ is the multiple combination of these features. So
we have the feature weights.
>>: Oh, I see.
>> Yanen Li: Right?
>>: Yeah. Okay.
>> Yanen Li: And also because there is a cluster framework, so the RIK, the candidate
cluster assignment also need to be updated.
>>: I see. So [inaudible].
>> Yanen Li: So we compare our method to several baselines including the individual
features as well as the Chakrabarti 2012. So actually this work is done in MSR.
So they use the multiple feature combination. And using the [inaudible] feature weights.
And so we have some operations. First of all, we know that so by using single feature,
it's not quite effective. And secondly so Chakrabarti's work can do a much better job than
the single features because they combine different features quite effectively.
Notice that -- so their method is a very high precision but relatively low recall. And our
method actually can achieve more balance without, more balance between the pictures
we call that. So we have best F1 score.
>>: So you have a wiki redirection time [inaudible].
>> Yanen Li: Yes.
>>: How important is that?
>> Yanen Li: Um...
>>: First, what is the reason you have that? What are you trying to regularize?
>> Yanen Li: Regularize -- so, as I said, the intuition is that entities having the wiki
redirect should be somehow a synonym. But this relation is [inaudible]. Right? So we
treat this as [inaudible] label data.
>>: I see.
>> Yanen Li: But we don't have any supervision. This kind of wiki redirect is
[inaudible] supervision data.
>>: [inaudible] how much your result will change?
>> Yanen Li: Will drop a little bit, but is still a little better than the Chakrabarti 2012.
>>: So therefore your regularization does not make much impact?
>> Yanen Li: Yes, because it's [inaudible] label data. Yeah. You sometimes will pay
twice by learning this [inaudible].
>>: So if you add the wiki redirect [inaudible]?
>> Yanen Li: So actually the Chakrabarti, they actually add the wiki as a kind of feature.
Yeah. Yeah, it's interesting question. So to treat it as a feature rather than the
regularization.
And, okay, so this level of query understanding is somewhat static. Right? We need to
know the whole query before we can do the query understanding.
However, in many scenario, people also demand the query understanding dynamically.
Meaning that -- so trying to do query understanding by only given a short prefix, which is
in the process of query auto-completion.
So in the rest of my talk, I will focus on our solution to model the query auto-completion
process. And the query auto-completion is to try to predict the best query representation
given a short prefix.
So a user usually undergoes a series of keystrokes in directions between the query
auto-completion engine before she finally lands on a click query.
So here each column I mean prefix as well as a list of suggested queries. Right? So here
we have one, two, three, four columns of data.
And compared to -- compare the query auto-completion process to document retrieval,
there are several similarities between the two. First of all, the query in the document
retrieval corresponds to the prefix in the QAC process. And the document in the
document retrieval corresponds to the query in the QAC.
And people usually employ some learning-to-rank method to train the relevance ranking
model.
And however there is a clear distinction between the two. In document retrieval, people
usually have the third-party editor judgments, their multilevel third-party judgments.
However, in the QAC, we usually don't have this kind of data. We only -- we could only
rely on the user clicks because the query auto-completion process is very personal. So
different editors will have very different judgments so that we can only rely on the user
clicks.
And most of the previous work focus on relevance ranking for QAC. So in this work,
actually they are usually on rely on the last column data. So there is some work in 2013
try to use all columns of data, all click keystrokes of data. However, so in this work,
they -- these kind of columns are all simulated except for the last one because we -actually we don't observe these columns.
A natural question is that if we have the real data, can we do better. And for this purpose,
we actually collect a new set of a QAC log. In this log, so we collect them from a real
user interaction at Yahoo! and this log is of high resolution. Meaning that we record each
keystroke of the whole process and we record the current position suggesting least click
queries as well as other user data. And notice that for each keystroke, the time resolution
is as high as in milliseconds.
And once we can get this data, there are several uses for that. For one, we can improve
the QAC relevance ranking model. And we can also decipher interesting user behaviors
in the QAC process. Yeah, question.
>>: So are you actually fighting your methods here, or you're just using their current
suggested list?
>> Yanen Li: Uh...
>>: So the rank list that you're considering there for completions, are you actually
fighting your method that presents its own rank list, or are you just taking the current
ranking and saying can you rerank it better?
>> Yanen Li: We use the data in the Yahoo!'s query [inaudible] engine. We have
[inaudible] different [inaudible] ranking model to try to know can we do better.
>>: So you're actually providing the whole different sets of rankings, you're finding
different ones there, whether they're reranking subset was presented to user. What I'm
trying to understand is there's a presentation bias. If you never show then something in
the first place ->> Yanen Li: I see ->>: Because you'll never complete to it, which is why ->> Yanen Li: Yeah, we state on the suggestion list we don't provide them new
suggestion queries. So all these ten suggested -- suggestion queries are fixed. We just -we ranked order.
>>: Okay. And you do that reranking where the user can actually click on it or post
[inaudible] after you've actually seen the click?
>> Yanen Li: After we actually see the clicks. So after collecting this data, we actually
done some interesting user behavior analysis. The first one is the vertical position bias
analysis. So we done that both on the PC platform as well as the mobile devices
platform.
So clearly we can see that there is a vertical position bias on the QAC process. So most
of the clicks are concentrated on the top 2 percent positions. And, in fact, they occupy
more than 65 percent of clicks. And here we introduce a vertical position bias
assumption which states that a query on higher rank tends to attract more clicks
regardless of relevance of the query to the prefix.
So this position bias is similar to the vertical position bias to the document retrieval field.
>>: Wait. What is the test, the QAC?
>> Yanen Li: What?
>>: What is the test here?
>> Yanen Li: QAC, query auto-completion.
>>: So what does the relevance to prefix mean?
>> Yanen Li: Means what the rank you predict this query to be, how to rank the query to
the prefix.
>>: All the queries share this M prefix, right?
>> Yanen Li: Yes. So if you can see the query as a document, the prefix is a query so
it's similar to the document retrieval.
>>: Except that all the prefix here are actually the same.
>> Yanen Li: No. So each keystroke we have different prefix. Right?
>>: Right. But the [inaudible] list ->> Yanen Li: Yeah, yeah, share the same prefix. So in the document retrieval, the
scenario is the same, right? A list of document correspond to the same query. Yeah,
question. Yeah.
>>: [inaudible] you already have a ranking [inaudible] have a ranking.
>> Yanen Li: Yes.
>>: So suppose the [inaudible] more clicks. Did you do a random [inaudible]?
>> Yanen Li: Yeah. So later in this talk I will show you some [inaudible] in the random
bucket. So in the random bucket is an unbiased estimation of the relevance ranking
model. Because it [inaudible] reduce the vertical position bias.
>>: No, if you want to [inaudible] random data.
>> Yanen Li: Yeah. Of course, yeah. No, you could also -- so model there is a vertical
position bias by a click model. Okay? Okay, let me move on.
So the implication of the vertical position bias on relevance ranking is that so when you
observe a click on a lower precision, so you actually want to emphasize the clicks on that
position more in the future.
And very interesting use of behavior we analyze is the horizontal skipping behavior. So
which means that the users try to skip relevant results even when these results are already
displayed.
For example here, the result, the open eye means that the user actually examine the whole
column. Actually open this column and start to examine. The closed eye means that the
user skip -- skips the whole column and continue to type. So, for example, here the
Obama healthcare bill, although the user finally clicks on the query, she skips most of the
time even if this query is already displayed.
So it happens very frequently. Actually, it happens more than 60 percent in all sessions.
And we -- here hypothesis horizontal skipping bias assumption. It states that a query will
not receive any click if the user skips the suggestion list regardless of relevance of the
query to the prefix.
>>: I'm not following. Based on your example, it looks like basically all that matters is
precision [inaudible] and the user only has so much attention while typing.
>> Yanen Li: No, so ->>: [inaudible] what's position one in the click?
>> Yanen Li: So this is one particular example. Of course there are other examples user
click on two and three. But here these examples just illustrate a case that so when the
user don't examine their whole list, when, for example, she talk too fast, even when
relevant result displayed, there is no click observed. So which it states the horizontal
skipping behavior.
>>: So another way to say this is at least 60 percent of the usage you see that the users do
not click on the desired suggested list immediately as soon as they are printed.
>> Yanen Li: Um-hmm. Yeah.
>>: So is that something that can be captured with a different discount [inaudible] you're
just arguing that the discount is steeper for this problem than for [inaudible]?
>> Yanen Li: So actually this kind of behavior combined with the vertical position bias,
actually we argue that -- so the assumption is that even the query is irrelevant or the
query is not examined. Sometimes it display at a very deep position or sometimes the
user talk too fast so that she just doesn't look at it at all. So in this case there is no click at
all.
>>: [inaudible] user did not look at [inaudible] but I don't want to type too many
[inaudible] for the first one, first column [inaudible] three times to get these results
[inaudible] the last one, I've had people faster and get this ideal one to this first place and
would type that.
>> Yanen Li: Yeah. That relays to the utility of the -- completing the whole task to the
effort of actually [inaudible] examine a click on it. Right? So actually in our modeling,
in the [inaudible] model, we add some bias to illustrate this.
>>: In your log, do you [inaudible] up/down arrows, that kind of thing? You say you
look every keystroke. Do think include typing an up/down arrow, cap, space?
>> Yanen Li: No.
>>: No?
>> Yanen Li: No. Yeah.
>>: So one user is typing [inaudible]. The list that's presented is static for most of that.
>> Yanen Li: No, it's not static. It's changing.
>>: [inaudible] changing the first two columns are the same, right, on your -- up there.
As they type, most of the letters [inaudible] I suspect most of the columns don't change.
>> Yanen Li: No. Sometimes they actually [inaudible] the top results will change
because it's not any more matching the prefix.
>>: Yeah. In this case. Would there be some utility to learning Obamacare glitches has
been at top for a few characters, has not been seen, maybe that should lower its likelihood
and push it down. And what if Obamacare healthcare bill is loaded up [inaudible]?
>> Yanen Li: I'm not sure what is your question.
>>: I'm wondering if this motivates some additional information because if something is
at the top and yet it's not even clear ->> Yanen Li: Yeah, of course there are -- there are -- I think there are several more
interpretations than the -- this skipping behavior bias assumption. Because the QAC
process a very complex process. So we try to model the -- argue the most important
reasons why there is no clicks.
>>: Let me sort of refine the question which is you're kind of arguing that the user is not
looking at the results based on lack of clicks.
>> Yanen Li: Yeah. Right.
>>: But really they might be looking at the results in only a small portion, they might be
looking every time that there's a significant flash that results have changed and been
re-examined, which then you might be able to estimate by the amount of total change
[inaudible] or the even better refined thing [inaudible] just doing the eye tracking study
and looking at when is the user looking at the results of that rather than guessing based
on -- because it's important to know whether they're looking or not and how deep.
>> Yanen Li: Yeah, yeah, that's a good suggestion. Yeah. Okay. So here I try to argue
the implications of the horizontal skipping bias to the relevance ranking.
So in order to do that, we did an experiment on the real data. So first we train a
RankSVM model on -- only on the last column. And then we trained it by all columns.
So it turns out that trained on all columns so achieves -- actually achieved not better than
just training on last columns. So our hypothesis is that we can actually train on the
examine columns, train on the examine columns only. Because the examine columns
will provide more certain negative or positive examples than the columns that are not
examined. Because there are much more noise in these columns.
So our goal is that we try to build a better relevance model by better modeling the
horizontal skipping behavior and vertical position bias behavior. Because the whole click
model can be decomposed by three components.
And another question we want to ask, whether we can adopt the existing click models
from the document retrieval site, because there are already several click models such as
the UBM, DBN, and BSS.
And after investigation, our answer is no. Because, for one, the horizontal skipping
behavior is unique to the QAC process. Right? So we expect that these models will not
have a very good result on the QAC process.
And, second, these kind of models are not content-aware, meaning that they cannot
handle the unseen prefix and query pairs. However, in the QAC process, the unseen
prefix and query pairs are very high, more than 67 percent in the PC platform and 60
percent in the mobile phones.
>>: You said UBN, universal ->> Yanen Li: UBN, yeah. The user browsing model.
>>: Browsing model. Okay. DBN?
>> Yanen Li: DBN, dynamic Bayesian model, Bayesian network. Yeah. Okay. Cool.
And that's why we propose a new two-dimensional click model to model the QAC
process.
So in this model, the first component is the H model, the horizontal skipping behavior.
So in the H model, the HI equals to 1, means the user actually stop and try to examine
this column, HI. And the HI equals to 0 means the user actually stop and move to the
other column.
And we model the distribution of HI equals to 1 using sigmoid function of multiple
feature combinations. And there are a couple of interesting features such as the typing
speed. The lesson there is that an expert user is not likely to examine many columns.
And also it the isWordBoundary is an important feature also because we expect that
people usually examine the columns and the word boundaries. And the current position
is also playing a role here because we expect that the people will not examine the column
at the beginning of typing.
And for the D model is to try to model the vertical position bias. The DI equals to J
means the people has examine to the depth of J. We also use the softmax function to
model the distribution.
And we use the click model to model the C model to model the intrinsic relevance
between the query and the prefix.
The CIJ equals 1 means there is a click at position I and J, column I and row J.
So let me walk through some cases where we don't receive a click and we do receive a
click.
And first off, in the first case, if the user stop here and HI equals 0, she will -- there is no
click observed and she will type to another character by the probability of HI equals to 0.
And another case, in this case, actually the user opens to look at this column, HI equals to
1. And he wish the depth of 2. So even though there is a relevance query here at depth 4,
we don't observe any click was the user hasn't reached -- hasn't examined to this query.
And for the third case, and in this case, the user has opened this column and she has also
reached to the depth where there is a relevant -- there is a query. So she examine it.
However, because the query is irrelevant so that we don't observe a click.
>>: So examine means what?
>> Yanen Li: Examine actually means to events. One is open the column to examine
down and the other is to examine a skip as the level J. So the H ->>: How do you know the depth?
>> Yanen Li: The depth that we modeled. We model each depth. Right? The DI equals
to J models how deep the users goes to. Right? And in this case we still don't observe
any click.
And finally so only when the query -- the user opened the column to examine and then
she reach to the depth to a query that is relevant to the prefix. Then we finally observe a
click. Right?
So which means that only when the user has examined the query and the query is relevant
we observe a click. Right? And so we know that the DI [inaudible] and the H are both
latent variables.
So in order to estimate the distribution of H and D as well as C, we need to sum over all
the possible values of H and D. And we can estimate using E-M algorithm. So we'll skip
the mathematical details and instead I will explain more on the experiment part.
So we did an evaluation on three sets of data. First is on the PC site and the other is on
the mobile devices. And we also introduce our random bucket test dataset in which we
shuffle the query list for each prefix. Because the random bucket data can provide the
unbiased evaluation of the relevance model because it will reduce the vertical position
bias.
So we use the MRR as the major metric and we measure the MRR across all columns.
And we include two sets of baselines. And the first set is the non content aware models.
Right? And the second set -- the second set -- the second baseline the is BSS model
which is a content-aware model. So our model, TDCM, is actually kind of a
content-aware model because it can handle unseen prefix and query pairs.
And so for the results we have several observations. First of all, the non content-aware
models is somehow not effective to model the -- all columns of data. So indicated by the
performance that is actually lower than the MPC. The MPC is just looking at the query
count, global query counts. Because they are not able to model the content-awareness.
And because a large portion of the prefix and query pairs are unseen.
And our second observation is that actually the BSS model sometimes it do a rather good
job because it is a content-aware model. However, so its performance is not stable. And
sometimes it's better than the MPC baseline; sometimes it's lower. Indicating that it's not
effective to model all columns of data.
And finally so our model by better modeling of the unseen data as well as better
modeling of the whole QAC process, thus, our model achieves the best results.
Question?
>>: So the MRR@All, this is over all columns of data that you take MRR for each of
them and ->> Yanen Li: Yeah, MRR@All means we measure each column's MRR and then take
the average.
>>: Does that give higher weight to longer prefixes that were typed before completing?
>> Yanen Li: This is...
>>: You're going to represent more of that average, right?
>> Yanen Li: Yeah, you're right, but the [inaudible] because we do the same thing for all
methods. So the relative performance will be different.
>>: Right. But we're essentially seeing things weighted towards methods that perform
well ->> Yanen Li: Yeah, you're right. So at the first one or two columns, the MRR value
would be final, right?
>>: Right.
>> Yanen Li: Because it's much harder.
>>: And also the methods that don't take into account context might perform better on
the shorter ones where there's less [inaudible].
>> Yanen Li: I think so for taking advantage all columns better, so if you even do a good
job on these early columns, right? Because if you only look at the last column, so it's not
likely you can predict very good on the first one and two columns.
Okay? So in order to validate the H model, we also did another experiment. We tried to
leverage the user behavior to enhance other [inaudible] methods. So in this experiment,
we trained the RankSVM by all columns, last column only, and viewed columns. The
viewed column means the columns that we are quite sure the users has examined.
And it turns out that so by modeling all columns is actually worse than modeling the last
column. Right? If you only use the RankSVM and trained [inaudible]. And, however,
by better modeling the horizontal behavior, so actually the data is more clean so that it
could provide more valuable information. Right? So that the RankSVM achieve a little
bit better than training on last column.
>>: Do you have the sense that training on all the viewed columns gets us better result
because there are more -- essentially more instances of data in the training set ->> Yanen Li: No, it's not necessarily the case because by training on the first one or two
columns, so it's quite noisy. So it's ->>: Compared to the last column.
>> Yanen Li: Hmm?
>>: Between training [inaudible] last column versus training on the viewed columns.
Training on the viewed columns essentially has more training data, right?
>> Yanen Li: Yeah. But -- yeah, you're right [inaudible]. The training data should be
also clean. If you provided all noisy data, so it will not help you much.
>>: I understand. Do you know how much training data we're talking about for these
experiments?
>> Yanen Li: So we have [inaudible].
>>: [inaudible] training data?
>> Yanen Li: [inaudible] 50,000 to 100,000 pairs.
>>: [inaudible] columns there are [inaudible]?
>> Yanen Li: Yes. We also have the average columns for each section. So I didn't
provide here for. For the iPhone data, the average column is about 9 and for the PC about
13 to 15.
>>: Okay. So that's the number of -- that's all?
>> Yanen Li: Yeah.
>>: [inaudible] viewed, is that two columns that are viewed per session ->> Yanen Li: It depends on [inaudible].
>>: And what's average?
>> Yanen Li: I don't have the number here, but I guess it's more than two. Yeah.
>>: What does this average prefix mean on the table?
>> Yanen Li: Average prefix means we do the MRR on all columns and we take
average.
>>: So the MRR for a column where there are no clicks, it is going to be 0?
>> Yanen Li: No. So if the click query is this display on that, we consider it as
[inaudible].
>>: Okay.
>> Yanen Li: Okay. So it's also interesting to see the learned user behavior by looking at
the feature weights. For example, for the H model, the learned feature weight for the
typing speed is actually [inaudible] proportional to the probability of H equals to 1.
Right? Which matches our intuition, because an expert user is not likely to examine the
whole column.
Also, the word boundary as well as the current position is important features. Meaning
that so people actually examine, start to examine the column at their word boundaries as
well as at the later columns.
And on the D model, so we found that the first three positions occupied the most
examination probability which also matches our observation for the vertical position bias.
And, third, for the relevance model, we find that the query history frequency is important.
>>: [inaudible].
>> Yanen Li: No, the MPC [inaudible].
>>: What is that?
>> Yanen Li: You mean the -- the MPC is the global query frequency.
>>: Okay.
>> Yanen Li: And this one is the query count in your past history. So meaning that -- so
actually users usually use the QAC as the query storage. And also the GeoSense,
geolocation related query counts as well as the time related query counts play a role in the
relevance model. So -- yeah.
>>: [inaudible] features [inaudible] understand the meanings [inaudible].
>> Yanen Li: Understand the meaning of this ->>: [inaudible] entertainment people will click high and for politics, political [inaudible]
queries [inaudible].
>> Yanen Li: So what's your question exactly?
>>: For given queries, you will have different intent.
>> Yanen Li: You're right. So, yeah, we also have -- we also have a feature called -related to the query intent, for example, the [inaudible] query and the informational query
where we didn't show here, so actually it's on a later part. So it also plays a role in here.
So in our -- in my backup slides, I actually have a slide showing all kinds of features for
each model.
And let me draw a brief conclusion for this part. So actually we first collect the first set
of high-resolution query log specifically for the query auto-completion process. And
after collecting this new dataset, we actually analyze two important and interesting user
behavior. Namely, the vertical position bias as well as the horizontal skipping bias. So
we've shown that though there are great implications to the relevance ranking.
And then we propose a level two-dimensional click model to model these two interesting
user behaviors. And the resulting relevance model actually outperforms the existing click
models.
And, okay, so let me come back to our roadmap. And in my Ph.D. study I have adjust the
key questions in each level of query understanding. So actually our solution frequently
try to leverage the frequently occurring patterns in large amount of data so that our
methods are generally applicable to all type of query.
And, second, so our -- we usually use the unsupervised or semi-supervised models so that
we require minimal human efforts. Thus, our method could potentially scale up to
millions of queries.
And so in the future, in the query understanding direction, I would like to explore the
following interesting research directions. First of all, I would like to investigate another
important context such as the location as well as the temporal context. So you know that
those are very important contexts if you trigger very interesting research questions.
And, secondly, notice that my current research agenda focus on single query. However, I
want to expand my research on a scenario that actually combines multiple queries, which
is the task query scenario.
And finally but not least, I want to also explore the interesting field of mobile search. So
in mobile search, the query's characteristics could be very different on the PC side and
there could be many interesting applications for the query understanding.
And, finally, I also want to apply my research to a broader field of data mining and
recommender systems as well as biomedical informatics where I have also done some
work.
And, finally, I will stop here and try to take more of your questions.
[applause].
>> Kuansan Wang: We have ten minutes.
>>: For your last work, how is this fundamentally different with Web results with
ranking?
>> Yanen Li: With what?
>>: The search results with ranking. We have a ranking query auto-completion.
>> Yanen Li: Yeah.
>>: How is this fundamentally different with Web result ->> Yanen Li: Yeah. Fundamental difference is that so QAC is very personal. We don't
have the third-party relevance judgments. Right? So that we could only rely on implicit
feedback.
>>: Yeah, for query [inaudible], you click it.
>> Yanen Li: Yeah, but it's not -- it's not the major [inaudible].
>>: I think it would be very good to work on personalization, ranking for
personalization, which is very personal.
>> Yanen Li: Yeah. You're right. Yeah, it's more related to the personalized search.
>>: In your [inaudible] work to improve ranking, you didn't mention -- didn't tell us
exactly how you measure the relevance between the segmentation and [inaudible].
>> Yanen Li: Yes. Yeah, good question. So in that part we have a document language
model, so specifically we use a bigram generated [inaudible] model to score the different
segmentations. Different segmentation means we have different trunks, right, and for
each trunk we use the document language model to score that.
>>: The language model is based on all the documents or ->> Yanen Li: Based on -- this is [inaudible] language model, first based on this
document only and we [inaudible] our documents.
>> Kuansan Wang: Any other questions? All right. Let's thank the speaker.
[applause].
>> Yanen Li: Thanks very much.
Download