>> Kuansan Wang: All right. Let's get started. It is my greatest pleasure to welcome Yanen Li from UIUC to come here and visit us. Yanen is a Ph.D. student from the computer science department under the supervision of Professor ChengXiang Zhai. We're all good friends. He has been a great intern, doing internship multiple times with us on query understanding. And he has contributed also in many areas, and he also participated in the first Bing and Microsoft Research Speller Challenge, and he won the second place. So his outcome is so impressive, so we couldn't wait for him to get a degree, and we brought him back to have interview. And today he's going to tell us his Ph.D. work. I'll let you take over. Welcome. >> Yanen Li: Thank you very much for Kuansan's warm welcome. And good afternoon, everyone. So today I'm very happy to share with you about my Ph.D. studies on the topic of multilevel query understanding. Okay. Now, let me get started. Nowadays we are entering the big data age. Per any second, we are generating a large amount of data on the Web. And the search engines remain the most effective tools for managing such a huge amount of data. This magic box is everywhere, from Web document search to product search to people search. The reason why the search engines remain the dominant tool for many of these such huge amount of data is that actually it's not the volume that matters. Actually it is the small amount of relevant data that greatly influences people's decision-making. And the most effective way for getting this relevant data is search. In the whole search process, the query the critical bridge connecting the users in one side and the information on the Web on the other side. So understanding a query, a query is a critical task for improving the search accuracy as well as the user's experience. Here are the query understanding I mean is that not only what the user says but also what she wants. For example, for a query how old is Obama, it's much more desirable to directly give the answer 52 years than just returning a set of Web pages. However, understanding a Web search query is a long and trivial task. And here we introduce our framework in which the whole query understanding task is broken into multiple levels, all multilevel query interpretation and representation. In this framework, the first level of query understanding is called lexical query understanding. So the task of this level of query understanding is to try to reduce the gap between the user's issued query and the ideal query in his mind. For example, in the query Louis Vuitton store in Las Vegas, so people sometimes has a more ill-formed query because of the query spelling errors. So this task is to try to transform the ill-formed query into a well-formed one. And once we get the well-formed queries, so in the next level of query understanding, they lead the syntactic level of query understanding. We try to make sense of all of the query, which means we try to break the query into semantic boundaries, which called query segments. So once we can do that, actually the query can be represented by a bag of concepts or a bag of phrases, instead of bag of words. So by doing that, actually we can support more complex retrieval models such as concept weighting and phrase-based translation model also. >>: Quick question. Do you find that the segmentation tends to result into naturally what we would consider linguistic [inaudible] Louis Vuitton, or for particular tasks do they ->> Yanen Li: We're in the particular context of having the relevance meaning, right, well, particularly in the retrieval, building the better retrieval models, right? >>: Yeah. And do those tend to correspond to things that we would consider entities and ->>: Yes, sometimes. Yes, yes, sometimes. Yes, [inaudible]. And then in the next level, the semantic level of query understanding. So once we get the phrase boundaries of the query, we also try to get the entities and relations of the query. For example, here, we no that Louis Vuitton and Las Vegas are both entities. Louis Vuitton is a famous luxury brand and Las Vegas is a famous city in Nevada. So once we can get such which query -- what representation. So we can support many interesting applications, such as we can enrich the search result with entity attributes and the related entities. And we can also support entity search where we return a list of relevant entities instead of a list of relevant pages. Also, we can also do direct question answering where we directly give the answers to users if we know exactly what the users want. So there a couple of research questions here. First, how can we do it in a general way so that we can apply our solution to all kinds of queries. The second is how can we do it with minimal human effort so that we can save money and notation effort. And, third, can we do it in real time when the users need it. And so specifically we are interested in the questions as below. In the lexical query understanding, we are interested in the problem of query spelling correction. And in the query syntactic understanding, we are more focused on query segmentation. And in the query semantic parsing, we focus on a particular problem of entity synonym mining, which is a critical component of the query semantic parsing. And I notice that this level of query understanding is somewhat static, meaning that we have to know the query as a whole in advance. However, we also interested in dynamic query understanding where the user just give a little hint such as a very short prefix. So we are also interested in the dynamic query understanding, specifically query auto-completion task. So during my Ph.D. studies, I have addressed key questions in each level of the query understanding framework. And, for example, for the lexical understanding, we are just modeling multiple types of query spelling errors. And on the syntactic level, we adjust a query segmentation problem by utilizing the click-through data. And on the semantic level of query understanding, we mine the entity synonyms by a clustering framework. And, finally, in the dynamic query understanding, we try to model the query auto-completion by two-dimensional click models. And in this talk, I will very briefly overview the first three level of query understanding works I -- we have done. And in the rest of my talk, I will focus on the dynamic query understanding part. Okay. So let me first introduce our work on modeling multi type of -- multi types of query spelling errors. So, as I mentioned, the query spelling correction problem is to try to reduce the gap between the ill-formed query and the ideal query in user's mind. So although there are a lot of effort Bing make in this problem, there are still a couple of research challenges, so within which a critical challenge is then how to model the complex types of errors. For example, here I show several types of error. Some just happen in single word; some cross multiple words. For example, such errors could happen simultaneously in the same query. And such kind of complex errors significantly influences user's experience. And so people have adjust the query spelling correction problem by usual or normal hidden Markov model. And in this setting, actually we try to get the most likely states based on the input query. And, however, in our setting, this is the input. The input is the query and the output is the most likely states. However, in our work, we add another layer, which is the state's type layer into the HMM framework so that we can actually correct such complex types of errors. So I will walk through some examples to illustrate the idea. For example, for single word error, so which is easy, for example, the HMM process is like this. We first [inaudible] our state, right, through the in-word transformation type of state, and then we [inaudible] spell query word. And then we transit to another state and [inaudible] another word, and so on and so forth. You can see that there is a 1-to-1 correspondence between the state and the query word. However, so in a complex case, so we have splitting error. Means that this -- actually the homepage is splitted into two words. So suppose now the state sequence is at this point. And the model tries to [inaudible] to a stay with type merging. In this type, it will try to merge multiple words and then omit the misspelled series of words and then go to another state. In this way, actually we can correct this type of spelling error. And in another case, the concatenation error, so actually multiple words are concatenated into a single word. And in this case, so suppose we are now here. And the model tries to [inaudible] to a state with this type splitting. In this state, it will try to split this word into multiple words and omit to the misspelled word. So you can see that once we can get this type sequence, it will help to generate the correct omissions. However, this state sequence is latent. Meaning that not only we couldn't observe that state sequence, and we also need to infer the hidden state type sequence. And this is a typical type of structural learning with latent variables. And in this world we adopt Collins 2002 method and do some modification. And so for the detailed mathematical part, I will skip and instead I will show some examples. So actually we compare our method to a state-of-the-art model for query spelling correction, and the result shows that actually in two sets of data, we -- our model significant outperforms the baseline. And so for the progress of query spelling correction, actually the industry had to put a lot of effort. So back in 2011, MSR and Bing actually launch a Speller Challenge to improve the Bing's speller. So we took part of the challenge and we got to second place. Okay. Now let me move to the second part of my talk. So once we get the well-formed queries, actually in the syntactic level of query understanding, we try to, well, divide the query into a bag of concepts, a bag of phrases. So for this purpose, we have done a query segmentation by exploiting the click-through documents. So the task of query segmentation is clear. It's to just in bracket to the query so that it can represent it by a bag of concepts. And so our solution is to try to exploit the click-throughs. So the key idea here is this observation. So in order to be a valid segment, so it should appear both popular on the query site as well as on the click document site. So our solution is to try to build a generative model to leverage this observation. So our method is actually a generative model. So the step for generating a query given a set of documents and their parameters is as follows. We first generate our query length given the length's distribution. And then so we select a query partition B which is the empty bracket based on the bracket distribution. And then we try to fill in the actual query words based on their bracket and the document language model as well as our segment unigram model. And the segment unigram model is the parameters of what we want to estimate, and by maximizing a set of queries so we can iteratively update this model and get a local optimal. >>: Basically what's BNQ here? >> Yanen Li: B is the click document, Q is the actual query. So we tried to generate a query that the segment could be consistent with our segment unigram model as well as the document language model. >>: [inaudible] click-through data, is a database provided here [inaudible] based on the title or based on the ->> Yanen Li: The title. >>: The title. Not ->> Yanen Li: Not the actual description. Yeah. >>: Does talking to language models [inaudible] phrase boundaries? >> Yanen Li: Yes, it's a binary -- a bigram language model. Okay. Here let me show you some quantitative results. So we test our model on two datasets. Actually, we can find some interesting results. So we can actually get the event names and movie names and some entity name as well as the time name. So you can see that some segments expand modulo words. And we have also conducted the quantitative evaluation, especially compared to the Tan 2008 method, which is a state-of-the-art query segmentation model, which only utilize the information on the query side. So clearly our model -- so outperforms the baseline, indicating that actually leverage the click-through data actually could help the query segmentation. So ->>: Could you say something a little bit more about what the data is and where the segmentation comes from as far as [inaudible]? >> Yanen Li: Okay. The data is -- actually it comes from this paper. So it's a standard dataset for evaluation query segmentation. It contains 500 queries. >>: 500 queries over what and who provided the gold standard. >> Yanen Li: Yeah, the group provided the gold standard paper so they actually published the queries and the corresponding segmentations in this paper. Okay. So after we're getting the query segmentation results, actually another interesting question we want to ask is that whether better segmentation will lead to better search models. And in order to do that, we need to handle the segmentation uncertainties. For example, here for the Bank of America Online, this query, suppose we can get two possible segmentations with this possibility. And suppose we have a document, and actually we build a new language model with the query segmentation to score a document based on the query segmentation. So this model, the procedure of scoring a document is quite intuitive. The score of a query and a document, it consists of two parts, summarization of two parts. The first part is actually the probability of this segmentation. The second part is the relevant score of this segmentation and the document. And we sum over all the possible segmentations, then we can get the final score. And here I skip most of the details of the mathematical modeling, and I will show some -sure. >>: Can I ask a question? The previous slide, do you segment the documents the same way you ->> Yanen Li: No, we don't segment document. We try all the possible -- all the possible ways to score this as one. >>: What other -- what more possible ways [inaudible]? >> Yanen Li: No, it's the bigram language model. >>: Bigram language model treating the query segment as a unigram [inaudible]? >> Yanen Li: Yes. So treating this as unit and this is another unit. >>: I see. >> Yanen Li: So, yeah, we actually did a set of experiments on 12,000 queries, and we have two observations. One is that actually our model outperforms the BM25 unigram language model as well as the bigram model. And ->>: And when you do BM25, do you use the segment, or was it no ->> Yanen Li: No. >>: Just use word. >> Yanen Li: Just use word, yeah. >>: So you consider that's real comparison? >> Yanen Li: I think it is because the BM25 is a standard way of retrieval models. >>: But BM25 say nothing that the [inaudible]. >> Yanen Li: Yes, of course, you can extend to a more advanced version of BM25. For example, BM25 considering word proximity. But here we didn't do this comparison, yeah. The second observation is that when their query lens go bigger, actually our performance increasement is actually bigger. So which means that when the query lens goes bigger, we have more opportunity to leverage the query segmentation so that we can get a better result. Yeah. >>: So you do [inaudible] relevance query [inaudible] to the unigram. So how do you [inaudible]? >>: So, yeah, actually we can put this question offline because of the time. >>: [inaudible]. >> Yanen Li: Yeah. >>: [inaudible]. >> Yanen Li: This one? >>: Yeah. So here you assume each document and query are generated from the same prior distribution? Is that correct? >> Yanen Li: Actually the document is not based on other distribution, is the independent distribution. The query should be based on the blanket distribution as well as the document language model distribution. >>: So basically your training data is a query as well as ->> Yanen Li: Yeah. >>: So it's already given [inaudible]. >> Yanen Li: Okay, yeah. Let me move to the other part so I can try to answer more of your questions at the end. So once we can get this concept, boundaries in the query semantic understanding, so in this part of talk I will introduce our work on mining the entity synonyms by a clustering framework, which is a critical task for parsing the semantics of the query. So in the query semantic annotation, we tried to recognize the entities and relations [inaudible] if all the entities are written in the standard form, so actually the challenge is not that big. Because once you can get the phrase boundaries out of the query segmentation, we can just do a dictionary lookup. However, a major challenge here is that people usually don't write their standard form of their entities in the queries. There are a lot of variations in this alternative surface forms. For example, in the shoes brand as well as the movie title, we observe a lot of such variant forms of the same entity. And also in the previous work, people do the entity synonym mining in a way that they process single input in a time, which means that so they get the synonyms by one input and then process another one. However, in our work, we actually changed the learning protocol a bit. We tried to do a joint learning by a set of entities in the same category. By doing that, actually there are a couple of benefits. The first benefit is that actually we can utilize their mutual attraction to do a better job. For example, for the entity the lord of the rings the return of the king, there were two close candidates. One is lotr 2, the lotr 3. So it's easy to get such incorrect synonym. However, if we add another input, the lord of the rings the two towers, so it will be easier to get the correct answer because of the mutual attraction. And then other benefit is that we could actually discover the categorical patterns. The categorical patterns means the left and right context of a set of entities. So here I show the left and right context in the shoe brand domain. And such kind of categorical pattern is very useful for disambiguation. Because, for example, for the diesel fuel, so we know that is not likely to be a synonym of diesel as diesel brand. Diesel is shoe brand. Because they [inaudible] very different categorical patterns. >>: But -- sorry, I'm not understanding the disambiguation challenge here. When somebody types diesel fuel or diesel brand, it's already disambiguated. If they type just diesel, you have no context to disambiguate. So how do you envision using this for disambiguation? >> Yanen Li: Oh, actually I think one candidate is diesel fuel, another is just diesel. Because our input is a set of entities in the same category. We already know that diesel is from the shoe brand category. So we don't want the diesel as a synonym of the diesel as the shoe brand. And so we try to solve this problem by a clustering framework with weak separation. This is our objective function. There are two major parts of it. The first part is that there is the overall in-cluster dispersion. We tried to make all the objects in one cluster and one closer. And the other part is the wiki redirects. We use it as a regularization. The idea is that we want the points having wiki, wiki redirect to be closer to each other. And in this way we'll [inaudible] function is multiple combination of different semantic matches. The categorical similarity is one of them. Let me explain a little bit about the clustering procedure. First of all, we initialize the cluster center by the canonical values of each entity. Because we believe that the synonyms sender should not be deviated from the canonical value of the entity. And then we update the candidate cluster assignments by the initial parameters. After that, we try to add just the sender. So the way to add just a sender is quite interesting. We try to combine the top K candidates as well as the canonical value to be a new center. Because sometimes, for example, the lotr 3 is actually more popular than the canonical value because it's short, so people tend to type this short form more than their standard form. So by including the top K candidates, it will increase the robustness of the solution. And then we update our feature weights by two criterias. One is that we try to make points in the cluster more close to each other. And the second criteria is that points are having wiki redirect should be more closer also. And we iterated to this process the steps. And finally it will converge to the local optimum. >>: Excuse me. Can you go back to the previous slide? What are the parameters you can tweak? >> Yanen Li: The parameter is -- okay. Sure. >>: So I'm lost obviously. >> Yanen Li: Yeah. So the DI -- DXIZ is the multiple combination of these features. So we have the feature weights. >>: Oh, I see. >> Yanen Li: Right? >>: Yeah. Okay. >> Yanen Li: And also because there is a cluster framework, so the RIK, the candidate cluster assignment also need to be updated. >>: I see. So [inaudible]. >> Yanen Li: So we compare our method to several baselines including the individual features as well as the Chakrabarti 2012. So actually this work is done in MSR. So they use the multiple feature combination. And using the [inaudible] feature weights. And so we have some operations. First of all, we know that so by using single feature, it's not quite effective. And secondly so Chakrabarti's work can do a much better job than the single features because they combine different features quite effectively. Notice that -- so their method is a very high precision but relatively low recall. And our method actually can achieve more balance without, more balance between the pictures we call that. So we have best F1 score. >>: So you have a wiki redirection time [inaudible]. >> Yanen Li: Yes. >>: How important is that? >> Yanen Li: Um... >>: First, what is the reason you have that? What are you trying to regularize? >> Yanen Li: Regularize -- so, as I said, the intuition is that entities having the wiki redirect should be somehow a synonym. But this relation is [inaudible]. Right? So we treat this as [inaudible] label data. >>: I see. >> Yanen Li: But we don't have any supervision. This kind of wiki redirect is [inaudible] supervision data. >>: [inaudible] how much your result will change? >> Yanen Li: Will drop a little bit, but is still a little better than the Chakrabarti 2012. >>: So therefore your regularization does not make much impact? >> Yanen Li: Yes, because it's [inaudible] label data. Yeah. You sometimes will pay twice by learning this [inaudible]. >>: So if you add the wiki redirect [inaudible]? >> Yanen Li: So actually the Chakrabarti, they actually add the wiki as a kind of feature. Yeah. Yeah, it's interesting question. So to treat it as a feature rather than the regularization. And, okay, so this level of query understanding is somewhat static. Right? We need to know the whole query before we can do the query understanding. However, in many scenario, people also demand the query understanding dynamically. Meaning that -- so trying to do query understanding by only given a short prefix, which is in the process of query auto-completion. So in the rest of my talk, I will focus on our solution to model the query auto-completion process. And the query auto-completion is to try to predict the best query representation given a short prefix. So a user usually undergoes a series of keystrokes in directions between the query auto-completion engine before she finally lands on a click query. So here each column I mean prefix as well as a list of suggested queries. Right? So here we have one, two, three, four columns of data. And compared to -- compare the query auto-completion process to document retrieval, there are several similarities between the two. First of all, the query in the document retrieval corresponds to the prefix in the QAC process. And the document in the document retrieval corresponds to the query in the QAC. And people usually employ some learning-to-rank method to train the relevance ranking model. And however there is a clear distinction between the two. In document retrieval, people usually have the third-party editor judgments, their multilevel third-party judgments. However, in the QAC, we usually don't have this kind of data. We only -- we could only rely on the user clicks because the query auto-completion process is very personal. So different editors will have very different judgments so that we can only rely on the user clicks. And most of the previous work focus on relevance ranking for QAC. So in this work, actually they are usually on rely on the last column data. So there is some work in 2013 try to use all columns of data, all click keystrokes of data. However, so in this work, they -- these kind of columns are all simulated except for the last one because we -actually we don't observe these columns. A natural question is that if we have the real data, can we do better. And for this purpose, we actually collect a new set of a QAC log. In this log, so we collect them from a real user interaction at Yahoo! and this log is of high resolution. Meaning that we record each keystroke of the whole process and we record the current position suggesting least click queries as well as other user data. And notice that for each keystroke, the time resolution is as high as in milliseconds. And once we can get this data, there are several uses for that. For one, we can improve the QAC relevance ranking model. And we can also decipher interesting user behaviors in the QAC process. Yeah, question. >>: So are you actually fighting your methods here, or you're just using their current suggested list? >> Yanen Li: Uh... >>: So the rank list that you're considering there for completions, are you actually fighting your method that presents its own rank list, or are you just taking the current ranking and saying can you rerank it better? >> Yanen Li: We use the data in the Yahoo!'s query [inaudible] engine. We have [inaudible] different [inaudible] ranking model to try to know can we do better. >>: So you're actually providing the whole different sets of rankings, you're finding different ones there, whether they're reranking subset was presented to user. What I'm trying to understand is there's a presentation bias. If you never show then something in the first place ->> Yanen Li: I see ->>: Because you'll never complete to it, which is why ->> Yanen Li: Yeah, we state on the suggestion list we don't provide them new suggestion queries. So all these ten suggested -- suggestion queries are fixed. We just -we ranked order. >>: Okay. And you do that reranking where the user can actually click on it or post [inaudible] after you've actually seen the click? >> Yanen Li: After we actually see the clicks. So after collecting this data, we actually done some interesting user behavior analysis. The first one is the vertical position bias analysis. So we done that both on the PC platform as well as the mobile devices platform. So clearly we can see that there is a vertical position bias on the QAC process. So most of the clicks are concentrated on the top 2 percent positions. And, in fact, they occupy more than 65 percent of clicks. And here we introduce a vertical position bias assumption which states that a query on higher rank tends to attract more clicks regardless of relevance of the query to the prefix. So this position bias is similar to the vertical position bias to the document retrieval field. >>: Wait. What is the test, the QAC? >> Yanen Li: What? >>: What is the test here? >> Yanen Li: QAC, query auto-completion. >>: So what does the relevance to prefix mean? >> Yanen Li: Means what the rank you predict this query to be, how to rank the query to the prefix. >>: All the queries share this M prefix, right? >> Yanen Li: Yes. So if you can see the query as a document, the prefix is a query so it's similar to the document retrieval. >>: Except that all the prefix here are actually the same. >> Yanen Li: No. So each keystroke we have different prefix. Right? >>: Right. But the [inaudible] list ->> Yanen Li: Yeah, yeah, share the same prefix. So in the document retrieval, the scenario is the same, right? A list of document correspond to the same query. Yeah, question. Yeah. >>: [inaudible] you already have a ranking [inaudible] have a ranking. >> Yanen Li: Yes. >>: So suppose the [inaudible] more clicks. Did you do a random [inaudible]? >> Yanen Li: Yeah. So later in this talk I will show you some [inaudible] in the random bucket. So in the random bucket is an unbiased estimation of the relevance ranking model. Because it [inaudible] reduce the vertical position bias. >>: No, if you want to [inaudible] random data. >> Yanen Li: Yeah. Of course, yeah. No, you could also -- so model there is a vertical position bias by a click model. Okay? Okay, let me move on. So the implication of the vertical position bias on relevance ranking is that so when you observe a click on a lower precision, so you actually want to emphasize the clicks on that position more in the future. And very interesting use of behavior we analyze is the horizontal skipping behavior. So which means that the users try to skip relevant results even when these results are already displayed. For example here, the result, the open eye means that the user actually examine the whole column. Actually open this column and start to examine. The closed eye means that the user skip -- skips the whole column and continue to type. So, for example, here the Obama healthcare bill, although the user finally clicks on the query, she skips most of the time even if this query is already displayed. So it happens very frequently. Actually, it happens more than 60 percent in all sessions. And we -- here hypothesis horizontal skipping bias assumption. It states that a query will not receive any click if the user skips the suggestion list regardless of relevance of the query to the prefix. >>: I'm not following. Based on your example, it looks like basically all that matters is precision [inaudible] and the user only has so much attention while typing. >> Yanen Li: No, so ->>: [inaudible] what's position one in the click? >> Yanen Li: So this is one particular example. Of course there are other examples user click on two and three. But here these examples just illustrate a case that so when the user don't examine their whole list, when, for example, she talk too fast, even when relevant result displayed, there is no click observed. So which it states the horizontal skipping behavior. >>: So another way to say this is at least 60 percent of the usage you see that the users do not click on the desired suggested list immediately as soon as they are printed. >> Yanen Li: Um-hmm. Yeah. >>: So is that something that can be captured with a different discount [inaudible] you're just arguing that the discount is steeper for this problem than for [inaudible]? >> Yanen Li: So actually this kind of behavior combined with the vertical position bias, actually we argue that -- so the assumption is that even the query is irrelevant or the query is not examined. Sometimes it display at a very deep position or sometimes the user talk too fast so that she just doesn't look at it at all. So in this case there is no click at all. >>: [inaudible] user did not look at [inaudible] but I don't want to type too many [inaudible] for the first one, first column [inaudible] three times to get these results [inaudible] the last one, I've had people faster and get this ideal one to this first place and would type that. >> Yanen Li: Yeah. That relays to the utility of the -- completing the whole task to the effort of actually [inaudible] examine a click on it. Right? So actually in our modeling, in the [inaudible] model, we add some bias to illustrate this. >>: In your log, do you [inaudible] up/down arrows, that kind of thing? You say you look every keystroke. Do think include typing an up/down arrow, cap, space? >> Yanen Li: No. >>: No? >> Yanen Li: No. Yeah. >>: So one user is typing [inaudible]. The list that's presented is static for most of that. >> Yanen Li: No, it's not static. It's changing. >>: [inaudible] changing the first two columns are the same, right, on your -- up there. As they type, most of the letters [inaudible] I suspect most of the columns don't change. >> Yanen Li: No. Sometimes they actually [inaudible] the top results will change because it's not any more matching the prefix. >>: Yeah. In this case. Would there be some utility to learning Obamacare glitches has been at top for a few characters, has not been seen, maybe that should lower its likelihood and push it down. And what if Obamacare healthcare bill is loaded up [inaudible]? >> Yanen Li: I'm not sure what is your question. >>: I'm wondering if this motivates some additional information because if something is at the top and yet it's not even clear ->> Yanen Li: Yeah, of course there are -- there are -- I think there are several more interpretations than the -- this skipping behavior bias assumption. Because the QAC process a very complex process. So we try to model the -- argue the most important reasons why there is no clicks. >>: Let me sort of refine the question which is you're kind of arguing that the user is not looking at the results based on lack of clicks. >> Yanen Li: Yeah. Right. >>: But really they might be looking at the results in only a small portion, they might be looking every time that there's a significant flash that results have changed and been re-examined, which then you might be able to estimate by the amount of total change [inaudible] or the even better refined thing [inaudible] just doing the eye tracking study and looking at when is the user looking at the results of that rather than guessing based on -- because it's important to know whether they're looking or not and how deep. >> Yanen Li: Yeah, yeah, that's a good suggestion. Yeah. Okay. So here I try to argue the implications of the horizontal skipping bias to the relevance ranking. So in order to do that, we did an experiment on the real data. So first we train a RankSVM model on -- only on the last column. And then we trained it by all columns. So it turns out that trained on all columns so achieves -- actually achieved not better than just training on last columns. So our hypothesis is that we can actually train on the examine columns, train on the examine columns only. Because the examine columns will provide more certain negative or positive examples than the columns that are not examined. Because there are much more noise in these columns. So our goal is that we try to build a better relevance model by better modeling the horizontal skipping behavior and vertical position bias behavior. Because the whole click model can be decomposed by three components. And another question we want to ask, whether we can adopt the existing click models from the document retrieval site, because there are already several click models such as the UBM, DBN, and BSS. And after investigation, our answer is no. Because, for one, the horizontal skipping behavior is unique to the QAC process. Right? So we expect that these models will not have a very good result on the QAC process. And, second, these kind of models are not content-aware, meaning that they cannot handle the unseen prefix and query pairs. However, in the QAC process, the unseen prefix and query pairs are very high, more than 67 percent in the PC platform and 60 percent in the mobile phones. >>: You said UBN, universal ->> Yanen Li: UBN, yeah. The user browsing model. >>: Browsing model. Okay. DBN? >> Yanen Li: DBN, dynamic Bayesian model, Bayesian network. Yeah. Okay. Cool. And that's why we propose a new two-dimensional click model to model the QAC process. So in this model, the first component is the H model, the horizontal skipping behavior. So in the H model, the HI equals to 1, means the user actually stop and try to examine this column, HI. And the HI equals to 0 means the user actually stop and move to the other column. And we model the distribution of HI equals to 1 using sigmoid function of multiple feature combinations. And there are a couple of interesting features such as the typing speed. The lesson there is that an expert user is not likely to examine many columns. And also it the isWordBoundary is an important feature also because we expect that people usually examine the columns and the word boundaries. And the current position is also playing a role here because we expect that the people will not examine the column at the beginning of typing. And for the D model is to try to model the vertical position bias. The DI equals to J means the people has examine to the depth of J. We also use the softmax function to model the distribution. And we use the click model to model the C model to model the intrinsic relevance between the query and the prefix. The CIJ equals 1 means there is a click at position I and J, column I and row J. So let me walk through some cases where we don't receive a click and we do receive a click. And first off, in the first case, if the user stop here and HI equals 0, she will -- there is no click observed and she will type to another character by the probability of HI equals to 0. And another case, in this case, actually the user opens to look at this column, HI equals to 1. And he wish the depth of 2. So even though there is a relevance query here at depth 4, we don't observe any click was the user hasn't reached -- hasn't examined to this query. And for the third case, and in this case, the user has opened this column and she has also reached to the depth where there is a relevant -- there is a query. So she examine it. However, because the query is irrelevant so that we don't observe a click. >>: So examine means what? >> Yanen Li: Examine actually means to events. One is open the column to examine down and the other is to examine a skip as the level J. So the H ->>: How do you know the depth? >> Yanen Li: The depth that we modeled. We model each depth. Right? The DI equals to J models how deep the users goes to. Right? And in this case we still don't observe any click. And finally so only when the query -- the user opened the column to examine and then she reach to the depth to a query that is relevant to the prefix. Then we finally observe a click. Right? So which means that only when the user has examined the query and the query is relevant we observe a click. Right? And so we know that the DI [inaudible] and the H are both latent variables. So in order to estimate the distribution of H and D as well as C, we need to sum over all the possible values of H and D. And we can estimate using E-M algorithm. So we'll skip the mathematical details and instead I will explain more on the experiment part. So we did an evaluation on three sets of data. First is on the PC site and the other is on the mobile devices. And we also introduce our random bucket test dataset in which we shuffle the query list for each prefix. Because the random bucket data can provide the unbiased evaluation of the relevance model because it will reduce the vertical position bias. So we use the MRR as the major metric and we measure the MRR across all columns. And we include two sets of baselines. And the first set is the non content aware models. Right? And the second set -- the second set -- the second baseline the is BSS model which is a content-aware model. So our model, TDCM, is actually kind of a content-aware model because it can handle unseen prefix and query pairs. And so for the results we have several observations. First of all, the non content-aware models is somehow not effective to model the -- all columns of data. So indicated by the performance that is actually lower than the MPC. The MPC is just looking at the query count, global query counts. Because they are not able to model the content-awareness. And because a large portion of the prefix and query pairs are unseen. And our second observation is that actually the BSS model sometimes it do a rather good job because it is a content-aware model. However, so its performance is not stable. And sometimes it's better than the MPC baseline; sometimes it's lower. Indicating that it's not effective to model all columns of data. And finally so our model by better modeling of the unseen data as well as better modeling of the whole QAC process, thus, our model achieves the best results. Question? >>: So the MRR@All, this is over all columns of data that you take MRR for each of them and ->> Yanen Li: Yeah, MRR@All means we measure each column's MRR and then take the average. >>: Does that give higher weight to longer prefixes that were typed before completing? >> Yanen Li: This is... >>: You're going to represent more of that average, right? >> Yanen Li: Yeah, you're right, but the [inaudible] because we do the same thing for all methods. So the relative performance will be different. >>: Right. But we're essentially seeing things weighted towards methods that perform well ->> Yanen Li: Yeah, you're right. So at the first one or two columns, the MRR value would be final, right? >>: Right. >> Yanen Li: Because it's much harder. >>: And also the methods that don't take into account context might perform better on the shorter ones where there's less [inaudible]. >> Yanen Li: I think so for taking advantage all columns better, so if you even do a good job on these early columns, right? Because if you only look at the last column, so it's not likely you can predict very good on the first one and two columns. Okay? So in order to validate the H model, we also did another experiment. We tried to leverage the user behavior to enhance other [inaudible] methods. So in this experiment, we trained the RankSVM by all columns, last column only, and viewed columns. The viewed column means the columns that we are quite sure the users has examined. And it turns out that so by modeling all columns is actually worse than modeling the last column. Right? If you only use the RankSVM and trained [inaudible]. And, however, by better modeling the horizontal behavior, so actually the data is more clean so that it could provide more valuable information. Right? So that the RankSVM achieve a little bit better than training on last column. >>: Do you have the sense that training on all the viewed columns gets us better result because there are more -- essentially more instances of data in the training set ->> Yanen Li: No, it's not necessarily the case because by training on the first one or two columns, so it's quite noisy. So it's ->>: Compared to the last column. >> Yanen Li: Hmm? >>: Between training [inaudible] last column versus training on the viewed columns. Training on the viewed columns essentially has more training data, right? >> Yanen Li: Yeah. But -- yeah, you're right [inaudible]. The training data should be also clean. If you provided all noisy data, so it will not help you much. >>: I understand. Do you know how much training data we're talking about for these experiments? >> Yanen Li: So we have [inaudible]. >>: [inaudible] training data? >> Yanen Li: [inaudible] 50,000 to 100,000 pairs. >>: [inaudible] columns there are [inaudible]? >> Yanen Li: Yes. We also have the average columns for each section. So I didn't provide here for. For the iPhone data, the average column is about 9 and for the PC about 13 to 15. >>: Okay. So that's the number of -- that's all? >> Yanen Li: Yeah. >>: [inaudible] viewed, is that two columns that are viewed per session ->> Yanen Li: It depends on [inaudible]. >>: And what's average? >> Yanen Li: I don't have the number here, but I guess it's more than two. Yeah. >>: What does this average prefix mean on the table? >> Yanen Li: Average prefix means we do the MRR on all columns and we take average. >>: So the MRR for a column where there are no clicks, it is going to be 0? >> Yanen Li: No. So if the click query is this display on that, we consider it as [inaudible]. >>: Okay. >> Yanen Li: Okay. So it's also interesting to see the learned user behavior by looking at the feature weights. For example, for the H model, the learned feature weight for the typing speed is actually [inaudible] proportional to the probability of H equals to 1. Right? Which matches our intuition, because an expert user is not likely to examine the whole column. Also, the word boundary as well as the current position is important features. Meaning that so people actually examine, start to examine the column at their word boundaries as well as at the later columns. And on the D model, so we found that the first three positions occupied the most examination probability which also matches our observation for the vertical position bias. And, third, for the relevance model, we find that the query history frequency is important. >>: [inaudible]. >> Yanen Li: No, the MPC [inaudible]. >>: What is that? >> Yanen Li: You mean the -- the MPC is the global query frequency. >>: Okay. >> Yanen Li: And this one is the query count in your past history. So meaning that -- so actually users usually use the QAC as the query storage. And also the GeoSense, geolocation related query counts as well as the time related query counts play a role in the relevance model. So -- yeah. >>: [inaudible] features [inaudible] understand the meanings [inaudible]. >> Yanen Li: Understand the meaning of this ->>: [inaudible] entertainment people will click high and for politics, political [inaudible] queries [inaudible]. >> Yanen Li: So what's your question exactly? >>: For given queries, you will have different intent. >> Yanen Li: You're right. So, yeah, we also have -- we also have a feature called -related to the query intent, for example, the [inaudible] query and the informational query where we didn't show here, so actually it's on a later part. So it also plays a role in here. So in our -- in my backup slides, I actually have a slide showing all kinds of features for each model. And let me draw a brief conclusion for this part. So actually we first collect the first set of high-resolution query log specifically for the query auto-completion process. And after collecting this new dataset, we actually analyze two important and interesting user behavior. Namely, the vertical position bias as well as the horizontal skipping bias. So we've shown that though there are great implications to the relevance ranking. And then we propose a level two-dimensional click model to model these two interesting user behaviors. And the resulting relevance model actually outperforms the existing click models. And, okay, so let me come back to our roadmap. And in my Ph.D. study I have adjust the key questions in each level of query understanding. So actually our solution frequently try to leverage the frequently occurring patterns in large amount of data so that our methods are generally applicable to all type of query. And, second, so our -- we usually use the unsupervised or semi-supervised models so that we require minimal human efforts. Thus, our method could potentially scale up to millions of queries. And so in the future, in the query understanding direction, I would like to explore the following interesting research directions. First of all, I would like to investigate another important context such as the location as well as the temporal context. So you know that those are very important contexts if you trigger very interesting research questions. And, secondly, notice that my current research agenda focus on single query. However, I want to expand my research on a scenario that actually combines multiple queries, which is the task query scenario. And finally but not least, I want to also explore the interesting field of mobile search. So in mobile search, the query's characteristics could be very different on the PC side and there could be many interesting applications for the query understanding. And, finally, I also want to apply my research to a broader field of data mining and recommender systems as well as biomedical informatics where I have also done some work. And, finally, I will stop here and try to take more of your questions. [applause]. >> Kuansan Wang: We have ten minutes. >>: For your last work, how is this fundamentally different with Web results with ranking? >> Yanen Li: With what? >>: The search results with ranking. We have a ranking query auto-completion. >> Yanen Li: Yeah. >>: How is this fundamentally different with Web result ->> Yanen Li: Yeah. Fundamental difference is that so QAC is very personal. We don't have the third-party relevance judgments. Right? So that we could only rely on implicit feedback. >>: Yeah, for query [inaudible], you click it. >> Yanen Li: Yeah, but it's not -- it's not the major [inaudible]. >>: I think it would be very good to work on personalization, ranking for personalization, which is very personal. >> Yanen Li: Yeah. You're right. Yeah, it's more related to the personalized search. >>: In your [inaudible] work to improve ranking, you didn't mention -- didn't tell us exactly how you measure the relevance between the segmentation and [inaudible]. >> Yanen Li: Yes. Yeah, good question. So in that part we have a document language model, so specifically we use a bigram generated [inaudible] model to score the different segmentations. Different segmentation means we have different trunks, right, and for each trunk we use the document language model to score that. >>: The language model is based on all the documents or ->> Yanen Li: Based on -- this is [inaudible] language model, first based on this document only and we [inaudible] our documents. >> Kuansan Wang: Any other questions? All right. Let's thank the speaker. [applause]. >> Yanen Li: Thanks very much.