24076 >> Kuansan Wang: It is my greatest pleasure to welcome Le Zhao back to Microsoft. Many of you probably overlapped with Le Zhao as he was a formal ISRC intern within our group working with Cha Lu [phonetic]. But we all remember he's actually from CMU in the LTI, Language Technology Institute, CMU. Very famous institution in the HLT area. And for those of you who are active in the check community, you probably already now him, he's a active contributor in the trek community contributing to the lemma toolkit. And his earlier works span from the structure document retrieval to the recent one most recently on this term problem, but in addition to that Le is also very, he's from LTI, so he's also well known in the HLT community. So he has done work in pattern retrieval, biomedical document retrieval; and today he's going to tell us his Ph.D. thesis work. So without further ado. >> Le Zhao: Hello, everybody. I'm really glad to be here again to see my old friends here and to make new friends. And I'm especially glad to see a reasonable turnout on a Monday morning. In this talk -- so first if I get speaking with the talk, I want to comment that I know there are people joining through the video link and there's no way for you to participate in the question-and-answer session. So I'll have my e-mail here you can send your questions to my e-mail and I'll try to get to them at the end of my talk. Okay. So this talk is about text retrieval. And let's first see what is text retrieval. The task is the user, the confused user generates a query, which contains certain query terms. Retrieval engine, retrieval engine returns a set of results from a document collection and feedback to the user. So hopefully the user will be happy with the results. This task and the retrieval engine is usually evaluated using a Cranfield-style evaluation where this evaluation abstracts away the user by retaining only the query that the user issues. And retaining only the results that users are happy about which are called relevance judgments. And these relevant documents help us evaluate together with the queries help us evaluate the retrieval engine in a relatively objective way, an automatic way. So here comes the big question: Where are we? And where are we going? The current retrieval model dates back to the early 1970s. The best ones are from still from the 1990s. These current models are based on simple collection statistics, like TFID. TF is term frequency which measures how frequently a term occurs in a result document. measures how rare a term is in the collection. And IDF So what these retrieval models doesn't do is any kind of deep understanding of natural language text. So given this current status what's the ideal land like, what's perfect retrieval like? Given the query information retrieval, given an answer text search, a perfect retrieval model should be able to judge that text search implies information retrieval. And that is called a textual entitlement task, inferring when one sentence equals another sentence. And it's known as a difficult natural language processing task. And also searchers are frustrated frequently when they are doing informational searches. So we're still fairly far away from the perfect land. But what problems have been holding us back all those years? And this work argues that perhaps the following true central and long-standing problems in retrieval are the culprit. One is called the term mismatch problem where query terms fail to appear in relevant documents. It happens sometimes because people describe the same thing using different vocabulary. And this general vocabulary mismatch problem is studied in the 1980s, when Sue was still working for Bell Labs. That's how long standing this problem is. However, there's still no clear definition of term mismatch in retrieval. The second problem is query dependent term importance. If you are familiar with retrieval models, this is a probability of a term T occurring given the relevant set of documents. So the probability of term T occurring in a set of relevant -- in a set of documents relevant to the query. Traditionally term weighting is assessed using IDF. Basically how rare a term is in the collection. Because it has nothing to do with the query. It has nothing to do with relevance. This probability PTR is known to be important for retrieval. It's known to improve retrieval a lot. It appeared very early on and have been studied by scarsly, by scarce research. And these research provided very few clues about how to accurately estimate this probability in a query-dependent way. This work connects these two problems, shows that they can result in huge gains and use a predictive approach to try to solve both problems. And in this talk I will use term mismatch problem as a thread. Because it's a more general version of true problems as I will show. First, what is term mismatch and why do we care? In job search, you might be looking for information retrieval jobs on the market, a job post could as well say text search. This could easily cost you potentially 50 percent or even more of the job opportunities even if you are careful when you are formally making a query. In legal discovery, you might be looking for bribery or foul play in corporate documents but they'll never say that, at most they'll say grease or payoff, and this will cost you cases. In patent search cases it could cause businesses. In medical record retrieval successfully finding the record regardless of the vocabulary mismatch could potentially save lives. So in areas where the user cares most, mismatch can hit most. People know that mismatch is an important problem. And they have tried to solve mismatch from different angles from the documents and from the queries end and from both ends. Our approach is different. So suppose you are given a problem, any question, any problem. How would you proceed to solve that problem? What's your first step? The first step is always is clearly define the problem, what the problem is. And then to see if it is a real problem. So in this case I will show you that in theory and in practice this is a real problem. And then we shall try to understand the underlying mechanism of how this problem is created and try to solve the problem using principled ways. So I promise to show you this in this talk and I will come back to this slide throughout the talk. So that you know I've fulfilled my promise. So first definition. We define the probability of the term to mismatch rather than documents to be the probability of a term not occurring in the relevant set of documents relevant to the query. So suppose we have collection of documents here, the larger bubble represents the documents that contain T. The smaller bubble has the documents relevant to your query. Mismatch, the proportion of the relevant documents that do not contain the term T. So if you use the word retrieval as your search term, then these will be the jobs that you mismatched in all the relevant jobs on the market. So first I want to comment that the term mismatch probability is related to the term recall probability, which is just a complement of mismatch. It's a recall is a complement of mismatch. Secondly, this probability can be very directly calculated if you know the relevant set of documents for query. So if you have relevant judgments from queries you can accurately estimate this probability. Basically how many documents contain how many relevant documents doesn't contain the document T divided by the size of the relevant set. Now, let's look at some examples. These are some queries and query terms and these are the term recall probabilities for these query terms in the query. So for the word spills in oil spills, it occurs in 99 percent of the relevant documents. Has a very high recall, low mismatch. Why is that the case? Perhaps there are no other ways to describe oil spill. Yes, question? >>: How do you know which documents are relevant? label them? Do you manually >> Le Zhao: Yes, use manual label documents and then we collect those. Thanks for the clarification. So perhaps there are no other ways to describe spills, except oil spills, because oil leaks means something else. Term, imitation. Term. The word term appears in 98 percent of the relevant documents. Perhaps also no other ways to describe the word term here. However, the same word term in long-term care appears only in 68 percent of the relevant documents. Why is that the case? Because long-term care can be described as elderly care or home care, et cetera. It's not really a necessary term or relevance. The word "effects" appears, it's an abstract term. And it only appears in 28 percent of the relevant documents, because the facts could be described as improvements, decrease, impact, et cetera. Ailments is not only abstract. But also a rare term. And that makes the situation much worse. These queries are from track datasets where the government provides users to generate queries and do the assessments to generate relevant judgments and we as participants can freely participate and evaluate our systems. So this is a very nice deal. So by now there should be a lightning strike in your head. We have a simple definition which allows us to estimate the probability of mismatch from relevant documents and to analyze mismatch. This probability, the probability term occurring in the relevant set, occurs in one of the very early retrieval models. If you assume that term currency are binary and conditional of each other given the relevant information, then the optimal -- the optimal ranking score is given by this formula. If you are familiar with machine learning, this is just a naive base model. The retrieval model, the retrieval model aggregates relevant scores for the terms that have appeared both in the query and in the document. And we are scoring whether document D is relevant. And in this case two probabilities, two conditional probabilities determine two sets of probabilities determine this retrieval score and two probabilities determine the optimal term weight here. One is a probability of term occurring in the nonrelevant set. Because the relevant set is usually very small compared to the collection, this probability can be accurately approximated by the probability of term occurring in the collection and results in the traditional inverse document frequency based term weighting. So term weighting based on how rare a term is if it is rare term it is more important. The other part we now know is determined by the term recall probability. Now, this is a very basic model. However, more advanced models use this as the sole part, as the only part that specifies term weighting, how important the term is. And more advanced model, other advanced models behave similarly. And these models -- these advanced models have been used as very effective features in Web search. And to recognize that it is important to recognize that this probability, the term recall probability, appears as the only part in the query that is about relevance. This part has nothing to do with relevance, because it's a collection statistic. It has nothing to do with the query. This full formula has been called relevance weight or term relevance, but term recall is the real talk about relevance. In theory, it's as important as IDF. And the only part in a retrieval formula that's about relevance. In practice, because people know this probability is difficult to estimate, you need to -- you need information from the full set of relevant documents in order to estimate this probability. So people typically just ignore this part and only using IDF based term weighting. And when people do that, it causes the emphasis problem where the retrieval model try to emphasize, will try to emphasize the high IDF terms in the query. So, for example, for this query prognosis of viability of political third-party in the U.S., prognosis and viability have the high ID, they're rare terms. So they are being emphasized. If we look at the term recall probability, political third party should be emphasized instead. When the retrieval models assign a wrong emphasis to the query terms, there could be top false positives where this is a rank result given using advanced retrieval model, language model, and all these top ten results are false positives. Meaning they're irrelevant documents that happen to contain the rare terms, prognosis and viability but they're not about a topic. And it's important to recognize that this is an emphasis problem instead of a precision problem, because if you just look at the top results, right, you see that prognosis and viability are being used about something else instead of political third party. So you might think that if you require prognosis and viability to be about political third party, it might improve the situation. But in fact increasing the match, strictness of the match, can only increase mismatch and can make the situation worse. So this is a mismatch problem, not a precision problem. Recall, not precision. And even Google and Bing still have top ten false positives. I should note that Bing is much better than two years ago when I first tried this query on Bing. And Google actually decreased performance a little bit on this query. I don't know why. So false positives throughout the rank list, increasing -- decreasing precision at all recall levels. But I've shown you that this is an important problem. But how frequent is it occur? How significant is the problem? Turns out in a 2003 real reliable information access workshop it gathered many groups of experts, research IR systems evaluated language models BN 25, pseudo feedback, all the standard techniques still being used now. And did a failure analysis and discovered that out of the 44 failed topics 64 percent of them. So we're summarizing their results here. They did not summarize their results that way. Out of these failures, 64 percent, because of emphasis, and we now know that term recall based term weighting can help solve this problem. Another 27 percent is the mismatch problem where you need kind of a query expansion to solve the problem, and we now know that if we know which terms tend to mismatch, we can guide our expansion toward solving these problem terms. So underlying more than 90 percent of the failures is our need to predict this term mismatch probability. So in practice it takes plains common failures over retrieval models; not only that, but also many other behaviors of the retrieval techniques, such as when you are combining bigrams with unigrams in your query. The bigrams tend to have a much lower weight than unigrams. Why is that the case because the bigrams increase mismatch, right? So they should have a much lower recall probability and much lower weight than the unigrams. And personalization, the disambiguation, and structure retrieval which enforce structural matching between the query and the documents. These techniques all increase precision and are shown to be less stable for improving retrieval. Why is that the case? Perhaps the problems that the queries are suffering from are the mismatch problems. Not the precision problem. Okay. I've shown you that it's a frequent problem. Now let's focus on the emphasis problem. It's a frequent problem but what about retrieval performance gain? What kind of gain are we talking about? For basic models, if you apply the true term [inaudible] into the retrieval models, it can get 100 percent gain in retrieval. In more advanced models, it can still get 30 to 80 percent gain. So for a new query without relevance judgments, we need to predict that probability, but that prediction perhaps doesn't need to be very accurate to show a performance gain because there's a huge potential. So huge potential of gain. Now on to prediction. How do we predict this probability, this mysterious probability that people find no clue to predict very few clues. We look at the data. First it varies from 0 to 1. So we need prediction. Second, the same term in different queries can have different term recall probabilities. So we need a query dependent way to predict this probability. Third, it's different from IDF. What have -- so these three trends also occurs in more larger scale analysis. Here I'm listing the term recall probabilities each point is a term, the term recall probability of a term. Sorted in descending order. As you can see here, the term recall probability here varies from 0 to 1 almost uniformly. >>: All this comes up in one query? >> Le Zhao: This is from one track dataset which includes 50 queries. So these are fairly long queries. For reasonably -- for shorter queries the term recall probability, there's a bias toward higher recall. And you should be surprised to see this statistic, because on average a query term mismatch is 30 percent of the relevant documents. And that's a query term from a short query. So imagine you are as a user you're typing into your search engine one query term. You're excluding 30 percent of the relevant documents. And as you type the next term, you are excluding another 30 percent from the remaining. Right? So that gives you less than 50 percent of the relevant set to even begin with. >>: Confuses the term recall [inaudible] and you're not doing any stemming? >> Le Zhao: Good point. I'm doing stemming. Yeah. Call that stemming. And, globally, the product here, the term recall probability is the mean and the variance for the same term that occurs in more than one query. As you can see, it still varies from 0 to 1 and the spikes are the variance. So for a lot of terms, there is a variance, but the variance is small. But it could be large for certain query terms. And plotting here, the points are the query terms. And X is the DF, IDF, Y is the term recall probability. As you can see, there's a slight correlation. But it's still messy. So you cannot directly predict this using IDF. But what have prior approaches done? Prior approaches have tuned this as a constant, or use IDF as the only feature to predict. And success is limited over very basic models and the missing piece is the knowledge that this probability measures term recall and is related to term mismatch. With this knowledge, we can ask ourselves: What causes mismatch? What might cause mismatch. First, a term being not, a concept being not central to the topic. The words related or potentially related are not really central to this topic. Propounded not really central. So these terms tend to mismatch. Second, synonyms tend to occur in place of the original query term causing mismatch. Abstract terms tend to be replaced by more specific terms in relevant documents and cause mismatch. And given these factors, we can try to design features to model these factors and to do prediction. So I've shown you the magnetism of how mismatch occurs and how mismatch causes problem in retrieval. In terms of features, what we need to do, we need to identify synonyms of a query term in a query dependent way, with made specific choices in our design of the features to be general. To only depend on the query and the collection. Not to use external features, external resources, but certainly it's not the best -- it's not the best way. There should be better ways to design these features. But this is the first set of features that have been shown to work for this problem. Let's see how we do that. So first external resources like What Net [phonetic] Wikipedia query logs, they have a coverage problem; they tend to be more static not dependent, so they're not easy to use. Using them research topics themselves. What we did is use a term-term similarity measure in the concept space to help us identify the synonyms. So it's called local latent semantic indexing because given the query we do initial retrieval. Get the top set of documents from that retrieval and apply latency semantic indexing on the top documents. So, for example, we can get top 200 documents, do dimensional reduction. Keep only 150 dimensions. And here are some examples. This is the query term. These are the top similar terms identified by latent semantic indexing. So we use as a feature the self-similarity. So we use inner product as a similarity measure. We use a self-similarity of the term as one feature. That measures how well the term co-relates with the latent concept space, and the latent concept space is related to query. So well this term relates to the query. We also use as features the average of the supporting terms. Average similarity of the supporting terms as a measure. So basically we're also requiring that not only the term correlates with the topic but also the supporting terms, basically the concept is central to the topic. And also we use as we designed a feature which measures how likely these synonyms replace the term in collection documents, right? If the synonyms are frequent terms, then it's very likely that these terms will appear in documents that the original query term doesn't appear in. So these are the features. And we can measure -- >>: Can you go back a slide. supporting terms? I didn't understand number two. What are >> Le Zhao: Supporting terms are the top similar terms measured by latent semantic indexing. So basically we represent each term in a concept space and we compute inner product similarities. >>: With just ->> Le Zhao: Rank. Rank the terms and these are top similar terms. >>: Seem to [inaudible]. >> Le Zhao: Yes. We pick top five. Yes. >>: And then, sorry, didn't understand three, the synonyms are the top supporting terms. >> Le Zhao: Yes, synonyms are the top supporting interprets. And we are measuring how likely these terms occur in collection documents that this term doesn't occur? >>: It doesn't occur in. Okay. >> Le Zhao: Measure of how likely these terms replace the original query term in collection documents. >>: That is a collection of -- >> Le Zhao: Of the entire collection, yes. Not just the top-top documents. We can measure how well they correlate with a target from recall. Turns out term centrality has a high correlation because IDF has a 1.3 correlation. And negative also means helpful. Positive or negative. As long as this absolute value is large, it's helpful. Centrality concept, centrality also very helpful, replacability, well, understandably it's negative correlation, and it's fairly high. Abstractness. The abstractness feature is based on the observation that users tend to modify more abstract terms using more concrete terms. For example, educational modifies the word program in this query. So program tends to be the more abstract term. And so on. Effects also tend to be an abstract term. So we can use dependency passer to pass the query. And if a query term is being modified by other query terms, then we say this is more abstract. This is a binary feature, and it has a co-relation of 0.12 with a target. So this is also very helpful. And the top documents are retrieved just using IDF. >> Le Zhao: Yes, yes, using the baseline retrieval model. baseline I'm comparing to. Whatever >>: So you are using a [inaudible] so which dependent ->> Le Zhao: Stanford. >>: Stanford. >> Le Zhao: Yes. >>: The Stanford processing is not the sign for query [inaudible]. >> Le Zhao: No. >>: So what's the increase [inaudible]. >> Le Zhao: I haven't verified. But for the small set of queries that are looked at, the pass looks very accurate. So the passers also behave fairly accurately on short texts. So given these features, we can model the prediction of term recall as a standard regression problem. We can use training data with known relevance, training queries with known relevance to train the model and use another set of queries without relevancy information as a test set. And here we use Gaussian kernel support vector regression as the prediction model. You can also use other advanced prediction models like posted decision tree, posted regression tree. It works similarly. And the experiments we are measuring two things. One, how accurately we are predicting term recall, using how closely the prediction is from the truth. Second, we are measuring retrieval performance using overall retrieval success and precision at top ranks. So what percentage of the top ten are relevant. So here's one example. For this query, we are getting the correct emphasis here. Although the absolute value is still very -- is still not very close. And more globally, this is the method that uses a training set average to predict for the test set. And it gets an error of 0.3. So suppose the distribution is uniform from 0 to 1. The term recall is uniform 0 to 1. If we use the training set average, it should get an error of 0.333. So it's not completely uniform. Using IDF alone increasing error a little bit. Using all features and tuning the meta parameters of the features we can reduce error by half. And this shows if we use the recurring term, the terms that occur more than one query in the training set to predict. So use the previous occurrence to predict the next occurrence we can get a fairly low prediction error. But these two values are not directly comparable because they're not measured on the same set. This is measured on test set. This is measured on the training set. So it can be predicted. And I want to just briefly insert that our method demands a more general view of the retrieval modeling problem. Traditionally, retrieval modeling is seen as restrictedly as a document classifier for a given query. To classify whether a document is relevant to the query or not. The more general view sees a retrieval model as a meta classifier which is responsible for many queries. Takes the query as input and outputs of a document classifier. So given this view, learning a retrieval model basically is just transfer learning in machine learning. You are using, basically using the knowledge from related tasks training queries to classify a document, learn the new classifier for the test query. Features and model are just facilitating the transfer. So this more general view would perhaps lead to more principled investigations of the problem of how to learn retrieval models and also allow us to apply more advanced transfer learning techniques into retrieval. Okay. That's the insert. We're measuring retrieval performance now. If you are familiar with the retrieval models, this is how we insert the probability into the retrieval models as the term weighting, as term weighting, in language model. If you are not familiar with retrieval models, it's okay. We're just weighting the query terms. We're not doing any kind of expansion. This is the performance on six different test sets. Each test set from track contains 50 queries and we're using one as a training set and one as a test set. And the crossbar, there's a 10 to 25 percent gain. Most of them are significant. So this is measuring overall retrieval performance gain. And in top precision, as predicted by theory, there's also a 10 to 20 percent gain, although not always significant, because the measurements are sparse. So it can be used in retrieval as term weighting to help solve the emphasis problem and leads to significant gains. What about the mismatch problem? If we can successfully solve the mismatch problem, increasing the term recall probability of every query term, we can at the same time solve the emphasis problem. So let's recap. >>: Let me verify, use the union the two [inaudible]. >> Le Zhao: Sorry. >>: The two stage retrieval. >> Le Zhao: Yes. We need two stage retrieval. features. One to do the retrieval. One to generate the >>: And bringing you to [inaudible] the set of -- the relevant document. >> Le Zhao: Yes. >>: And general features and bring the model to weight this [inaudible]. >> Le Zhao: Yes. Two stages. >>: So have you a comparing method [inaudible]. >> Le Zhao: Yes, that's a good point. back to that after the talk. I do have. But I will come >>: So this slide you're showing on the transfer learning, can you go back to that. Sorry. Keep going back. So this more general view that this is sort of like a learning -- are you saying something different? Most learning models are used in a training and test set, are you saying something different from that? >> Le Zhao: Good point. So by now there's a dominant model is learning to rank. Basically learns global retrieval model out of the training set. So just one retrieval model. Basically one classification model. And applies to model -- the same model on the test queries. But here we're learning from these training queries and generate a new classifier. So the classifier is different. While the learning to rank learns a global model one classifier, is that answer your question? >>: No. So the classifier in your case is the product of the POT in ours or do you actually have an underlying model as well in addition to those? >> Le Zhao: The underlying classifier is just BM 25 or language models, which is just the traditional models with that probability inserted into the model. >>: So what you're doing is you're learning an improved probability. >> Le Zhao: Yes, but at the same time with a different term weighting, that is a different model. That is a different classification model. It's the same retrieval model but it's a different classifier, because you're classifying the documents differently with a different term weights. >>: The retrieval is too thick. >>: I understand that. I can ask you more about this later. It seems more like different feature values to me than different model. But I'll ask it. >> Le Zhao: Yeah, yeah. Okay. I can perhaps come to that later. So let's recap. Mismatch. Mismatch ranges from 30 to 50 percent on average. Relevance matching can degrade very quickly for multi word queries. The solution, one solution is to fix every query term. If we fix every query term by expanding every query term using its synonyms, it results in a conjunctive normal form query. But in this case this keyword query is being expanded until a conducted normal form query. It's very expressive and very compact. One conjunction normal form query in this case is equivalent to hundreds of alternative queries using keyword queries. This is used very frequently by lawyers and librarians. In this case it's a query, legal track query which is created by lawyers. As you can see, they spend lots of time trying to expand every query term. So it's a very tedious task. And what we propose to do is given this term mismatch probability, given the prediction, we can help guide the expansion to focus, to let the user focus on the terms that have the problems. So here placement and children are being expanded. And keeping the other terms untouched. To go is to expand -- say two terms and still getting 90 percent of the improvement that would be great. So how do we evaluate that? Ideally have a user, we let the user propose a keyword query. Keyword query is sent to the diagnosis system being diagnosed which are the problem terms, and the problem terms are being fed back to the user. The user expands the problem term and the query formulation strategy generates the query. Submit to the retrieval engine and do evaluation. So in this case we can have different diagnoses methods in the diagnosis component and we can have different query formulation strategies in the query formulation component, and we can compare these different methods. That's what we intend to do. However, online user studies of such a complex system need to control many variables. Without millions of users this kind of online study cannot be carried out. What we end up doing is a simulation. We have the expert user give us a fully expanded CNF query beforehand. We extract the keyword query by taking the first term out of each conjunct. Similarly, the pipeline goes. So this is one simulation. And the user expansion is also one simulation where the expanse term is being extracted out of the CNF query. So this simulation is fairly realistic. It's used partial -- to use full expansions to evaluate partial expansions, to simulate partial expansions. >>: I'm missing something. Why is it bad to expand everything? >> Le Zhao: It's good to expand everything. I'm just saying given our probability how are we using that? We can use that to guide the expansion, to save some time for the user. >>: So I know the user -- the expansion happens automatically, right? What is the user do? >> Le Zhao: The expansion doesn't happen automatically. So the CNF query you're seeing is being expanded by a lawyer. Spending lots of time on one query. >>: But you know the synonyms, right. We have dictionaries with synonyms. I don't understand what's the manual part here? >> Le Zhao: The synonyms are difficult to get. These are the gold, if you can get the synonyms you get query retrieval. But we don't. We don't have automatic ways to get that. Perhaps Bing has that. >>: All right. Okay. >> Le Zhao: But even Google and Bing have this problem. if we can manually expand more terms, we can do better. So that means >>: Manually generate CNF queries give you the upper -- [inaudible]. >> Le Zhao: Yes. So I'll show the experiments. So here we're using two sets, two datasets. One with CNF queries created by lawyers. One with queries created by search experts. In terms of retrieval, in terms of diagnosis methods, I'm plotting here on the X axis, the number of query terms that are being selected for expansion. On the Y the retrieval performance gain. The relative performance gain. So the upper bound is expanding fully or expanding not so full. But close. So fully expanding sort of give upper bound. You can observe here these two points are using the PTR-based diagnosis. So predict PTR and do diagnosis and do a simulation expansion, expansion simulation and evaluate. Expanding two terms using PTR-based diagnosis can get about 90 percent of the performance gain while if we are only using IDF as diagnosis, we need to expand three terms. So we're effectively saving users' time. >>: Do you know what the very best you could have done too is. >> Le Zhao: What do you mean the very best? >>: You could have tried all pairs to see [inaudible]. >> Le Zhao: No, I haven't done that. But that's a good point. So we're using a greedy strategy here. Greedy strategy. Not necessarily the optimal. That's right. >>: Sorry. I don't think that's what I meant. I mean -- unless I misunderstood you. So greedily selecting the two, fine, but you also have like the four or five expansions, right? So would you be able to just -- would you be able to try all two and see which one gets the very highest retrieval scores. >> Le Zhao: Expanding which query term or do you mean how many query terms to expand for each query term? >>: Which query terms to expand. >> Le Zhao: So no we haven't tried permutations. greedy approach to expand -- We are only using a >>: Try all the possible combinations. >> Le Zhao: We haven't tried that. >>: Rank them just by bag of ->> Le Zhao: Yes. >>: Because you already have the -- [inaudible]. >> Le Zhao: Yes. >>: Oracle. >> Le Zhao: Yes that's an Oracle experiment. So in terms of expansion, forms of expansion, we're comparing CNF expansion versus traditional bag of words expansion, keyword expansion. As you can see CNF gets better than keyword. CNF gets better than keyword. >>: So what is the [inaudible] expansion would be? >> Le Zhao: So we are using the same set of menu high quality expansion terms from the users. But instead of doing CNF expansion, we are combining these set of expansion terms as a group. And combining that set of query terms with the original query, using a weighted combination. And we have also tried weighting the expansion terms. But that doesn't help much. So we have tried several different ways to do the traditional way of expansion. >>Le Zhao: >>: Traditional way of the expansion. You also assume that the candidate expansions from CNU expansions? >> Le Zhao: Yes. The candidate terms are from CNF expansion, yes. I have also tried using a automatic. So the standard relevance model. So standard pseudo relevance feedback method to automatically extract expansion terms but that's worse than manual expansion terms. So we show that CNF queries, so the problem diagnosis can produce simple and effective CNF queries. I've also worked on other aspects of the problem such as improving the efficiency of the prediction. So here we can use a one pass prediction, one pass retrieval with three to ten times speed up and close to performance while retaining still 70 to 90 percent of the performance, that's based on the observation that many of the query terms for many of the query terms the term mismatch probability doesn't vary much across queries. So we can use that to speed up our prediction. >>: Is that mean you don't need pseudo environment documents? >> Le Zhao: one stage. We're not doing pseudorandom feedback in this case. So >>: So you need a cache of each community term that had this probability. >> Le Zhao: Yes. We need a cache. But luckily the training set -- >>: The model and the only difference is the probability is the probability of term given the document? >> Le Zhao: Sorry. I don't get your point. >>: I'll take it off line. >>: You're predicting from the previous three queries ->>: Query terms. >>: Judge relevant documents. >> Le Zhao: Judge relevant documents from the training set. worked on -- Also >>: Coverage would be a problem. >> Le Zhao: Yes, coverage could have been a problem. And, yeah. >>: I think you [inaudible]. >> Le Zhao: Luckily you don't need much coverage. It's about 50 percent term coverage gets about 70 to 90 percent gain. If you have larger training sets, yes definitely better. We've also worked on structure retrieval using symmetric role-based structure. So we can annotate the question which terms are the target. Which terms are the agents and subjects, objects. We can annotate the answer in the same way and try to match the structure, not only keywords. But this problem causes this structure retrieval causes mismatch at the structure level where because of a switch of the key term, key verb here the argument zero becomes the argument two here and it's a different target. We use undirected graphic model to learn, jointly learn the field level translation from the question to the answer, using the training set which are basically true question-and-answer pairs. We can learn the translation and we can predict which are the likely structures given the query. And use the prediction we can use [inaudible] lemma search engine to query these alternative structures. Allows us to query this kind of structure. We can get 20 percent gain in overall retrieval versus using just a strict question structure alone. Conclusions. This talk is about two long standing problems in retrieval term mismatch and term weight estimation. We have provided the definition and initial analysis of term mismatch. Future work would explore new features and new prediction models that would improve our prediction even further. We have showed the row of term mismatch probability in basic retrieval theory. And used the principled approaches to solve term mismatch. But what about more advanced models like learning to rank, transfer learning, model term mismatch played. How is it play out in these more advanced models? We have used automatic ways to predict term mismatch and turned the initial modeling of the causes of the mismatch, the possible causes of mismatch. And we have provided an efficient way to predict this probability. Future work should explore better ways to model these causes of other causes that haven't been explored here. In terms of effectiveness in retrieval, we have used term weighting and diagnosed the expansion. Better techniques are needed. Like automatically expanding the query into a CNF form. Better formalism like transfer learning might facilitate, facilitate us to extend this work into more tasks like relevance feedback, et cetera. We have diagnosed intervention. Diagnosed intervention can happen at different levels of the retrieval process, we're applying the diagnosis at the term level and only diagnosing the term mismatch problem. And we have shown that this guided expansion can help retrieval. Future research should explore the diagnosis of specific types of mismatch problems. Is it because of abstractness or synonyms? It could also explore different problems, not just mismatch, but also precision problems, right? So that we can guide lots of different P techniques personalization to solve the real problem of the query and to improve. And even further, we can proactively diagnose the user. We can see what problem the user is having and suggest searches or results even before the user types into the search engine. So that's my thesis work. I have also worked on lots of other things at CMU building datasets like Blue Web O9 being used by more than 200 research groups worldwide, more than seven track retrieval evaluation tasks. I've worked on the lema toolkit which is an open source IR toolkit which can do lots of fancy things. I've worked on large scale computing, and I do a fairly popular Hadoop tutorial at CMU. In terms of other research I've worked on structure retrieval. I've worked on legal discovery tasks, pattern retrieval in biomedical chemical domains and I have worked on information retrieval for human language technology applications like question answering, filtering knowledge bases extraction from the Web and information extraction. And with that, I end my talk, and I'd like to take feedback. [applause]. >> Kuansan Wang: >> Le Zhao: More time for questions. I'll just check here, go ahead. >>: Just a question for the low level [inaudible]. >> Le Zhao: [inaudible] are there better? >>: I'm just -- so you are using [inaudible] level. So there may be some new level [inaudible] to expand that. So have a more specific kind of test. >> Le Zhao: I see. I see. I see. I see. >>: Okay. >> Le Zhao: >>: A better patch [inaudible]. >> Le Zhao: Right. >>: Specific [inaudible]. >> Le Zhao: Right. So James point is that better semantic representation of the text will help solve the mismatch. Yeah. just check who else might have questions during the talk. >> Kuansan Wang: [applause] If not, let's thank the speaker again. I'll