>> Li-wei He: It's my pleasure to welcome Qiaozhu Mei from UIUC. It's a Ph.D. candidate at department of computer science. His research interest include text mining and information retrieval, especially context analysis with probabilistic language models and topic models. He is published extensively in these areas and in both KDD 2006 and 2007. He actually received the best student paper runner-up award. And he's also one of the recipient of Yahoo! fellowship. Okay. >> Qiaozhu Mei: Okay. Thanks for introduction and thanks everybody for coming to my talk. I'm Qiaozhu Mei with University of Illinois. And today I'm going to talk about a new paradigm of text mining called contextual text mining with its applications on Web search and the information retrieval. So text data is huge. If we look at the information on the Web, the text information, we can always observe this pattern. [inaudible] research articles and Wikipedia has six meaning articles. The research has around 150,000 posts every day and block [inaudible] takes five times as many as that. Twitter has around 3 million messages coming in every day, and Yahoo! Groups has 10 billion messages. When we talk about commercial search engines, how many Web pages are we collecting, right? The new search engine that they were collecting over 100 billion messages, Web pages. So text is huge. We can actually compare the text data on the Web with this huge snowy mountain which is still growing every day, and we can also compare the text mining process to the process of looking for goat in such the mountain. Then the question is where should you start and where can you go. Life is hard, you don't have any clues. So it's time to bring in context information in text to help. By context we mean [inaudible] of text. If we look at this particular article from Twitter, we can see how many type of context information are there in such a short article. We have the information about the author, we have the author's location, we have author's occupation, and we have the time at which the article was written. We have the source from which the article was submitted, and we have the language information of this article. So all these situations are pretty easy to be extracted from the data. We call them simple context. There is another type of context which is much harder to get from the data. You need to digest the text and figure out whether text is about positive opinion or negative opinion. There is yet another type of context which is more complex than all of these such as social networks in the text. Right? So when we look at the rich text information on the Web, we can always reach context information as well. PPOP [phonetic] has papers from 73 years from 400,000 authors and from 4,000 sources, which means computer science conferences and journals. Wikipedia has 8 million contributors and more than 100 languages. The research has 5 million users and 500 million URLs, and if we; look at other type of text information, we can always reach context information as well. So this is huge opportunity. If we can make use of the rich context information in text, can we do better text mining. Then the question is with rich context information can we really do better than what we are doing right now in text mining. If we still compare the text information as this huge snowy mountain, the question is what is context doing here. We see that context is actually the guidance for us to do better text mining, so instead of just this snow mountain, we now have the coordinates of this mountain. We now have the map of this mountain, we now have a GPS system and we have the highways which take us to everywhere in this mountain. So in one word, when we have the context information, we have the guide which leads us to the gold mine. So in this slide and a few slides thereafter, I will introduce different applications of text mining with context information to show what we can really do with context in text mining. In this example, if the text information is from the query logs and the context information we used is the user information, we can achieve personalized search, where the basic idea is that this -- if I input a query "MSR" to a commercial search engine, I'm actually looking for Microsoft Research. But if you look at the top results, there's no Microsoft Research. You have a lot of things like [inaudible] research, like medical simulation, like mountain safety research, like metropolitan street racer, but there is no Microsoft Research. However, if you know the user, if you know the user is me, you should give me much better answer by providing me with Microsoft Research. So here's how the content information of the user can help us by achieving personalized search. For another example, if the text information is from customer reviews and the context information is the brand of the laptops, we can extract comparative product summaries like this. We can tell what people are talking about different aspects of different brands of laptops. So with such the comparative summary, we can make smarter decisions on which brand of laptop we should buy. For another example, if the text information is from scientific literature and the context information we use is the time, we can extract trends of research topics in literature. We can tell what's hot in sigma [phonetic] literature, so if you want to publish something in next year's sigma conference, here are some shortcuts. For another example, if the text information is from block articles and the context information we use is the time and location, we can track the diffusion of topics in time and space. So we can tell how the discussion about the topic spreads over time and locations. Another example, if the text information is still from block articles and we are modeling implicit context like sentiments, we can generate the [inaudible] opinion summary [inaudible] we can tell what is good and what is bad, for the movie of [inaudible] and for the book of [inaudible] code. We can also compare these summaries and we can even track the dynamics of these opinions. And for another example, if the text information is from scientific publications and the context is social networks, we can extract topical communities from the literature. We can tell who works together and on what topic. We can extract communities like the machine learning community, the data mining community, and the IR community from the literature and the social network. So we have introduced so many different application of text mining with context information. Then the general research question to ask is whether we can find the general solution for all these problems. If we can, life becomes good because we can solve all these problems in a unified way, we can even derive solutions for new types of text mining problems. And in this talk I will introduce a general solution for all these problems, which is called context or text mining. Context or text mining is the new paradigm of text mining which treats context information as the first-class citizen. So as the outline of this talk, I will first introduce generally models of text as the general methodology for text mining, then I will introduce how we can incorporate context information into such generative models by modeling simple context, modeling implicit context, like sentiments, and modeling complex context like social networks. We have quite a few publications on these topics, but I will choose two application of context or text mining to show how effective the general solution could be and how it could be applied in Web search and in the information retrieval. Let's first look in the generative models of text. In many text mining and machine learning problems, we usually assume that the text is generated from some hidden model in a word-by-word manner. In other words, we assume that there is the magic black box which produce the text in the word-by-word manner. And this is called the generation process of the text. What we want to do is to reverse this process based on the observation of text data we have, we want to infer, we want to figure out what's going on in this magic black box. We want to estimate the parameters of these variables. So different assumptions lead to different generative models. It could be as simple as the [inaudible] model axis, from the top-ranked words you can guess what the topic in this magic black box is about. It could be also as complex as the mixture of many, many such unified models. A particular assumption about the generation of text is that text is generated from mixture of topics. By topic we mean the subject of a discourse. So we can usually assume that topics like data mining, machine learning, Web search and database in the computer science literature, so we can assume that there are K such topics in a text collection. If we look at a particular research paper, it could be done to more than one topic. So in a document point of view, a topic is essentially a soft cluster of documents. And in the word point of view, every topic corresponds to a multinomial distribution [inaudible] model of words. From the top-ranked words, you can tell the semantics, or the meaning, of each topic. Okay. And the generative process can be usually explained by the so-called probabilistic topic model including the [inaudible] model and the replace LDA model. The basic idea is like this: Assuming that there are two topics in the data collection, each of which corresponds to the multinomial distribution of words, and we want to generate every word in the document. What do we do? We will first choose the topic according to some topic distribution of the document, and once we have selected the topic, we will choose the word from the corresponding word distribution of this topic. To generate another word, we'll do this process all over again. We will first choose the topic according to the topic distribution, and once we have chosen the topic, we will choose the word from the corresponding word distribution. If we repeat this process again and again, we will generate all the observations of text data. And what we want to do is to infer the distribution of the topic distribution and the word distributions for every topic from the data we observed. We usually do this by maximizing the data likelihood. We want to find out the parameters and maximize the [inaudible] of the data generated by the model. For models like [inaudible] we can usually use standard [inaudible] expectation maximization algorithm. For more complex models, such as LDA, we can use other inference measures like gift sampling, like [inaudible] inference, like expectation propagation. The basic idea of [inaudible] is quite simple. So remember that we want to estimate all these parameters. If we already know the affiliations of the data and the topic, if we already know which word is generated by which topic, the [inaudible] is easy, you just count the frequency of the words in the same topic and normalize them. Right? By the way, don't have these [inaudible]. So instead what we want to do is to start from some random initialization of these parameters, and then we make a guess about the affiliations of the words and topics. Once we have this guess, we can estimate the parameters based on the guess. We can update these parameters. And then we want to iterate this process. We will make the new case based on the updated parameters. And based on the new case, we will estimate the parameters again. And when this process converts, we have all the parameters estimated. In practice we can usually add in some pseudocounts to the observation. By pseudocounts we mean the words that we already know, the affiliation of topic. By adding the pseudocounts, we can usually allow the user to guide the system to generate the topics that -- according to his prior loadage. So we have introduced that text is generated by a mixture of topics, but topics are not enough to capture the generation process of text because topics themselves affected significantly by the context. [inaudible] probably the oldest definition of context by situations at which the text was written. Indeed, if we look at the topics in science nowadays and topics in science 500 years ago, are we still working on the same topics? Probably not. If we are still working on the same topic in 500 years, where I myself won't choose such a career because it's too hard. A computer scientist could write an article or could submit a query about tree, root, and prune, and the gardener could also do so. Do they mean different topic? Do they mean if different topic? Probably different topics. In Europe if you write an article about a soccer report, you probably use the word football. What about in the States? Would you use the word football to describe a soccer game? Probably not. So all these are context information and it's telling us that text are generated according to the context. So what we need to do is to incorporate the context information in those topic models. I will introduce very general methodology to incorporate the context such as simple context, like time and location, implicit context, like sentiments, and complex context, like social networks. The basic idea is like this: Remember that we have a black box which generates the text in a word-by-word manner. Now we want to contextualize those black boxes. Instead of one black box we introduce different copies of the black boxes, each of which corresponds to a context. So we have a box for the year 1998. We have another box for year 2008. We have a box for United States, and we have another box for China. To generate the text, what we want to figure out is how to select the black boxes from these copies and how to model the structure of this context. In the inference process, what we need to figure out is how we estimate the context, the model in each black box, and by comparing the black boxes for different context, how can we reveal contextual patterns. For example, we can estimate two models for the year 2008 and for year 1998. If we compare the word distribution, we can see that although they are both talking about Harry Potter, they are talking about the book Harry Potter ten years ago and they are taking about the movie Harry Potter recently. Because the year 1998, there wasn't a movie about Harry Potter. This kind of comparison is very interesting as contextual patterns. Let's begin with simple context in text. By simple context we mean those iterations that are pretty easy to be extracted, like time, like location, like the author of the article, like the source, or like any [inaudible] situations such as where the query has the word price, where the query has [inaudible] and where the query is following another query. All these are very easy and testable situations. We call them simple context. The basic idea is this: So instead of just one topic model, we contextualize the topic models. We make different copies of the topic models for every context. To generate a word in a document, we will first choose the context according to some context distribution. And once we have chosen the context, the rest is easy. We just generate the word based on the corresponding topic model. To generate another word, we will do this again. We will choose the context first, and based on the context, we use the corresponding copy of topic model to generate the word. If we look at these distributions, these topic distributions and word distributions, they are all conditional around the context information. And by comparing those conditional distributions based on different context, we can reveal very different context topic patterns. I will only show one example. We have already seen this before. The contextual topic pattern is called topic lifecycle. We got that by comparing the distribution of topic given different time, and the context here we use is the time information, so we can extract hot topics in sigma literature. This kind of result is very interesting for beginning researchers or for graduate students if we want to keep track of what's going down in the literature and to pick up a hot topic to work on. Let's look at the more complex case. Can we model implicit context in topic model. By implicit context, we mean the situations that leads us to digest the text to extract. For example, sentiments. And there are also other examples of implicit context, like intents of the user, whether he wants to buy something, whether he wants to advertise something. Or whether the content of the document is trackable or whether the content of the document makes the high impact. The basic idea is that we need to infer these intuitions from the data hopefully with some prior knowledge from the user. Using the similar graph, we can show how we incorporate implicit context in topic model. Remember that we don't know the affiliations of every topic to every context, so what we do is to use some distributions of context for words to infer this connection. If we don't have this context reward distribution, it's totally the same with the model we use from the implicit context -- for the simple context, I'm sorry. But now we have this distribution of context or words, we can use them to infer the connection of the documents and the context. We can usually get these distributions from some training data or from some guidance from the user or from some prior loadage such as [inaudible] structure. The basic idea is to add in this distribution as the prior for this topic model. So instead of maximizing the data likelihood, we are now maximizing the posterior. And in practice we usually handle this by adding in pseudocounts to the [inaudible]. This methodology is also very powerful by modeling the implicit context such as sentiments. We can usually extract summarization of specific options. For example, we can tell when people are talking about the movie Da Vinci Code and they have positive opinion, what will they say, and when people say negative about the book Da Vinci Code, what will they say. In practice, we allow the user to input keywords like this to guide the system to generate the topics he expected to see. But if he doesn't input any keyword, it's totally okay. We will then fully listen to the data. We will still generate meaningful topics, but may or may not be similar to these keywords. So we provide the flexibility between two extremes to allow the user to guide the system completely or to fully listen to the data. Another example is to model the complex context of text. By complex context, we mean the structures of the context. Since there's so much context information, there's usually a natural structure between this context. For example, the time context follows a linear order, right, if we look at the locations, every state has its near states, and every state is connected to other states based on the international highways. If we look at the user, there's usually tree structure of users, and more generally we can usually find the social network of people. By modeling the complex context in text, we can review normal contextual patterns. We can regularize the contextual models so that those magic black boxes won't go wild. We can also alleviate data sparseness problem. So if we with have -- don't have an enough data for one context, we can usually get help from the other context in the same structure. The basic idea is this. We can add the [inaudible] term based on the contextual structure in the topic model. If we don't model the contextual structure, it's essentially the model we use to handle the simple context. But once we have the contextual structure, we can incorporate many important intuitions about the topic models. For example, if a context A and B are closely related in this structure, we can see that the model for context A and the model for context B should be very similar to each other. And formerly we can add either those intuitions as the regularization term to this objective function. So instead of maximizing the data likelihood, we are now maximizing the regularized data likelihood. And there can be many interesting instantiations for such the intuition. For example, [inaudible] in the same building [inaudible] if I'm looking for pizza and you also like to have pizza, we may end up meeting at the same Pizza Hut across the street. Collaborating researchers work on similar things. That's why we're all sitting here to listen to this talk instead of just some [inaudible] talks. And topics in sigma are like the topics in [inaudible] because of the two conferences are closely related to each other. >>: Is there a paper on that one? >> Qiaozhu Mei: Yes. This one is actually one of the papers. Yeah. So as the example, if we leverage the context as social network, we can extract topical communities from the scientific literature. For example, we can extract topics like information retrieval, data mining, machine learning and Web. We can also discover communities like the machine learning community, data mining community, and information retrieval community. So we can see that people in the same topic or community also collaborate a lot with each other. So I have pretty much introduced the general framework of contextual text mining and how we incorporate contexts in generative models of text by modeling simple context, implicit context, and complex context. We have a lot of work on individual applications of this framework, but I want to choose two applications to introduce how effective this general methodology can be useful in Web search and the information retrieval. >>: Can I ask just one question? >> Qiaozhu Mei: Sure. >>: How do you -- on the regularization term, how do you specify that what it means to say two models should be like each other, is there a function that says model 1 is a function of -- these are both functions of a common model or what? >> Qiaozhu Mei: I will introduce in one of the examples, as we can see that we can introduce the graph-based regularization to control the difference between the two models for different context. I will introduce in [inaudible] in this application you can see later. So the first application I choose the personalized search in Web search domain. We can see how the context information of user can help with search. Remember this ambiguous query, MSR, which could easily be Microsoft Research or could be mountain safety research, right? What we want to do is to disambiguate this query based on user's prior case. But we don't have enough data probably for everyone, so we don't want to break everything down to individual users. Instead we want to back off to classes of users. As a proof of concept, we introduce the context as classes of users defined on [inaudible]. It's arguable whether [inaudible] is the best choice here, but we can potentially do better by using the context like demographic verbals, right, or we can back off to the users who click like me, which means that we can do collaborative search based on the friendship. So this shows why personalized search is an instantiation of context or text mining. The text information we're looking at are just query logs, so every record has three entries: IP address of the user, the query that the users are meeting, and the URL that the user clicked on based on this query. The context information we model is actually the users or group of users defined by IP addresses. So the smallest context corresponds to the individual user identified by the four bytes of IP address. And we measure the larger context, concerning the users who share the first three bytes of IP address. Then even a larger context, continual users sharing the first two bytes of IP address, first one byte and then everyone in the world has the largest context. And the generation model of text we want to look at is the distribution of URL given the query. Imagine that we can estimate a good model of this, we can probably predict whether the user will click on this URL, on that URL based on the query he submits. And by incorporating the user information, we want to estimate the contextualized model of the probability of URL given the query and the user. The goal is to estimate a better distribution of URL and the query and the user so that we can well predict the future. We can better predict what URL the user is going to click on. But wait a moment. What do you mean by the better distribution. We introduced the entropy of the distribution as the evaluation metric for this goodness of the distribution. So if we look at entropy of the URL distribution, it actually models the difficulty of encoding information from a distribution, so we can consider entropy of the distribution as the metric for the size of search space or for the difficulty of the task if you want to predict the next one. Entropy is the powerful tool for [inaudible] opportunities from which we can tell how hard Web search could be and how much could personalization help. To predict whether we can use the history to predict the future, we can use the cross entropy -- we estimate the model from history and we model entropy based on this query, but based on the future observations. So we can use cross entropy to measure whether our model can better predict the future. As some intuitive examples, by definition of entropy, we can define the queries into user queries and hard queries. For instance, MSR is the hot query. If you look at the distribution of the click, there are so many answers and every answer is almost equally likely to be clicked on. So it's pretty hard to decide which one you want to put to the top. So it's the hot query. For another query, Google is the easy query because the distribution of the click has the low entropy. There aren't so many answers and it's almost sure which answer got almost all the probabilities, so it's very easy to decide which one you want to put to the top. And by incorporating the user context, we want to make a hot query into an easy query. And here are some findings. We estimate the entropy from a very large query log database from Live Search. We estimate the entropy of the verbals query URL and IP address one at a time, the joined entropy of the verbals two at a time and then three at a time. Then we can estimate the difficulty of traditional search, what is that. It's actually modeled by the joint -- by the conditional entropy of URL given the query, which is only 2.8 beats, which means that we can usually find the results we want in the top ten pages. And what about personalized search? We can also estimate difficulty of personalized search as the conditional entropy of URL, again a query and IP address, which is only 1.2 beats. And this is huge. This means that personalization could potentially cut the entropy in half. This brings in large opportunity to improve Web search. We can also look at the story in query searches [inaudible] traditional searches, if we don't know any suggestion, how well can we predict the next query of the user? It corresponds to the entropy of the query, which is 21 beats is pretty difficult. But what if we know the user, what if we know the IP of the user? We can reduce the conditional entropy into just five beats. Once again personalization cuts the entropy in half and this time twice. Again, this means huge opportunity. To improve the Web search. Of course this only tells us the potential benefit we can bring in. If we have a model that only Gus [phonetic] knows, but we can always estimate a model from a history. We introduce the model of URL, give the user and query by incorporating the five different language models. So we have five black boxes. One black box corresponds to the model estimated for the individual user, another one corresponds to the user group which shares the three bytes of IP address, another one for the two bytes, another one for the one byte, and then another for everyone in the world. Then we want to incorporate these models to get this distribution of URL given the user and query. If we only use this component, we are doing full personalization so that every user has the different model. But we may run into the problem of data sparseness. Whereas if we only use this component, we are not doing personalization at all. So all users share the same model. We must -- the opportunity to present search. What we want is something in the middle. We want to do personalization, but also back off to users who click like me, we don't want to bring anything down -- everything down to individual users. And we can estimate the parameters of lambda, the parameters of every component using EM algorithm. So we can see that a little bit of personalization is actually better than too much personalization. It's at better than too little. If we only use the four bytes of IP address, we actually run into a problem of sparse data. And if we use the -- don't use the IP address at all or if we rely on context which is too large, we miss the opportunity of personalization. We can also use cross entropy to evaluate how well we can use the history to predict the future. So in this plot, we have the cross entropy of the future given the history, based on no personalization, based on complete personalization, and based on personalization with backoff. It's different because the IP address in the future may or may not be seen in the history. As we can see from this chart, if we know every byte of IP address in a history, which means we have enough data for everyone, indeed complete personalization is the best. However, if we don't have enough data for everyone, if we don't observe some parts of the IP address, complete personalization is not as good as personalization with backoff. And we can see that in all cases personalization with backoff outperforms no personalization. In this example, if we know at least two bytes of the IP address in the history, we can almost cut off the entropy in half. So this is not what only Gus knows, this is this practice what we can do with the history to predict the future. Yes. >>: So when you're using cross entropy, you're penalizing [inaudible] entire distribution of the future. >> Qiaozhu Mei: Yes. >>: If you -- looking at things like ranking [inaudible] you probably care more about some subsets of future as opposed to the entire distribution? Because there you'll get the disproportional effects from more rare parts of it and so on. Have you looked at possibly doing [inaudible] for different layers that just care about the ranking-type tasks where you care about just the some subset? >> Qiaozhu Mei: That's a very interesting suggestion. We haven't looked at that. We haven't looked at targeted query type for -- or classes of queries. But we do look at other context information which you can see later. It's interesting to get a subset of the queries or get a subset of the future clicks to estimate whether the entropy is larger or lower. It's definitely interesting. We haven't done that. So the question is that can we do better by training other type of context information. Because it's really arguable whether IP address is the best choice as the [inaudible] verbal for the user. Can we do better than IP addresses? Can we use other contextual verbals like user ID, right, like the query type, like the click rate, like the intent of the query. Can we use other verbals like demographics verbals, age, gender, income, can we use other verbals like time of day or days of week. We have done some very preliminary research on using other type of context information such as days of week. We can see from this chat that by comparing business days with weekends, business days you have more clicks. The queries on business days have more clicks. But they're also easier queries. Which means it makes sense to distinguish the queries in business days and weekends. And we also look at the context information of hours of day, and we can see from this plot that the harder queries come in around 9 p.m., which is around the TV time. So this is also very interesting. We can see this pattern. So it's potentially useful to distinguish the queries by hour of the day. So this is just a preliminary result, which shows that there's still a huge potential to do better than IP address, to incorporate other context information. So the second application I want to introduce is to smooth language models information retrieval. The basic idea of language modeling-based IR is like this. Suppose from a document you can estimate a document language model, which is essentially a multinomial distribution of words, and from a query, you can also estimate a query language model. Then you can rank the documents based on the similarity between these two language models. For example, we can use negative [inaudible] divergence. But the problem is that every document only has very limited information, so the distribution we estimated from maximizing likelihood is usually not trustful. It could cause some serious problem. So instead we want to use the smoothest version of document language model which is more robust and more accurate. And different language modeling-based approaches varies on how they estimate language model, and then boils down to how they smooth the language models off the document. If they have a better strategy to smooth the language model, they can usually yield better retrieval performance. A particular strategy in literature is to incorporate the maximum likelihood estimates with some other language model estimated from the collection, to integrate them. There are two goals. One goal is to [inaudible] another probability to unsynch words and another goal is to estimate a more accurate distribution from sparse data. People have done a lot of work to satisfy this goal, but it's still not clear what we mean by an accurate distribution from the sparse data. Let's choose that smoothing engine model is also an instantiation of contextual text mining where the text we are looking at is index correction and the context are documents themselves. And we want to look at a contextualized generation model, which is [inaudible] words in the document. But what's not clear here is what kind of structure of this context we should use. And the goal is to estimate the smoothest version of the language model so that we can get better retrieval performance, and we want to regularize the maximum likelihood estimates based on the structure of the context here -- every context in the document. There's quite a few previous work on smoothing language models based on the collection engine model. The basic idea is to incorporate the maximum likelihood estimates with some reference model usually estimated from the whole collection. So we have a maximum likelihood estimates of the language model, we have a reference [inaudible] model, and we have to somehow incorporate these two models and get the smooth language model. And there are many heuristics proposed to estimate such a reference engine model. The simplest case is to trace every document in a document -- in a document collection into the same big document and estimate a collection language model from the big document. Then people proposed a better way to first cluster the documents in the collection and estimate a language model for each cluster, then we will incorporate the maximum likelihood estimates with the language model for the [inaudible] the document belongs too. But it's still not clear whether all the documents in the same cluster is enclosed to the document itself. Yes. >>: [inaudible] between the first model and the [inaudible] measure seems like we're just getting the most probability from the time [inaudible] the least [inaudible]. >> Qiaozhu Mei: Yes. Yeah. But it's under the language modeling approach, so it's not using TFIDF -- although people have some interpretation about the connection of the language modeling-based approach to the TFIDF models. But this is under the camp of language modeling approach. >>: What is the language model? You mean [inaudible]? >> Qiaozhu Mei: Yes. A unigram. >>: Oh, unigram. Right. So in that case the unigram, the [inaudible] of unigram is just TF, right? >> Qiaozhu Mei: It's just TF. But there's no IDF there. So by adding in this smoothing term, you're actually incorporating some component related to the IDF. We're not looking at like bigram models or n-gram models, because people have found that in traditional ad hoc IR, unigram matching model works pretty well. So I have introduced the other measure, which based on class rings, the document collection first, and then chose the cluster language model to smooth the language model of the document. There's yet another heuristic. To look at the nearest labors of the document, to estimate language model based on nearest labors of the target document and then incorporate the model with the maximum [inaudible] estimates, right, since there are so many different heuristics off them, it's better results than not smoothing or smoothing [inaudible] with the background. But there's also problems of these existing measures. For example, if we smooth with the global background, we're actually ignoring the collection structure. We're not leveraging structure of the documents, right? If we smooth with document clusters, we're actually ignoring the local structure inside clusters. So we're not sure where all the documents in the same cluster contributes equally to the target document. And on the other hand, if we smooth using the nearest labor documents, we're ignoring the global structure. Once again, we haven't really leveraged the full power of the infrastructure of the context. So although there are so many different heuristics and different heuristics of interpolation, there's still no clear objective function for optimization. You don't know what these measures are optimizing, and there's no guidance on how to further improve the existing measures. So the research [inaudible] here is includes what is the right cover structure, what is the right structure of the context we should use, and what are the criteria for good smoothing method, what do we actually mean by accurate language model here. We also want to answer what [inaudible] optimizing by smoothing those document language models and could there be a general optimization framework rather than all these heuristics. So we introduced the novel and general view of smoothing based on a graph structure of the documents. Suppose we have a graph of documents as a structure, these are from some linked structure or from some similarity computation of the documents. So what we want to do is to project such the graph on a hyperplane, and then what is the language model. The [inaudible] of the word given the document actually makes the point on top of this hyperplane. So if we look at the language models of different documents, they actually make surfaces on top of this hyperplane, on top of this graph structure. Of course if we only rely on maximum likely estimates, this surface could be very rough because we don't have enough data in every document. The data is sparse. So what we want to get by smoothing is some smoothed surface on top of the graph structure so that the smoothed language model is equivalent to smoothed surfaces on top of this graph structure. So this actually gives the interpretation of what smoothing is actually doing for language models. Then we can see two heuristics or two intuitions. The first intuitions that were on the [inaudible] to the maximum likelihood estimates, we don't want these two surfaces to vary too much. Another intuition is to find smooth surfaces on top of this graph. So we want a smooth version of this rough surface. And we can see that interesting this covers many existing heuristic measures by special cases by means of what kind of graph they're using for exposing actual models. The general case is to find the smooth surfaces on top of this general graph structure of documents. Right? The [inaudible] want using nearest labor. We're actually [inaudible] on top of this graph. This is actually the local graph of the target document, which only contains the nearest labors. What about smoothing with document clusters? Okay. This is actually equivalent to finding a smooth surfaces on top of these forests where we introduce the pseudodocument for every cluster, and then we connect every document in this cluster with this pseudodocument. And finally what is the connection between smoothing the language model using the global structure? We can see that it's equivalent to finding the smooth surfaces on top of this [inaudible] graph where we make a pseudodocument for all the documents in the connection and then connect this pseudonote with all the documents in the collection. So it shows that our framework actually covers all the existing heuristics as special cases. And based on the intuitions we can formally define objective function of smoothing for language models. Remember that the maximum likelihood estimates makes the rough surface, and what we want is smoothed surface. So we first introduced the weight of every vertex on top of this graph such as the degree of the vertex such as some other metric for the document. We then introduce the weight on every edge between two documents such as the similarity of two documents. Then we can formally define this component to control the fidelity of the smoothest version of language models to the maximum likelihood estimates. Similarly, we can define another component which controls the smoothness of the language models over the surface, so these parties, actually the difference of the language model between two connected vertices or two documents and the [inaudible] on this graph. Then we can introduce the objective function which can [inaudible] turns two parts. These parts corresponds to the intuition that we want to keep fidelity to the maximum likelihood estimates, and these parts control the smoothness organization on top of this graph structure. So by finding the tradeoff between these two components, we are satisfying the two intuitions at the same time. We also propose the algorithm to solve this smoothing engine model problems on top of the document graph. So we want to first construct a [inaudible] labor graph of documents. By defining the weight of the edge as the cosign similarity of two document, and then we complete the importance weight of every document as a degree of the document and the graph, right, then we want to apply this updating formula for the language model and we want to go over this updating formula for more times to iterate this evidence so that it will reach the convergence. And once we have a converged language model, we were then adding additional [inaudible] smoothing because we still want to avoid the zero [inaudible] in the language models, and finally we would get a smooth language model. So you may ask what is this updating formula really doing, right? Is there intuitive interpretation of this updating formula. We can see that we can actually interpret this updating formula with random work, so we can rewrite this updating formula in this format which corresponds to computing the absorption [inaudible] this kind of random work of documents. So we transition [inaudible] from one document to an appositive state from one document to an elective state and from one document to another document. Then this updating formula is essentially computing the absorption [inaudible] from one document to the positive state. So intuitively, if you don't know whether you want to write a word in a document, what you want to do is to do the random work on this document Markov chain [inaudible] as your labors do, and write down the word if you eventually reach this positive state. So this is actually an intuitive interpretation of this updating formula. And we evaluate our algorithm extracted from this general framework using standard check collections and standard check queries. So we have four collections which contains up to 500,000 documents. And we compare this algorithm with the state-of-art algorithm which uses the whole collection as the reference model to estimate smoothest language model, and another method using clusters as a document structure to smooth the language model. We evaluate our method using the mean average precision map defined on the ranked list of documents, so we can see that our algorithm actually outperforms both algorithms significantly. So as the summary of this talk, I have introduced a new paradigm of text mining, which is contextual text mining. It's a new paradigm of text mining which treats context information as first-class citizen. I've also introduced general methodology for contextual text mining by incorporating context information into generative models of text. Then I have introduced two applications of contextual text mining on Web search and information retrieval to show the general framework is very effective in solving real-world problems. The takeaway message here is that with rich context information in text mining, we have a guidance. We are more easy -- we are much easier to find the goat from the big mountain. And this is the roadmap of my work. I have been concentrating in contextual text mining and the information retrieval and Web search. I've pretty much touched on this work by adding in context information into probabilistic topic models. I have also other work which are not corresponding to contextual text mining, but specifically information retrieval and Web search, which includes the [inaudible] model for language model-based information retrieval, and to generate impact by summarization for scientific literature and to make use of query URL by [inaudible] graph to generate query suggestions. And in the future of my work, first I'm interested in continuing my work on contextual text mining, by working through the [inaudible] framework because there is still many computational challenges, and we still don't know what is the good model to model the structure of contexts such as other contextual structures. And I also want to work towards the applications by designing the task support systems to different type of users, such as the Web users, such as business users, such as other users. And more importantly I want to leverage the power of contextual text mining to enhance Web search by providing the so-called contextualized search including personalized search or intent-based search or other paradigm of Web search based on different context information. And then I'm also interested in integrative analysis of heterogenous data. For example, we have Web.0 data, we have data from the user, we have Web from the search log, we have data from [inaudible] logs. So can we really integrate this data and to find information that is very hard to find from a particular dataset. Okay. And I want to stop here for questions. Thanks. [applause] >>: You talk about context in text. And I note that a lot of text is generated by humans for not a human to consume, so most of the text you're talking about [inaudible] is probably from [inaudible] sources. So I always notice the word context [inaudible] and traditionally in the language I think the biggest context is actually the context within the text themselves in addition to the things that you're talking about. So I'm very curious as to, wow, you're searching for the context in IR data mining and the machine learning community, why haven't you touched about in linguistic community the natural language process. >> Qiaozhu Mei: So I think the context defining linguistics are usually based on the so-called linguistic context. We actually borrow the probably the oldest definition of context, which is the [inaudible] at which the text was written, so which includes the linguistic context and other context like the cognitive context like the context in the environment. >>: Those are two high levels. Can you come down a little bit, I mean, just like a -- for example, have you seen about the synthetic structure, that -- or semantic structure, all that stuff? >> Qiaozhu Mei: Yes. We have done some work on that. We actually leverage the local context, for example, to generate the annotations of frequent patterns or to generate the label for topic models. But for this kind of problem I have introduced, such contexts we find a good -- we find a good fate of these problems with these kind of contexts. And for that kind of contexts, I don't know, I'm not sure whether or not we'll help as much as this context. >>: [inaudible] is that going beyond [inaudible] but then you alluded that somehow things beyond unigram doesn't work? >> Qiaozhu Mei: The basic accepted fact in IR communities is that by trying n-gram doesn't work much more effective than unigram models, which is because when you go through the n-gram, you also make the data much sparser. So it makes it even more challenging to estimate the [inaudible] or accurate language model. So in the empirical experiments, people didn't really observe much larger boost in the retrieval performance. >>: [inaudible] you need a [inaudible]? >> Qiaozhu Mei: [inaudible] yes, you know, but it's harder to do that. And it doesn't really bring in additional benefits. Yes. >>: So I'd like to take a shot, which is I think it depends a lot on the task. So if the task is to predict the next word, then it's obvious that bigrams are much better than unigrams. But if the task is to predict relevance to a query, then it's not as clear. However, I think if you were to, say, bring in another data source, like let's say what the users are interested in versus the authors, so the users' interest would be, say, clicks, and authors' interest would be documents, and if you wanted to know relevance to a query, then the combination of users' interests and authors' interests are very meaningful, but the bigrams are less meaningful. So what we have here in a framework here is to say for a particular task, we can predict the entropy of an output variable Y, which is like what the next part is, a relevance to a query, from a bunch of inputs, X, and we can make precise the question of which features are useful for which task. And then you can address a question with content instead of getting hot under the collar about whether this feature is useful or not, more useful than that feature for this task or that task. >>: I totally understand that. But he's also talking about in the context of smoothing the document clustering. >>: Well, he's talking about smoothing for a particular task, like say relevance to a query. >>: Well, in the beginning, I think I heard about the topic ->> Qiaozhu Mei: Yes. Topics ->>: [inaudible] the way the hierarchy define topics. It's more like text classification. >> Qiaozhu Mei: Right. >>: And for that task I think bigrams is not so obvious. >>: So obvious? >>: It isn't so obvious bigrams aren't that powerful for that. Guessing the next word is very obvious bigrams. >>: Have you tried? >> Qiaozhu Mei: People have tried in literature by using bigram language models for topic models and using bigram language models for retrieval, but the regions are in terms of relevance in terms of retrieval performance. It doesn't really help much. >>: Whereas there are quite a few studies that show for predicting the next word bigrams [inaudible]. >> Qiaozhu Mei: Yes. But this is not the traditional test of IR, because we [inaudible] ranking of documents based on the relevance to the query. So I think it also depends on the context of your styling. >>: Well, the context theory is to really find out what context means, right? >> Qiaozhu Mei: No, the context query is how to [inaudible] performance of retrieval or Web search. So I think it's [inaudible] question. >>: [inaudible] Web search, I found many of the pages are very [inaudible] like opposed to Twitter [inaudible] we found the count actually -- I myself, I found that I did not [inaudible] computer, I myself [inaudible] difficulty understanding what the [inaudible] is talking about. I wonder whether the context can help [inaudible] to understand a very short page. >> Qiaozhu Mei: That depends on what you mean by understanding the text. So it's still -- it's still text dependent. If your goal is to provide a better understanding of the natural language, I don't know. We didn't really look at the problem. But if your goal is to find relevant information, for example, if you have a query and you want to find the relevant Twitter articles, the context information is definitely very helpful. >>: [inaudible] we don't want to [inaudible]. >> Qiaozhu Mei: Yes. >>: And suppose that we want [inaudible] and I want to put an article into the ODP categories, however, [inaudible] the context information is really hard [inaudible]. Do you have any study on how contact [inaudible]? >> Qiaozhu Mei: We haven't tried in the exact setup. But definitely the setup information can help for your task because it enriches your feature space. If you can find the trillion data with context information, then you can build a much powerful model which leverages the context information in the Twitter articles. And you can also make use of the context information to make connections between the individual articles in Twitter so that you can push in some intuitions that if this article belongs to this category, then the article similar to this one should also belong to this category. By adding in these kind of intuitions defined on context information, I think the text could definitely be benefited from the context analysis. But then the [inaudible] is still text dependent. I can't guarantee that context information is useful for all kinds of tasks. Right? Yes. >>: First I want to comment, so from your presentation is [inaudible] I didn't see any points that prohibit you from applying this to more general [inaudible], for example, when you ran [inaudible] use the [inaudible]. >> Qiaozhu Mei: Exactly. Exactly. >>: Okay. Now, another question is about the aspect fining. So have you had any matters that can automatically estimate how many aspects are there in a document [inaudible], for example, for the laptop [inaudible] you put out three aspects. Is that for you predefine or automatically [inaudible] from the [inaudible]? >> Qiaozhu Mei: Yes. There are some initial work done on that. If you're familiar of the body of working, topic modeling, you probably know the work by [inaudible] about the Chinese [inaudible] process. They provide some message to automatically estimate the number of topics in the text. I don't want to make a comment whether the -- how -where the method could work, but I think what we do is to put this task in the framework of user-guided studies. So we allow the user to provide some guidance to the topic model. If he knows what type of topic he wants to look at, he can give such as the guidance to the system, So, for example, if he knows that, okay, these topics are too high level, he can probably drill down the analysis to provide a larger number of topics, and he can also input some keywords to say what kind of topic he's looking at. So we allow the user to provide guidance to the system instead of just estimating the number of topics automatically. Because it's still not clear what the topic is really -- is really defined as. You can think about like Web search as the topic, right, if you're looking at the level of different research areas. But you can also think about query login as the topic if you want to drill down to the lower level. So it's really unclear how many topics are there in the text if you don't have any prior loadage from the user in this case. >>: [inaudible] the question is suppose you have two [inaudible] one is about Dell, one is about [inaudible] and there's no guarantee [inaudible] so how to align testimony together. >> Qiaozhu Mei: Yes. In our work we define the context, a larger context which contains both of the [inaudible], so we align the topics by leverage in the overlap of these context. So in this case we have three context. One context for Dell, another context for -- what did you say [inaudible] and another context contains both of the brands. So by leveraging there's overlapping of context. We can match them. >>: So in [inaudible] also use something like a [inaudible] to minimize -- which is essentially [inaudible] so they want to minimize that to estimate the optimal number of [inaudible] so you're also like optimizing the likelihood of the [inaudible] do you see the similarity between your work and theirs? >> Qiaozhu Mei: That's the particular way that people estimate the number of topics in text. But, as I said, it's still unclear what you mean by the granularity of the topic. So you can always find some matrix that could help you define the number of topics, but then the question is whether the topic would be useful in practice is unclear. So our work, the maximum likelihood -- the objective functions in this work actually don't take any training data. So this probably answers your question. The [inaudible] study is based on that we have is some holdout data and they have the topic model estimate on some training data. They want to, you know, maximize the likelihood and the holdout data based on the training data. So it's a different setup. Yes. >>: So overall like there's two fundamental types of context that you're dealing with: one is explicit where you have things like age, gender, links and so on, and another one is implicit or latent, where you're trying to infer things like topics, et cetera, and so there's a big concern that I think the previous question touched on is where you're trying to exploit implicit context, you're creating effectively a whole separate problem, which is independent from the main task you're solving. And it seems like there's a big danger there in the sense -- I mean, from a learning theoretic standpoint, there's a big danger of when you're trying to create sort of solely the main problem by creating a subproblem, you're making it, like, more complicated. So, I mean, certainly there's been cases where it worked, but, in general, from your experience, do you think sort of going forward there will be work on sort of bare models for implicit context that will give us more tools, or is just better tools for dealing with explicit context that will just capture everything there is and we don't need this extra layer of implicit context. >> Qiaozhu Mei: That's an interesting question, and the answer I think is as follows. We don't want to motivate our research to really get fancier model or to create a subproblem that is -- that requires more computational challenge. What we are looking at is whether the context information is really useful. So your comment is that some implicit context may or may not be useful, right? But I could give you two examples that implicit context will be very useful in real-world tasks. One is the sentiment. When you compare the product, you don't want really just track the number of reviews that people have written for this product. You still want to figure out whether they have positive opinions or negative opinions. Right. In previous work we usually don't distinguish that. We see a spike of the discussion of the book sales in the [inaudible] that there would be books -- spike of the discussion of block articles to infer that, which you can infer the spike of sales from that. But if you don't distinguish the positive or negative opinions, you can -- you can't really do that. What if all the discussions are negative? You don't want to buy the book based on the negative discussions. Another example is the intent of the Web search. You want to figure out whether the user have the intent of buying some stuff or he wants to do the research on a small topic or whatever other intents. This kind of context, we can't easily get that from the text either. You still have to infer them from the data, you have to digest the data first to get the context. So I think these two examples of implicit context can show you how this category can be very useful in practice. So it's not just the practice for fancier models. >>: [inaudible] actually I'll take issue with what you just said. So you're implying that there's this binomial split of positive versus negative sentiment. >> Qiaozhu Mei: Yes. >>: And then there is this sort of fixed hierarchy of intents as opposed to in reality you could say, well, what if there is a much more complicated structure of both opinions that sort of where you can think of positive and negative sentiment as well as the sentiment where people will say, well, these are the good things in this aspect, these are the bad things in these aspects, so this goes back to the whole aspect issues. And then the question is, again, possibly by collapsing these two, we just did binary presentation, well, maybe you're making your final text better, maybe worse, but, again, there is a concern there that by placing those assumptions on the structure of the implicit context you may be losing information or gaining it if that's all that matters for your final task and it's sort of you're better off collapsing it. But I think there's a lot of room for future work. I think that's actually a nice thing about this is it does open up this whole closed issues of how to deal with it, what's the context. But I think it's -- it's a big concern there, it's a big danger to just assume that we can just place a structure on an implicit context and that's what we're dealing with without taking into account the whole structure is sort of -- there's a question of how many topics there are, how many types of opinions there are and so on. So that's why. >> Qiaozhu Mei: Yes, I totally agree with you. We are taking a big step to leverage in the context information in text to make our life better. But I'm not saying that we are at the complete stage of that goal. By modeling the simple context which is probably more mature at this stage and also the techniques to model the implicit context, which is probably not so much [inaudible] stage, we are taking a step towards that goal. Yeah. Thanks. >> Li-wei He: Okay. Thank you. [applause]