>> Li-wei He: It's my pleasure to welcome Qiaozhu Mei from UIUC. It's a Ph.D.
candidate at department of computer science. His research interest include text mining
and information retrieval, especially context analysis with probabilistic language models
and topic models.
He is published extensively in these areas and in both KDD 2006 and 2007. He actually
received the best student paper runner-up award. And he's also one of the recipient of
Yahoo! fellowship. Okay.
>> Qiaozhu Mei: Okay. Thanks for introduction and thanks everybody for coming to
my talk. I'm Qiaozhu Mei with University of Illinois. And today I'm going to talk about
a new paradigm of text mining called contextual text mining with its applications on Web
search and the information retrieval.
So text data is huge. If we look at the information on the Web, the text information, we
can always observe this pattern.
[inaudible] research articles and Wikipedia has six meaning articles. The research has
around 150,000 posts every day and block [inaudible] takes five times as many as that.
Twitter has around 3 million messages coming in every day, and Yahoo! Groups has 10
billion messages.
When we talk about commercial search engines, how many Web pages are we collecting,
right? The new search engine that they were collecting over 100 billion messages, Web
pages. So text is huge. We can actually compare the text data on the Web with this huge
snowy mountain which is still growing every day, and we can also compare the text
mining process to the process of looking for goat in such the mountain. Then the
question is where should you start and where can you go. Life is hard, you don't have
any clues.
So it's time to bring in context information in text to help. By context we mean
[inaudible] of text. If we look at this particular article from Twitter, we can see how
many type of context information are there in such a short article.
We have the information about the author, we have the author's location, we have
author's occupation, and we have the time at which the article was written. We have the
source from which the article was submitted, and we have the language information of
this article. So all these situations are pretty easy to be extracted from the data. We call
them simple context.
There is another type of context which is much harder to get from the data. You need to
digest the text and figure out whether text is about positive opinion or negative opinion.
There is yet another type of context which is more complex than all of these such as
social networks in the text. Right?
So when we look at the rich text information on the Web, we can always reach context
information as well. PPOP [phonetic] has papers from 73 years from 400,000 authors
and from 4,000 sources, which means computer science conferences and journals.
Wikipedia has 8 million contributors and more than 100 languages.
The research has 5 million users and 500 million URLs, and if we; look at other type of
text information, we can always reach context information as well.
So this is huge opportunity. If we can make use of the rich context information in text,
can we do better text mining.
Then the question is with rich context information can we really do better than what we
are doing right now in text mining. If we still compare the text information as this huge
snowy mountain, the question is what is context doing here. We see that context is
actually the guidance for us to do better text mining, so instead of just this snow
mountain, we now have the coordinates of this mountain. We now have the map of this
mountain, we now have a GPS system and we have the highways which take us to
everywhere in this mountain.
So in one word, when we have the context information, we have the guide which leads us
to the gold mine.
So in this slide and a few slides thereafter, I will introduce different applications of text
mining with context information to show what we can really do with context in text
In this example, if the text information is from the query logs and the context information
we used is the user information, we can achieve personalized search, where the basic idea
is that this -- if I input a query "MSR" to a commercial search engine, I'm actually
looking for Microsoft Research. But if you look at the top results, there's no Microsoft
Research. You have a lot of things like [inaudible] research, like medical simulation, like
mountain safety research, like metropolitan street racer, but there is no Microsoft
However, if you know the user, if you know the user is me, you should give me much
better answer by providing me with Microsoft Research. So here's how the content
information of the user can help us by achieving personalized search.
For another example, if the text information is from customer reviews and the context
information is the brand of the laptops, we can extract comparative product summaries
like this. We can tell what people are talking about different aspects of different brands
of laptops. So with such the comparative summary, we can make smarter decisions on
which brand of laptop we should buy.
For another example, if the text information is from scientific literature and the context
information we use is the time, we can extract trends of research topics in literature. We
can tell what's hot in sigma [phonetic] literature, so if you want to publish something in
next year's sigma conference, here are some shortcuts.
For another example, if the text information is from block articles and the context
information we use is the time and location, we can track the diffusion of topics in time
and space. So we can tell how the discussion about the topic spreads over time and
Another example, if the text information is still from block articles and we are modeling
implicit context like sentiments, we can generate the [inaudible] opinion summary
[inaudible] we can tell what is good and what is bad, for the movie of [inaudible] and for
the book of [inaudible] code. We can also compare these summaries and we can even
track the dynamics of these opinions.
And for another example, if the text information is from scientific publications and the
context is social networks, we can extract topical communities from the literature. We
can tell who works together and on what topic. We can extract communities like the
machine learning community, the data mining community, and the IR community from
the literature and the social network.
So we have introduced so many different application of text mining with context
information. Then the general research question to ask is whether we can find the general
solution for all these problems. If we can, life becomes good because we can solve all
these problems in a unified way, we can even derive solutions for new types of text
mining problems.
And in this talk I will introduce a general solution for all these problems, which is called
context or text mining. Context or text mining is the new paradigm of text mining which
treats context information as the first-class citizen.
So as the outline of this talk, I will first introduce generally models of text as the general
methodology for text mining, then I will introduce how we can incorporate context
information into such generative models by modeling simple context, modeling implicit
context, like sentiments, and modeling complex context like social networks.
We have quite a few publications on these topics, but I will choose two application of
context or text mining to show how effective the general solution could be and how it
could be applied in Web search and in the information retrieval.
Let's first look in the generative models of text. In many text mining and machine
learning problems, we usually assume that the text is generated from some hidden model
in a word-by-word manner. In other words, we assume that there is the magic black box
which produce the text in the word-by-word manner. And this is called the generation
process of the text.
What we want to do is to reverse this process based on the observation of text data we
have, we want to infer, we want to figure out what's going on in this magic black box.
We want to estimate the parameters of these variables.
So different assumptions lead to different generative models. It could be as simple as the
[inaudible] model axis, from the top-ranked words you can guess what the topic in this
magic black box is about. It could be also as complex as the mixture of many, many such
unified models.
A particular assumption about the generation of text is that text is generated from mixture
of topics. By topic we mean the subject of a discourse. So we can usually assume that
topics like data mining, machine learning, Web search and database in the computer
science literature, so we can assume that there are K such topics in a text collection. If
we look at a particular research paper, it could be done to more than one topic.
So in a document point of view, a topic is essentially a soft cluster of documents. And in
the word point of view, every topic corresponds to a multinomial distribution [inaudible]
model of words. From the top-ranked words, you can tell the semantics, or the meaning,
of each topic.
Okay. And the generative process can be usually explained by the so-called probabilistic
topic model including the [inaudible] model and the replace LDA model.
The basic idea is like this: Assuming that there are two topics in the data collection, each
of which corresponds to the multinomial distribution of words, and we want to generate
every word in the document. What do we do? We will first choose the topic according
to some topic distribution of the document, and once we have selected the topic, we will
choose the word from the corresponding word distribution of this topic.
To generate another word, we'll do this process all over again. We will first choose the
topic according to the topic distribution, and once we have chosen the topic, we will
choose the word from the corresponding word distribution. If we repeat this process
again and again, we will generate all the observations of text data. And what we want to
do is to infer the distribution of the topic distribution and the word distributions for every
topic from the data we observed.
We usually do this by maximizing the data likelihood. We want to find out the
parameters and maximize the [inaudible] of the data generated by the model.
For models like [inaudible] we can usually use standard [inaudible] expectation
maximization algorithm. For more complex models, such as LDA, we can use other
inference measures like gift sampling, like [inaudible] inference, like expectation
The basic idea of [inaudible] is quite simple. So remember that we want to estimate all
these parameters. If we already know the affiliations of the data and the topic, if we
already know which word is generated by which topic, the [inaudible] is easy, you just
count the frequency of the words in the same topic and normalize them. Right?
By the way, don't have these [inaudible]. So instead what we want to do is to start from
some random initialization of these parameters, and then we make a guess about the
affiliations of the words and topics. Once we have this guess, we can estimate the
parameters based on the guess. We can update these parameters. And then we want to
iterate this process. We will make the new case based on the updated parameters. And
based on the new case, we will estimate the parameters again. And when this process
converts, we have all the parameters estimated.
In practice we can usually add in some pseudocounts to the observation. By
pseudocounts we mean the words that we already know, the affiliation of topic.
By adding the pseudocounts, we can usually allow the user to guide the system to
generate the topics that -- according to his prior loadage.
So we have introduced that text is generated by a mixture of topics, but topics are not
enough to capture the generation process of text because topics themselves affected
significantly by the context. [inaudible] probably the oldest definition of context by
situations at which the text was written. Indeed, if we look at the topics in science
nowadays and topics in science 500 years ago, are we still working on the same topics?
Probably not. If we are still working on the same topic in 500 years, where I myself
won't choose such a career because it's too hard.
A computer scientist could write an article or could submit a query about tree, root, and
prune, and the gardener could also do so. Do they mean different topic? Do they mean if
different topic? Probably different topics.
In Europe if you write an article about a soccer report, you probably use the word
football. What about in the States? Would you use the word football to describe a soccer
game? Probably not. So all these are context information and it's telling us that text are
generated according to the context.
So what we need to do is to incorporate the context information in those topic models. I
will introduce very general methodology to incorporate the context such as simple
context, like time and location, implicit context, like sentiments, and complex context,
like social networks.
The basic idea is like this: Remember that we have a black box which generates the text
in a word-by-word manner. Now we want to contextualize those black boxes. Instead of
one black box we introduce different copies of the black boxes, each of which
corresponds to a context. So we have a box for the year 1998. We have another box for
year 2008. We have a box for United States, and we have another box for China. To
generate the text, what we want to figure out is how to select the black boxes from these
copies and how to model the structure of this context.
In the inference process, what we need to figure out is how we estimate the context, the
model in each black box, and by comparing the black boxes for different context, how
can we reveal contextual patterns.
For example, we can estimate two models for the year 2008 and for year 1998. If we
compare the word distribution, we can see that although they are both talking about Harry
Potter, they are talking about the book Harry Potter ten years ago and they are taking
about the movie Harry Potter recently. Because the year 1998, there wasn't a movie
about Harry Potter. This kind of comparison is very interesting as contextual patterns.
Let's begin with simple context in text. By simple context we mean those iterations that
are pretty easy to be extracted, like time, like location, like the author of the article, like
the source, or like any [inaudible] situations such as where the query has the word price,
where the query has [inaudible] and where the query is following another query. All
these are very easy and testable situations. We call them simple context.
The basic idea is this: So instead of just one topic model, we contextualize the topic
models. We make different copies of the topic models for every context. To generate a
word in a document, we will first choose the context according to some context
distribution. And once we have chosen the context, the rest is easy. We just generate the
word based on the corresponding topic model.
To generate another word, we will do this again. We will choose the context first, and
based on the context, we use the corresponding copy of topic model to generate the word.
If we look at these distributions, these topic distributions and word distributions, they are
all conditional around the context information. And by comparing those conditional
distributions based on different context, we can reveal very different context topic
I will only show one example. We have already seen this before. The contextual topic
pattern is called topic lifecycle. We got that by comparing the distribution of topic given
different time, and the context here we use is the time information, so we can extract hot
topics in sigma literature. This kind of result is very interesting for beginning researchers
or for graduate students if we want to keep track of what's going down in the literature
and to pick up a hot topic to work on.
Let's look at the more complex case. Can we model implicit context in topic model. By
implicit context, we mean the situations that leads us to digest the text to extract. For
example, sentiments. And there are also other examples of implicit context, like intents
of the user, whether he wants to buy something, whether he wants to advertise something.
Or whether the content of the document is trackable or whether the content of the
document makes the high impact.
The basic idea is that we need to infer these intuitions from the data hopefully with some
prior knowledge from the user.
Using the similar graph, we can show how we incorporate implicit context in topic
model. Remember that we don't know the affiliations of every topic to every context, so
what we do is to use some distributions of context for words to infer this connection.
If we don't have this context reward distribution, it's totally the same with the model we
use from the implicit context -- for the simple context, I'm sorry. But now we have this
distribution of context or words, we can use them to infer the connection of the
documents and the context.
We can usually get these distributions from some training data or from some guidance
from the user or from some prior loadage such as [inaudible] structure. The basic idea is
to add in this distribution as the prior for this topic model. So instead of maximizing the
data likelihood, we are now maximizing the posterior. And in practice we usually handle
this by adding in pseudocounts to the [inaudible].
This methodology is also very powerful by modeling the implicit context such as
sentiments. We can usually extract summarization of specific options. For example, we
can tell when people are talking about the movie Da Vinci Code and they have positive
opinion, what will they say, and when people say negative about the book Da Vinci Code,
what will they say.
In practice, we allow the user to input keywords like this to guide the system to generate
the topics he expected to see. But if he doesn't input any keyword, it's totally okay. We
will then fully listen to the data. We will still generate meaningful topics, but may or
may not be similar to these keywords. So we provide the flexibility between two
extremes to allow the user to guide the system completely or to fully listen to the data.
Another example is to model the complex context of text. By complex context, we mean
the structures of the context. Since there's so much context information, there's usually a
natural structure between this context. For example, the time context follows a linear
order, right, if we look at the locations, every state has its near states, and every state is
connected to other states based on the international highways. If we look at the user,
there's usually tree structure of users, and more generally we can usually find the social
network of people.
By modeling the complex context in text, we can review normal contextual patterns. We
can regularize the contextual models so that those magic black boxes won't go wild. We
can also alleviate data sparseness problem. So if we with have -- don't have an enough
data for one context, we can usually get help from the other context in the same structure.
The basic idea is this. We can add the [inaudible] term based on the contextual structure
in the topic model. If we don't model the contextual structure, it's essentially the model
we use to handle the simple context. But once we have the contextual structure, we can
incorporate many important intuitions about the topic models. For example, if a context
A and B are closely related in this structure, we can see that the model for context A and
the model for context B should be very similar to each other.
And formerly we can add either those intuitions as the regularization term to this
objective function. So instead of maximizing the data likelihood, we are now
maximizing the regularized data likelihood. And there can be many interesting
instantiations for such the intuition. For example, [inaudible] in the same building
[inaudible] if I'm looking for pizza and you also like to have pizza, we may end up
meeting at the same Pizza Hut across the street.
Collaborating researchers work on similar things. That's why we're all sitting here to
listen to this talk instead of just some [inaudible] talks. And topics in sigma are like the
topics in [inaudible] because of the two conferences are closely related to each other.
>>: Is there a paper on that one?
>> Qiaozhu Mei: Yes. This one is actually one of the papers. Yeah. So as the example,
if we leverage the context as social network, we can extract topical communities from the
scientific literature. For example, we can extract topics like information retrieval, data
mining, machine learning and Web. We can also discover communities like the machine
learning community, data mining community, and information retrieval community. So
we can see that people in the same topic or community also collaborate a lot with each
So I have pretty much introduced the general framework of contextual text mining and
how we incorporate contexts in generative models of text by modeling simple context,
implicit context, and complex context. We have a lot of work on individual applications
of this framework, but I want to choose two applications to introduce how effective this
general methodology can be useful in Web search and the information retrieval.
>>: Can I ask just one question?
>> Qiaozhu Mei: Sure.
>>: How do you -- on the regularization term, how do you specify that what it means to
say two models should be like each other, is there a function that says model 1 is a
function of -- these are both functions of a common model or what?
>> Qiaozhu Mei: I will introduce in one of the examples, as we can see that we can
introduce the graph-based regularization to control the difference between the two
models for different context. I will introduce in [inaudible] in this application you can
see later.
So the first application I choose the personalized search in Web search domain. We can
see how the context information of user can help with search.
Remember this ambiguous query, MSR, which could easily be Microsoft Research or
could be mountain safety research, right? What we want to do is to disambiguate this
query based on user's prior case. But we don't have enough data probably for everyone,
so we don't want to break everything down to individual users. Instead we want to back
off to classes of users.
As a proof of concept, we introduce the context as classes of users defined on [inaudible].
It's arguable whether [inaudible] is the best choice here, but we can potentially do better
by using the context like demographic verbals, right, or we can back off to the users who
click like me, which means that we can do collaborative search based on the friendship.
So this shows why personalized search is an instantiation of context or text mining. The
text information we're looking at are just query logs, so every record has three entries: IP
address of the user, the query that the users are meeting, and the URL that the user
clicked on based on this query.
The context information we model is actually the users or group of users defined by IP
addresses. So the smallest context corresponds to the individual user identified by the
four bytes of IP address. And we measure the larger context, concerning the users who
share the first three bytes of IP address. Then even a larger context, continual users
sharing the first two bytes of IP address, first one byte and then everyone in the world has
the largest context. And the generation model of text we want to look at is the
distribution of URL given the query. Imagine that we can estimate a good model of this,
we can probably predict whether the user will click on this URL, on that URL based on
the query he submits.
And by incorporating the user information, we want to estimate the contextualized model
of the probability of URL given the query and the user. The goal is to estimate a better
distribution of URL and the query and the user so that we can well predict the future. We
can better predict what URL the user is going to click on.
But wait a moment. What do you mean by the better distribution. We introduced the
entropy of the distribution as the evaluation metric for this goodness of the distribution.
So if we look at entropy of the URL distribution, it actually models the difficulty of
encoding information from a distribution, so we can consider entropy of the distribution
as the metric for the size of search space or for the difficulty of the task if you want to
predict the next one.
Entropy is the powerful tool for [inaudible] opportunities from which we can tell how
hard Web search could be and how much could personalization help. To predict whether
we can use the history to predict the future, we can use the cross entropy -- we estimate
the model from history and we model entropy based on this query, but based on the
future observations. So we can use cross entropy to measure whether our model can
better predict the future.
As some intuitive examples, by definition of entropy, we can define the queries into user
queries and hard queries. For instance, MSR is the hot query. If you look at the
distribution of the click, there are so many answers and every answer is almost equally
likely to be clicked on. So it's pretty hard to decide which one you want to put to the top.
So it's the hot query.
For another query, Google is the easy query because the distribution of the click has the
low entropy. There aren't so many answers and it's almost sure which answer got almost
all the probabilities, so it's very easy to decide which one you want to put to the top. And
by incorporating the user context, we want to make a hot query into an easy query.
And here are some findings. We estimate the entropy from a very large query log
database from Live Search. We estimate the entropy of the verbals query URL and IP
address one at a time, the joined entropy of the verbals two at a time and then three at a
time. Then we can estimate the difficulty of traditional search, what is that. It's actually
modeled by the joint -- by the conditional entropy of URL given the query, which is only
2.8 beats, which means that we can usually find the results we want in the top ten pages.
And what about personalized search? We can also estimate difficulty of personalized
search as the conditional entropy of URL, again a query and IP address, which is only 1.2
beats. And this is huge. This means that personalization could potentially cut the
entropy in half. This brings in large opportunity to improve Web search.
We can also look at the story in query searches [inaudible] traditional searches, if we
don't know any suggestion, how well can we predict the next query of the user? It
corresponds to the entropy of the query, which is 21 beats is pretty difficult. But what if
we know the user, what if we know the IP of the user? We can reduce the conditional
entropy into just five beats. Once again personalization cuts the entropy in half and this
time twice. Again, this means huge opportunity. To improve the Web search.
Of course this only tells us the potential benefit we can bring in. If we have a model that
only Gus [phonetic] knows, but we can always estimate a model from a history. We
introduce the model of URL, give the user and query by incorporating the five different
language models. So we have five black boxes. One black box corresponds to the model
estimated for the individual user, another one corresponds to the user group which shares
the three bytes of IP address, another one for the two bytes, another one for the one byte,
and then another for everyone in the world. Then we want to incorporate these models to
get this distribution of URL given the user and query.
If we only use this component, we are doing full personalization so that every user has
the different model. But we may run into the problem of data sparseness. Whereas if we
only use this component, we are not doing personalization at all. So all users share the
same model. We must -- the opportunity to present search.
What we want is something in the middle. We want to do personalization, but also back
off to users who click like me, we don't want to bring anything down -- everything down
to individual users.
And we can estimate the parameters of lambda, the parameters of every component using
EM algorithm. So we can see that a little bit of personalization is actually better than too
much personalization. It's at better than too little. If we only use the four bytes of IP
address, we actually run into a problem of sparse data. And if we use the -- don't use the
IP address at all or if we rely on context which is too large, we miss the opportunity of
We can also use cross entropy to evaluate how well we can use the history to predict the
future. So in this plot, we have the cross entropy of the future given the history, based on
no personalization, based on complete personalization, and based on personalization with
backoff. It's different because the IP address in the future may or may not be seen in the
history. As we can see from this chart, if we know every byte of IP address in a history,
which means we have enough data for everyone, indeed complete personalization is the
best. However, if we don't have enough data for everyone, if we don't observe some parts
of the IP address, complete personalization is not as good as personalization with
backoff. And we can see that in all cases personalization with backoff outperforms no
In this example, if we know at least two bytes of the IP address in the history, we can
almost cut off the entropy in half. So this is not what only Gus knows, this is this
practice what we can do with the history to predict the future. Yes.
>>: So when you're using cross entropy, you're penalizing [inaudible] entire distribution
of the future.
>> Qiaozhu Mei: Yes.
>>: If you -- looking at things like ranking [inaudible] you probably care more about
some subsets of future as opposed to the entire distribution? Because there you'll get the
disproportional effects from more rare parts of it and so on. Have you looked at possibly
doing [inaudible] for different layers that just care about the ranking-type tasks where
you care about just the some subset?
>> Qiaozhu Mei: That's a very interesting suggestion. We haven't looked at that. We
haven't looked at targeted query type for -- or classes of queries. But we do look at other
context information which you can see later.
It's interesting to get a subset of the queries or get a subset of the future clicks to estimate
whether the entropy is larger or lower. It's definitely interesting. We haven't done that.
So the question is that can we do better by training other type of context information.
Because it's really arguable whether IP address is the best choice as the [inaudible] verbal
for the user.
Can we do better than IP addresses? Can we use other contextual verbals like user ID,
right, like the query type, like the click rate, like the intent of the query. Can we use
other verbals like demographics verbals, age, gender, income, can we use other verbals
like time of day or days of week.
We have done some very preliminary research on using other type of context information
such as days of week. We can see from this chat that by comparing business days with
weekends, business days you have more clicks. The queries on business days have more
clicks. But they're also easier queries. Which means it makes sense to distinguish the
queries in business days and weekends.
And we also look at the context information of hours of day, and we can see from this
plot that the harder queries come in around 9 p.m., which is around the TV time. So this
is also very interesting. We can see this pattern. So it's potentially useful to distinguish
the queries by hour of the day.
So this is just a preliminary result, which shows that there's still a huge potential to do
better than IP address, to incorporate other context information.
So the second application I want to introduce is to smooth language models information
retrieval. The basic idea of language modeling-based IR is like this. Suppose from a
document you can estimate a document language model, which is essentially a
multinomial distribution of words, and from a query, you can also estimate a query
language model. Then you can rank the documents based on the similarity between these
two language models.
For example, we can use negative [inaudible] divergence. But the problem is that every
document only has very limited information, so the distribution we estimated from
maximizing likelihood is usually not trustful. It could cause some serious problem. So
instead we want to use the smoothest version of document language model which is more
robust and more accurate. And different language modeling-based approaches varies on
how they estimate language model, and then boils down to how they smooth the language
models off the document.
If they have a better strategy to smooth the language model, they can usually yield better
retrieval performance. A particular strategy in literature is to incorporate the maximum
likelihood estimates with some other language model estimated from the collection, to
integrate them.
There are two goals. One goal is to [inaudible] another probability to unsynch words and
another goal is to estimate a more accurate distribution from sparse data. People have
done a lot of work to satisfy this goal, but it's still not clear what we mean by an accurate
distribution from the sparse data.
Let's choose that smoothing engine model is also an instantiation of contextual text
mining where the text we are looking at is index correction and the context are
documents themselves. And we want to look at a contextualized generation model,
which is [inaudible] words in the document.
But what's not clear here is what kind of structure of this context we should use. And the
goal is to estimate the smoothest version of the language model so that we can get better
retrieval performance, and we want to regularize the maximum likelihood estimates
based on the structure of the context here -- every context in the document.
There's quite a few previous work on smoothing language models based on the collection
engine model. The basic idea is to incorporate the maximum likelihood estimates with
some reference model usually estimated from the whole collection. So we have a
maximum likelihood estimates of the language model, we have a reference [inaudible]
model, and we have to somehow incorporate these two models and get the smooth
language model.
And there are many heuristics proposed to estimate such a reference engine model. The
simplest case is to trace every document in a document -- in a document collection into
the same big document and estimate a collection language model from the big document.
Then people proposed a better way to first cluster the documents in the collection and
estimate a language model for each cluster, then we will incorporate the maximum
likelihood estimates with the language model for the [inaudible] the document belongs
too. But it's still not clear whether all the documents in the same cluster is enclosed to the
document itself. Yes.
>>: [inaudible] between the first model and the [inaudible] measure seems like we're just
getting the most probability from the time [inaudible] the least [inaudible].
>> Qiaozhu Mei: Yes. Yeah. But it's under the language modeling approach, so it's not
using TFIDF -- although people have some interpretation about the connection of the
language modeling-based approach to the TFIDF models. But this is under the camp of
language modeling approach.
>>: What is the language model? You mean [inaudible]?
>> Qiaozhu Mei: Yes. A unigram.
>>: Oh, unigram. Right. So in that case the unigram, the [inaudible] of unigram is just
TF, right?
>> Qiaozhu Mei: It's just TF. But there's no IDF there. So by adding in this smoothing
term, you're actually incorporating some component related to the IDF. We're not
looking at like bigram models or n-gram models, because people have found that in
traditional ad hoc IR, unigram matching model works pretty well.
So I have introduced the other measure, which based on class rings, the document
collection first, and then chose the cluster language model to smooth the language model
of the document.
There's yet another heuristic. To look at the nearest labors of the document, to estimate
language model based on nearest labors of the target document and then incorporate the
model with the maximum [inaudible] estimates, right, since there are so many different
heuristics off them, it's better results than not smoothing or smoothing [inaudible] with
the background.
But there's also problems of these existing measures. For example, if we smooth with the
global background, we're actually ignoring the collection structure. We're not leveraging
structure of the documents, right? If we smooth with document clusters, we're actually
ignoring the local structure inside clusters. So we're not sure where all the documents in
the same cluster contributes equally to the target document.
And on the other hand, if we smooth using the nearest labor documents, we're ignoring
the global structure. Once again, we haven't really leveraged the full power of the
infrastructure of the context.
So although there are so many different heuristics and different heuristics of
interpolation, there's still no clear objective function for optimization. You don't know
what these measures are optimizing, and there's no guidance on how to further improve
the existing measures.
So the research [inaudible] here is includes what is the right cover structure, what is the
right structure of the context we should use, and what are the criteria for good smoothing
method, what do we actually mean by accurate language model here.
We also want to answer what [inaudible] optimizing by smoothing those document
language models and could there be a general optimization framework rather than all
these heuristics.
So we introduced the novel and general view of smoothing based on a graph structure of
the documents. Suppose we have a graph of documents as a structure, these are from
some linked structure or from some similarity computation of the documents. So what
we want to do is to project such the graph on a hyperplane, and then what is the language
The [inaudible] of the word given the document actually makes the point on top of this
hyperplane. So if we look at the language models of different documents, they actually
make surfaces on top of this hyperplane, on top of this graph structure. Of course if we
only rely on maximum likely estimates, this surface could be very rough because we
don't have enough data in every document. The data is sparse.
So what we want to get by smoothing is some smoothed surface on top of the graph
structure so that the smoothed language model is equivalent to smoothed surfaces on top
of this graph structure. So this actually gives the interpretation of what smoothing is
actually doing for language models. Then we can see two heuristics or two intuitions.
The first intuitions that were on the [inaudible] to the maximum likelihood estimates, we
don't want these two surfaces to vary too much. Another intuition is to find smooth
surfaces on top of this graph. So we want a smooth version of this rough surface.
And we can see that interesting this covers many existing heuristic measures by special
cases by means of what kind of graph they're using for exposing actual models. The
general case is to find the smooth surfaces on top of this general graph structure of
documents. Right? The [inaudible] want using nearest labor. We're actually [inaudible]
on top of this graph. This is actually the local graph of the target document, which only
contains the nearest labors.
What about smoothing with document clusters? Okay. This is actually equivalent to
finding a smooth surfaces on top of these forests where we introduce the
pseudodocument for every cluster, and then we connect every document in this cluster
with this pseudodocument. And finally what is the connection between smoothing the
language model using the global structure? We can see that it's equivalent to finding the
smooth surfaces on top of this [inaudible] graph where we make a pseudodocument for
all the documents in the connection and then connect this pseudonote with all the
documents in the collection. So it shows that our framework actually covers all the
existing heuristics as special cases.
And based on the intuitions we can formally define objective function of smoothing for
language models. Remember that the maximum likelihood estimates makes the rough
surface, and what we want is smoothed surface. So we first introduced the weight of
every vertex on top of this graph such as the degree of the vertex such as some other
metric for the document. We then introduce the weight on every edge between two
documents such as the similarity of two documents. Then we can formally define this
component to control the fidelity of the smoothest version of language models to the
maximum likelihood estimates.
Similarly, we can define another component which controls the smoothness of the
language models over the surface, so these parties, actually the difference of the language
model between two connected vertices or two documents and the [inaudible] on this
graph. Then we can introduce the objective function which can [inaudible] turns two
parts. These parts corresponds to the intuition that we want to keep fidelity to the
maximum likelihood estimates, and these parts control the smoothness organization on
top of this graph structure. So by finding the tradeoff between these two components, we
are satisfying the two intuitions at the same time.
We also propose the algorithm to solve this smoothing engine model problems on top of
the document graph. So we want to first construct a [inaudible] labor graph of
documents. By defining the weight of the edge as the cosign similarity of two document,
and then we complete the importance weight of every document as a degree of the
document and the graph, right, then we want to apply this updating formula for the
language model and we want to go over this updating formula for more times to iterate
this evidence so that it will reach the convergence.
And once we have a converged language model, we were then adding additional
[inaudible] smoothing because we still want to avoid the zero [inaudible] in the language
models, and finally we would get a smooth language model.
So you may ask what is this updating formula really doing, right? Is there intuitive
interpretation of this updating formula. We can see that we can actually interpret this
updating formula with random work, so we can rewrite this updating formula in this
format which corresponds to computing the absorption [inaudible] this kind of random
work of documents. So we transition [inaudible] from one document to an appositive
state from one document to an elective state and from one document to another
Then this updating formula is essentially computing the absorption [inaudible] from one
document to the positive state. So intuitively, if you don't know whether you want to
write a word in a document, what you want to do is to do the random work on this
document Markov chain [inaudible] as your labors do, and write down the word if you
eventually reach this positive state. So this is actually an intuitive interpretation of this
updating formula.
And we evaluate our algorithm extracted from this general framework using standard
check collections and standard check queries. So we have four collections which
contains up to 500,000 documents.
And we compare this algorithm with the state-of-art algorithm which uses the whole
collection as the reference model to estimate smoothest language model, and another
method using clusters as a document structure to smooth the language model. We
evaluate our method using the mean average precision map defined on the ranked list of
documents, so we can see that our algorithm actually outperforms both algorithms
So as the summary of this talk, I have introduced a new paradigm of text mining, which
is contextual text mining. It's a new paradigm of text mining which treats context
information as first-class citizen. I've also introduced general methodology for
contextual text mining by incorporating context information into generative models of
Then I have introduced two applications of contextual text mining on Web search and
information retrieval to show the general framework is very effective in solving
real-world problems.
The takeaway message here is that with rich context information in text mining, we have
a guidance. We are more easy -- we are much easier to find the goat from the big
And this is the roadmap of my work. I have been concentrating in contextual text mining
and the information retrieval and Web search. I've pretty much touched on this work by
adding in context information into probabilistic topic models. I have also other work
which are not corresponding to contextual text mining, but specifically information
retrieval and Web search, which includes the [inaudible] model for language model-based
information retrieval, and to generate impact by summarization for scientific literature
and to make use of query URL by [inaudible] graph to generate query suggestions.
And in the future of my work, first I'm interested in continuing my work on contextual
text mining, by working through the [inaudible] framework because there is still many
computational challenges, and we still don't know what is the good model to model the
structure of contexts such as other contextual structures.
And I also want to work towards the applications by designing the task support systems
to different type of users, such as the Web users, such as business users, such as other
And more importantly I want to leverage the power of contextual text mining to enhance
Web search by providing the so-called contextualized search including personalized
search or intent-based search or other paradigm of Web search based on different context
And then I'm also interested in integrative analysis of heterogenous data. For example,
we have Web.0 data, we have data from the user, we have Web from the search log, we
have data from [inaudible] logs. So can we really integrate this data and to find
information that is very hard to find from a particular dataset.
Okay. And I want to stop here for questions. Thanks.
>>: You talk about context in text. And I note that a lot of text is generated by humans
for not a human to consume, so most of the text you're talking about [inaudible] is
probably from [inaudible] sources. So I always notice the word context [inaudible] and
traditionally in the language I think the biggest context is actually the context within the
text themselves in addition to the things that you're talking about.
So I'm very curious as to, wow, you're searching for the context in IR data mining and the
machine learning community, why haven't you touched about in linguistic community the
natural language process.
>> Qiaozhu Mei: So I think the context defining linguistics are usually based on the
so-called linguistic context. We actually borrow the probably the oldest definition of
context, which is the [inaudible] at which the text was written, so which includes the
linguistic context and other context like the cognitive context like the context in the
>>: Those are two high levels. Can you come down a little bit, I mean, just like a -- for
example, have you seen about the synthetic structure, that -- or semantic structure, all that
>> Qiaozhu Mei: Yes. We have done some work on that. We actually leverage the local
context, for example, to generate the annotations of frequent patterns or to generate the
label for topic models. But for this kind of problem I have introduced, such contexts we
find a good -- we find a good fate of these problems with these kind of contexts. And for
that kind of contexts, I don't know, I'm not sure whether or not we'll help as much as this
>>: [inaudible] is that going beyond [inaudible] but then you alluded that somehow
things beyond unigram doesn't work?
>> Qiaozhu Mei: The basic accepted fact in IR communities is that by trying n-gram
doesn't work much more effective than unigram models, which is because when you go
through the n-gram, you also make the data much sparser. So it makes it even more
challenging to estimate the [inaudible] or accurate language model. So in the empirical
experiments, people didn't really observe much larger boost in the retrieval performance.
>>: [inaudible] you need a [inaudible]?
>> Qiaozhu Mei: [inaudible] yes, you know, but it's harder to do that. And it doesn't
really bring in additional benefits. Yes.
>>: So I'd like to take a shot, which is I think it depends a lot on the task. So if the task is
to predict the next word, then it's obvious that bigrams are much better than unigrams.
But if the task is to predict relevance to a query, then it's not as clear.
However, I think if you were to, say, bring in another data source, like let's say what the
users are interested in versus the authors, so the users' interest would be, say, clicks, and
authors' interest would be documents, and if you wanted to know relevance to a query,
then the combination of users' interests and authors' interests are very meaningful, but the
bigrams are less meaningful.
So what we have here in a framework here is to say for a particular task, we can predict
the entropy of an output variable Y, which is like what the next part is, a relevance to a
query, from a bunch of inputs, X, and we can make precise the question of which features
are useful for which task. And then you can address a question with content instead of
getting hot under the collar about whether this feature is useful or not, more useful than
that feature for this task or that task.
>>: I totally understand that. But he's also talking about in the context of smoothing the
document clustering.
>>: Well, he's talking about smoothing for a particular task, like say relevance to a
>>: Well, in the beginning, I think I heard about the topic ->> Qiaozhu Mei: Yes. Topics ->>: [inaudible] the way the hierarchy define topics. It's more like text classification.
>> Qiaozhu Mei: Right.
>>: And for that task I think bigrams is not so obvious.
>>: So obvious?
>>: It isn't so obvious bigrams aren't that powerful for that. Guessing the next word is
very obvious bigrams.
>>: Have you tried?
>> Qiaozhu Mei: People have tried in literature by using bigram language models for
topic models and using bigram language models for retrieval, but the regions are in terms
of relevance in terms of retrieval performance. It doesn't really help much.
>>: Whereas there are quite a few studies that show for predicting the next word bigrams
>> Qiaozhu Mei: Yes. But this is not the traditional test of IR, because we [inaudible]
ranking of documents based on the relevance to the query. So I think it also depends on
the context of your styling.
>>: Well, the context theory is to really find out what context means, right?
>> Qiaozhu Mei: No, the context query is how to [inaudible] performance of retrieval or
Web search. So I think it's [inaudible] question.
>>: [inaudible] Web search, I found many of the pages are very [inaudible] like opposed
to Twitter [inaudible] we found the count actually -- I myself, I found that I did not
[inaudible] computer, I myself [inaudible] difficulty understanding what the [inaudible] is
talking about. I wonder whether the context can help [inaudible] to understand a very
short page.
>> Qiaozhu Mei: That depends on what you mean by understanding the text. So it's still
-- it's still text dependent. If your goal is to provide a better understanding of the natural
language, I don't know. We didn't really look at the problem. But if your goal is to find
relevant information, for example, if you have a query and you want to find the relevant
Twitter articles, the context information is definitely very helpful.
>>: [inaudible] we don't want to [inaudible].
>> Qiaozhu Mei: Yes.
>>: And suppose that we want [inaudible] and I want to put an article into the ODP
categories, however, [inaudible] the context information is really hard [inaudible]. Do
you have any study on how contact [inaudible]?
>> Qiaozhu Mei: We haven't tried in the exact setup. But definitely the setup
information can help for your task because it enriches your feature space. If you can find
the trillion data with context information, then you can build a much powerful model
which leverages the context information in the Twitter articles.
And you can also make use of the context information to make connections between the
individual articles in Twitter so that you can push in some intuitions that if this article
belongs to this category, then the article similar to this one should also belong to this
category. By adding in these kind of intuitions defined on context information, I think
the text could definitely be benefited from the context analysis.
But then the [inaudible] is still text dependent. I can't guarantee that context information
is useful for all kinds of tasks. Right? Yes.
>>: First I want to comment, so from your presentation is [inaudible] I didn't see any
points that prohibit you from applying this to more general [inaudible], for example,
when you ran [inaudible] use the [inaudible].
>> Qiaozhu Mei: Exactly. Exactly.
>>: Okay. Now, another question is about the aspect fining. So have you had any
matters that can automatically estimate how many aspects are there in a document
[inaudible], for example, for the laptop [inaudible] you put out three aspects. Is that for
you predefine or automatically [inaudible] from the [inaudible]?
>> Qiaozhu Mei: Yes. There are some initial work done on that. If you're familiar of
the body of working, topic modeling, you probably know the work by [inaudible] about
the Chinese [inaudible] process. They provide some message to automatically estimate
the number of topics in the text. I don't want to make a comment whether the -- how -where the method could work, but I think what we do is to put this task in the framework
of user-guided studies. So we allow the user to provide some guidance to the topic
model. If he knows what type of topic he wants to look at, he can give such as the
guidance to the system,
So, for example, if he knows that, okay, these topics are too high level, he can probably
drill down the analysis to provide a larger number of topics, and he can also input some
keywords to say what kind of topic he's looking at. So we allow the user to provide
guidance to the system instead of just estimating the number of topics automatically.
Because it's still not clear what the topic is really -- is really defined as.
You can think about like Web search as the topic, right, if you're looking at the level of
different research areas. But you can also think about query login as the topic if you want
to drill down to the lower level. So it's really unclear how many topics are there in the
text if you don't have any prior loadage from the user in this case.
>>: [inaudible] the question is suppose you have two [inaudible] one is about Dell, one is
about [inaudible] and there's no guarantee [inaudible] so how to align testimony together.
>> Qiaozhu Mei: Yes. In our work we define the context, a larger context which
contains both of the [inaudible], so we align the topics by leverage in the overlap of these
context. So in this case we have three context. One context for Dell, another context
for -- what did you say [inaudible] and another context contains both of the brands. So
by leveraging there's overlapping of context. We can match them.
>>: So in [inaudible] also use something like a [inaudible] to minimize -- which is
essentially [inaudible] so they want to minimize that to estimate the optimal number of
[inaudible] so you're also like optimizing the likelihood of the [inaudible] do you see the
similarity between your work and theirs?
>> Qiaozhu Mei: That's the particular way that people estimate the number of topics in
text. But, as I said, it's still unclear what you mean by the granularity of the topic. So
you can always find some matrix that could help you define the number of topics, but
then the question is whether the topic would be useful in practice is unclear.
So our work, the maximum likelihood -- the objective functions in this work actually
don't take any training data. So this probably answers your question. The [inaudible]
study is based on that we have is some holdout data and they have the topic model
estimate on some training data. They want to, you know, maximize the likelihood and
the holdout data based on the training data. So it's a different setup. Yes.
>>: So overall like there's two fundamental types of context that you're dealing with:
one is explicit where you have things like age, gender, links and so on, and another one is
implicit or latent, where you're trying to infer things like topics, et cetera, and so there's a
big concern that I think the previous question touched on is where you're trying to exploit
implicit context, you're creating effectively a whole separate problem, which is
independent from the main task you're solving. And it seems like there's a big danger
there in the sense -- I mean, from a learning theoretic standpoint, there's a big danger of
when you're trying to create sort of solely the main problem by creating a subproblem,
you're making it, like, more complicated.
So, I mean, certainly there's been cases where it worked, but, in general, from your
experience, do you think sort of going forward there will be work on sort of bare models
for implicit context that will give us more tools, or is just better tools for dealing with
explicit context that will just capture everything there is and we don't need this extra layer
of implicit context.
>> Qiaozhu Mei: That's an interesting question, and the answer I think is as follows. We
don't want to motivate our research to really get fancier model or to create a subproblem
that is -- that requires more computational challenge. What we are looking at is whether
the context information is really useful.
So your comment is that some implicit context may or may not be useful, right? But I
could give you two examples that implicit context will be very useful in real-world tasks.
One is the sentiment. When you compare the product, you don't want really just track the
number of reviews that people have written for this product. You still want to figure out
whether they have positive opinions or negative opinions. Right.
In previous work we usually don't distinguish that. We see a spike of the discussion of
the book sales in the [inaudible] that there would be books -- spike of the discussion of
block articles to infer that, which you can infer the spike of sales from that. But if you
don't distinguish the positive or negative opinions, you can -- you can't really do that.
What if all the discussions are negative? You don't want to buy the book based on the
negative discussions.
Another example is the intent of the Web search. You want to figure out whether the
user have the intent of buying some stuff or he wants to do the research on a small topic
or whatever other intents. This kind of context, we can't easily get that from the text
either. You still have to infer them from the data, you have to digest the data first to get
the context. So I think these two examples of implicit context can show you how this
category can be very useful in practice. So it's not just the practice for fancier models.
>>: [inaudible] actually I'll take issue with what you just said. So you're implying that
there's this binomial split of positive versus negative sentiment.
>> Qiaozhu Mei: Yes.
>>: And then there is this sort of fixed hierarchy of intents as opposed to in reality you
could say, well, what if there is a much more complicated structure of both opinions that
sort of where you can think of positive and negative sentiment as well as the sentiment
where people will say, well, these are the good things in this aspect, these are the bad
things in these aspects, so this goes back to the whole aspect issues.
And then the question is, again, possibly by collapsing these two, we just did binary
presentation, well, maybe you're making your final text better, maybe worse, but, again,
there is a concern there that by placing those assumptions on the structure of the implicit
context you may be losing information or gaining it if that's all that matters for your final
task and it's sort of you're better off collapsing it.
But I think there's a lot of room for future work. I think that's actually a nice thing about
this is it does open up this whole closed issues of how to deal with it, what's the context.
But I think it's -- it's a big concern there, it's a big danger to just assume that we can just
place a structure on an implicit context and that's what we're dealing with without taking
into account the whole structure is sort of -- there's a question of how many topics there
are, how many types of opinions there are and so on. So that's why.
>> Qiaozhu Mei: Yes, I totally agree with you. We are taking a big step to leverage in
the context information in text to make our life better. But I'm not saying that we are at
the complete stage of that goal.
By modeling the simple context which is probably more mature at this stage and also the
techniques to model the implicit context, which is probably not so much [inaudible]
stage, we are taking a step towards that goal. Yeah. Thanks.
