Topic Modelling: Beyond Bag of Words

Topic Modelling: Beyond Bag of Words
By Hanna M. Wallach
ICML 2006
Presented by Eric Wang, April 25th 2008
• Introduction / Motivation
• Bigram language models (MacKay & Peto, 1995)
• N-gram topic models (LDA, Blei et al.)
• Bigram Topic Model
• Results
• Conclusion
• Generative topic models fall into two major categories
– Bigram language models: Generate words based on some measure of
previous words.
– N-gram topic models: Generate words based on latent topics inferred
from word or document correlations.
• N-gram topic models are independent of word order, while bigram
models consider pairs of words with the leading word defining a
• Is word order important? Consider the following example
– “The department chair couches offers.”
– “The chair department offers couches.”
• To an n-gram model, the two sentences are identical, but to a
reader they are not the same. Therefore, a great deal of semantic
information must reside in word order. Bigram models would see
the two sentences as being different.
Bigram language models: Hierarchical Dirichlet Language Model
• Bigram topic models are specified by the conditional
• The matrix
can be thought of as a transition probability
• Given a corpus
, the likelihood function is
• And the prior on
Bigram language models: Hierarchical Dirichlet Language Model
• Combining (1) and (2), and integrating out the transition matrix
yields the evidence of
conditioned on the hyperparameters
• We can also obtain the predictive distribution
• Where
is the number of times word i follows word j in the
corpus, and
is the number of times word j appears in the
corpus. We say that word j, being the leading word of the two word
pair, sets a “context”, which is analogous to factors and topics in
other models.
• Mackay and Peto showed that the optimal
empirical Bayes method to maximize (3).
is found using the
N-gram topic models: Latent Dirichlet Allocation
• Latent Dirichlet Allocation does not consider word order.
• The matrices and govern the word emission conditioned
on topic, and topic emission conditioned on document,
• Where t is the word index within the corpurs, i is the word
index in the dictionary, k is the topic index, and d is the
document index.
N-gram topic models: Latent Dirichlet Allocation
• Therefore, the joint probability of the corpus
of latent topics is
and the set
• Where
is the number times topic k has generated word i,
is the number of times topic k was generated in
document d.
• We place Dirichlet priors on
N-gram topic models: Latent Dirichlet Allocation
• Combining (5), (6) and (7), and integrating out
Number of times topic k generates word i
Number of times topic k appears in document d
Number of time topic k appears in z
Number of words in document d
• However, (8) is intractable so approximation methods (MCMC,
VB) are used to get around this issue.
• Assuming optimal parameters
, approximate
predictive distributions for topic k and document d are
Bigram Topic Model
• We would like to create a model which considers both topics
(like LDA) and word order (like bigram language models).
• We accomplish this by using a simple extension of LDA.
• We specify a new conditional distribution for word generation
 i | j , k
• These parameters form a matrix where each “plane” can be
thought of as the characteristic transition matrix for a topic
j, as before, defines the context or
leading word of a word pair. i is the
word index of the trailing word. k is the
topic plane index
Topic “planes”, k
Bigram Topic Model
• Topic generation is identical to LDA
• We place a Dirichlet prior on the topic generation parameters
• Then the joint probability of the corpus
latent topics is
and a single set of
Bigram Topic Model
• In both the Heirarchical Dirichlet Language Model and LDA,
the prior over (either the context matrix or topic matrix)
are coupled in the sense that the hyperparameter vector
is shared between all possible contexts or topics.
• However, in this model, because we induced dependence on
both topic k and context j, there are two possible priors on
1) Global sharing
Here, a single set of hyperparameters are shared across all contexts in all
topics. This leads to a simpler formulation.
2) Topic level sharing
More intuitively, we allow each topic k to have a set of hyperparameters
shared by only the contexts in the topic.
Bigram Topic Model
• We are now in a position to describe the generative process
of the Bigram Topic Model
Bigram Topic Model
• Combining (12), (13) and Prior 1 and integrating out
we arrive at the evidence
• Alternatively, combining (12), (13), and Prior 2 and integrating
out and
• Again, the summation is intractable, so as before, we utilize
Bigram Topic Model
• Given optimum hyperparameters
, the
predictive distributions of over words given the previous word
and topic k are
Prior 1
Prior 2
• The predictive distribution of the topic k given document d is
Inference of Hyperparameters
• A Gibbs EM algorithm is employed to find optimal
and either
• We summarize the EM algorithm below
Prior 1
Prior 2
• 2 150 document datasets were used.
– 150 random abstracts from the Phsycological Abstract dataset (100
training, 50 test).
• 1374 word dictionary
– 150 random postings from 20 Newsgroups dataset (100 training, 50
• 2281 word dictionary
• All words occurring only once were removed, along with
punctuation and numbers.
Plot of Information Rate (bits per word) as a function of number of topics, with results
from the Phsychological Abstract dataset on the left, and 20 Newsgroups dataset on the
Information rate is computed as shown
Future Work / Conclusion
• Another possible prior over would be similar to prior 2, but
it would impose sharing of hyperparameters
– That is, all word pairs which share the same leading word.
• It is not entirely clear if this approach would result in any
improvement. Further, the computational complexity of this
approach is much greater than using prior 2.
• The bigram topic model shows improved performance
compared to both the bigram language model and LDA, and is
an encouraging direction of research.
• It is much more feasible to consider word level models when
word order is not ignored.