Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25th 2008 Outline • Introduction / Motivation • Bigram language models (MacKay & Peto, 1995) • N-gram topic models (LDA, Blei et al.) • Bigram Topic Model • Results • Conclusion Introduction • Generative topic models fall into two major categories – Bigram language models: Generate words based on some measure of previous words. – N-gram topic models: Generate words based on latent topics inferred from word or document correlations. • N-gram topic models are independent of word order, while bigram models consider pairs of words with the leading word defining a “context”. • Is word order important? Consider the following example – “The department chair couches offers.” – “The chair department offers couches.” • To an n-gram model, the two sentences are identical, but to a reader they are not the same. Therefore, a great deal of semantic information must reside in word order. Bigram models would see the two sentences as being different. Bigram language models: Hierarchical Dirichlet Language Model • Bigram topic models are specified by the conditional distribution . • The matrix matrix. can be thought of as a transition probability • Given a corpus , the likelihood function is (1) • And the prior on is (2) Bigram language models: Hierarchical Dirichlet Language Model • Combining (1) and (2), and integrating out the transition matrix yields the evidence of conditioned on the hyperparameters (3) • We can also obtain the predictive distribution (4) • Where is the number of times word i follows word j in the corpus, and is the number of times word j appears in the corpus. We say that word j, being the leading word of the two word pair, sets a “context”, which is analogous to factors and topics in other models. • Mackay and Peto showed that the optimal empirical Bayes method to maximize (3). is found using the N-gram topic models: Latent Dirichlet Allocation • Latent Dirichlet Allocation does not consider word order. • The matrices and govern the word emission conditioned on topic, and topic emission conditioned on document, respectively. • Where t is the word index within the corpurs, i is the word index in the dictionary, k is the topic index, and d is the document index. N-gram topic models: Latent Dirichlet Allocation • Therefore, the joint probability of the corpus of latent topics is and the set (5) • Where is the number times topic k has generated word i, and is the number of times topic k was generated in document d. • We place Dirichlet priors on and (6) (7) N-gram topic models: Latent Dirichlet Allocation • Combining (5), (6) and (7), and integrating out Number of times topic k generates word i and , Number of times topic k appears in document d (8) Number of time topic k appears in z Number of words in document d • However, (8) is intractable so approximation methods (MCMC, VB) are used to get around this issue. • Assuming optimal parameters and , approximate predictive distributions for topic k and document d are (9) (10) Bigram Topic Model • We would like to create a model which considers both topics (like LDA) and word order (like bigram language models). • We accomplish this by using a simple extension of LDA. • We specify a new conditional distribution for word generation i | j , k (11) • These parameters form a matrix where each “plane” can be thought of as the characteristic transition matrix for a topic j, as before, defines the context or leading word of a word pair. i is the word index of the trailing word. k is the topic plane index j i Topic “planes”, k Bigram Topic Model • Topic generation is identical to LDA • We place a Dirichlet prior on the topic generation parameters (12) • Then the joint probability of the corpus latent topics is and a single set of (13) Bigram Topic Model • In both the Heirarchical Dirichlet Language Model and LDA, the prior over (either the context matrix or topic matrix) are coupled in the sense that the hyperparameter vector is shared between all possible contexts or topics. • However, in this model, because we induced dependence on both topic k and context j, there are two possible priors on 1) Global sharing Here, a single set of hyperparameters are shared across all contexts in all topics. This leads to a simpler formulation. 2) Topic level sharing More intuitively, we allow each topic k to have a set of hyperparameters shared by only the contexts in the topic. Bigram Topic Model • We are now in a position to describe the generative process of the Bigram Topic Model Bigram Topic Model • Combining (12), (13) and Prior 1 and integrating out we arrive at the evidence and (14) • Alternatively, combining (12), (13), and Prior 2 and integrating out and (15) • Again, the summation is intractable, so as before, we utilize approximations. Bigram Topic Model • Given optimum hyperparameters , the predictive distributions of over words given the previous word and topic k are Prior 1 (16) Prior 2 (17) • The predictive distribution of the topic k given document d is (18) Inference of Hyperparameters • A Gibbs EM algorithm is employed to find optimal hyperparameters and either • We summarize the EM algorithm below Prior 1 Prior 2 Where or Results • 2 150 document datasets were used. – 150 random abstracts from the Phsycological Abstract dataset (100 training, 50 test). • 1374 word dictionary – 150 random postings from 20 Newsgroups dataset (100 training, 50 test). • 2281 word dictionary • All words occurring only once were removed, along with punctuation and numbers. Results Plot of Information Rate (bits per word) as a function of number of topics, with results from the Phsychological Abstract dataset on the left, and 20 Newsgroups dataset on the right. Information rate is computed as shown Future Work / Conclusion • Another possible prior over would be similar to prior 2, but it would impose sharing of hyperparameters among contexts. – That is, all word pairs which share the same leading word. • It is not entirely clear if this approach would result in any improvement. Further, the computational complexity of this approach is much greater than using prior 2. • The bigram topic model shows improved performance compared to both the bigram language model and LDA, and is an encouraging direction of research. • It is much more feasible to consider word level models when word order is not ignored.