Finding Scientific topics August 26 - 31, 2011 Topic Modeling 1. 2. 3. A document as a probabilistic mixture of topics. A topic as a probability distribution over words. Words are assumed known and the number of words is fixed. If there are T topics, Here, { w } denote words, { z } denote topics. The conditional probability P ( w | z ) indicates which words are important to a topic. For a particular document, P (z), distribution over topics, determines how these topics are mixed together in forming the document. An example “Soft Classification”: a document is not assigned with only one topic (a single class). For document, P(z) gives an indication of what topic should be assigned to it. How to think/visualize P(z), P(w | z )? words topics What do you want know? What do we want to compute from the input data? Inputs: a. A document or a collection of documents, with the collection of words appeared in the document {w1, …. , wn } (repetition allowed, perhaps with deletions of unimportant words such as articles ‘the’, ‘an’, prepositions ‘on’, ‘of’). b. Number of topics, T. We want to know (compute) P ( z ), P (w | z ) for each topic z. There is one (or D, the number of documents) P( z ), and T P (w | z = j). What form should P(z) and P( w | z) take? Multinomial distributions ( http://en.wikipedia.org/wiki/Multinomial_distribution ) What is this? Each P ( z) is a non-negative vector with T components (with sum = 1). Each P ( w | z ) is a non-negative vector with W components (with sum = 1). One possible solution Question: how many variables are here? Problem with this approach: local maxima and slow convergence Bayesian Approach Estimate phi and theta indirectly via the following generative model Dirichlet Distribution (http://en.wikipedia.org/wiki/Dirichlet_distribution ) What does this generative model say? It gives us the way that the observed data are thought to be generated. Where is the prior? Idea: using the generative model to explain the input data. Alpha, beta are the hyperparameters The goal is evaluate the posterior distribution Difficult because the denominator cannot be computed. (Know very well what the notations Z, W stand for). However, we do have (for the numerator P( w, z ) ) P ( z ) = P ( z1, … , zw | theta ) = P (z1 | theta) … P (zw | theta) ( Assuming conditional Independence of zi given theta) This gives Equation 3 (with D= 1, one document) Equation 2 can be obtained similarly. The goal is evaluate the posterior distribution Difficult because the denominator cannot be computed. But what can we do with P(z | w)? Recall that our goal is to estimate theta (topic proportion) and phi (topics) Suppose we know the true topic assignments (z1, … , zW), theta can be estimated as theta_i = ( number of words assigned topic i ) / ( number of total words, W) and How about phi? More About Dirichlet Distribution Dirichlet Distribution is defined on the T-dimensional standard simplex The density function for DD with parameters When alpha_i is close to zero, probability will concentrate near theta_i =0. On the other hand, when alpha_i is away from zero (large), probability will move away from theta_i=0. An example with T = 3, More About Dirichlet Distribution The expected value and variance of each component theta_i are given by the formulas When alpha_i is close to zero, probability will concentrate near theta_i =0. On the other hand, when alpha_i is away from zero (large), probability will move away from theta_i=0. An example with T = 3, Suppose we know the true topic assignments (z1, … , zW), theta can be estimated as theta_i = ( number of words assigned topic i ) / ( number of total words, W) and How about phi? Of course, we don’t know the ground truth, but only the distribution P ( Z | W). We need to know P ( theta | Z, W) By Baye’ rule, we have Therefore, P(theta | Z, W) is also another Dirichlet Distribution. For a given Z, what should the estimated theta be? This gives Equation 6 (and 7 similarly). The point, of course, is that we don’t know the exact topic assignment (z1, … zW), but only its distribution P( z ). For a probability distribution P(x), the expectation of a function can be estimated as (where yi are the samples of P(x) ) For example, we can use this formula to estimate the mean, variance of a distribution from its samples. More samples give more accurate estimate on the right. The goal is evaluate the posterior distribution Difficult because the denominator cannot be computed. Using Markov Chain Monte Carlo (MCMC) to simulate the probability). What this mean is that we want to draw samples with respect to the distribution for each sample we generated from P(z | w), we have one estimate of theta and phi according to Simulating P (z|w) using MCMC (Markov Chain Monte Carlo) Much more on this later… The steps are 1. 2. Initialize the topic assignments (z1, … zW)s=0, Do (say three thousand iterations) for each i = 1, …. , W change the current assignment for zi according to the probability One cycle through all i gives a new topic assignment (z1, … zW)s=s+1, What does the formula say? 3. Generate Samples Example, Suppose there are 3 (T=3) topics and there are 3 words (A, B, C) in the dictionary and the word list has 10 words (W=10). Take alpha=beta=1. Word List is { A, A, C, B, C, A, C, A, B, B} With the initial topic assignment { 1, 1, 2, 2, 3, 1, 2, 3, 1 , 1} How do we apply the formula? First word is A, in the word list, A has been assigned to topics 1, 1, 1, and 3. P ( z1 | Z-1, W) = (0.5159, 0.1848, 0.3003) { 3, 1, 2, 2, 3, 1, 2, 3, 1 , 1} Next, compute P ( z1 | Z-1, W) and sample a new z2 value What are the effects of alpha and beta? (prior and large W) Summary Goal is to infer theta (Dirichlet) from the data, with theta itself a distribution. Dirichlet distribution is a prior distribution on theta (A Bayesian approach). Therefore, it is a distribution on the space of distributions. No particular forms of theta are assumed (nonparametric). The base probability space is finite and discrete X = {1, …, T}. Things become much more complicated when X is no longer discrete, for example, X is the set of real numbers. Need more sophisticated mathematical language, and that will be the goal for the next two – three weeks.