LSI, pLSI, LDA and inference methods Guillaume Obozinski INRIA - Ecole Normale Supérieure - Paris RussIR summer school Yaroslavl, August 6-10th 2012 Guillaume Obozinski LSI, pLSI, LDA and inference methods 1/40 Latent Semantic Indexing Guillaume Obozinski LSI, pLSI, LDA and inference methods 2/40 Latent Semantic Indexing (LSI) (Deerwester et al., 1990) Idea: words that co-occur frequently in documents should be similar. (i) (i) Let x1 and x2 count resp. the number of occurrences of the words physician and doctor in the i th document. Guillaume Obozinski LSI, pLSI, LDA and inference methods 3/40 Latent Semantic Indexing (LSI) (Deerwester et al., 1990) Idea: words that co-occur frequently in documents should be similar. e2 (i) x1 (i) x2 Let and count resp. the number of occurrences of the words physician and doctor in the i th document. Guillaume Obozinski x(2) • u(1) • • e1 x(1) LSI, pLSI, LDA and inference methods 3/40 Latent Semantic Indexing (LSI) (Deerwester et al., 1990) Idea: words that co-occur frequently in documents should be similar. e2 (i) x1 (i) x2 Let and count resp. the number of occurrences of the words physician and doctor in the i th document. x(2) • u(1) • • e1 x(1) The directions of covariance or principal directions are obtained using the singular value decomposition of X ∈ Rd×N X = USV > , with U > U = Id and V > V = IN and S ∈ Rd×N a matrix with non-zero element only on the diagonal: the singular values of X , positives and sorted in decreasing order. Guillaume Obozinski LSI, pLSI, LDA and inference methods 3/40 LSI: computation of the document representation | | (1) (d) : the principal directions. U= u . . . u | | Let UK ∈ Rd×K , VK ∈ RN×K be the matrices retaining the K first columns and SK ∈ RK ×K the top left K × K corner of S. Guillaume Obozinski LSI, pLSI, LDA and inference methods 4/40 LSI: computation of the document representation | | (1) (d) : the principal directions. U= u . . . u | | Let UK ∈ Rd×K , VK ∈ RN×K be the matrices retaining the K first columns and SK ∈ RK ×K the top left K × K corner of S. Guillaume Obozinski LSI, pLSI, LDA and inference methods 4/40 LSI: computation of the document representation | | (1) (d) : the principal directions. U= u . . . u | | Let UK ∈ Rd×K , VK ∈ RN×K be the matrices retaining the K first columns and SK ∈ RK ×K the top left K × K corner of S. The projection of x(i) on the subspace spanned by UK yields the Latent representation: Guillaume Obozinski x̃(i) = UK> x(i) . LSI, pLSI, LDA and inference methods 4/40 LSI: computation of the document representation | | (1) (d) : the principal directions. U= u . . . u | | Let UK ∈ Rd×K , VK ∈ RN×K be the matrices retaining the K first columns and SK ∈ RK ×K the top left K × K corner of S. The projection of x(i) on the subspace spanned by UK yields the Latent representation: x̃(i) = UK> x(i) . Remarks UK> X = UK> UK SK VK> = SK VK> Guillaume Obozinski LSI, pLSI, LDA and inference methods 4/40 LSI: computation of the document representation | | (1) (d) : the principal directions. U= u . . . u | | Let UK ∈ Rd×K , VK ∈ RN×K be the matrices retaining the K first columns and SK ∈ RK ×K the top left K × K corner of S. The projection of x(i) on the subspace spanned by UK yields the Latent representation: x̃(i) = UK> x(i) . Remarks UK> X = UK> UK SK VK> = SK VK> u(k) is somehow like a topic and x̃(i) is the vector of coefficients of decomposition of a document on the K “topics”. Guillaume Obozinski LSI, pLSI, LDA and inference methods 4/40 LSI: computation of the document representation | | (1) (d) : the principal directions. U= u . . . u | | Let UK ∈ Rd×K , VK ∈ RN×K be the matrices retaining the K first columns and SK ∈ RK ×K the top left K × K corner of S. The projection of x(i) on the subspace spanned by UK yields the Latent representation: x̃(i) = UK> x(i) . Remarks UK> X = UK> UK SK VK> = SK VK> u(k) is somehow like a topic and x̃(i) is the vector of coefficients of decomposition of a document on the K “topics”. The similarity between two documents can now be measured by cos(∠(x̃(i) , x̃(j) )) = Guillaume Obozinski x̃(i) x̃(j) · kx̃(i) k kx̃(j) k LSI, pLSI, LDA and inference methods 4/40 LSI vs PCA LSI is almost identical to Principal Component Analysis (PCA) proposed by Karl Pearson in 1901. Guillaume Obozinski LSI, pLSI, LDA and inference methods 5/40 LSI vs PCA LSI is almost identical to Principal Component Analysis (PCA) proposed by Karl Pearson in 1901. Like PCA, LSI aims at finding the directions of high correlations between words called principal directions. Like PCA, it retains the projection of the data on a number k of these principal directions, which are called the principal components. Guillaume Obozinski LSI, pLSI, LDA and inference methods 5/40 LSI vs PCA LSI is almost identical to Principal Component Analysis (PCA) proposed by Karl Pearson in 1901. Like PCA, LSI aims at finding the directions of high correlations between words called principal directions. Like PCA, it retains the projection of the data on a number k of these principal directions, which are called the principal components. Difference between LSI and PCA LSI does not center the data (no specific reason). LSI is typically combined with TF-IDF Guillaume Obozinski LSI, pLSI, LDA and inference methods 5/40 Limitations and shortcomings of LSI The generative model of the data underlying PCA is a Gaussian cloud which does not match the structure of the data. Guillaume Obozinski LSI, pLSI, LDA and inference methods 6/40 Limitations and shortcomings of LSI The generative model of the data underlying PCA is a Gaussian cloud which does not match the structure of the data. In particular: LSI ignores That the data are counts, frequencies or tf-idf scores. The data is positive (uk typically has negative coefficients) Guillaume Obozinski LSI, pLSI, LDA and inference methods 6/40 Limitations and shortcomings of LSI The generative model of the data underlying PCA is a Gaussian cloud which does not match the structure of the data. In particular: LSI ignores That the data are counts, frequencies or tf-idf scores. The data is positive (uk typically has negative coefficients) The singular value decomposition is expensive to compute Guillaume Obozinski LSI, pLSI, LDA and inference methods 6/40 Topic models and matrix factorization X ∈ Rd×M with columns xi corresponding to documents B the matrix whose columns correspond to different topics Θ the matrix of decomposition coefficients with columns θi associated each to one document and which encodes its “topic content”. X = B Guillaume Obozinski . Θ LSI, pLSI, LDA and inference methods 7/40 Probabilistic LSI Guillaume Obozinski LSI, pLSI, LDA and inference methods 8/40 Probabilistic Latent Semantic Indexing (Hofmann, 2001) TOPIC 1 TOPIC 2 TOPIC 3 computer, technology, system, service, site, phone, internet, machine sell, sale, store, product, business, advertising, market, consumer play, film, movie, theater, production, star, director, stage (a) Topics Red Light, Green Light: A 2-Tone L.E.D. to Simplify Screens TOPIC 1 The three big Internet portals begin to distinguish among themselves as shopping malls Stock Trades: A Better Deal For Investors Isn't Simple TOPIC 2 Forget the Bootleg, Just Download the Movie Legally The Shape of Cinema, Transformed At the Click of a Mouse TOPIC 3 Multiplex Heralded As Linchpin To Growth A Peaceful Crew Puts Muppets Where Its Mouth Is (b) Document Assignments to Topics Figure 1: The latent space of a topic model consists of topics, which are distributions over words, and distribution over these topics for each document. On the left are three topics from a fifty topic LDA mod trained on articles from the New York Times. On the right is a simplex depicting the distribution over topi associated with seven documents. The line from eachLSI, document’s shows the document’s position Guillaume Obozinski pLSI, LDA title and inference methods 9/40in th Probabilistic Latent Semantic Indexing (Hofmann, 2001) Obtain a more expressive model by allowing several topics per document in various proportions so that each word win gets its own topic zin drawn from the multinomial distribution di unique to the i th document. Guillaume Obozinski LSI, pLSI, LDA and inference methods 10/40 Probabilistic Latent Semantic Indexing (Hofmann, 2001) Obtain a more expressive model by allowing several topics per document in various proportions so that each word win gets its own topic zin drawn from the multinomial distribution di unique to the i th document. di di topic proportions in document i zin B zin ∼ M(1, di ) (win |zink = 1) ∼ M(1, (b1k , . . . , bdk )) win Ni M Guillaume Obozinski LSI, pLSI, LDA and inference methods 10/40 EM algorithm for pLSI Denote jin∗ the index in the dictionary of the word appearing in document i as the nth word. Guillaume Obozinski LSI, pLSI, LDA and inference methods 11/40 EM algorithm for pLSI Denote jin∗ the index in the dictionary of the word appearing in document i as the nth word. Expectation step (t−1) (t) qink = p(zink (t−1) d b∗ ik jin k (t−1) = 1 | win ; di , B(t−1) ) = K X (t−1) (t−1) d 0 b∗ 0 ik jin k 0 k =1 Maximization step (i) (i) (t) dik = N X (t) qink n=1 N (i) X K X = (t) qink 0 (i) Ñk N (i) n k 0 =1 and (t) bjk = M X N X (t) qink winj i=1 n=1 (i) M X N X (t) qink i=1 n=1 Guillaume Obozinski LSI, pLSI, LDA and inference methods 11/40 Issues with pLSI Too many parameters → overfitting ! Guillaume Obozinski LSI, pLSI, LDA and inference methods 12/40 Issues with pLSI ? Too many parameters → overfitting ! Not clear Solutions Guillaume Obozinski LSI, pLSI, LDA and inference methods 12/40 Issues with pLSI ? Too many parameters → overfitting ! Not clear Solutions or alternative approaches Frequentist approach: regularize + optimize → Dictionary Learning min − log p(xi |θi ) + λΩ(θi ) θi Guillaume Obozinski LSI, pLSI, LDA and inference methods 12/40 Issues with pLSI ? Too many parameters → overfitting ! Not clear Solutions or alternative approaches Frequentist approach: regularize + optimize → Dictionary Learning min − log p(xi |θi ) + λΩ(θi ) θi Bayesian approach: prior + integrate → Latent Dirichlet Allocation p(θi |xi , α) ∝ p(xi |θi ) p(θi |α) Guillaume Obozinski LSI, pLSI, LDA and inference methods 12/40 Issues with pLSI ? Too many parameters → overfitting ! Not clear Solutions or alternative approaches Frequentist approach: regularize + optimize → Dictionary Learning min − log p(xi |θi ) + λΩ(θi ) θi Bayesian approach: prior + integrate → Latent Dirichlet Allocation p(θi |xi , α) ∝ p(xi |θi ) p(θi |α) “Frequentist + Bayesian” → integrate + optimize max α M Z Y i=1 Guillaume Obozinski p(xi |θi ) p(θi |α) dθ LSI, pLSI, LDA and inference methods 12/40 Issues with pLSI ? Too many parameters → overfitting ! Not clear Solutions or alternative approaches Frequentist approach: regularize + optimize → Dictionary Learning min − log p(xi |θi ) + λΩ(θi ) θi Bayesian approach: prior + integrate → Latent Dirichlet Allocation p(θi |xi , α) ∝ p(xi |θi ) p(θi |α) “Frequentist + Bayesian” → integrate + optimize max α M Z Y i=1 p(xi |θi ) p(θi |α) dθ ... called Empirical Bayes approach or Type II Maximum Likelihood Guillaume Obozinski LSI, pLSI, LDA and inference methods 12/40 Latent Dirichlet Allocation Guillaume Obozinski LSI, pLSI, LDA and inference methods 13/40 Latent Dirichlet Allocation α θi zin B win Ni M (Blei et al., 2003) K topics α = (α1 , . . . , αK ) parameter vector θi = (θ1i , . . . , θKi ) ∼ Dir(α) topic proportions zin topic indicator vector for nth word of i th document: z = (zin1 , . . . , zinK )> ∈ {0, 1}K zin ∼ M(1, (θ1i , . . . , θKi )) p(zin |θi ) = Guillaume Obozinski K Y zink θki k=1 LSI, pLSI, LDA and inference methods 14/40 Latent Dirichlet Allocation α θi zin B win Ni M (Blei et al., 2003) K topics α = (α1 , . . . , αK ) parameter vector θi = (θ1i , . . . , θKi ) ∼ Dir(α) topic proportions zin topic indicator vector for nth word of i th document: z = (zin1 , . . . , zinK )> ∈ {0, 1}K zin ∼ M(1, (θ1i , . . . , θKi )) p(zin |θi ) = K Y zink θki k=1 win | {zink = 1} ∼ M(1, (b1k , . . . , bdk )) p(winj = 1 | zink = 1) = bjk Guillaume Obozinski LSI, pLSI, LDA and inference methods 14/40 LDA likelihood p win , zin 1≤m≤Ni | θi = Guillaume Obozinski LSI, pLSI, LDA and inference methods 15/40 LDA likelihood p win , zin 1≤m≤Ni | θi = Guillaume Obozinski Ni Y n=1 p win , zin | θi LSI, pLSI, LDA and inference methods 15/40 LDA likelihood p win , zin 1≤m≤Ni | θi = = Ni Y p win , zin | θi n=1 Ni Y d Y K Y (bjk θki ) winj zink n=1 j=1 k=1 Guillaume Obozinski LSI, pLSI, LDA and inference methods 15/40 LDA likelihood p win , zin 1≤m≤Ni | θi = = Ni Y p win , zin | θi n=1 Ni Y d Y K Y (bjk θki ) winj zink n=1 j=1 k=1 p win 1≤m≤Ni | θi Guillaume Obozinski LSI, pLSI, LDA and inference methods 15/40 LDA likelihood p win , zin 1≤m≤Ni | θi = = Ni Y p win , zin | θi n=1 Ni Y d Y K Y (bjk θki ) winj zink n=1 j=1 k=1 p win 1≤m≤Ni | θi = Guillaume Obozinski Ni X Y n=1 zin p win , zin | θi LSI, pLSI, LDA and inference methods 15/40 LDA likelihood p win , zin 1≤m≤Ni | θi = = Ni Y p win , zin | θi n=1 Ni Y d Y K Y (bjk θki ) winj zink n=1 j=1 k=1 p win 1≤m≤Ni | θi = Ni X Y n=1 zin = Ni Y d X K Y n=1 j=1 Guillaume Obozinski p win , zin | θi k=1 bjk θki winj , LSI, pLSI, LDA and inference methods 15/40 LDA likelihood p win , zin 1≤m≤Ni | θi = = Ni Y p win , zin | θi n=1 Ni Y d Y K Y (bjk θki ) winj zink n=1 j=1 k=1 p win 1≤m≤Ni | θi = Ni X Y n=1 zin = Ni Y d X K Y n=1 j=1 so that i.i.d. win | θi ∼ M(1, Bθi ) Guillaume Obozinski p win , zin | θi or bjk θki k=1 winj , i.i.d. xi | θi ∼ M(Ni , Bθi ). LSI, pLSI, LDA and inference methods 15/40 LDA as Multinomial Factorial Analysis Eliminating zs from the model yields a conceptually simpler model in which θi can be interpreted as latent factors as in factorial analysis. α θi B xi M Topic proportions for document i: θi ∈ R K θi ∼ Dir(α) Empirical words counts for document i: xi ∈ Rd xi ∼ M(Ni , Bθi ) Guillaume Obozinski LSI, pLSI, LDA and inference methods 16/40 LDA with smoothing of the dictionary Issue with new words: they will have probability 0 if B is optimized over the training data. → Need to smooth B e.g. via Laplacian smoothing. α α θi η zin B win η θi B xi M Ni M Guillaume Obozinski LSI, pLSI, LDA and inference methods 17/40 Learning with LDA Guillaume Obozinski LSI, pLSI, LDA and inference methods 18/40 How do we learn with LDA? How do we learn for each topic its word distribution bk ? How do we learn for each document its topic composition θi ? How do we assign to each word of a document its topic zin ? Guillaume Obozinski LSI, pLSI, LDA and inference methods 19/40 How do we learn with LDA? How do we learn for each topic its word distribution bk ? How do we learn for each document its topic composition θi ? How do we assign to each word of a document its topic zin ? These quantities are treated in a Bayesian fashion, so the natural Bayesian answer are p(B|W) p(θi |W) Guillaume Obozinski p(zin |W) LSI, pLSI, LDA and inference methods 19/40 How do we learn with LDA? How do we learn for each topic its word distribution bk ? How do we learn for each document its topic composition θi ? How do we assign to each word of a document its topic zin ? These quantities are treated in a Bayesian fashion, so the natural Bayesian answer are p(B|W) p(θi |W) p(zin |W) E(B|W) E(θi |W) E(zin |W) or if point-estimates are needed. Guillaume Obozinski LSI, pLSI, LDA and inference methods 19/40 Monte Carlo Principle of Monte Carlo integration Let Z be a random variable, to compute E[f (Z )] we can sample i.i.d. Z (1) , . . . , Z (B) ∼ Z and do the approximation E[f (X )] ≈ Guillaume Obozinski B 1 X f (Z (b) ) B b=1 LSI, pLSI, LDA and inference methods 20/40 Monte Carlo Principle of Monte Carlo integration Let Z be a random variable, to compute E[f (Z )] we can sample i.i.d. Z (1) , . . . , Z (B) ∼ Z and do the approximation E[f (X )] ≈ B 1 X f (Z (b) ) B b=1 Problem: In most situations sampling exactly from the distribution of Z is too hard, so this direct approach is impossible. Guillaume Obozinski LSI, pLSI, LDA and inference methods 20/40 Markov Chain Monte Carlo (MCMC) If we can cannot sample exact from the distribution of Z , i.e. from some q(z) = P(Z = z) or q(z) is the density of r.v. Z , then we can create a sequence of random variables that approach the correct distribution. Principle of MCMC Construct a chain of random variables Z (b,1) , . . . , Z (b,T ) with Z (b,t) ∼ pt (z (b,t) | Z (b,t−1) = z (b,t−1) ) such that D Z (b,T ) −→ Z T →∞ We can then approximate: B 1 X (b,T ) E[f (Z )] ≈ f Z B b=1 Guillaume Obozinski LSI, pLSI, LDA and inference methods 21/40 MCMC in practice Run a single chain: T 1 X (T0 +k·t) E[f (Z )] ≈ f Z T t=1 Guillaume Obozinski LSI, pLSI, LDA and inference methods 22/40 MCMC in practice Run a single chain: T 1 X (T0 +k·t) E[f (Z )] ≈ f Z T t=1 T0 is the burn-in time Guillaume Obozinski LSI, pLSI, LDA and inference methods 22/40 MCMC in practice Run a single chain: T 1 X (T0 +k·t) E[f (Z )] ≈ f Z T t=1 T0 is the burn-in time k is the thinning factor Guillaume Obozinski LSI, pLSI, LDA and inference methods 22/40 MCMC in practice Run a single chain: T 1 X (T0 +k·t) E[f (Z )] ≈ f Z T t=1 T0 is the burn-in time k is the thinning factor → Useful to take k > 1 only if almost i.i.d. samples are required. → To compute an expectation in which the correlation between Z (t) and Z (t−1) would not interfere take k = 1 Guillaume Obozinski LSI, pLSI, LDA and inference methods 22/40 MCMC in practice Run a single chain: T 1 X (T0 +k·t) E[f (Z )] ≈ f Z T t=1 T0 is the burn-in time k is the thinning factor → Useful to take k > 1 only if almost i.i.d. samples are required. → To compute an expectation in which the correlation between Z (t) and Z (t−1) would not interfere take k = 1 Main difficulties: the mixing time of the chain can be very large Guillaume Obozinski LSI, pLSI, LDA and inference methods 22/40 MCMC in practice Run a single chain: T 1 X (T0 +k·t) E[f (Z )] ≈ f Z T t=1 T0 is the burn-in time k is the thinning factor → Useful to take k > 1 only if almost i.i.d. samples are required. → To compute an expectation in which the correlation between Z (t) and Z (t−1) would not interfere take k = 1 Main difficulties: the mixing time of the chain can be very large Assessing whether the chain has mixed or not is a hard problem Guillaume Obozinski LSI, pLSI, LDA and inference methods 22/40 MCMC in practice Run a single chain: T 1 X (T0 +k·t) E[f (Z )] ≈ f Z T t=1 T0 is the burn-in time k is the thinning factor → Useful to take k > 1 only if almost i.i.d. samples are required. → To compute an expectation in which the correlation between Z (t) and Z (t−1) would not interfere take k = 1 Main difficulties: the mixing time of the chain can be very large Assessing whether the chain has mixed or not is a hard problem ⇒ proper approximation only with T very large. Guillaume Obozinski LSI, pLSI, LDA and inference methods 22/40 MCMC in practice Run a single chain: T 1 X (T0 +k·t) E[f (Z )] ≈ f Z T t=1 T0 is the burn-in time k is the thinning factor → Useful to take k > 1 only if almost i.i.d. samples are required. → To compute an expectation in which the correlation between Z (t) and Z (t−1) would not interfere take k = 1 Main difficulties: the mixing time of the chain can be very large Assessing whether the chain has mixed or not is a hard problem ⇒ proper approximation only with T very large. ⇒ MCMC can be quite slow or just never converge and you will not necessarily know it. Guillaume Obozinski LSI, pLSI, LDA and inference methods 22/40 Gibbs sampling A nice special case of MCMC: Guillaume Obozinski LSI, pLSI, LDA and inference methods 23/40 Gibbs sampling A nice special case of MCMC: Principle of Gibbs sampling For each node i in turn, sample the node conditionally on the other nodes, i.e. (t) (t−1) Sample Zi ∼ p zi | Z−i = z−i Guillaume Obozinski LSI, pLSI, LDA and inference methods 23/40 Gibbs sampling α A nice special case of MCMC: θi η Principle of Gibbs sampling For each node i in turn, sample the node conditionally on the other nodes, i.e. (t) (t−1) Sample Zi ∼ p zi | Z−i = z−i Guillaume Obozinski zin B LSI, pLSI, LDA and inference methods win Ni M 23/40 Gibbs sampling α A nice special case of MCMC: θi η Principle of Gibbs sampling For each node i in turn, sample the node conditionally on the other nodes, i.e. (t) (t−1) Sample Zi ∼ p zi | Z−i = z−i zin B win Ni M Markov Blanket Definition: Let V be the set of nodes of the graph. The Markov blanket of node i is the minimal set of nodes S (not containing i) such that p(Zi | ZS ) = p(Zi | Z−i ) or equivalently Guillaume Obozinski Zi ⊥ ⊥ ZV \(S∪{i}) | ZS LSI, pLSI, LDA and inference methods 23/40 d-separation Theorem Let A, B and C three disjoint sets of nodes. The property XA ⊥⊥ XB |XC holds if and only if all paths connecting A to B are blocked, which means that they contain at least one blocking node. Node j is a blocking node if there is no ”v-structure” in j and j is in C or if there is a ”v-structure” in j and if neither j nor any of its descendants in the graph is in C . f a e b c Guillaume Obozinski LSI, pLSI, LDA and inference methods 24/40 Markov Blanket in a Directed Graphical model xi Guillaume Obozinski LSI, pLSI, LDA and inference methods 25/40 Markov Blankets in LDA? α Markov blankets for θi η zin B win Ni M Guillaume Obozinski LSI, pLSI, LDA and inference methods 26/40 Markov Blankets in LDA? α θi η Markov blankets for θi → zin B win Ni M Guillaume Obozinski LSI, pLSI, LDA and inference methods 26/40 Markov Blankets in LDA? α θi η Markov blankets for θi → (zin )n=1...Ni zin B win Ni M Guillaume Obozinski LSI, pLSI, LDA and inference methods 26/40 Markov Blankets in LDA? α Markov blankets for θi θi → (zin )n=1...Ni zin zin → η B win Ni M Guillaume Obozinski LSI, pLSI, LDA and inference methods 26/40 Markov Blankets in LDA? α Markov blankets for θi θi → (zin )n=1...Ni zin zin → win , θi and B η B win Ni M Guillaume Obozinski LSI, pLSI, LDA and inference methods 26/40 Markov Blankets in LDA? α Markov blankets for θi θi → (zin )n=1...Ni zin zin → win , θi and B η B win Ni M Guillaume Obozinski B → LSI, pLSI, LDA and inference methods 26/40 Markov Blankets in LDA? α Markov blankets for θi θi → (zin )n=1...Ni zin zin → win , θi and B η B win Ni M Guillaume Obozinski B → (win , zin )n=1...Ni ,i=1...,M LSI, pLSI, LDA and inference methods 26/40 Markov Blankets in LDA? α Markov blankets for θi θi → (zin )n=1...Ni zin zin → win , θi and B η B win Ni M Guillaume Obozinski B → (win , zin )n=1...Ni ,i=1...,M LSI, pLSI, LDA and inference methods 26/40 Gibbs sampling for LDA with a single document p(w, z, θ, B; α, η) = Guillaume Obozinski LSI, pLSI, LDA and inference methods 27/40 Gibbs sampling for LDA with a single document p(w, z, θ, B; α, η) = Y N n=1 p(wn |zn , B) p(zn |θ) p(θ|α) Guillaume Obozinski Y LSI, pLSI, LDA and inference methods k p(bk |η) 27/40 Gibbs sampling for LDA with a single document p(w, z, θ, B; α, η) = Y N n=1 ∝ p(wn |zn , B) p(zn |θ) p(θ|α) Y N Y (bjk θk )wnj znk n=1 j,k Guillaume Obozinski Y Y θkαk −1 k LSI, pLSI, LDA and inference methods k p(bk |η) Y η −1 bjkj j,k 27/40 Gibbs sampling for LDA with a single document p(w, z, θ, B; α, η) = Y N n=1 ∝ p(wn |zn , B) p(zn |θ) p(θ|α) Y N Y (bjk θk )wnj znk n=1 j,k Y Y θkαk −1 k k p(bk |η) Y η −1 bjkj j,k We thus have: (zn | wn , θ) ∼ Guillaume Obozinski LSI, pLSI, LDA and inference methods 27/40 Gibbs sampling for LDA with a single document p(w, z, θ, B; α, η) = Y N n=1 ∝ p(wn |zn , B) p(zn |θ) p(θ|α) Y N Y (bjk θk )wnj znk n=1 j,k Y Y k θkαk −1 k p(bk |η) Y η −1 bjkj j,k We thus have: (zn | wn , θ) ∼ M(1, p̃m ) bj(n),k θk . k 0 bj(n),k 0 θk 0 with p̃nk = P Guillaume Obozinski LSI, pLSI, LDA and inference methods 27/40 Gibbs sampling for LDA with a single document p(w, z, θ, B; α, η) = Y N n=1 ∝ p(wn |zn , B) p(zn |θ) p(θ|α) Y N Y (bjk θk )wnj znk n=1 j,k Y Y k θkαk −1 k p(bk |η) Y η −1 bjkj j,k We thus have: (zn | wn , θ) ∼ M(1, p̃m ) (θ | (zn , wn )n , α) ∼ bj(n),k θk . k 0 bj(n),k 0 θk 0 with p̃nk = P Guillaume Obozinski LSI, pLSI, LDA and inference methods 27/40 Gibbs sampling for LDA with a single document p(w, z, θ, B; α, η) = Y N n=1 ∝ p(wn |zn , B) p(zn |θ) p(θ|α) Y N Y (bjk θk )wnj znk n=1 j,k Y Y k θkαk −1 k p(bk |η) Y η −1 bjkj j,k We thus have: (zn | wn , θ) ∼ M(1, p̃m ) bj(n),k θk . k 0 bj(n),k 0 θk 0 with p̃nk = P (θ | (zn , wn )n , α) ∼ Dir(α̃) Guillaume Obozinski with α̃k = αk + Nk , Nk = N X znk . n=1 LSI, pLSI, LDA and inference methods 27/40 Gibbs sampling for LDA with a single document p(w, z, θ, B; α, η) = Y N n=1 ∝ p(wn |zn , B) p(zn |θ) p(θ|α) Y N Y (bjk θk )wnj znk n=1 j,k Y Y k θkαk −1 k p(bk |η) Y η −1 bjkj j,k We thus have: (zn | wn , θ) ∼ M(1, p̃m ) bj(n),k θk . k 0 bj(n),k 0 θk 0 with p̃nk = P (θ | (zn , wn )n , α) ∼ Dir(α̃) with α̃k = αk + Nk , Nk = znk . n=1 (bk | (zn , wn )n , η) ∼ Guillaume Obozinski N X LSI, pLSI, LDA and inference methods 27/40 Gibbs sampling for LDA with a single document p(w, z, θ, B; α, η) = Y N n=1 ∝ p(wn |zn , B) p(zn |θ) p(θ|α) Y N Y (bjk θk )wnj znk n=1 j,k Y Y k θkαk −1 k p(bk |η) Y η −1 bjkj j,k We thus have: (zn | wn , θ) ∼ M(1, p̃m ) bj(n),k θk . k 0 bj(n),k 0 θk 0 with p̃nk = P (θ | (zn , wn )n , α) ∼ Dir(α̃) with (bk | (zn , wn )n , η) ∼ Dir(η̃) with η̃j = ηj + Guillaume Obozinski α̃k = αk + Nk , Nk = PN N X znk . n=1 n=1 wnj znk . LSI, pLSI, LDA and inference methods 27/40 LDA Results (Blei et al., 2003) \Arts" \Budgets" \Children" \Education" NEW FILM SHOW MUSIC MOVIE PLAY MUSICAL BEST ACTOR FIRST YORK OPERA THEATER ACTRESS LOVE MILLION TAX PROGRAM BUDGET BILLION FEDERAL YEAR SPENDING NEW STATE PLAN MONEY PROGRAMS GOVERNMENT CONGRESS CHILDREN WOMEN PEOPLE CHILD YEARS FAMILIES WORK PARENTS SAYS FAMILY WELFARE MEN PERCENT CARE LIFE SCHOOL STUDENTS SCHOOLS EDUCATION TEACHERS HIGH PUBLIC TEACHER BENNETT MANIGAT NAMPHY STATE PRESIDENT ELEMENTARY HAITI The William Randolph Hearst Foundation will give $1.25 million to Lincoln Center, Metropolitan Opera Co., New York Philharmonic and Juilliard School. “Our board felt that we had a Obozinski LDA and inference real opportunity to make a Guillaume mark on the future LSI, of pLSI, the performing artsmethods with these grants 28/40 an act N EW YORK T IMES 50 100 150 50 100 150 -7.3214 / 784.38 -7.2761 / 778.24 -7.2477 / 777.32 -7.5257 / 961.86 -7.4629 / 935.53 -7.4266 / 929.76 -7.3335 / 788.58 -7.3384 / 796.43 -7.2647 / 762.16 -7.2834 / 785.05 -7.2467 / 755.55 -7.2382 770.36 (Boyd-Graber et al., /2009) -7.5332 / 936.58 -7.5378 / 975.88 -7.4385 / 880.30 -7.4748 / 951.78 -7.3872 / 852.46 -7.4355 / 945.29 Reading Tea leaves: word precision W IKIPEDIA 50 topics 100 topics 150 topics 1.0 New York Times 0.8 0.6 Model Precision 0.4 0.2 0.0 ● 1.0 0.8 Wikipedia 0.6 0.4 ● 0.2 ● ● ● ● ● ● 0.0 CTM LDA pLSI CTM LDA pLSI CTM ● ● ● LDA pLSI Figure 3: The model precision (Equation 1) for the three models on two corpora. Higher is better. Surprisingly, Precision of athe ofthan word outliers, although CTM generally achieves betteridentification predictive likelihood the other models (Table 1), the topics it infers fare worst when evaluated against human by humans andjudgments. for different models. Web interface. No specialized training or knowledge is typically expected of the workers. Amazon Mechanical Turk has been successfully used in the past to develop gold-standard data for natural language processing [22] and to label images [23]. both intrusion Guillaume Obozinski LSI,For pLSI, LDAthe andword inference methods and topic intrusion 29/40 Reading Tea leaves: topic precision 50 topics (Boyd-Graber et al., 2009) 100 topics 150 topics 0 −2 ● Topic Log Odds −3 ● ● ● ● ● ● ● ● −4 −5 New York Times −1 ● ● ● 0 −1 −3 ● ● ● ● ● ● ● ● ● ● ● ● LDA pLSI ● −4 Wikipedia −2 −5 −6 ● ● ● ● −7 CTM LDA pLSI CTM LDA pLSI CTM Figure 6: The topic log odds (Equation the three models on corpora. Higher is better. Although CTM Precision of the2) for identification oftwo topic outliers, generally achieves a better predictive likelihood than the other models (Table 1), the topics it infers fare worst by humans when evaluated against human judgments. and for different models. m m and let jd,∗ denote the “true” intruder, i.e., the one generated by the model. We define the topic log odds as the log ratio of the probability mass assigned to the true intruder to the probability mass Guillaume Obozinski 30/40 assigned to the intruder selected by the subject, LSI, pLSI, LDA and inference methods Reading Tea leaves: log-likelihood on held out data (Boyd-Graber et al., 2009) Table 1: Two predictive metrics: predictive log likelihood/predictive rank. Consistent with values reported in th literature, CTM generally performs the best, followed by LDA, then pLSI. The bold numbers indicate the be performance in each row. C ORPUS N EW YORK T IMES W IKIPEDIA T OPICS 50 100 150 50 100 150 LDA -7.3214 / 784.38 -7.2761 / 778.24 -7.2477 / 777.32 -7.5257 / 961.86 -7.4629 / 935.53 -7.4266 / 929.76 CTM -7.3335 / 788.58 -7.2647 / 762.16 -7.2467 / 755.55 -7.5332 / 936.58 -7.4385 / 880.30 -7.3872 / 852.46 P LSI -7.3384 / 796.43 -7.2834 / 785.05 -7.2382 / 770.36 -7.5378 / 975.88 -7.4748 / 951.78 -7.4355 / 945.29 Log-likelihoods of several models100including LDA, pLSI and CTM 50 topics topics 150 topics 1.0 (CTM=correlated topic model) New York Times 0.8 0.6 del Precision 0.4 0.2 0.0 ● 1.0 Guillaume Obozinski LSI, pLSI, LDA and inference methods 31/40 Variational inference for LDA Guillaume Obozinski LSI, pLSI, LDA and inference methods 32/40 Principle of Variational Inference Problem: it is hard to compute: p(B, θi , zin |W), E(B|W), E(θi |W), E(zin |W). Idea of Variational Inference: Find a distribution q which is as close as possible to p(·|W) for which it is not too hard to compute Eq (B), Eq (θi ), Eq (zin ). Guillaume Obozinski LSI, pLSI, LDA and inference methods 33/40 Principle of Variational Inference Problem: it is hard to compute: p(B, θi , zin |W), E(B|W), E(θi |W), E(zin |W). Idea of Variational Inference: Find a distribution q which is as close as possible to p(·|W) for which it is not too hard to compute Eq (B), Eq (θi ), Eq (zin ). Usual approach: 1 Choose a simple parametric family Q for q. Guillaume Obozinski LSI, pLSI, LDA and inference methods 33/40 Principle of Variational Inference Problem: it is hard to compute: p(B, θi , zin |W), E(B|W), E(θi |W), E(zin |W). Idea of Variational Inference: Find a distribution q which is as close as possible to p(·|W) for which it is not too hard to compute Eq (B), Eq (θi ), Eq (zin ). Usual approach: 1 2 Choose a simple parametric family Q for q. Solve the variational formulation Guillaume Obozinski min KL q k p(·|W) q∈Q LSI, pLSI, LDA and inference methods 33/40 Principle of Variational Inference Problem: it is hard to compute: p(B, θi , zin |W), E(B|W), E(θi |W), E(zin |W). Idea of Variational Inference: Find a distribution q which is as close as possible to p(·|W) for which it is not too hard to compute Eq (B), Eq (θi ), Eq (zin ). Usual approach: 1 2 3 Choose a simple parametric family Q for q. Solve the variational formulation min KL q k p(·|W) q∈Q Compute the desired expectations: Eq (B), Eq (θi ), Eq (zin ). Guillaume Obozinski LSI, pLSI, LDA and inference methods 33/40 Principle of Variational Inference Problem: it is hard to compute: p(B, θi , zin |W), E(B|W), E(θi |W), E(zin |W). Idea of Variational Inference: Find a distribution q which is as close as possible to p(·|W) for which it is not too hard to compute Eq (B), Eq (θi ), Eq (zin ). Usual approach: 1 2 3 Choose a simple parametric family Q for q. Solve the variational formulation min KL q k p(·|W) q∈Q Compute the desired expectations: Eq (B), Eq (θi ), Eq (zin ). Guillaume Obozinski LSI, pLSI, LDA and inference methods 33/40 Principle of Variational Inference Problem: it is hard to compute: p(B, θi , zin |W), E(B|W), E(θi |W), E(zin |W). Idea of Variational Inference: Find a distribution q which is as close as possible to p(·|W) for which it is not too hard to compute Eq (B), Eq (θi ), Eq (zin ). Usual approach: 1 2 3 Choose a simple parametric family Q for q. Solve the variational formulation min KL q k p(·|W) q∈Q Compute the desired expectations: Eq (B), Eq (θi ), Eq (zin ). Guillaume Obozinski LSI, pLSI, LDA and inference methods 33/40 Variational Inference for LDA (Blei et al., 2003) Assume B is a parameter, assume there is a single document, and focus on the inference on θ and (zn )n . Choose q in a factorized form (mean field approximation) q(θ, (zn )n ) = qθ (θ) N Y qzn (zn ) n=1 Guillaume Obozinski LSI, pLSI, LDA and inference methods 34/40 Variational Inference for LDA (Blei et al., 2003) Assume B is a parameter, assume there is a single document, and focus on the inference on θ and (zn )n . Choose q in a factorized form (mean field approximation) q(θ, (zn )n ) = qθ (θ) N Y qzn (zn ) with n=1 P Γ ( k γk ) Y γk −1 qθ (θ) = Q θk k Γ (γk ) k Guillaume Obozinski LSI, pLSI, LDA and inference methods 34/40 Variational Inference for LDA (Blei et al., 2003) Assume B is a parameter, assume there is a single document, and focus on the inference on θ and (zn )n . Choose q in a factorized form (mean field approximation) q(θ, (zn )n ) = qθ (θ) N Y qzn (zn ) with n=1 P Γ ( k γk ) Y γk −1 qθ (θ) = Q θk k Γ (γk ) k Guillaume Obozinski and qzn (zn ) = Y znk φnk . k LSI, pLSI, LDA and inference methods 34/40 Variational Inference for LDA (Blei et al., 2003) Assume B is a parameter, assume there is a single document, and focus on the inference on θ and (zn )n . Choose q in a factorized form (mean field approximation) q(θ, (zn )n ) = qθ (θ) N Y qzn (zn ) with n=1 P Γ ( k γk ) Y γk −1 qθ (θ) = Q θk k Γ (γk ) and qzn (zn ) = znk φnk . k k h Y i X q(θ, (zn )n ) = Eq log qθ (θ) + log qzn (zn ) p(θ, (zn )n | W) n X . . . − log p(θ|α) − log p(zn |θ) + log p(wn |zn , B) − p (wn )n KL q k p(·|W) = Eq log n Variational Inference for LDA (Blei et al., 2003) Assume B is a parameter, assume there is a single document, and focus on the inference on θ and (zn )n . Choose q in a factorized form (mean field approximation) q(θ, (zn )n ) = qθ (θ) N Y qzn (zn ) with n=1 P Γ ( k γk ) Y γk −1 qθ (θ) = Q θk k Γ (γk ) and qzn (zn ) = . . . − log p(θ|α) − X n znk φnk . k k KL q k p(·|W) = Y Eq log qθ (θ) + X n log qzn (zn ) log p(zn |θ) + log p(wn |zn , B) − p (wn )n Variational Inference for LDA (Blei et al., 2003) Assume B is a parameter, assume there is a single document, and focus on the inference on θ and (zn )n . Choose q in a factorized form (mean field approximation) q(θ, (zn )n ) = qθ (θ) N Y qzn (zn ) with n=1 P Γ ( k γk ) Y γk −1 qθ (θ) = Q θk k Γ (γk ) and qzn (zn ) = znk φnk . k k h Y i X q(θ, (zn )n ) = Eq log qθ (θ) + log qzn (zn ) p(θ, (zn )n | W) n X . . . − log p(θ|α) − log p(zn |θ) + log p(wn |zn , B) − p (wn )n KL q k p(·|W) = Eq log n Guillaume Obozinski LSI, pLSI, LDA and inference methods 34/40 Variational Inference for LDA II X E log qθ (θ)−log p(θ|α)+ log qzn (zn )−log p(zn |θ)−log p(wn |zn , B) n Guillaume Obozinski LSI, pLSI, LDA and inference methods 35/40 Variational Inference for LDA II X E log qθ (θ)−log p(θ|α)+ log qzn (zn )−log p(zn |θ)−log p(wn |zn , B) n Eq log qθ (θ) = Guillaume Obozinski LSI, pLSI, LDA and inference methods 35/40 Variational Inference for LDA II X E log qθ (θ)−log p(θ|α)+ log qzn (zn )−log p(zn |θ)−log p(wn |zn , B) n P P P Eq log qθ (θ) = Eq log Γ ( k γk ) − k log Γ (γk ) + k Guillaume Obozinski (γk −1) log(θk ) LSI, pLSI, LDA and inference methods 35/40 Variational Inference for LDA II X E log qθ (θ)−log p(θ|α)+ log qzn (zn )−log p(zn |θ)−log p(wn |zn , B) n P P P Eq log qθ (θ) = Eq log Γ ( k γk ) − k log Γ (γk ) + k (γk −1) log(θk ) P P P = log Γ ( k γk ) − k log Γ (γk ) + k (γk −1) Eq [log(θk )] Guillaume Obozinski LSI, pLSI, LDA and inference methods 35/40 Variational Inference for LDA II X E log qθ (θ)−log p(θ|α)+ log qzn (zn )−log p(zn |θ)−log p(wn |zn , B) n P P P Eq log qθ (θ) = Eq log Γ ( k γk ) − k log Γ (γk ) + k (γk −1) log(θk ) P P P = log Γ ( k γk ) − k log Γ (γk ) + k (γk −1) Eq [log(θk )] Eq [p(θ|α)] = Guillaume Obozinski LSI, pLSI, LDA and inference methods 35/40 Variational Inference for LDA II X E log qθ (θ)−log p(θ|α)+ log qzn (zn )−log p(zn |θ)−log p(wn |zn , B) n P P P Eq log qθ (θ) = Eq log Γ ( k γk ) − k log Γ (γk ) + k (γk −1) log(θk ) P P P = log Γ ( k γk ) − k log Γ (γk ) + k (γk −1) Eq [log(θk )] Eq [p(θ|α)] = E[(αk − 1) log(θk )] + cst = (αk − 1)Eq [log(θk )] + cst Guillaume Obozinski LSI, pLSI, LDA and inference methods 35/40 Variational Inference for LDA II X E log qθ (θ)−log p(θ|α)+ log qzn (zn )−log p(zn |θ)−log p(wn |zn , B) n P P P Eq log qθ (θ) = Eq log Γ ( k γk ) − k log Γ (γk ) + k (γk −1) log(θk ) P P P = log Γ ( k γk ) − k log Γ (γk ) + k (γk −1) Eq [log(θk )] Eq [p(θ|α)] = E[(αk − 1) log(θk )] + cst = (αk − 1)Eq [log(θk )] + cst Eq [log qzn (zn ) − log p(zn )] = Guillaume Obozinski LSI, pLSI, LDA and inference methods 35/40 Variational Inference for LDA II X E log qθ (θ)−log p(θ|α)+ log qzn (zn )−log p(zn |θ)−log p(wn |zn , B) n P P P Eq log qθ (θ) = Eq log Γ ( k γk ) − k log Γ (γk ) + k (γk −1) log(θk ) P P P = log Γ ( k γk ) − k log Γ (γk ) + k (γk −1) Eq [log(θk )] Eq [p(θ|α)] = E[(αk − 1) log(θk )] + cst = (αk − 1)Eq [log(θk )] + cst Eq [log qzn (zn ) − log p(zn )] = Eq = X X k Guillaume Obozinski k znk log(φnk ) − znk log(θk ) Eq [znk ] log(φnk ) − Eq [log(θk )] LSI, pLSI, LDA and inference methods 35/40 Variational Inference for LDA II X E log qθ (θ)−log p(θ|α)+ log qzn (zn )−log p(zn |θ)−log p(wn |zn , B) n P P P Eq log qθ (θ) = Eq log Γ ( k γk ) − k log Γ (γk ) + k (γk −1) log(θk ) P P P = log Γ ( k γk ) − k log Γ (γk ) + k (γk −1) Eq [log(θk )] Eq [p(θ|α)] = E[(αk − 1) log(θk )] + cst = (αk − 1)Eq [log(θk )] + cst Eq [log qzn (zn ) − log p(zn )] = Eq = X k Eq [log p(wn |zn , B)] = Eq X X k znk log(φnk ) − znk log(θk ) Eq [znk ] log(φnk ) − Eq [log(θk )] znk wnj log(bjk ) = Guillaumej,k Obozinski LSI, pLSI, LDA and inference methods 35/40 Variational Inference for LDA II X E log qθ (θ)−log p(θ|α)+ log qzn (zn )−log p(zn |θ)−log p(wn |zn , B) n P P P Eq log qθ (θ) = Eq log Γ ( k γk ) − k log Γ (γk ) + k (γk −1) log(θk ) P P P = log Γ ( k γk ) − k log Γ (γk ) + k (γk −1) Eq [log(θk )] Eq [p(θ|α)] = E[(αk − 1) log(θk )] + cst = (αk − 1)Eq [log(θk )] + cst Eq [log qzn (zn ) − log p(zn )] = Eq = X k Eq [log p(wn |zn , B)] = Eq X X k znk log(φnk ) − znk log(θk ) Eq [znk ] log(φnk ) − Eq [log(θk )] X znk wnj log(bjk ) = Eq [znk ] wnj log(bjk ) Guillaumej,k Obozinski j,k methods LSI, pLSI, LDA and inference 35/40 VI for LDA: Computing the expectations The expectation of the logarithm of a Dirichlet r.v. can be computed exactly with the digamma function Ψ: P Eq [log(θk )] = Ψ(γk ) − Ψ( k γk ), Guillaume Obozinski with Ψ(x) := ∂ log Γ (x) . ∂x LSI, pLSI, LDA and inference methods 36/40 VI for LDA: Computing the expectations The expectation of the logarithm of a Dirichlet r.v. can be computed exactly with the digamma function Ψ: P Eq [log(θk )] = Ψ(γk ) − Ψ( k γk ), We obviously have Eq [znk ] = φnk . Guillaume Obozinski with Ψ(x) := ∂ log Γ (x) . ∂x LSI, pLSI, LDA and inference methods 36/40 VI for LDA: Computing the expectations The expectation of the logarithm of a Dirichlet r.v. can be computed exactly with the digamma function Ψ: P Eq [log(θk )] = Ψ(γk ) − Ψ( k γk ), with Ψ(x) := We obviously have Eq [znk ] = φnk . ∂ log Γ (x) . ∂x The problem min KL q k p(·|W) is therefore equivalent to q∈Q min D(γ, (φn )n ) with γ,(φn )n Guillaume Obozinski LSI, pLSI, LDA and inference methods 36/40 VI for LDA: Computing the expectations The expectation of the logarithm of a Dirichlet r.v. can be computed exactly with the digamma function Ψ: P Eq [log(θk )] = Ψ(γk ) − Ψ( k γk ), with Ψ(x) := We obviously have Eq [znk ] = φnk . ∂ log Γ (x) . ∂x The problem min KL q k p(·|W) is therefore equivalent to q∈Q min D(γ, (φn )n ) with γ,(φn )n D(γ, (φn )n ) = log Γ ( − X n,k φnk X j X k γk ) − wnj log(bjk ) − X log Γ (γk ) + k X ((αk + k Guillaume Obozinski P n X φnk log(φnk ) n,k P φnk − γk ) Ψ(γk ) − Ψ( k γk ) LSI, pLSI, LDA and inference methods 36/40 VI for LDA: Solving for the variational updates P Introducing a Lagrangian to account for the constraints L(γ, (φn )n ) = D(γ, (φn )n ) + Guillaume Obozinski N X n=1 λn 1 − P k K k=1 φnk φnk LSI, pLSI, LDA and inference methods = 1: 37/40 VI for LDA: Solving for the variational updates P Introducing a Lagrangian to account for the constraints L(γ, (φn )n ) = D(γ, (φn )n ) + N X n=1 Computing the gradient of the Lagrangian: Guillaume Obozinski λn 1 − P k K k=1 φnk φnk LSI, pLSI, LDA and inference methods = 1: 37/40 VI for LDA: Solving for the variational updates P Introducing a Lagrangian to account for the constraints L(γ, (φn )n ) = D(γ, (φn )n ) + N X n=1 λn 1 − Computing the gradient of the Lagrangian: P k K k=1 φnk φnk = 1: X P ∂L = −(αk + φnk − γk )(Ψ0 (γk ) − Ψ0 ( k γk )) ∂γk n X P ∂L = log(φnk ) + 1 − wnj log(bjk ) − (Ψ(γk ) − Ψ( k γk )) − λn ∂φnk j Partial minimizations in γ and φnk are therefore respectively solved by X P φnk γk = αk + and φnk ∝ bj(n),k exp(Ψ(γk ) − Ψ( k γk )), n where j(n) is the one and only j such that wnj = 1. VI for LDA: Solving for the variational updates P Introducing a Lagrangian to account for the constraints L(γ, (φn )n ) = D(γ, (φn )n ) + N X n=1 Computing the gradient of the Lagrangian: λn 1 − P k K k=1 φnk φnk = 1: X P ∂L = −(αk + φnk − γk )(Ψ0 (γk ) − Ψ0 ( k γk )) ∂γk n X P ∂L = log(φnk ) + 1 − wnj log(bjk ) − (Ψ(γk ) − Ψ( k γk )) − λn ∂φnk j Guillaume Obozinski LSI, pLSI, LDA and inference methods 37/40 VI for LDA: Solving for the variational updates P Introducing a Lagrangian to account for the constraints L(γ, (φn )n ) = D(γ, (φn )n ) + N X n=1 λn 1 − Computing the gradient of the Lagrangian: P k K k=1 φnk φnk = 1: X P ∂L = −(αk + φnk − γk )(Ψ0 (γk ) − Ψ0 ( k γk )) ∂γk n X P ∂L = log(φnk ) + 1 − wnj log(bjk ) − (Ψ(γk ) − Ψ( k γk )) − λn ∂φnk j Partial minimizations in γ and φnk are therefore respectively solved by X P φnk γk = αk + and φnk ∝ bj(n),k exp(Ψ(γk ) − Ψ( k γk )), n where j(n) is the one and only j such that wnj = 1. Guillaume Obozinski LSI, pLSI, LDA and inference methods 37/40 Variational Algorithm Algorithm 1 Variational inference for LDA Require: W, α, γinit , (φn,init )n 1: while Not converged do P 2: γk ← αk + n φnk 3: for n=1..N do 4: for k=1..K do P 5: φnk ← bj(n),k exp(Ψ(γk ) − Ψ( k γk )) 6: end for 7: φn ← P 1φnk φn k 8: end for 9: end while 10: return γ, (φn )n Guillaume Obozinski LSI, pLSI, LDA and inference methods 38/40 Variational Algorithm Algorithm 2 Variational inference for LDA Require: W, α, γinit , (φn,init )n 1: while Not converged do P 2: γk ← αk + n φnk 3: for n=1..N do 4: for k=1..K do P 5: φnk ← bj(n),k exp(Ψ(γk ) − Ψ( k γk )) 6: end for 7: φn ← P 1φnk φn k 8: end for 9: end while 10: return γ, (φn )n With the quantities computed we can approximate: γk E[θk |W] ≈ P and E[zn |W] ≈ φn k 0 γk 0 Guillaume Obozinski LSI, pLSI, LDA and inference methods 38/40 Polylingual Topic Model (Mimno et al., 2009) Generalization of LDA to documents available simultaneously in several languages such as Wikipedia articles, which are not literal translations of one another but share the same topics. α θi (1) zin (1) win (1) Ni (2) (L) zin (2) win zin (L) win ... (2) (L) Ni Ni M B(1) B(2) Guillaume Obozinski B(L) LSI, pLSI, LDA and inference methods 39/40 References I Blei, D., Ng, A., and Jordan, M. (2003). Latent dirichlet allocation. The Journal of Machine Learning Research, 3:993–1022. Boyd-Graber, J., Chang, J., Gerrish, S., Wang, C., and Blei, D. (2009). Reading tea leaves: How humans interpret topic models. In Proceedings of the 23rd Annual Conference on Neural Information Processing Systems. Deerwester, S., Dumais, S., Furnas, G., Landauer, T., and Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American society for information science, 41(6):391–407. Hofmann, T. (2001). Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42(1):177–196. Mimno, D., Wallach, H., Naradowsky, J., Smith, D., and McCallum, A. (2009). Polylingual topic models. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2-Volume 2, pages 880–889. Association for Computational Linguistics. Guillaume Obozinski LSI, pLSI, LDA and inference methods 40/40