Scalable Inference for Logistic-Normal Topic Models Jianfei Chen, Jun Zhu, Zi Wang, Xun Zheng, Bo Zhang chenjf10@mails.tsinghua.edu.cn, dcszj@mail.tsinghua.edu.cn, wangzi10@mails.tsinghua.edu.cn, xunzheng@cs.cmu.edu, dcszb@mail.tsinghua.edu.cn Fast Approximate Sampling Problem: Discovering latent semantic structures from data eciently, particularly inference of non-conjugate logistic-normal topic models. Topic Model Background: Integrate out Φ. Gibbs sampling for marginalized distribution: Fast approximate sampling method to draw PG(n, ρ) samples, reducing the time complexity from O(n) to O(1). a qd zdn wdn N D Fk b k p(zdn For Existing methods for logistic-normal topic models: − = 1, Z¬n , W¬dn )e wdn Ck,¬n + βwdn k ηd e ∝ PV PV j j=1 βj j=1 Ck,¬n + k 2 λk (η k 2 d d) N (ηd |µ, σ ) 2 k p(λd |Z, W, η) k 2 λk (η k d d) ∝ exp − p(λ |N , 0) d d 2 k Nd from PG(1, ηd ). = k k 2 N (γd , (τd ) ) = PG k k λd ; Nd , ηd Σ|κ, W ∼ IW(Σ; κ, W inapplicability to real world for large scale of data Contribution: p(µ, Σ|η, Z, W) ∝ p0 (µ, Σ) 140 150 160 z ∼ PG(z ; m, ρ) 170 p(ηd |µ, Σ) = 1500 0 10 0 2200 2000 1800 1. draw vector ηd ∼ N (µ, Σ) and compute the topic mixing k ηd e k proportion θd : θd = PK ηj j=1 e d 2. for each word n (1 ≤ n ≤ Nd ): (a) draw topic zdn ∼ Mult(θd ) (b) draw word wdn ∼ Mult(Φzdn ), Φk ∼ Dir(β) Posterior distribution p(η, Z, Φ|W) ∝ p0 (η, Z, Φ)p(W|Z, Φ) • Non-conjugacy between normal prior and logistic likelihood. • Variational approximate methods -> strict factorization assump- tions. Parallel Implementation 20 No communication is needed infering ηd and λd . Broadcasting the global variables µ and Σ to every machine after each iteration. For each iteration: • For each document d: k p(zdn = |Z¬dn , W, η) • For each document d: for each topic k: ηdk ∼ N (γdk , (τdk )2 ); λkd ∼ PG λkd ; Nd , ηd k • µ, Σ ∼ 10 10 2 10 3 10 time (s) 4 10 1000 1 5 10 4 16 64 256 1024 a • Scalability analysis 4 x 10 15 Fixed M=8 Linearly scaling M Ideal 10 1.2M 2.4M 3.6M #docs 4.8M 6M - M = 8, 16, 24, 32, 40 each machine processes 150K documents. - Wikipedia data set with K = 500, as a practical problem. - As we pour in the same proportion of data and machines, the training time is almost kept constant. Parallel gCTM enjoys nice scalability. • We present a scalable Gibbs sampling algorithm for logistic-normal topic models. 10 vCTM gCTM (M=1, P=1) gCTM (M=1, P=12) 40 60 K 10 80 100 20 40 60 K 80 100 10 gCTM (M=1, P=12) gCTM (M=40, P=480) Y!LDA (M=40, P=480) 4000 • Expeirmental results show signicant improvement in time e- ciency over existing variational methods, with slightly better perplexity. • The algorithm enjoys excellent scalability, suggesting the ability to 6 4500 extract large structures from massive data. 4 3500 10 0 200 400 600 800 1000 K Selected References 2 10 2500 • Build upon state-of-the-art distributed sampler [3] 0 0 0 0 N IW(µ0 , ρ , κ , W ) 3 2 3000 For each word n (1 ≤ n ≤ Nd ): zdn ∼ 10 1500 1 10 3 Conclusion time (s) • For each document d: k ρk C z k 1−z k ρk d) d e d 1 dn dn = (e ρk ρk ρk N 1+e d 1+e d (1+e d ) d 1400 10 0 Γ(βi ) iP Γ( i βi ) perplexity QNd n=1 2 −1 PD (Nd −L) d=1 vCTM gCTM (M=1, P=1) gCTM (M=1, P=12) Q k (ρk )2 λ k R − d d κk ρ ∞ k 1 2 = e d d 0 e p(λk |N , 0)dλ d d d 2Nd 1600 4 1400 is the partition function of Dirichlet distribution. b C t is the number of times topic k being assigned to the term t over the whole k corpus. c With κk = C k − N /2 and p(λk |N , 0) the Polya-Gamma distribution[2] d d d d d PG(Nd , 0), likelihood `(ηdk |ηd¬k ) is 1800 5 Measurement: Perplexity Logistic-Normal Topic Models a δ(β) = 1 ηkd ∼ P(ηkd | ηd¬ k, Z, W ) 1600 CTM Generating Process[1]: 2 T 0 −1 180 1 Data Sets: NIPS paper abstracts, 20Newsgroups, NYTimes (New York Times) corpora, and the Wikipedia corpus. perplexity • Ecient parallel implementation. 2000 K=50 K=100 4000 • ρkd = −4.19, Cdk = 19, Nd = 1155, µkd = 0.40, σd2 = 0.31, ζ = 5.35. 0 0 0 0 N IW(µ0 , ρ , κ , W ) 2000 gCTM (S=1) gCTM (S=2) gCTM (S=4) gCTM (S=8) gCTM (S=16) gCTM (S=32) vCTM 1200 2000 130 2000 • frequency of f (z) with z ∼ PG(m, ρ); frequency of samples from ηdk ∼ p(ηdk |ηd¬k , Z, W). d • A partially collapsed gibbs sampling algorithm; 6000 • Qualitative analysis on perplexity and training time. ), µ|Σ, µ0 , ρ ∼ N (µ; µ0 , Σ/ρ) Y 8000 QD QNd Perp(M) = ( d=1 i=L+1 p(wdi |Mwd,1:L )) where M denotes the model parameters, wd,1:L is the observed part of document d. • Fully-Bayesian models −1 m=1 m=2 m=4 m=8 m=n (exact) Experiments k λd : draw 1 k ηd k p(wdn |zdn For ηdk : p(ηdk |ηd¬k , Z, W, λkd ) ∝ exp 1.5 0 120 = 1|Z¬n , wdn , W¬dn , η) ∝ k k κd ηd m=1 m=2 m=4 m=8 m=n (exact) 0.5 techniquec CTM) • limitations: subject to high computational cost, inaccuracy and j ηd ab N (ηd |µ, Σ) • Sampling Logistic-Normal parameters with data augmentation • Logistic-normal Topic Models (including CTM, DTM, innite • variational inference, based on mean-eld assumption • Sampling topic assignments K For each document n (1 ≤ n ≤ Nd ): 1. draw topic mixing proportion θd ∼ Dir(α) 2. for each word n: (a) draw topic zdn ∼ M ult(θd ) (b) draw word wdn ∼ M ult(Φzdn ), Φk ∼ Dir(β) d=1 n=1 j=1 e 10000 time (s) k=1 e PK x 10 2 zdn ηd Nd D Y Y 2.5 2500 gCTM (S=1) gCTM (S=2) gCTM (S=4) gCTM (S=8) gCTM (S=16) gCTM (S=32) vCTM 4 perplexity ∝ δ(Ck + β) δ(β) j=1 e j ηd • Sensitivity analysis on hyper-parameters. 2500 time (s) • Latent Dirichlet Allocation (LDA) K Y e PK N (ηd |µ, Σ) Frequency d=1 n=1 zdn ηd Frequency p(η, Z|W) ∝ p(W|Z) Nd D Y Y Experiments perplexity Gibbs Sampling with Data Augmentation perplexity Introduction 10 gCTM (M=1, P=12) gCTM (M=40, P=480) Y!LDA (M=40, P=480) 200 400 600 800 1000 K • Eciency vCTM becomes impractical when the data size reaches 285K. gCTM can handle larger data sets with considerable speed. Training time of vCTM and gCTM (M = 40) data set D K vCTM gCTM NIPS 1.2K 100 1.9 hr 8.9 min 20NG 11K 200 16 hr 9 min NYTimes 285K 400 N/A* 0.5 hr Wiki 6M 1000 N/A* 17 hr *not nished within 1 week. [1] D. Blei and J. Laerty. Correlated topic models. In Advances in Neural Information Processing Systems (NIPS), 2006. [2] N. G. Polson, J. G. Scott, and J. Windle. Bayesian inference for logistic models using Polya-Gamma latent variables. arXiv:1205.0310v2, 2013. [3] A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, and A. Smola. Scalable inference in latent variable models. In International Conference on Web Search and Data Mining (WSDM), 2012. [4] D. Newman, A. Asuncion, P. Smyth, and M. Welling. Distributed algorithms for topic models. Journal of Machine Learning Research, (10):18011828, 2009. [5] J. Paisley, C. Wang, and D. Blei. The discrete innite logistic normal distribution for mixed-membership modeling. In International Conference on Articial Intelligence and Statistics (AISTATS), 2011. [6] D. Blei and J. Laerty. Dynamic topic models. In International Conference on Machine Learning (ICML), 2006. [7] M.I. Jordan, Z. Ghahramani, T.S. Jaakkola, and L.K. Saul. An introduction to variational methods for graphical models. Machine learning, 37(2):183233, 1999.