Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process Chong Wang and David M. Blei NIPS 2009 Discussion led by Chunping Wang ECE, Duke University March 26, 2010 Outline • Motivations • LDA and HDP-LDA • Sparse Topic Models • Inference Using Collapsed Gibbs sampling • Experiments • Conclusions 1/16 Motivations • Topics modeling with the “bag of words” assumption • An extension of the HDP-LDA model • In the LDA and the HDP-LDA models, the topics are drawn from an exchangeable Dirichlet distribution with a scale parameter . As approaches zero, topics will be o sparse: most probability mass on only a few terms o less smooth: empirical counts dominant • Goal: to decouple sparsity and smoothness so that these two properties can be achieved at the same time. • How: a Bernoulli variable for each term and each topic is introduced. 2/16 LDA and HDP-LDA LDA βk ~ Dir(u) topic k: document d : θd ~ Dir(α) α θ N D β w z word i : zdi ~ Mult(θd ) wdi ~ Mult(β zdi ) u K Base measure α θ D HDP-LDA z N topic k: β w Nonparametric form of LDA, with the number of topics unbounded βk ~ Dir(u) weights α ~ GEM( ) document d : θd ~ DP( , α) u word i : zdi ~ Mult(θd ) wdi ~ Mult(β zdi ) 3/16 Sparse Topic Models The size of the vocabulary is V β ~ Dir(u) Defined on a V-1-simplex β ~ Dir(b) Defined on a sub-simplex specified by b b : a V-length binary vector composed of V Bernoulli variables bkv ~ Bern( k ) k ~ Beta(r, s) one selection proportion for each topic Sparsity: the pattern of ones in b k , controlled by k Smoothness: enforced over terms with non-zero bkv’s through Decoupled! 4/16 Sparse Topic Models 5/16 Inference Using Collapsed Gibbs sampling 6/16 Inference Using Collapsed Gibbs sampling As in the HDP-LDA Topic proportions θ and topic distributions βare integrated out. 6/16 Inference Using Collapsed Gibbs sampling As in the HDP-LDA Topic proportions θ and topic distributions βare integrated out. The direct-assignment method based on the Chinese restaurant franchise (CRF) is used for z, α and an augmented variable, table counts m 6/16 Inference Using Collapsed Gibbs sampling Notation: ndk : # of customers (words) in restaurant d (document) eating dish k (topic) mdk: # of tables in restaurant d serving dish k nd . , md . , m.k , m..: marginal counts represented with dots K, u: current # of topics and new topic index, respectively (v ) nk : # of times that term v has been assigned to topic k (.) nk : # of times that all the terms have been assigned to topic k w f k (wdi v | ) p(wdi v | {wd 'i ' , zd 'i ' : zd 'i ' k , d ' i' di},) di conditional density of wdi under the topic k given all data except wdi 7/16 Inference Using Collapsed Gibbs sampling Recall the direct-assignment sampling method for the HDP-LDA Sampling topic assignments (ndk , di k ) f k wdi ( wdi ) if k previouslyused p( zdi k | z di , m, α) wdi f ( wdi ) if k u u u if a new topic knew is sampled, then sample and and , and let Sampling stick length α | m ~ Dir(m.1,, m.K , ) Sampling table counts p(mdk m | z, m dk , α) (k ) s(ndk , m)(k ) m (k ndk ) 8/16 Inference Using Collapsed Gibbs sampling Recall the direct-assignment sampling method for HDP-LDA Sampling topic assignments (ndk , di k ) f k wdi ( wdi ) if k previouslyused p( zdi k | z di , m, α) wdi f ( wdi ) if k u u u for HDP-LDA f kwd i (wdi v) nk(v,) di for sparse TM f kwd i (wdi v | bk ) (nk(v,) di )bkv straightforward Instead, the authors integrate out b k for faster convergence. f k wdi ( wdi v | k ) dβ k p ( wdi v | β k ) p (β k , b k | {wd 'i ' , z d 'i ' : z d 'i ' k , d ' i ' di}, k ) bk Since there are total 2V possible b k , this is the central computational challenge for the sparse TM. 8/16 Inference Using Collapsed Gibbs sampling define vocabulary set of terms that have word assignments in topic k where X b vBk kv This conditional probability depends on the selector proportions. 9/16 Inference Using Collapsed Gibbs sampling 10/16 Inference Using Collapsed Gibbs sampling 10/16 Inference Using Collapsed Gibbs sampling Sampling Bernoulli parameter ( using b k as an auxiliary variable) define set of terms with an “on” b sample b k conditioned on k ; o sample k conditioned on b . k o Sampling hyper-parameters o , : with Gamma(1,1) priors o : Metropolis-Hastings using symmetric Gaussian proposal Estimate topic distributions β from any single sample of z and b sparsity smoothness on the selected terms 11/16 Experiments Four datasets: arXiv: online research abstracts, D = 2500, V = 2873 Nematode Biology: research abstracts, D = 2500, V = 2944 NIPS: NIPS articles between 1988-1999, V = 5005. 20% of words for each paper are used. Conf. abstracts: abstracts from CIKM, ICML, KDD, NIPS, SIGIR and WWW, between 2005-2008, V = 3733. Two predictive quantities: where the topic complexity complexityk Bk 12/16 Experiments better perplexity, simpler models larger : smoother less topics similar # of terms 13/16 Experiments 14/16 Experiments small (<0.01) 15/16 Experiments small (<0.01) (v) n k ˆkv (.) nk V lack of smoothness 15/16 Experiments small (<0.01) (v) n k ˆkv (.) nk V lack of smoothness Need more topics to explain all kinds of patterns of empirical word counts 15/16 Experiments small (<0.01) nk( v ) kv (.) nk V lack of smoothness Need more topics to explain all kinds of patterns of empirical word counts Infrequent words populate “noise” topics. 15/16 Conclusions A new topic model in the HDP-LDA framework, based on the “bag of words” assumption; Main contributions: • Decoupling the control of sparsity and smoothness by introducing binary selectors for term assignments in each topic; • Developing a collapsed Gibbs sampler in the HDPLDA framework. Held out performance is better than the HDP-LDA. 16/16