Inference Using Collapsed Gibbs sampling

advertisement
Decoupling Sparsity and Smoothness in the
Discrete Hierarchical Dirichlet Process
Chong Wang and David M. Blei
NIPS 2009
Discussion led by Chunping Wang
ECE, Duke University
March 26, 2010
Outline
• Motivations
• LDA and HDP-LDA
• Sparse Topic Models
• Inference Using Collapsed Gibbs sampling
• Experiments
• Conclusions
1/16
Motivations
• Topics modeling with the “bag of words” assumption
• An extension of the HDP-LDA model
• In the LDA and the HDP-LDA models, the topics are drawn from an
exchangeable Dirichlet distribution with a scale parameter . As 
approaches zero, topics will be
o
sparse: most probability mass on only a few terms
o
less smooth: empirical counts dominant
• Goal: to decouple sparsity and smoothness so that these two properties can
be achieved at the same time.
• How: a Bernoulli variable for each term and each topic is introduced.
2/16
LDA and HDP-LDA

LDA
βk ~ Dir(u)
topic k:
document d : θd ~ Dir(α)
α
θ
N
D
β
w
z
word i : zdi ~ Mult(θd )
wdi ~ Mult(β zdi )
u
K
Base measure


α
θ
D
HDP-LDA
z
N
topic k:

β
w

Nonparametric form of LDA, with the
number of topics unbounded
βk ~ Dir(u)
weights α ~ GEM( )
document d : θd ~ DP( , α)
u
word i : zdi ~ Mult(θd )
wdi ~ Mult(β zdi )
3/16
Sparse Topic Models
The size of the vocabulary is V
β ~ Dir(u)
Defined on a V-1-simplex
β ~ Dir(b)
Defined on a sub-simplex specified by b
b : a V-length binary vector composed of V Bernoulli variables
bkv ~ Bern( k )
 k ~ Beta(r, s)
one selection proportion for each topic
Sparsity: the pattern of ones in b k , controlled by k
Smoothness: enforced over terms with non-zero bkv’s through 
Decoupled!
4/16
Sparse Topic Models
5/16
Inference Using Collapsed Gibbs sampling
6/16
Inference Using Collapsed Gibbs sampling
As in the HDP-LDA
 Topic proportions θ and topic distributions βare integrated out.
6/16
Inference Using Collapsed Gibbs sampling
As in the HDP-LDA
 Topic proportions θ and topic distributions βare integrated out.
 The direct-assignment method based on the Chinese restaurant franchise
(CRF) is used for z, α and an augmented variable, table counts m
6/16
Inference Using Collapsed Gibbs sampling
Notation:
 ndk : # of customers (words) in restaurant d (document) eating dish k
(topic)
 mdk: # of tables in restaurant d serving dish k
 nd . , md . , m.k , m..: marginal counts represented with dots
 K, u: current # of topics and new topic index, respectively
(v )
 nk : # of times that term v has been assigned to topic k
(.)
 nk : # of times that all the terms have been assigned to topic k
w
 f k (wdi  v | )  p(wdi  v | {wd 'i ' , zd 'i ' : zd 'i '  k , d ' i'  di},)
di
conditional density of wdi under the topic k given all data except wdi
7/16
Inference Using Collapsed Gibbs sampling
Recall the direct-assignment sampling method for the HDP-LDA
 Sampling topic assignments
(ndk ,  di  k ) f k wdi ( wdi ) if k previouslyused
p( zdi  k | z di , m, α)  
 wdi

f
( wdi )
if k  u
u
u

if a new topic knew is sampled, then sample
and
and
, and let
 Sampling stick length α | m ~ Dir(m.1,, m.K ,  )
 Sampling table counts
p(mdk  m | z, m dk , α) 
(k )
s(ndk , m)(k ) m
(k  ndk )
8/16
Inference Using Collapsed Gibbs sampling
Recall the direct-assignment sampling method for HDP-LDA
 Sampling topic assignments
(ndk ,  di  k ) f k wdi ( wdi ) if k previouslyused
p( zdi  k | z di , m, α)  
 wdi

f
( wdi )
if k  u
u
u

for HDP-LDA
f kwd i (wdi  v)  nk(v,) di  
for sparse TM
f kwd i (wdi  v | bk )  (nk(v,) di   )bkv
straightforward
Instead, the authors integrate out b k for faster convergence.
f k wdi ( wdi  v |  k )   dβ k p ( wdi  v | β k ) p (β k , b k | {wd 'i ' , z d 'i ' : z d 'i '  k , d ' i '  di},  k )
bk
Since there are total 2V possible b k , this is the central computational
challenge for the sparse TM.
8/16
Inference Using Collapsed Gibbs sampling
define
vocabulary
set of terms that have word
assignments in topic k
where
X
b
vBk
kv
This conditional probability depends on the selector proportions.
9/16
Inference Using Collapsed Gibbs sampling
10/16
Inference Using Collapsed Gibbs sampling
10/16
Inference Using Collapsed Gibbs sampling
 Sampling Bernoulli parameter  ( using b k as an auxiliary variable)
define
set of terms with an “on” b
sample b k conditioned on  k ;
o sample  k conditioned on b .
k
o
 Sampling hyper-parameters
o  ,  : with Gamma(1,1) priors
o  : Metropolis-Hastings using symmetric Gaussian proposal
 Estimate topic distributions β from any single sample of z and b
sparsity
smoothness on the
selected terms
11/16
Experiments
Four datasets:
 arXiv: online research abstracts, D = 2500, V = 2873
 Nematode Biology: research abstracts, D = 2500, V = 2944
 NIPS: NIPS articles between 1988-1999, V = 5005. 20% of words
for each paper are used.
 Conf. abstracts: abstracts from CIKM, ICML, KDD, NIPS, SIGIR
and WWW, between 2005-2008, V = 3733.
Two predictive quantities:


where the topic complexity complexityk  Bk
12/16
Experiments
better perplexity,
simpler models
larger  : smoother
less topics
similar # of terms
13/16
Experiments
14/16
Experiments
small  (<0.01)
15/16
Experiments
small  (<0.01)
(v)
n

k
ˆkv  (.)
nk  V 
lack of smoothness
15/16
Experiments
small  (<0.01)
(v)
n

k
ˆkv  (.)
nk  V 
lack of smoothness
Need more topics to
explain all kinds of
patterns of empirical
word counts
15/16
Experiments
small  (<0.01)
nk( v )  
 kv  (.)
nk  V
lack of smoothness
Need more topics to
explain all kinds of
patterns of empirical
word counts
Infrequent words
populate “noise” topics.
15/16
Conclusions
 A new topic model in the HDP-LDA framework,
based on the “bag of words” assumption;
 Main contributions:
• Decoupling the control of sparsity and smoothness
by introducing binary selectors for term assignments
in each topic;
• Developing a collapsed Gibbs sampler in the HDPLDA framework.
 Held out performance is better than the HDP-LDA.
16/16
Download