ppt

advertisement
Integrating Topics and Syntax
-Thomas L. Griffiths, Mark Steyvers, David M. Blei, Joshua B.
Tenenbaum
Han Liu
Department of Computer Science
University of Illinois at Urbana-Champaign
hanliu@ncsa.uiuc.edu
April 12th. 2005
Outline
•
•
•
•
•
•
•
Motivations – Syntactic vs. semantic modeling
Formalization – Notations and terminology
Generative Models – pLSI; Latent Dirichlet Allocation
Composite Models –HMMs + LDA
Inference – MCMC (Metropolis; Gibbs Sampling )
Experiments – Performance and evaluations
Summary – Bayesian hierarchical models
Discussions !
2005-4-12
Han Liu
2
Motivations
• Statistical language modeling
- Syntactic dependencies  short range dependencies
- Semantic dependencies  long-range
• Current models only consider one aspect
- Hidden Markov Models (HMMs) : syntactic modeling
- Latent Dirichlet Allocation (LDA) : semantic modeling
- Probabilistic Latent Semantic Indexing (LSI) : semantic
modeling
A model which could capture both kinds of dependencies
may be more useful!
2005-4-12
Han Liu
3
Problem Formalization
• Word
- A word is an item from a vocabulary indexed by {1,…,V}.
Which is represented as unit-basis vectors. The vth word is
represented by a V-vector w such that only the vth element
is 1, while the others are 0
• Document
- A document is a sequence of N words denoted by w = {w1,
w2 , … , wN}, where wi is the ith word in the sequence.
• Corpus
- A corpus is a collection of M documents, denoted by D =
{w1, w2 , … , wM}
2005-4-12
Han Liu
4
Latent Semantic Structure
Distribution over words
Latent Structure 
P(w)   P(w, )

Inferring latent structure
P ( | w ) 
Words
w
P ( w |  ) P ( )
P( w )
Prediction
P(wn1 | w )  ...
2005-4-12
Han Liu
5
Probabilistic Generative Models
• Probabilistic Latent Semantic Indexing (pLSI)
- Hoffman (1999) ACM SIGIR
- Probabilistic semantic model
• Latent Dirichlet Allocation (LDA)
- Blei, Ng, & Jordan (2003) J. of Machine Learning Res.
- Probabilistic semantic model
• Hidden Markov Models (HMMs)
- Baum, & Petrie (1966) Ann. Math. Stat.
- Probabilistic syntactic model
2005-4-12
Han Liu
6
Dirichelt vs. Multinomial Distributions
• Dirichlet Distribution
(conjugate prior)
(  i ) 1 1
 k 1
p(  )  k
1  k ,
i 1 ( i )
k
i 1
  1
k
i 1 i
• Multinomial Distribution
(ik1 xi )! x1
xk
p( X  )  k
1  k ,
i 1 xi !
2005-4-12
Han Liu
  1
k
i 1 i
7
Probabilistic LSI : Graphical Model
p(d , wn )  p(d ) p( wn | z ) p( z | d )
z
d
z
Topic as latent
variables
w
generate a word from
that topic
Ndd D
2005-4-12
model the distribution
over topics
Han Liu
8
Probabilistic LSI- Parameter Estimation
• The log-likelihood of Probabilistic LSI
• EM - algorithm
- E - Step
- M- Step
2005-4-12
Han Liu
9
LDA : Graphical Model

b
f
z
sample a topic
w
T
2005-4-12

sample a distribution
over topics
Ndd D
Han Liu
sample a word from
that topic
10
Latent Dirichlet Allocation
• A variant LDA developed by Griffith 2003
- choose N |x ~ Poisson ( x )
- sample  | ~ Dir ( )
- sample f |b ~ Dir( b )
- sample z | ~ Multinomial ( )
- sample w| z, f(z) ~ Multinomial (f(z) )
• Model Inference
- all the Dirichlet prior is assumed to be symmetric
- Instead of using variational inference and empirical Bayes
parameter estimation, Gibbs Sampling is adopted
2005-4-12
Han Liu
11
The Composite Model
• An intuitive representation

z1
z2
z3
z4
w1
w2
w3
w4
s1
s2
s3
s4
2005-4-12
Semantic state:
generate words from
LDA
Syntactic states:
generate words from
HMMs
Han Liu
12
Composite Model : Graphical Model


c
z
b
F(z)
T
2005-4-12
w
d
Nd
Han Liu
M
p
g
F(c)
d
C
13
Composite Model
• All the Dirichelt are assumed to be symmetric
- choose N |x ~ Poisson ( x )
- sample (d) | ~ Dir ( )
- sample f(z )|b ~ Dir (b )
- sample f(c )| g ~ Dir (g )
- sample p(c )| d ~ Dir (d )
- sample zi |(d) ~ Multinomial ((d) )
- sample ci |p(c )~ Multinomial (p(c ))
- sample wi| zi, f(z ) ~ Multinomial (f(z ) ) if ci = 1
- sample wi| ci, f(c ) ~ Multinomial (f(c ) ) if not
i
i
i-1
i-1
2005-4-12
i-1
i
i
i
i
Han Liu
14
The Composite Model: Generative process
2005-4-12
Han Liu
15
Bayesian Inference
• EM algorithm can be applied to the composite
model
- treating , f(z) , f(c) , p(c) as parameters
- log P(w| , f(z) , f(c) , p(c) ) as the likelihood
- too many parameters and too slow convergence
- the dirichelet priors are necessary assumptions !
• Markov Chain Monte Carlo (MCMC)
- Instead of explicitly representing , f(z) , f(c) , p(c) , we
consider the posterior distribution over the assignment of
words to topics or classes P( z|w) and P(c|w)
2005-4-12
Han Liu
16
Markov Chain Monte Carlo
• Sampling posterior distribution according to a
Markov Chain
- an ergodic (irreducible & aperiodic ) Markov chain
converges to a unique equilibrium distribution p (x)
- Try to sample the parameters according to a Makrov
chain, whose equilibrium distribution p (x) is just he
posterior distribution p (x)
• The key task is to construct the suitable T(x,x’)
2005-4-12
Han Liu
17
Metropolis-Hastings Algorithm
• Sampling by constructing a reversible Markov
chain
- a reversible Markov chain could guarantee the condition
of the equilibrium distribution p (x)
- Simultaneous Metropolis Hastings Algorithm holds a
similar idea as rejection sampling
2005-4-12
Han Liu
18
Metropolis-Hastings Algorithm (cont.)
• Algorithm
loop
sample x’ from Q( x, x’);
a =min{1, (p (x’)/ p (x))*(Q( x(t), x’) / Q (x’, x(t)))};
r = U(0,1);
if a < r reject, x(t+1) = x(t); else accept, x(t+1) =x’;
end;
- Metropolis Hastings Intuition
r=1.0
r=p(x*)/p(xt)
x*
2005-4-12
xt
x*
Han Liu
19
Metropolis-Hastings Algorithm
• Why it works
Single-site Updating algorithm
2005-4-12
Han Liu
20
Gibbs Sampling
• A special case of single-site Updating Metropolis
2005-4-12
Han Liu
21
Gibbs Sampling for Composite Model
, f, p are all integrated out from the corresponding
terms, hyperparameters are sampled with single-site
Metropolis-Hastings algorithm
2005-4-12
Han Liu
22
Experiments
• Corpora
- Brown corpus 500 documents, 1,137,466 words
- TASA corpus, 37,651 documents, 12,190,931 word tokens
- NIPS corpus, 1713 documents, 4,312,614 word tokens
- W = 37,202 (Brown + TASA); W = 17,268 (NIPS)
• Experimental Design
- one class for sentence start/end markers {., ?,!}
- T=200 & C=20 (composite); C=2 (LDA); T=1 (HMMs)
- 4,000 iterations, with 2000 burn in and 100 lag
- 1st,2nd, 3rd Markov Chains are considered
2005-4-12
Han Liu
23
Identifying function and content words
2005-4-12
Han Liu
24
Comparative study on NIPS corpus
(T=100 & C = 50)
2005-4-12
Han Liu
25
Identifying function and content words
(NIPS)
2005-4-12
Han Liu
26
Marginal probabilities
• Bayesian model comparison
- P(w|M ) are calculated using the harmonic mean of the
likelihoods over the 2000 iterations
- To evaluate the Bayes factors
2005-4-12
Han Liu
27
Part of Speech Tagging
• Assessed performance on the Brown corpus
- One set consisted all Brown tags (297)
- The other set collapsed Browns tags into 10 designations
- The 20th sample used, evaluated by Adjusted Rand Index
- Compare with DC on the 1000 most frequent words on 19
clusters
2005-4-12
Han Liu
28
Document Classification
• Evaluated by Naïve Bayes Classifier
- 500 documents in Brown are classified into 15 groups
- The topic vectors produced by LDA and composite model
are used for training Naïve Bayes classifier
- 10-flod cross validation is used to evaluate the 20th sample
• Result (baseline accuracy: 0.09)
- Trained on Brown : LDA (0.51); 1st Composite model (0.45)
- Brown + TASA :
LDA (0.54); 1st Composite model (0.45)
- Explanation: only about 20% words are allocated to the
semantic component, too few to find correlations!
2005-4-12
Han Liu
29
Summary
• Bayesian hierarchical models are natural for text modeling
• Simultaneously learn syntactic classes and semantic topics is
possible through the combination of basic modules
• Discovering the syntactic and semantic building blocks form
the basis of more sophisticated representation
• Similar ideas could be generalized to the other areas
2005-4-12
Han Liu
30
Discussions
• Gibbs Sampling vs. EM algorithm ?
• Hieratical models reduce the number of Parameters, what
about model complexity?
• Equal prior for Bayesian model comparison?
• Whether there is really any effect of the 4 hyper-parameters?
• Probabilistic LSI does not have normal distribution
assumption, while Probabilistic PCA assumes normal!
• EM is sensitive to local maxima, why Bayesian goes through?
• Is document classification experiment a good evaluation?
• Majority vote for tagging?
2005-4-12
Han Liu
31
Download