LSI, pLSI, LDA and inference methods

advertisement
LSI, pLSI, LDA and inference methods
Guillaume Obozinski
INRIA - Ecole Normale Supérieure - Paris
RussIR summer school
Yaroslavl, August 6-10th 2012
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
1/40
Latent Semantic Indexing
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
2/40
Latent Semantic Indexing (LSI)
(Deerwester et al., 1990)
Idea: words that co-occur frequently in documents should be similar.
(i)
(i)
Let x1 and x2 count resp. the
number of occurrences of the words
physician and doctor in the i th
document.
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
3/40
Latent Semantic Indexing (LSI)
(Deerwester et al., 1990)
Idea: words that co-occur frequently in documents should be similar.
e2
(i)
x1
(i)
x2
Let
and
count resp. the
number of occurrences of the words
physician and doctor in the i th
document.
Guillaume Obozinski
x(2) •
u(1)
•
• e1
x(1)
LSI, pLSI, LDA and inference methods
3/40
Latent Semantic Indexing (LSI)
(Deerwester et al., 1990)
Idea: words that co-occur frequently in documents should be similar.
e2
(i)
x1
(i)
x2
Let
and
count resp. the
number of occurrences of the words
physician and doctor in the i th
document.
x(2) •
u(1)
•
• e1
x(1)
The directions of covariance or principal directions are obtained using
the singular value decomposition of X ∈ Rd×N
X = USV > ,
with U > U = Id
and V > V = IN
and S ∈ Rd×N a matrix with non-zero element only on the diagonal: the
singular values of X , positives and sorted in decreasing order.
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
3/40
LSI: computation
of the document representation


|
|
 (1)
(d) 
 : the principal directions.
U=
u
.
.
.
u


|
|
Let UK ∈ Rd×K , VK ∈ RN×K be the matrices retaining the K first
columns and SK ∈ RK ×K the top left K × K corner of S.
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
4/40
LSI: computation
of the document representation


|
|
 (1)
(d) 
 : the principal directions.
U=
u
.
.
.
u


|
|
Let UK ∈ Rd×K , VK ∈ RN×K be the matrices retaining the K first
columns and SK ∈ RK ×K the top left K × K corner of S.
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
4/40
LSI: computation
of the document representation


|
|
 (1)
(d) 
 : the principal directions.
U=
u
.
.
.
u


|
|
Let UK ∈ Rd×K , VK ∈ RN×K be the matrices retaining the K first
columns and SK ∈ RK ×K the top left K × K corner of S.
The projection of x(i) on the subspace spanned by UK yields the
Latent representation:
Guillaume Obozinski
x̃(i) = UK> x(i) .
LSI, pLSI, LDA and inference methods
4/40
LSI: computation
of the document representation


|
|
 (1)
(d) 
 : the principal directions.
U=
u
.
.
.
u


|
|
Let UK ∈ Rd×K , VK ∈ RN×K be the matrices retaining the K first
columns and SK ∈ RK ×K the top left K × K corner of S.
The projection of x(i) on the subspace spanned by UK yields the
Latent representation:
x̃(i) = UK> x(i) .
Remarks
UK> X = UK> UK SK VK> = SK VK>
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
4/40
LSI: computation
of the document representation


|
|
 (1)
(d) 
 : the principal directions.
U=
u
.
.
.
u


|
|
Let UK ∈ Rd×K , VK ∈ RN×K be the matrices retaining the K first
columns and SK ∈ RK ×K the top left K × K corner of S.
The projection of x(i) on the subspace spanned by UK yields the
Latent representation:
x̃(i) = UK> x(i) .
Remarks
UK> X = UK> UK SK VK> = SK VK>
u(k) is somehow like a topic and x̃(i) is the vector of coefficients
of decomposition of a document on the K “topics”.
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
4/40
LSI: computation
of the document representation


|
|
 (1)
(d) 
 : the principal directions.
U=
u
.
.
.
u


|
|
Let UK ∈ Rd×K , VK ∈ RN×K be the matrices retaining the K first
columns and SK ∈ RK ×K the top left K × K corner of S.
The projection of x(i) on the subspace spanned by UK yields the
Latent representation:
x̃(i) = UK> x(i) .
Remarks
UK> X = UK> UK SK VK> = SK VK>
u(k) is somehow like a topic and x̃(i) is the vector of coefficients
of decomposition of a document on the K “topics”.
The similarity between two documents can now be measured by
cos(∠(x̃(i) , x̃(j) )) =
Guillaume Obozinski
x̃(i)
x̃(j)
·
kx̃(i) k kx̃(j) k
LSI, pLSI, LDA and inference methods
4/40
LSI vs PCA
LSI is almost identical to Principal Component Analysis (PCA) proposed
by Karl Pearson in 1901.
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
5/40
LSI vs PCA
LSI is almost identical to Principal Component Analysis (PCA) proposed
by Karl Pearson in 1901.
Like PCA, LSI aims at finding the directions of high correlations
between words called principal directions.
Like PCA, it retains the projection of the data on a number k of
these principal directions, which are called the principal
components.
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
5/40
LSI vs PCA
LSI is almost identical to Principal Component Analysis (PCA) proposed
by Karl Pearson in 1901.
Like PCA, LSI aims at finding the directions of high correlations
between words called principal directions.
Like PCA, it retains the projection of the data on a number k of
these principal directions, which are called the principal
components.
Difference between LSI and PCA
LSI does not center the data (no specific reason).
LSI is typically combined with TF-IDF
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
5/40
Limitations and shortcomings of LSI
The generative model of the data underlying PCA is a Gaussian
cloud which does not match the structure of the data.
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
6/40
Limitations and shortcomings of LSI
The generative model of the data underlying PCA is a Gaussian
cloud which does not match the structure of the data.
In particular: LSI ignores
That the data are counts, frequencies or tf-idf scores.
The data is positive (uk typically has negative coefficients)
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
6/40
Limitations and shortcomings of LSI
The generative model of the data underlying PCA is a Gaussian
cloud which does not match the structure of the data.
In particular: LSI ignores
That the data are counts, frequencies or tf-idf scores.
The data is positive (uk typically has negative coefficients)
The singular value decomposition is expensive to compute
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
6/40
Topic models and matrix factorization
X ∈ Rd×M with columns xi corresponding to documents
B the matrix whose columns correspond to different topics
Θ the matrix of decomposition coefficients with columns θi
associated each to one document and which encodes its “topic
content”.
X
=
B
Guillaume Obozinski
.
Θ
LSI, pLSI, LDA and inference methods
7/40
Probabilistic LSI
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
8/40
Probabilistic Latent Semantic Indexing (Hofmann, 2001)
TOPIC 1
TOPIC 2
TOPIC 3
computer,
technology,
system,
service, site,
phone,
internet,
machine
sell, sale,
store, product,
business,
advertising,
market,
consumer
play, film,
movie, theater,
production,
star, director,
stage
(a) Topics
Red Light, Green Light: A
2-Tone L.E.D. to
Simplify Screens
TOPIC 1
The three big Internet
portals begin to distinguish
among themselves as
shopping malls
Stock Trades: A Better Deal
For Investors Isn't Simple
TOPIC 2
Forget the Bootleg, Just
Download the Movie Legally
The Shape of Cinema,
Transformed At the Click of
a Mouse
TOPIC 3
Multiplex Heralded As
Linchpin To Growth
A Peaceful Crew Puts
Muppets Where Its Mouth Is
(b) Document Assignments to Topics
Figure 1: The latent space of a topic model consists of topics, which are distributions over words, and
distribution over these topics for each document. On the left are three topics from a fifty topic LDA mod
trained on articles from the New York Times. On the right is a simplex depicting the distribution over topi
associated with seven documents.
The line
from eachLSI,
document’s
shows the
document’s position
Guillaume
Obozinski
pLSI, LDA title
and inference
methods
9/40in th
Probabilistic Latent Semantic Indexing (Hofmann, 2001)
Obtain a more expressive model by allowing several topics per document
in various proportions so that each word win gets its own topic zin drawn
from the multinomial distribution di unique to the i th document.
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
10/40
Probabilistic Latent Semantic Indexing (Hofmann, 2001)
Obtain a more expressive model by allowing several topics per document
in various proportions so that each word win gets its own topic zin drawn
from the multinomial distribution di unique to the i th document.
di
di topic proportions in document i
zin
B
zin ∼ M(1, di )
(win |zink = 1) ∼ M(1, (b1k , . . . , bdk ))
win
Ni
M
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
10/40
EM algorithm for pLSI
Denote jin∗ the index in the dictionary of the word appearing in document
i as the nth word.
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
11/40
EM algorithm for pLSI
Denote jin∗ the index in the dictionary of the word appearing in document
i as the nth word.
Expectation step
(t−1)
(t)
qink = p(zink
(t−1)
d
b∗
ik
jin k
(t−1)
= 1 | win ; di
, B(t−1) ) = K
X (t−1) (t−1)
d 0 b∗ 0
ik
jin k
0
k =1
Maximization step
(i)
(i)
(t)
dik =
N
X
(t)
qink
n=1
N (i) X
K
X
=
(t)
qink 0
(i)
Ñk
N (i)
n k 0 =1
and
(t)
bjk =
M X
N
X
(t)
qink winj
i=1 n=1
(i)
M X
N
X
(t)
qink
i=1 n=1
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
11/40
Issues with pLSI
Too many parameters → overfitting !
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
12/40
Issues with pLSI ?
Too many parameters → overfitting ! Not clear
Solutions
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
12/40
Issues with pLSI ?
Too many parameters → overfitting ! Not clear
Solutions or alternative approaches
Frequentist approach: regularize + optimize → Dictionary Learning
min − log p(xi |θi ) + λΩ(θi )
θi
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
12/40
Issues with pLSI ?
Too many parameters → overfitting ! Not clear
Solutions or alternative approaches
Frequentist approach: regularize + optimize → Dictionary Learning
min − log p(xi |θi ) + λΩ(θi )
θi
Bayesian approach: prior + integrate → Latent Dirichlet Allocation
p(θi |xi , α) ∝ p(xi |θi ) p(θi |α)
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
12/40
Issues with pLSI ?
Too many parameters → overfitting ! Not clear
Solutions or alternative approaches
Frequentist approach: regularize + optimize → Dictionary Learning
min − log p(xi |θi ) + λΩ(θi )
θi
Bayesian approach: prior + integrate → Latent Dirichlet Allocation
p(θi |xi , α) ∝ p(xi |θi ) p(θi |α)
“Frequentist + Bayesian” → integrate + optimize
max
α
M Z
Y
i=1
Guillaume Obozinski
p(xi |θi ) p(θi |α) dθ
LSI, pLSI, LDA and inference methods
12/40
Issues with pLSI ?
Too many parameters → overfitting ! Not clear
Solutions or alternative approaches
Frequentist approach: regularize + optimize → Dictionary Learning
min − log p(xi |θi ) + λΩ(θi )
θi
Bayesian approach: prior + integrate → Latent Dirichlet Allocation
p(θi |xi , α) ∝ p(xi |θi ) p(θi |α)
“Frequentist + Bayesian” → integrate + optimize
max
α
M Z
Y
i=1
p(xi |θi ) p(θi |α) dθ
... called Empirical Bayes approach or Type II Maximum Likelihood
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
12/40
Latent Dirichlet Allocation
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
13/40
Latent Dirichlet Allocation
α
θi
zin
B
win
Ni
M
(Blei et al., 2003)
K topics
α = (α1 , . . . , αK ) parameter vector
θi = (θ1i , . . . , θKi ) ∼ Dir(α) topic proportions
zin topic indicator vector for nth word of i th
document:
z = (zin1 , . . . , zinK )> ∈ {0, 1}K
zin ∼ M(1, (θ1i , . . . , θKi ))
p(zin |θi ) =
Guillaume Obozinski
K
Y
zink
θki
k=1
LSI, pLSI, LDA and inference methods
14/40
Latent Dirichlet Allocation
α
θi
zin
B
win
Ni
M
(Blei et al., 2003)
K topics
α = (α1 , . . . , αK ) parameter vector
θi = (θ1i , . . . , θKi ) ∼ Dir(α) topic proportions
zin topic indicator vector for nth word of i th
document:
z = (zin1 , . . . , zinK )> ∈ {0, 1}K
zin ∼ M(1, (θ1i , . . . , θKi ))
p(zin |θi ) =
K
Y
zink
θki
k=1
win | {zink = 1} ∼ M(1, (b1k , . . . , bdk ))
p(winj = 1 | zink = 1) = bjk
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
14/40
LDA likelihood
p win , zin
1≤m≤Ni
| θi
=
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
15/40
LDA likelihood
p win , zin
1≤m≤Ni
| θi
=
Guillaume Obozinski
Ni
Y
n=1
p win , zin | θi
LSI, pLSI, LDA and inference methods
15/40
LDA likelihood
p win , zin
1≤m≤Ni
| θi
=
=
Ni
Y
p win , zin | θi
n=1
Ni Y
d Y
K
Y
(bjk θki ) winj zink
n=1 j=1 k=1
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
15/40
LDA likelihood
p win , zin
1≤m≤Ni
| θi
=
=
Ni
Y
p win , zin | θi
n=1
Ni Y
d Y
K
Y
(bjk θki ) winj zink
n=1 j=1 k=1
p win
1≤m≤Ni
| θi
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
15/40
LDA likelihood
p win , zin
1≤m≤Ni
| θi
=
=
Ni
Y
p win , zin | θi
n=1
Ni Y
d Y
K
Y
(bjk θki ) winj zink
n=1 j=1 k=1
p win
1≤m≤Ni
| θi
=
Guillaume Obozinski
Ni X
Y
n=1 zin
p win , zin | θi
LSI, pLSI, LDA and inference methods
15/40
LDA likelihood
p win , zin
1≤m≤Ni
| θi
=
=
Ni
Y
p win , zin | θi
n=1
Ni Y
d Y
K
Y
(bjk θki ) winj zink
n=1 j=1 k=1
p win
1≤m≤Ni
| θi
=
Ni X
Y
n=1 zin
=
Ni Y
d X
K
Y
n=1 j=1
Guillaume Obozinski
p win , zin | θi
k=1
bjk θki
winj
,
LSI, pLSI, LDA and inference methods
15/40
LDA likelihood
p win , zin
1≤m≤Ni
| θi
=
=
Ni
Y
p win , zin | θi
n=1
Ni Y
d Y
K
Y
(bjk θki ) winj zink
n=1 j=1 k=1
p win
1≤m≤Ni
| θi
=
Ni X
Y
n=1 zin
=
Ni Y
d X
K
Y
n=1 j=1
so that
i.i.d.
win | θi ∼ M(1, Bθi )
Guillaume Obozinski
p win , zin | θi
or
bjk θki
k=1
winj
,
i.i.d.
xi | θi ∼ M(Ni , Bθi ).
LSI, pLSI, LDA and inference methods
15/40
LDA as Multinomial Factorial Analysis
Eliminating zs from the model yields a conceptually simpler model in
which θi can be interpreted as latent factors as in factorial analysis.
α
θi
B
xi
M
Topic proportions for document i:
θi ∈ R K
θi ∼ Dir(α)
Empirical words counts for document i:
xi ∈ Rd
xi ∼ M(Ni , Bθi )
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
16/40
LDA with smoothing of the dictionary
Issue with new words: they will have probability 0 if B is optimized over
the training data.
→ Need to smooth B e.g. via Laplacian smoothing.
α
α
θi
η
zin
B
win
η
θi
B
xi
M
Ni
M
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
17/40
Learning with LDA
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
18/40
How do we learn with LDA?
How do we learn for each topic its word distribution bk ?
How do we learn for each document its topic composition θi ?
How do we assign to each word of a document its topic zin ?
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
19/40
How do we learn with LDA?
How do we learn for each topic its word distribution bk ?
How do we learn for each document its topic composition θi ?
How do we assign to each word of a document its topic zin ?
These quantities are treated in a Bayesian fashion, so the natural
Bayesian answer are
p(B|W)
p(θi |W)
Guillaume Obozinski
p(zin |W)
LSI, pLSI, LDA and inference methods
19/40
How do we learn with LDA?
How do we learn for each topic its word distribution bk ?
How do we learn for each document its topic composition θi ?
How do we assign to each word of a document its topic zin ?
These quantities are treated in a Bayesian fashion, so the natural
Bayesian answer are
p(B|W)
p(θi |W)
p(zin |W)
E(B|W)
E(θi |W)
E(zin |W)
or
if point-estimates are needed.
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
19/40
Monte Carlo
Principle of Monte Carlo integration
Let Z be a random variable, to compute E[f (Z )] we can sample
i.i.d.
Z (1) , . . . , Z (B) ∼ Z
and do the approximation
E[f (X )] ≈
Guillaume Obozinski
B
1 X
f (Z (b) )
B
b=1
LSI, pLSI, LDA and inference methods
20/40
Monte Carlo
Principle of Monte Carlo integration
Let Z be a random variable, to compute E[f (Z )] we can sample
i.i.d.
Z (1) , . . . , Z (B) ∼ Z
and do the approximation
E[f (X )] ≈
B
1 X
f (Z (b) )
B
b=1
Problem: In most situations sampling exactly from the distribution of Z
is too hard, so this direct approach is impossible.
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
20/40
Markov Chain Monte Carlo (MCMC)
If we can cannot sample exact from the distribution of Z , i.e. from some
q(z) = P(Z = z) or q(z) is the density of r.v. Z , then we can create a
sequence of random variables that approach the correct distribution.
Principle of MCMC
Construct a chain of random variables
Z (b,1) , . . . , Z (b,T )
with
Z (b,t) ∼ pt (z (b,t) | Z (b,t−1) = z (b,t−1) )
such that
D
Z (b,T ) −→ Z
T →∞
We can then approximate:
B
1 X (b,T ) E[f (Z )] ≈
f Z
B
b=1
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
21/40
MCMC in practice
Run a single chain:
T
1 X (T0 +k·t) E[f (Z )] ≈
f Z
T
t=1
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
22/40
MCMC in practice
Run a single chain:
T
1 X (T0 +k·t) E[f (Z )] ≈
f Z
T
t=1
T0 is the burn-in time
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
22/40
MCMC in practice
Run a single chain:
T
1 X (T0 +k·t) E[f (Z )] ≈
f Z
T
t=1
T0 is the burn-in time
k is the thinning factor
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
22/40
MCMC in practice
Run a single chain:
T
1 X (T0 +k·t) E[f (Z )] ≈
f Z
T
t=1
T0 is the burn-in time
k is the thinning factor
→ Useful to take k > 1 only if almost i.i.d. samples are required.
→ To compute an expectation in which the correlation between Z (t)
and Z (t−1) would not interfere take k = 1
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
22/40
MCMC in practice
Run a single chain:
T
1 X (T0 +k·t) E[f (Z )] ≈
f Z
T
t=1
T0 is the burn-in time
k is the thinning factor
→ Useful to take k > 1 only if almost i.i.d. samples are required.
→ To compute an expectation in which the correlation between Z (t)
and Z (t−1) would not interfere take k = 1
Main difficulties:
the mixing time of the chain can be very large
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
22/40
MCMC in practice
Run a single chain:
T
1 X (T0 +k·t) E[f (Z )] ≈
f Z
T
t=1
T0 is the burn-in time
k is the thinning factor
→ Useful to take k > 1 only if almost i.i.d. samples are required.
→ To compute an expectation in which the correlation between Z (t)
and Z (t−1) would not interfere take k = 1
Main difficulties:
the mixing time of the chain can be very large
Assessing whether the chain has mixed or not is a hard problem
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
22/40
MCMC in practice
Run a single chain:
T
1 X (T0 +k·t) E[f (Z )] ≈
f Z
T
t=1
T0 is the burn-in time
k is the thinning factor
→ Useful to take k > 1 only if almost i.i.d. samples are required.
→ To compute an expectation in which the correlation between Z (t)
and Z (t−1) would not interfere take k = 1
Main difficulties:
the mixing time of the chain can be very large
Assessing whether the chain has mixed or not is a hard problem
⇒ proper approximation only with T very large.
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
22/40
MCMC in practice
Run a single chain:
T
1 X (T0 +k·t) E[f (Z )] ≈
f Z
T
t=1
T0 is the burn-in time
k is the thinning factor
→ Useful to take k > 1 only if almost i.i.d. samples are required.
→ To compute an expectation in which the correlation between Z (t)
and Z (t−1) would not interfere take k = 1
Main difficulties:
the mixing time of the chain can be very large
Assessing whether the chain has mixed or not is a hard problem
⇒ proper approximation only with T very large.
⇒ MCMC can be quite slow or just never converge and you will not
necessarily know it.
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
22/40
Gibbs sampling
A nice special case of MCMC:
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
23/40
Gibbs sampling
A nice special case of MCMC:
Principle of Gibbs sampling
For each node i in turn, sample the node
conditionally on the other nodes, i.e.
(t)
(t−1)
Sample
Zi ∼ p zi | Z−i = z−i
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
23/40
Gibbs sampling
α
A nice special case of MCMC:
θi
η
Principle of Gibbs sampling
For each node i in turn, sample the node
conditionally on the other nodes, i.e.
(t)
(t−1)
Sample
Zi ∼ p zi | Z−i = z−i
Guillaume Obozinski
zin
B
LSI, pLSI, LDA and inference methods
win
Ni
M
23/40
Gibbs sampling
α
A nice special case of MCMC:
θi
η
Principle of Gibbs sampling
For each node i in turn, sample the node
conditionally on the other nodes, i.e.
(t)
(t−1)
Sample
Zi ∼ p zi | Z−i = z−i
zin
B
win
Ni
M
Markov Blanket
Definition: Let V be the set of nodes of the graph. The Markov blanket
of node i is the minimal set of nodes S (not containing i) such that
p(Zi | ZS ) = p(Zi | Z−i )
or equivalently
Guillaume Obozinski
Zi ⊥
⊥ ZV \(S∪{i}) | ZS
LSI, pLSI, LDA and inference methods
23/40
d-separation
Theorem
Let A, B and C three disjoint sets of nodes. The property XA ⊥⊥ XB |XC
holds if and only if all paths connecting A to B are blocked, which means
that they contain at least one blocking node. Node j is a blocking node
if there is no ”v-structure” in j and j is in C or
if there is a ”v-structure” in j and if neither j nor any of its
descendants in the graph is in C .
f
a
e
b
c
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
24/40
Markov Blanket in a Directed Graphical model
xi
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
25/40
Markov Blankets in LDA?
α
Markov blankets for
θi
η
zin
B
win
Ni
M
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
26/40
Markov Blankets in LDA?
α
θi
η
Markov blankets for
θi →
zin
B
win
Ni
M
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
26/40
Markov Blankets in LDA?
α
θi
η
Markov blankets for
θi → (zin )n=1...Ni
zin
B
win
Ni
M
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
26/40
Markov Blankets in LDA?
α
Markov blankets for
θi
θi → (zin )n=1...Ni
zin
zin →
η
B
win
Ni
M
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
26/40
Markov Blankets in LDA?
α
Markov blankets for
θi
θi → (zin )n=1...Ni
zin
zin → win , θi and B
η
B
win
Ni
M
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
26/40
Markov Blankets in LDA?
α
Markov blankets for
θi
θi → (zin )n=1...Ni
zin
zin → win , θi and B
η
B
win
Ni
M
Guillaume Obozinski
B →
LSI, pLSI, LDA and inference methods
26/40
Markov Blankets in LDA?
α
Markov blankets for
θi
θi → (zin )n=1...Ni
zin
zin → win , θi and B
η
B
win
Ni
M
Guillaume Obozinski
B → (win , zin )n=1...Ni ,i=1...,M
LSI, pLSI, LDA and inference methods
26/40
Markov Blankets in LDA?
α
Markov blankets for
θi
θi → (zin )n=1...Ni
zin
zin → win , θi and B
η
B
win
Ni
M
Guillaume Obozinski
B → (win , zin )n=1...Ni ,i=1...,M
LSI, pLSI, LDA and inference methods
26/40
Gibbs sampling for LDA with a single document
p(w, z, θ, B; α, η) =
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
27/40
Gibbs sampling for LDA with a single document
p(w, z, θ, B; α, η) =
Y
N
n=1
p(wn |zn , B) p(zn |θ) p(θ|α)
Guillaume Obozinski
Y
LSI, pLSI, LDA and inference methods
k
p(bk |η)
27/40
Gibbs sampling for LDA with a single document
p(w, z, θ, B; α, η) =
Y
N
n=1
∝
p(wn |zn , B) p(zn |θ) p(θ|α)
Y
N Y
(bjk θk )wnj znk
n=1 j,k
Guillaume Obozinski
Y
Y
θkαk −1
k
LSI, pLSI, LDA and inference methods
k
p(bk |η)
Y
η −1
bjkj
j,k
27/40
Gibbs sampling for LDA with a single document
p(w, z, θ, B; α, η) =
Y
N
n=1
∝
p(wn |zn , B) p(zn |θ) p(θ|α)
Y
N Y
(bjk θk )wnj znk
n=1 j,k
Y
Y
θkαk −1
k
k
p(bk |η)
Y
η −1
bjkj
j,k
We thus have:
(zn | wn , θ) ∼
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
27/40
Gibbs sampling for LDA with a single document
p(w, z, θ, B; α, η) =
Y
N
n=1
∝
p(wn |zn , B) p(zn |θ) p(θ|α)
Y
N Y
(bjk θk )wnj znk
n=1 j,k
Y
Y
k
θkαk −1
k
p(bk |η)
Y
η −1
bjkj
j,k
We thus have:
(zn | wn , θ) ∼ M(1, p̃m )
bj(n),k θk
.
k 0 bj(n),k 0 θk 0
with p̃nk = P
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
27/40
Gibbs sampling for LDA with a single document
p(w, z, θ, B; α, η) =
Y
N
n=1
∝
p(wn |zn , B) p(zn |θ) p(θ|α)
Y
N Y
(bjk θk )wnj znk
n=1 j,k
Y
Y
k
θkαk −1
k
p(bk |η)
Y
η −1
bjkj
j,k
We thus have:
(zn | wn , θ) ∼ M(1, p̃m )
(θ | (zn , wn )n , α) ∼
bj(n),k θk
.
k 0 bj(n),k 0 θk 0
with p̃nk = P
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
27/40
Gibbs sampling for LDA with a single document
p(w, z, θ, B; α, η) =
Y
N
n=1
∝
p(wn |zn , B) p(zn |θ) p(θ|α)
Y
N Y
(bjk θk )wnj znk
n=1 j,k
Y
Y
k
θkαk −1
k
p(bk |η)
Y
η −1
bjkj
j,k
We thus have:
(zn | wn , θ) ∼ M(1, p̃m )
bj(n),k θk
.
k 0 bj(n),k 0 θk 0
with p̃nk = P
(θ | (zn , wn )n , α) ∼ Dir(α̃)
Guillaume Obozinski
with
α̃k = αk + Nk ,
Nk =
N
X
znk .
n=1
LSI, pLSI, LDA and inference methods
27/40
Gibbs sampling for LDA with a single document
p(w, z, θ, B; α, η) =
Y
N
n=1
∝
p(wn |zn , B) p(zn |θ) p(θ|α)
Y
N Y
(bjk θk )wnj znk
n=1 j,k
Y
Y
k
θkαk −1
k
p(bk |η)
Y
η −1
bjkj
j,k
We thus have:
(zn | wn , θ) ∼ M(1, p̃m )
bj(n),k θk
.
k 0 bj(n),k 0 θk 0
with p̃nk = P
(θ | (zn , wn )n , α) ∼ Dir(α̃)
with
α̃k = αk + Nk ,
Nk =
znk .
n=1
(bk | (zn , wn )n , η) ∼
Guillaume Obozinski
N
X
LSI, pLSI, LDA and inference methods
27/40
Gibbs sampling for LDA with a single document
p(w, z, θ, B; α, η) =
Y
N
n=1
∝
p(wn |zn , B) p(zn |θ) p(θ|α)
Y
N Y
(bjk θk )wnj znk
n=1 j,k
Y
Y
k
θkαk −1
k
p(bk |η)
Y
η −1
bjkj
j,k
We thus have:
(zn | wn , θ) ∼ M(1, p̃m )
bj(n),k θk
.
k 0 bj(n),k 0 θk 0
with p̃nk = P
(θ | (zn , wn )n , α) ∼ Dir(α̃)
with
(bk | (zn , wn )n , η) ∼ Dir(η̃)
with η̃j = ηj +
Guillaume Obozinski
α̃k = αk + Nk ,
Nk =
PN
N
X
znk .
n=1
n=1 wnj znk .
LSI, pLSI, LDA and inference methods
27/40
LDA Results
(Blei et al., 2003)
\Arts"
\Budgets"
\Children" \Education"
NEW
FILM
SHOW
MUSIC
MOVIE
PLAY
MUSICAL
BEST
ACTOR
FIRST
YORK
OPERA
THEATER
ACTRESS
LOVE
MILLION
TAX
PROGRAM
BUDGET
BILLION
FEDERAL
YEAR
SPENDING
NEW
STATE
PLAN
MONEY
PROGRAMS
GOVERNMENT
CONGRESS
CHILDREN
WOMEN
PEOPLE
CHILD
YEARS
FAMILIES
WORK
PARENTS
SAYS
FAMILY
WELFARE
MEN
PERCENT
CARE
LIFE
SCHOOL
STUDENTS
SCHOOLS
EDUCATION
TEACHERS
HIGH
PUBLIC
TEACHER
BENNETT
MANIGAT
NAMPHY
STATE
PRESIDENT
ELEMENTARY
HAITI
The William Randolph Hearst Foundation will give $1.25 million to Lincoln Center, Metropolitan Opera Co., New York Philharmonic and Juilliard School. “Our board felt that we had a
Obozinski
LDA and inference
real opportunity to make a Guillaume
mark on
the future LSI,
of pLSI,
the performing
artsmethods
with these grants 28/40
an act
N EW YORK T IMES
50
100
150
50
100
150
-7.3214 / 784.38
-7.2761 / 778.24
-7.2477 / 777.32
-7.5257 / 961.86
-7.4629 / 935.53
-7.4266 / 929.76
-7.3335 / 788.58 -7.3384 / 796.43
-7.2647 / 762.16
-7.2834 / 785.05
-7.2467
/ 755.55
-7.2382
770.36
(Boyd-Graber
et al., /2009)
-7.5332 / 936.58
-7.5378 / 975.88
-7.4385 / 880.30
-7.4748 / 951.78
-7.3872 / 852.46
-7.4355 / 945.29
Reading Tea leaves: word precision
W IKIPEDIA
50 topics
100 topics
150 topics
1.0
New York Times
0.8
0.6
Model Precision
0.4
0.2
0.0
●
1.0
0.8
Wikipedia
0.6
0.4
●
0.2
●
●
●
●
●
●
0.0
CTM
LDA
pLSI
CTM
LDA
pLSI
CTM
●
●
●
LDA
pLSI
Figure 3: The model precision (Equation 1) for the three models on two corpora. Higher is better. Surprisingly,
Precision
of athe
ofthan
word
outliers,
although CTM generally
achieves
betteridentification
predictive likelihood
the other
models (Table 1), the topics it
infers fare worst when evaluated
against human
by humans
andjudgments.
for different models.
Web interface. No specialized training or knowledge is typically expected of the workers. Amazon
Mechanical Turk has been successfully used in the past to develop gold-standard data for natural
language processing [22] and
to label
images [23].
both
intrusion
Guillaume
Obozinski
LSI,For
pLSI,
LDAthe
andword
inference
methods and topic intrusion
29/40
Reading Tea leaves: topic precision
50 topics
(Boyd-Graber et al., 2009)
100 topics
150 topics
0
−2
●
Topic Log Odds
−3
●
●
●
●
●
●
●
●
−4
−5
New York Times
−1
●
●
●
0
−1
−3
●
●
●
●
●
●
●
●
●
●
●
●
LDA
pLSI
●
−4
Wikipedia
−2
−5
−6
●
●
●
●
−7
CTM
LDA
pLSI
CTM
LDA
pLSI
CTM
Figure 6: The topic
log odds (Equation
the three models on
corpora.
Higher is better. Although CTM
Precision
of the2) for
identification
oftwo
topic
outliers,
generally achieves a better predictive likelihood than the other models (Table 1), the topics it infers fare worst
by humans
when evaluated against human
judgments. and for different models.
m
m and let jd,∗
denote the “true” intruder, i.e., the one generated by the model. We define the topic
log odds as the log ratio of the probability mass assigned to the true intruder to the probability mass
Guillaume
Obozinski
30/40
assigned to the intruder selected
by the
subject, LSI, pLSI, LDA and inference methods
Reading Tea leaves: log-likelihood on held out data
(Boyd-Graber et al., 2009)
Table 1: Two predictive metrics: predictive log likelihood/predictive rank. Consistent with values reported in th
literature, CTM generally performs the best, followed by LDA, then pLSI. The bold numbers indicate the be
performance in each row.
C ORPUS
N EW YORK T IMES
W IKIPEDIA
T OPICS
50
100
150
50
100
150
LDA
-7.3214 / 784.38
-7.2761 / 778.24
-7.2477 / 777.32
-7.5257 / 961.86
-7.4629 / 935.53
-7.4266 / 929.76
CTM
-7.3335 / 788.58
-7.2647 / 762.16
-7.2467 / 755.55
-7.5332 / 936.58
-7.4385 / 880.30
-7.3872 / 852.46
P LSI
-7.3384 / 796.43
-7.2834 / 785.05
-7.2382 / 770.36
-7.5378 / 975.88
-7.4748 / 951.78
-7.4355 / 945.29
Log-likelihoods
of several models100including
LDA, pLSI and
CTM
50 topics
topics
150 topics
1.0
(CTM=correlated topic model)
New York Times
0.8
0.6
del Precision
0.4
0.2
0.0
●
1.0
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
31/40
Variational inference for LDA
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
32/40
Principle of Variational Inference
Problem: it is hard to compute:
p(B, θi , zin |W),
E(B|W),
E(θi |W),
E(zin |W).
Idea of Variational Inference:
Find a distribution q which is
as close as possible to p(·|W)
for which it is not too hard to compute Eq (B), Eq (θi ), Eq (zin ).
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
33/40
Principle of Variational Inference
Problem: it is hard to compute:
p(B, θi , zin |W),
E(B|W),
E(θi |W),
E(zin |W).
Idea of Variational Inference:
Find a distribution q which is
as close as possible to p(·|W)
for which it is not too hard to compute Eq (B), Eq (θi ), Eq (zin ).
Usual approach:
1
Choose a simple parametric family Q for q.
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
33/40
Principle of Variational Inference
Problem: it is hard to compute:
p(B, θi , zin |W),
E(B|W),
E(θi |W),
E(zin |W).
Idea of Variational Inference:
Find a distribution q which is
as close as possible to p(·|W)
for which it is not too hard to compute Eq (B), Eq (θi ), Eq (zin ).
Usual approach:
1
2
Choose a simple parametric family Q for q.
Solve the variational formulation
Guillaume Obozinski
min KL q k p(·|W)
q∈Q
LSI, pLSI, LDA and inference methods
33/40
Principle of Variational Inference
Problem: it is hard to compute:
p(B, θi , zin |W),
E(B|W),
E(θi |W),
E(zin |W).
Idea of Variational Inference:
Find a distribution q which is
as close as possible to p(·|W)
for which it is not too hard to compute Eq (B), Eq (θi ), Eq (zin ).
Usual approach:
1
2
3
Choose a simple parametric family Q for q.
Solve the variational formulation
min KL q k p(·|W)
q∈Q
Compute the desired expectations: Eq (B), Eq (θi ), Eq (zin ).
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
33/40
Principle of Variational Inference
Problem: it is hard to compute:
p(B, θi , zin |W),
E(B|W),
E(θi |W),
E(zin |W).
Idea of Variational Inference:
Find a distribution q which is
as close as possible to p(·|W)
for which it is not too hard to compute Eq (B), Eq (θi ), Eq (zin ).
Usual approach:
1
2
3
Choose a simple parametric family Q for q.
Solve the variational formulation
min KL q k p(·|W)
q∈Q
Compute the desired expectations: Eq (B), Eq (θi ), Eq (zin ).
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
33/40
Principle of Variational Inference
Problem: it is hard to compute:
p(B, θi , zin |W),
E(B|W),
E(θi |W),
E(zin |W).
Idea of Variational Inference:
Find a distribution q which is
as close as possible to p(·|W)
for which it is not too hard to compute Eq (B), Eq (θi ), Eq (zin ).
Usual approach:
1
2
3
Choose a simple parametric family Q for q.
Solve the variational formulation
min KL q k p(·|W)
q∈Q
Compute the desired expectations: Eq (B), Eq (θi ), Eq (zin ).
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
33/40
Variational Inference for LDA
(Blei et al., 2003)
Assume B is a parameter, assume there is a single document, and focus
on the inference on θ and (zn )n . Choose q in a factorized form (mean
field approximation)
q(θ, (zn )n ) = qθ (θ)
N
Y
qzn (zn )
n=1
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
34/40
Variational Inference for LDA
(Blei et al., 2003)
Assume B is a parameter, assume there is a single document, and focus
on the inference on θ and (zn )n . Choose q in a factorized form (mean
field approximation)
q(θ, (zn )n ) = qθ (θ)
N
Y
qzn (zn )
with
n=1
P
Γ ( k γk ) Y γk −1
qθ (θ) = Q
θk
k Γ (γk )
k
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
34/40
Variational Inference for LDA
(Blei et al., 2003)
Assume B is a parameter, assume there is a single document, and focus
on the inference on θ and (zn )n . Choose q in a factorized form (mean
field approximation)
q(θ, (zn )n ) = qθ (θ)
N
Y
qzn (zn )
with
n=1
P
Γ ( k γk ) Y γk −1
qθ (θ) = Q
θk
k Γ (γk )
k
Guillaume Obozinski
and
qzn (zn ) =
Y
znk
φnk
.
k
LSI, pLSI, LDA and inference methods
34/40
Variational Inference for LDA
(Blei et al., 2003)
Assume B is a parameter, assume there is a single document, and focus
on the inference on θ and (zn )n . Choose q in a factorized form (mean
field approximation)
q(θ, (zn )n ) = qθ (θ)
N
Y
qzn (zn )
with
n=1
P
Γ ( k γk ) Y γk −1
qθ (θ) = Q
θk
k Γ (γk )
and
qzn (zn ) =
znk
φnk
.
k
k
h
Y
i
X
q(θ, (zn )n )
= Eq log qθ (θ) +
log qzn (zn )
p(θ, (zn )n | W)
n
X
. . . − log p(θ|α) −
log p(zn |θ) + log p(wn |zn , B) − p (wn )n
KL q k p(·|W) = Eq log
n
Variational Inference for LDA
(Blei et al., 2003)
Assume B is a parameter, assume there is a single document, and focus
on the inference on θ and (zn )n . Choose q in a factorized form (mean
field approximation)
q(θ, (zn )n ) = qθ (θ)
N
Y
qzn (zn )
with
n=1
P
Γ ( k γk ) Y γk −1
qθ (θ) = Q
θk
k Γ (γk )
and
qzn (zn ) =
. . . − log p(θ|α) −
X
n
znk
φnk
.
k
k
KL q k p(·|W) =
Y
Eq log qθ (θ) +
X
n
log qzn (zn )
log p(zn |θ) + log p(wn |zn , B) − p (wn )n
Variational Inference for LDA
(Blei et al., 2003)
Assume B is a parameter, assume there is a single document, and focus
on the inference on θ and (zn )n . Choose q in a factorized form (mean
field approximation)
q(θ, (zn )n ) = qθ (θ)
N
Y
qzn (zn )
with
n=1
P
Γ ( k γk ) Y γk −1
qθ (θ) = Q
θk
k Γ (γk )
and
qzn (zn ) =
znk
φnk
.
k
k
h
Y
i
X
q(θ, (zn )n )
= Eq log qθ (θ) +
log qzn (zn )
p(θ, (zn )n | W)
n
X
. . . − log p(θ|α) −
log p(zn |θ) + log p(wn |zn , B) − p (wn )n
KL q k p(·|W) = Eq log
n
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
34/40
Variational Inference for LDA II
X
E log qθ (θ)−log p(θ|α)+
log qzn (zn )−log p(zn |θ)−log p(wn |zn , B)
n
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
35/40
Variational Inference for LDA II
X
E log qθ (θ)−log p(θ|α)+
log qzn (zn )−log p(zn |θ)−log p(wn |zn , B)
n
Eq log qθ (θ) =
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
35/40
Variational Inference for LDA II
X
E log qθ (θ)−log p(θ|α)+
log qzn (zn )−log p(zn |θ)−log p(wn |zn , B)
n
P
P
P
Eq log qθ (θ) = Eq log Γ ( k γk ) − k log Γ (γk ) + k
Guillaume Obozinski
(γk −1) log(θk )
LSI, pLSI, LDA and inference methods
35/40
Variational Inference for LDA II
X
E log qθ (θ)−log p(θ|α)+
log qzn (zn )−log p(zn |θ)−log p(wn |zn , B)
n
P
P
P
Eq log qθ (θ) = Eq log Γ ( k γk ) − k log Γ (γk ) + k (γk −1) log(θk )
P
P
P
= log Γ ( k γk ) − k log Γ (γk ) + k (γk −1) Eq [log(θk )]
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
35/40
Variational Inference for LDA II
X
E log qθ (θ)−log p(θ|α)+
log qzn (zn )−log p(zn |θ)−log p(wn |zn , B)
n
P
P
P
Eq log qθ (θ) = Eq log Γ ( k γk ) − k log Γ (γk ) + k (γk −1) log(θk )
P
P
P
= log Γ ( k γk ) − k log Γ (γk ) + k (γk −1) Eq [log(θk )]
Eq [p(θ|α)] =
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
35/40
Variational Inference for LDA II
X
E log qθ (θ)−log p(θ|α)+
log qzn (zn )−log p(zn |θ)−log p(wn |zn , B)
n
P
P
P
Eq log qθ (θ) = Eq log Γ ( k γk ) − k log Γ (γk ) + k (γk −1) log(θk )
P
P
P
= log Γ ( k γk ) − k log Γ (γk ) + k (γk −1) Eq [log(θk )]
Eq [p(θ|α)] = E[(αk − 1) log(θk )] + cst = (αk − 1)Eq [log(θk )] + cst
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
35/40
Variational Inference for LDA II
X
E log qθ (θ)−log p(θ|α)+
log qzn (zn )−log p(zn |θ)−log p(wn |zn , B)
n
P
P
P
Eq log qθ (θ) = Eq log Γ ( k γk ) − k log Γ (γk ) + k (γk −1) log(θk )
P
P
P
= log Γ ( k γk ) − k log Γ (γk ) + k (γk −1) Eq [log(θk )]
Eq [p(θ|α)] = E[(αk − 1) log(θk )] + cst = (αk − 1)Eq [log(θk )] + cst
Eq [log qzn (zn ) − log p(zn )] =
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
35/40
Variational Inference for LDA II
X
E log qθ (θ)−log p(θ|α)+
log qzn (zn )−log p(zn |θ)−log p(wn |zn , B)
n
P
P
P
Eq log qθ (θ) = Eq log Γ ( k γk ) − k log Γ (γk ) + k (γk −1) log(θk )
P
P
P
= log Γ ( k γk ) − k log Γ (γk ) + k (γk −1) Eq [log(θk )]
Eq [p(θ|α)] = E[(αk − 1) log(θk )] + cst = (αk − 1)Eq [log(θk )] + cst
Eq [log qzn (zn ) − log p(zn )] = Eq
=
X
X
k
Guillaume Obozinski
k
znk log(φnk ) − znk log(θk )
Eq [znk ] log(φnk ) − Eq [log(θk )]
LSI, pLSI, LDA and inference methods
35/40
Variational Inference for LDA II
X
E log qθ (θ)−log p(θ|α)+
log qzn (zn )−log p(zn |θ)−log p(wn |zn , B)
n
P
P
P
Eq log qθ (θ) = Eq log Γ ( k γk ) − k log Γ (γk ) + k (γk −1) log(θk )
P
P
P
= log Γ ( k γk ) − k log Γ (γk ) + k (γk −1) Eq [log(θk )]
Eq [p(θ|α)] = E[(αk − 1) log(θk )] + cst = (αk − 1)Eq [log(θk )] + cst
Eq [log qzn (zn ) − log p(zn )] = Eq
=
X
k
Eq [log p(wn |zn , B)] = Eq
X
X
k
znk log(φnk ) − znk log(θk )
Eq [znk ] log(φnk ) − Eq [log(θk )]
znk wnj log(bjk ) =
Guillaumej,k
Obozinski
LSI, pLSI, LDA and inference methods
35/40
Variational Inference for LDA II
X
E log qθ (θ)−log p(θ|α)+
log qzn (zn )−log p(zn |θ)−log p(wn |zn , B)
n
P
P
P
Eq log qθ (θ) = Eq log Γ ( k γk ) − k log Γ (γk ) + k (γk −1) log(θk )
P
P
P
= log Γ ( k γk ) − k log Γ (γk ) + k (γk −1) Eq [log(θk )]
Eq [p(θ|α)] = E[(αk − 1) log(θk )] + cst = (αk − 1)Eq [log(θk )] + cst
Eq [log qzn (zn ) − log p(zn )] = Eq
=
X
k
Eq [log p(wn |zn , B)] = Eq
X
X
k
znk log(φnk ) − znk log(θk )
Eq [znk ] log(φnk ) − Eq [log(θk )]
X
znk wnj log(bjk ) =
Eq [znk ] wnj log(bjk )
Guillaumej,k
Obozinski
j,k methods
LSI, pLSI, LDA and inference
35/40
VI for LDA: Computing the expectations
The expectation of the logarithm of a Dirichlet r.v. can be computed
exactly with the digamma function Ψ:
P
Eq [log(θk )] = Ψ(γk ) − Ψ( k γk ),
Guillaume Obozinski
with
Ψ(x) :=
∂
log Γ (x) .
∂x
LSI, pLSI, LDA and inference methods
36/40
VI for LDA: Computing the expectations
The expectation of the logarithm of a Dirichlet r.v. can be computed
exactly with the digamma function Ψ:
P
Eq [log(θk )] = Ψ(γk ) − Ψ( k γk ),
We obviously have Eq [znk ] = φnk .
Guillaume Obozinski
with
Ψ(x) :=
∂
log Γ (x) .
∂x
LSI, pLSI, LDA and inference methods
36/40
VI for LDA: Computing the expectations
The expectation of the logarithm of a Dirichlet r.v. can be computed
exactly with the digamma function Ψ:
P
Eq [log(θk )] = Ψ(γk ) − Ψ( k γk ),
with
Ψ(x) :=
We obviously have Eq [znk ] = φnk .
∂
log Γ (x) .
∂x
The problem min KL q k p(·|W) is therefore equivalent to
q∈Q
min D(γ, (φn )n )
with
γ,(φn )n
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
36/40
VI for LDA: Computing the expectations
The expectation of the logarithm of a Dirichlet r.v. can be computed
exactly with the digamma function Ψ:
P
Eq [log(θk )] = Ψ(γk ) − Ψ( k γk ),
with
Ψ(x) :=
We obviously have Eq [znk ] = φnk .
∂
log Γ (x) .
∂x
The problem min KL q k p(·|W) is therefore equivalent to
q∈Q
min D(γ, (φn )n )
with
γ,(φn )n
D(γ, (φn )n ) = log Γ (
−
X
n,k
φnk
X
j
X
k
γk ) −
wnj log(bjk ) −
X
log Γ (γk ) +
k
X
((αk +
k
Guillaume Obozinski
P
n
X
φnk log(φnk )
n,k
P
φnk − γk ) Ψ(γk ) − Ψ( k γk )
LSI, pLSI, LDA and inference methods
36/40
VI for LDA: Solving for the variational updates
P
Introducing a Lagrangian to account for the constraints
L(γ, (φn )n ) = D(γ, (φn )n ) +
Guillaume Obozinski
N
X
n=1
λn 1 −
P
k
K
k=1 φnk
φnk
LSI, pLSI, LDA and inference methods
= 1:
37/40
VI for LDA: Solving for the variational updates
P
Introducing a Lagrangian to account for the constraints
L(γ, (φn )n ) = D(γ, (φn )n ) +
N
X
n=1
Computing the gradient of the Lagrangian:
Guillaume Obozinski
λn 1 −
P
k
K
k=1 φnk
φnk
LSI, pLSI, LDA and inference methods
= 1:
37/40
VI for LDA: Solving for the variational updates
P
Introducing a Lagrangian to account for the constraints
L(γ, (φn )n ) = D(γ, (φn )n ) +
N
X
n=1
λn 1 −
Computing the gradient of the Lagrangian:
P
k
K
k=1 φnk
φnk
= 1:
X
P
∂L
= −(αk +
φnk − γk )(Ψ0 (γk ) − Ψ0 ( k γk ))
∂γk
n
X
P
∂L
= log(φnk ) + 1 −
wnj log(bjk ) − (Ψ(γk ) − Ψ( k γk )) − λn
∂φnk
j
Partial minimizations in γ and φnk are therefore respectively solved by
X
P
φnk
γk = αk +
and
φnk ∝ bj(n),k exp(Ψ(γk ) − Ψ( k γk )),
n
where j(n) is the one and only j such that wnj = 1.
VI for LDA: Solving for the variational updates
P
Introducing a Lagrangian to account for the constraints
L(γ, (φn )n ) = D(γ, (φn )n ) +
N
X
n=1
Computing the gradient of the Lagrangian:
λn 1 −
P
k
K
k=1 φnk
φnk
= 1:
X
P
∂L
= −(αk +
φnk − γk )(Ψ0 (γk ) − Ψ0 ( k γk ))
∂γk
n
X
P
∂L
= log(φnk ) + 1 −
wnj log(bjk ) − (Ψ(γk ) − Ψ( k γk )) − λn
∂φnk
j
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
37/40
VI for LDA: Solving for the variational updates
P
Introducing a Lagrangian to account for the constraints
L(γ, (φn )n ) = D(γ, (φn )n ) +
N
X
n=1
λn 1 −
Computing the gradient of the Lagrangian:
P
k
K
k=1 φnk
φnk
= 1:
X
P
∂L
= −(αk +
φnk − γk )(Ψ0 (γk ) − Ψ0 ( k γk ))
∂γk
n
X
P
∂L
= log(φnk ) + 1 −
wnj log(bjk ) − (Ψ(γk ) − Ψ( k γk )) − λn
∂φnk
j
Partial minimizations in γ and φnk are therefore respectively solved by
X
P
φnk
γk = αk +
and
φnk ∝ bj(n),k exp(Ψ(γk ) − Ψ( k γk )),
n
where j(n) is the one and only j such that wnj = 1.
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
37/40
Variational Algorithm
Algorithm 1 Variational inference for LDA
Require: W, α, γinit , (φn,init )n
1: while Not converged do
P
2:
γk ← αk + n φnk
3:
for n=1..N do
4:
for k=1..K do
P
5:
φnk ← bj(n),k exp(Ψ(γk ) − Ψ( k γk ))
6:
end for
7:
φn ← P 1φnk φn
k
8:
end for
9: end while
10: return γ, (φn )n
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
38/40
Variational Algorithm
Algorithm 2 Variational inference for LDA
Require: W, α, γinit , (φn,init )n
1: while Not converged do
P
2:
γk ← αk + n φnk
3:
for n=1..N do
4:
for k=1..K do
P
5:
φnk ← bj(n),k exp(Ψ(γk ) − Ψ( k γk ))
6:
end for
7:
φn ← P 1φnk φn
k
8:
end for
9: end while
10: return γ, (φn )n
With the quantities computed we can approximate:
γk
E[θk |W] ≈ P
and
E[zn |W] ≈ φn
k 0 γk 0
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
38/40
Polylingual Topic Model
(Mimno et al., 2009)
Generalization of LDA to documents available simultaneously in several
languages such as Wikipedia articles, which are not literal translations of one
another but share the same topics.
α
θi
(1)
zin
(1)
win
(1)
Ni
(2)
(L)
zin
(2)
win
zin
(L)
win
...
(2)
(L)
Ni
Ni
M
B(1)
B(2)
Guillaume Obozinski
B(L)
LSI, pLSI, LDA and inference methods
39/40
References I
Blei, D., Ng, A., and Jordan, M. (2003). Latent dirichlet allocation. The Journal of Machine
Learning Research, 3:993–1022.
Boyd-Graber, J., Chang, J., Gerrish, S., Wang, C., and Blei, D. (2009). Reading tea leaves:
How humans interpret topic models. In Proceedings of the 23rd Annual Conference on
Neural Information Processing Systems.
Deerwester, S., Dumais, S., Furnas, G., Landauer, T., and Harshman, R. (1990). Indexing by
latent semantic analysis. Journal of the American society for information science,
41(6):391–407.
Hofmann, T. (2001). Unsupervised learning by probabilistic latent semantic analysis. Machine
Learning, 42(1):177–196.
Mimno, D., Wallach, H., Naradowsky, J., Smith, D., and McCallum, A. (2009). Polylingual
topic models. In Proceedings of the 2009 Conference on Empirical Methods in Natural
Language Processing: Volume 2-Volume 2, pages 880–889. Association for Computational
Linguistics.
Guillaume Obozinski
LSI, pLSI, LDA and inference methods
40/40
Download