Online Technical Appendix Hansen, McMahon, and Prat (2015)

advertisement
Online Technical Appendix
Hansen, McMahon, and Prat (2015)
This appendix details various technical aspects of the estimation of Latent Dirichlet
Allocation for the paper “Transparency and Deliberation within the FOMC: a Computational Linguistics Approach”. It discusses vocabulary selection, gives background on the
Dirichlet distribution, defines LDA and derives Gibbs sampling equations,1 and explains
how we form aggregate topic distribution at the section level. All source code is available on https://github.com/sekhansen/text-mining-tutorial, and an example of
implementing the analysis is worked through on http://nbviewer.ipython.org/url/
www.econ.upf.edu/~shansen/tutorial_notebook.ipynb.
1
Document Pre-processing and Vocabulary Selection
The first step in document processing is to tokenize each FOMC statement. This breaks
statements apart based on white space and punctuation.2 So that contractions like ‘aren’t’
are not broken into ‘arent’ ‘” ‘t’ we first replace all contractions with their equivalent
words, so that ‘aren’t’ becomes ‘are’ ‘not’.
Then, for constructing the documents that we input into LDA we use the following
steps:
1. Remove all tokens not containing solely alphabetic characters. This strips out
punctuation and numbers.
2. Remove all tokens of length 1. This strips out algebraic symbols, copyright signs,
currency symbols, etc.
3. Convert all tokens to lowercase.
4. Remove stop words, or extremely common words that appear in many documents.
Our list is rather conservative, and contains all English pronouns, auxiliary verbs,
and articles.3
1
Heinrich (2009) contains additional material.
For tokenization and some other language processing tasks outlined below, we used the Natural
Language Toolkit developed for Python and described in Bird, Klein, and Loper (2009).
3
The list is available from http://snowball.tartarus.org/algorithms/english/stop.txt. The
removal of length-1 tokens already eliminates the pronoun ‘I’ and article ‘a’. The other length-1 English
word is the exclamation ‘O’. We converted US to ‘United States’ so that it was not stripped out with
the pronoun ‘us’.
2
1
5. Stem the remaining tokens to bring them into a common linguistic root. We use the
default Porter Stemmer implemented in Python’s Natural Language Toolkit. For
example, ‘preferences’, ‘preference’, and ‘prefers’ all become ‘prefer’. The output
of the Porter Stemmer need not be an English word. For example, the stem of
‘inflation’ is ‘inflat’. Hence, the words are now most appropriately called “tokens”.
As explained in the text, we then made a custom list of two- and three-word sequences
to replace with a single token.
Finally, we rank the 13,888 remaining tokens in terms of their contribution to discriminating among documents. The ranking punishes words both when they are infrequent
and when they appear in many documents.4 Plotting the ranking indicates a natural
cutoff we use to select V = 8, 615 words for the topic model. Words below this cutoff are
removed from the dataset.
After this processing, 2,715,586 total tokens remain. Some statements are empty, so
we remove them from the dataset, leaving D = 46, 169 total documents for input into
the Gibbs sampler.
2
Properties of the Dirichlet distribution
A Dirichlet random variable is characterized in terms of a parameter vector α = (α1 , . . . , αK )
and has a probability mass function given by
K
1 Y αk −1
θ
Dir(θ | α) =
B(α) k=1 k
where
K
Q
B(α) ≡
Γ(αk )
k=1
and α0 ≡
Γ(α0 )
K
X
αk .
k=1
is a generalization of the beta function.
Now suppose we have count data of the form X = (x1 , . . . , xn ) where each xi is a
single draw from a categorical distribution with parameter θ = (θ1 , . . . , θK ); that is,
4
More specifically, we computed each word’s term-frequency, inverse-document-frequency—or tfidf —score. The formula for token v is
D
tf-idfv = log [1 + (Nv )] × log
Dv
where Nv is the number of times v appears in the dataset and Dv is the number of documents in which
v appears.
2
Pr [ xi = k | θ ] = θk . The probability of observing a particular X is then
Pr [ X | θ ] =
K
Y
θkNk
k=1
where Nk is the number of observations drawn in category k. If we place a Dirichlet prior
on the θ parameters, then the posterior satisfies
Pr [ θ | X, α ] ∝ Pr [ X | θ ] Pr [ θ ] ∝
K
Y
θkNk
k=1
K
Y
θkαk −1
=
k=1
K
Y
θkNk +αk −1
k=1
which after normalization gives
K
Y
1
θkNk +αk −1 ,
Pr [ θ | X, α ] =
B(α1 + N1 , α2 + N2 , . . . , αK + NK ) k=1
a Dirichlet with parameters (α1 + N1 , . . . , αK + NK ). In other words, the Dirichlet is
conjugate to the categorical distribution.
Given conjugacy, we can derive a closed-form expression for the model evidence
Pr [ X | α ], which must satisfy
Pr [ θ | X, α ] =
Pr [ θ | α ] Pr [ X | θ, α ]
.
Pr [ X | α ]
We have given expressions for all these probabilites except Pr [ X | α ] above—note that
Pr [ X | θ, α ] = Pr [ X | θ ], and simple substitution gives
P
Γ( k αk ) Y Γ(Nk + αk )
P
Pr [ X | α ] =
Γ(N + k αk ) k
Γ(αk )
where N =
3
P
k
Nk .
LDA: Model and Estimation
The data generating process of Latent Dirichlet Allocation is the following:
1. Draw βk independently for k = 1, . . . , K from Dirichlet(η).
2. Draw θd independently for d = 1, . . . , D from Dirichlet(α).
3. Each word wd,n in document d is generated from a two-step process:
(a) Draw topic assignment zd,n from θd .
(b) Draw wd,n from βzd,n .
3
(1)
The observed data is W = (w1 , . . . , wD ), while the unobserved data is W = (z1 , . . . , zD ).
Given the statistical structure of LDA, the joint likelihood factors as follows:
Pr [ W, Z | α, η ] = Pr [ W | Z, η ] Pr [ Z | α ].
Expressions for each factor are easy to derive from results in the previous section. Let mdk
be the number of words from topic k in document d. By (1), the probability of observing
this in one document is
Γ(Kα) Y Γ(mdk + α)
.
Γ(Nd + Kα) k
Γ(α)
P
Where Nd = k mdk is the total number of words in document d. So, the probability of
observing the topic assignments across all documents is
Y
d
Q
D
Γ(Kα) Y Γ(mdk + α)
Γ(Kα) Y k Γ(mdk + α)
=
.
Γ(Nd + Kα) k
Γ(α)
ΓK (α)
Γ(Nd + Kα)
d
By similar calculations we obtain
Γ(V η)
Pr [ W | Z, η ] =
ΓV (η)
K Y Q
k
v Γ(mv + η)
P
,
Γ( v mkv + V η)
k
where mkv be the number of times token v is assigned to topic k.
In this way, we have written the probability of the data (W, Z) in terms of word
and topic assignment counts, and eliminated the θd and βk parameters. This expression
facilitates the derivation of a collapsed Gibbs sampler that is substantially more efficient
than one constructed from uncollapsed likelihood.
3.1
Full conditional distribution
To construct a collapsed Gibbs sampler, we need to compute the probability that zd,n = k
given the other topic assignments Z−(d,n) and words W . By Bayes’ rule, we have
Pr zd,n = k, Z−(d,n) , W α, η
Pr zd,n = k Z−(d,n) , W, α, η =
=
Pr Z−(d,n) , W α, η
Pr W zd,n = k, Z−(d,n) , η Pr zd,n = k, Z−(d,n) α
∝
Pr W Z−(d,n) , η Pr Z−(d,n) α
Pr W zd,n = k, Z−(d,n) , η Pr zd,n = k, Z−(d,n) α
.
Pr W−(d,n) Z−(d,n) , η Pr Z−(d,n) α
The final step follows first from Pr W Z−(d,n) , η = Pr [ wd,n ] Pr W−(d,n) Z−(d,n) , η
given that zd,n —which generates wd,n —is drawn independently of Z−(d,n) .
4
From this expression, one can directly compute terms from the equations above. Let’s
rewrite Pr zd,n = k, Z−(d,n) α in the following way, equivalent to the expression above
#"
#
0
Y Γ(md0 + α)
Γ(Kα) Γ(mdk + α)
Γ(Kα) Y Γ(mdk ) + α
k
=
Γ(n
α
Γ(α)
Γ(n
Γ(α)
d0 + Kα)
d + Kα)
0
0
K
k 6=k
d 6=d
"
#"
#
0
Y
Y Γ(md0 + α)
Γ(Kα) Y Γ(mdk ) + α
Γ(Kα) Γ(mdk,−n + 1 + α)
k
Γ(nd0 + Kα) K
α
Γ(α)
Γ(nd + Kα)
Γ(α)
d0 6=d
k0 6=k
"
Y
where mdk,−n is the number of slots in document d assigned to topic k excluding the nth
slot. One can similarly write Pr Z−(d,n) α as
"
0
Γ(Kα) Y Γ(mdk ) + α
Γ(nd0 + Kα) k
α
d0 6=d
#"
Y
Using the fact that
Γ(x+1)
Γ(x)
#
Y Γ(md0 + α)
Γ(mdk,−n + α)
Γ(Kα)
k
Γ(α)
Γ(nd − 1 + Kα)
Γ(α)
k0 6=k
= x, we obtain that
mdk,−n + α
Pr zd,n = k, Z−(d,n) α
=
.
nd − 1 + Kα
Pr Z−(d,n) α
Similarly, one can write Pr W zd,n = k, Z−(d,n) , η as
#"
#
Y Γ(mk0 + η)
Y Γ(mk0 + η)
Γ(mkv + η)
Γ(V η)
Γ(V η)
v
v
P k0
P k
=
+
V
η)
m
m
+
V
η)
Γ(
Γ(η)
Γ(η)
Γ(
Γ(η)
v
v
v
v
0
0
v
v 6=v
k 6=k
"
#"
#
0
Y Γ(mk + η)
Y
Y Γ(mk0 + η)
Γ(mkv,−(d,n) + 1 + η)
Γ(V η)
Γ(V η)
v
v
P
P
Γ( v mkv 0 + V η) v
Γ(η)
Γ(η)
Γ( v mkv + V η)
Γ(η)
k0 6=k
v 0 6=v
"
Y
where v is implicitly the token assignment of the nth slot of document d. Moreover
Pr W Z−(d,n) , η becomes
"
0
Y Γ(mk + η)
Γ(V η)
v
P k0
Γ(
m
+
V
η)
Γ(η)
v
v
v
k0 6=k
Y
#"
#
Y Γ(mk0 + η)
Γ(mkv,−(d,n) + η)
Γ(V η)
v
P
.
Γ(η)
Γ(η)
Γ( v mkv,−(d,n) + V η)
v 0 6=v
P
P
0
We know that v mkv,−(d,n) + 1 = v mkv : because the nth slot of document d with token
v has been assigned topic k, if we exclude it from the dataset, then the total number of
times the token v is assigned to topic k is one less. So
mk
+η
Pr W zd,n = k, Z−(d,n) , η
= P v,−(d,n)
.
k
Pr W−(d,n) Z−(d,n) , η
v mv,−(d,n) + V η
5
We conclude that
mkv,−(d,n) + η
mdk,−n + α .
Pr zd,n = k Z−(d,n) , W, α, η ∝ P k
v mv,−(d,n) + V η
Note that we can exclude the nd − 1 + Kα term since it doesn’t vary with k.
3.2
Gibbs sampling algorithm
From the conditional distribution computed above, we implement the following algorithm,
informally described in the main text in section 5.3.
1. Randomly allocate to each token in the corpus a topic assignment drawn uniformly
from {1, . . . , K}.
2. For each token, sequentially draw a new topic assignment via multinomial sampling
where
Pr zd,n
mkv,−(d,n) + η
d
P
= k Z−(d,n) , W, α, η ∝
m
+
α
.
k,−n
k
v mv,−(d,n) + V η
3. Repeat step 2 4,000 times as a burn in phase.
4. Repeat step 2 4,000 more times, and store every 50th sample.
3.3
Predictive distributions
The collapsed Gibbs sampler gives an estimate of each word’s topic assignment, but
not the parameters θd and βk since these are collapsed out of the likelihood function. In
order to describe a document’s topic distribution and the topics themselves, the following
predictive distributions are used.
k
m +η
βbkv = PV v
k
v=1 (mv + η)
and
(2)
d
m +α
θbdk = PK k d
.
k=1 mk + α
(3)
The derivation of the predictive distribution for a Dirichlet is standard, see for example
Murphy (2012) section 3.4.4.
6
3.4
Convergence
As with all Markov Chain Monte Carlo methods, the realized value of any one chain depends on the random starting values and determining if a chain has converged is important. To address these concerns, for each specification of the model we run five different
chains starting from five different initial seeds. Along each chain we then compute the
value of the model’s perplexity at regular intervals. Perplexity is a common measure of
fit in the natural language processing literature. The formula is

P
K bk bv
β
θ
n
log
v=1 d,v
k=1 d k

PD
N
d
d=1
 PD PV
exp −
d=1
where nd,v is the number of times word v occurs in document d.
Table 1: Perplexity scores for five chains (50-topic model)
Iteration
4000
5000
6000
7000
8000
Mean
SD
Chain 1
911.60384
911.37954
911.2509
911.27972
911.34918
911.372636
0.139199655
Chain 2
912.3436
912.23545
911.73059
911.80829
912.19474
912.062534
0.274409956
Chain 3
911.71599
911.77959
911.44483
911.14402
910.91158
911.399202
0.370818234
Chain 4
912.03563
911.54668
912.09184
911.58896
911.74574
911.80177
0.251162034
Chain 5
912.93617
912.89446
912.18517
911.8613
911.8404
912.3435
0.539793824
Table 1 reports values of perplexity along five chains drawn for the 50-topic model
at various iterations.5 Various features are worth noting. First, from the 4,000th iteration onwards the perplexity values are quite stable, indicating that the chains have
converged. Second, the differences in perplexity across chains are marginal, indicating
that the estimates are not especially sensitive to starting values. Third, chain 1 performs
marginally better both in terms of its average perplexity and standard deviation. Thus
for our analysis we choose chain 1.
4
Estimating aggregate document distributions
As explained in the text, we are more interested in the topic distributions at the meetingspeaker-section level rather than at the individual statement level. Denote by θi,t,s the
topic distribution of the aggregate document. Let wi,t,s,n be the nth word in the document,
zi,t,s,n its topic assignment, vi,t,s,n its token index, mi,t,s
the number of words in the docuk
i,t,s
ment assigned to topic k, and mk,−n the number of words besides the nth word assigned to
5
We only report values at a limited number of iterations here for space. The same analysis done over
finer intervals yields very similar results.
7
topic k. To re-sample the distribution θi,t,s , for each iteration j ∈ {4050, 4100, . . . , 8000}
of the Gibbs sampler:
1. Form mi,t,s
from the topic assignments of all the words that compose the aggregate
k
document (i, t, s) from the Gibbs sampling.
2. Drop wi,t,s,n from the sample and form the count mi,t,s
k,−n .
3. Assign a new topic for word wi,t,s,n by sampling from
v
Pr zi,t,s,n = k z−(i,t,s,n) , wi,t,s ∝ βbk i,t,s,n mi,t,s
k,−n + α
(4)
where z−(i,t,s,n) is the vector of topic assignments in document (i, s, t) excluding
word n and wi,t,s is the vector of words in the document.
4. Proceed sequentially through all words.
5. Repeat 20 times.
We then obtain the aggregate document predictive distribution
i,t,s
m
k
θbi,t,s
= PK k
k=1
+α
mi,t,s
+α
k
.
(5)
This is identical to the regular Gibbs sampling procedure except for two differences.
First, the topics are kept fixed at the values of their predictive distributions rather than
estimated. Second, because topics do not need to be estimated, many fewer iterations
are needed to sample topic assignments for each document.
References
Bird, S., E. Klein, and E. Loper (2009): Natural Language Processing with Python.
O’Reilly Media.
Heinrich, G. (2009): “Parameter estimation for text analysis,” Technical report, vsonix
GmbH and University of Leipzig.
Murphy, K. P. (2012): Machine Learning: A Probabilistic Perspective, Adaptive Computation and Machine Learning. MIT Press.
8
Download