Online Technical Appendix Hansen, McMahon, and Prat (2015) This appendix details various technical aspects of the estimation of Latent Dirichlet Allocation for the paper “Transparency and Deliberation within the FOMC: a Computational Linguistics Approach”. It discusses vocabulary selection, gives background on the Dirichlet distribution, defines LDA and derives Gibbs sampling equations,1 and explains how we form aggregate topic distribution at the section level. All source code is available on https://github.com/sekhansen/text-mining-tutorial, and an example of implementing the analysis is worked through on http://nbviewer.ipython.org/url/ www.econ.upf.edu/~shansen/tutorial_notebook.ipynb. 1 Document Pre-processing and Vocabulary Selection The first step in document processing is to tokenize each FOMC statement. This breaks statements apart based on white space and punctuation.2 So that contractions like ‘aren’t’ are not broken into ‘arent’ ‘” ‘t’ we first replace all contractions with their equivalent words, so that ‘aren’t’ becomes ‘are’ ‘not’. Then, for constructing the documents that we input into LDA we use the following steps: 1. Remove all tokens not containing solely alphabetic characters. This strips out punctuation and numbers. 2. Remove all tokens of length 1. This strips out algebraic symbols, copyright signs, currency symbols, etc. 3. Convert all tokens to lowercase. 4. Remove stop words, or extremely common words that appear in many documents. Our list is rather conservative, and contains all English pronouns, auxiliary verbs, and articles.3 1 Heinrich (2009) contains additional material. For tokenization and some other language processing tasks outlined below, we used the Natural Language Toolkit developed for Python and described in Bird, Klein, and Loper (2009). 3 The list is available from http://snowball.tartarus.org/algorithms/english/stop.txt. The removal of length-1 tokens already eliminates the pronoun ‘I’ and article ‘a’. The other length-1 English word is the exclamation ‘O’. We converted US to ‘United States’ so that it was not stripped out with the pronoun ‘us’. 2 1 5. Stem the remaining tokens to bring them into a common linguistic root. We use the default Porter Stemmer implemented in Python’s Natural Language Toolkit. For example, ‘preferences’, ‘preference’, and ‘prefers’ all become ‘prefer’. The output of the Porter Stemmer need not be an English word. For example, the stem of ‘inflation’ is ‘inflat’. Hence, the words are now most appropriately called “tokens”. As explained in the text, we then made a custom list of two- and three-word sequences to replace with a single token. Finally, we rank the 13,888 remaining tokens in terms of their contribution to discriminating among documents. The ranking punishes words both when they are infrequent and when they appear in many documents.4 Plotting the ranking indicates a natural cutoff we use to select V = 8, 615 words for the topic model. Words below this cutoff are removed from the dataset. After this processing, 2,715,586 total tokens remain. Some statements are empty, so we remove them from the dataset, leaving D = 46, 169 total documents for input into the Gibbs sampler. 2 Properties of the Dirichlet distribution A Dirichlet random variable is characterized in terms of a parameter vector α = (α1 , . . . , αK ) and has a probability mass function given by K 1 Y αk −1 θ Dir(θ | α) = B(α) k=1 k where K Q B(α) ≡ Γ(αk ) k=1 and α0 ≡ Γ(α0 ) K X αk . k=1 is a generalization of the beta function. Now suppose we have count data of the form X = (x1 , . . . , xn ) where each xi is a single draw from a categorical distribution with parameter θ = (θ1 , . . . , θK ); that is, 4 More specifically, we computed each word’s term-frequency, inverse-document-frequency—or tfidf —score. The formula for token v is D tf-idfv = log [1 + (Nv )] × log Dv where Nv is the number of times v appears in the dataset and Dv is the number of documents in which v appears. 2 Pr [ xi = k | θ ] = θk . The probability of observing a particular X is then Pr [ X | θ ] = K Y θkNk k=1 where Nk is the number of observations drawn in category k. If we place a Dirichlet prior on the θ parameters, then the posterior satisfies Pr [ θ | X, α ] ∝ Pr [ X | θ ] Pr [ θ ] ∝ K Y θkNk k=1 K Y θkαk −1 = k=1 K Y θkNk +αk −1 k=1 which after normalization gives K Y 1 θkNk +αk −1 , Pr [ θ | X, α ] = B(α1 + N1 , α2 + N2 , . . . , αK + NK ) k=1 a Dirichlet with parameters (α1 + N1 , . . . , αK + NK ). In other words, the Dirichlet is conjugate to the categorical distribution. Given conjugacy, we can derive a closed-form expression for the model evidence Pr [ X | α ], which must satisfy Pr [ θ | X, α ] = Pr [ θ | α ] Pr [ X | θ, α ] . Pr [ X | α ] We have given expressions for all these probabilites except Pr [ X | α ] above—note that Pr [ X | θ, α ] = Pr [ X | θ ], and simple substitution gives P Γ( k αk ) Y Γ(Nk + αk ) P Pr [ X | α ] = Γ(N + k αk ) k Γ(αk ) where N = 3 P k Nk . LDA: Model and Estimation The data generating process of Latent Dirichlet Allocation is the following: 1. Draw βk independently for k = 1, . . . , K from Dirichlet(η). 2. Draw θd independently for d = 1, . . . , D from Dirichlet(α). 3. Each word wd,n in document d is generated from a two-step process: (a) Draw topic assignment zd,n from θd . (b) Draw wd,n from βzd,n . 3 (1) The observed data is W = (w1 , . . . , wD ), while the unobserved data is W = (z1 , . . . , zD ). Given the statistical structure of LDA, the joint likelihood factors as follows: Pr [ W, Z | α, η ] = Pr [ W | Z, η ] Pr [ Z | α ]. Expressions for each factor are easy to derive from results in the previous section. Let mdk be the number of words from topic k in document d. By (1), the probability of observing this in one document is Γ(Kα) Y Γ(mdk + α) . Γ(Nd + Kα) k Γ(α) P Where Nd = k mdk is the total number of words in document d. So, the probability of observing the topic assignments across all documents is Y d Q D Γ(Kα) Y Γ(mdk + α) Γ(Kα) Y k Γ(mdk + α) = . Γ(Nd + Kα) k Γ(α) ΓK (α) Γ(Nd + Kα) d By similar calculations we obtain Γ(V η) Pr [ W | Z, η ] = ΓV (η) K Y Q k v Γ(mv + η) P , Γ( v mkv + V η) k where mkv be the number of times token v is assigned to topic k. In this way, we have written the probability of the data (W, Z) in terms of word and topic assignment counts, and eliminated the θd and βk parameters. This expression facilitates the derivation of a collapsed Gibbs sampler that is substantially more efficient than one constructed from uncollapsed likelihood. 3.1 Full conditional distribution To construct a collapsed Gibbs sampler, we need to compute the probability that zd,n = k given the other topic assignments Z−(d,n) and words W . By Bayes’ rule, we have Pr zd,n = k, Z−(d,n) , W α, η Pr zd,n = k Z−(d,n) , W, α, η = = Pr Z−(d,n) , W α, η Pr W zd,n = k, Z−(d,n) , η Pr zd,n = k, Z−(d,n) α ∝ Pr W Z−(d,n) , η Pr Z−(d,n) α Pr W zd,n = k, Z−(d,n) , η Pr zd,n = k, Z−(d,n) α . Pr W−(d,n) Z−(d,n) , η Pr Z−(d,n) α The final step follows first from Pr W Z−(d,n) , η = Pr [ wd,n ] Pr W−(d,n) Z−(d,n) , η given that zd,n —which generates wd,n —is drawn independently of Z−(d,n) . 4 From this expression, one can directly compute terms from the equations above. Let’s rewrite Pr zd,n = k, Z−(d,n) α in the following way, equivalent to the expression above #" # 0 Y Γ(md0 + α) Γ(Kα) Γ(mdk + α) Γ(Kα) Y Γ(mdk ) + α k = Γ(n α Γ(α) Γ(n Γ(α) d0 + Kα) d + Kα) 0 0 K k 6=k d 6=d " #" # 0 Y Y Γ(md0 + α) Γ(Kα) Y Γ(mdk ) + α Γ(Kα) Γ(mdk,−n + 1 + α) k Γ(nd0 + Kα) K α Γ(α) Γ(nd + Kα) Γ(α) d0 6=d k0 6=k " Y where mdk,−n is the number of slots in document d assigned to topic k excluding the nth slot. One can similarly write Pr Z−(d,n) α as " 0 Γ(Kα) Y Γ(mdk ) + α Γ(nd0 + Kα) k α d0 6=d #" Y Using the fact that Γ(x+1) Γ(x) # Y Γ(md0 + α) Γ(mdk,−n + α) Γ(Kα) k Γ(α) Γ(nd − 1 + Kα) Γ(α) k0 6=k = x, we obtain that mdk,−n + α Pr zd,n = k, Z−(d,n) α = . nd − 1 + Kα Pr Z−(d,n) α Similarly, one can write Pr W zd,n = k, Z−(d,n) , η as #" # Y Γ(mk0 + η) Y Γ(mk0 + η) Γ(mkv + η) Γ(V η) Γ(V η) v v P k0 P k = + V η) m m + V η) Γ( Γ(η) Γ(η) Γ( Γ(η) v v v v 0 0 v v 6=v k 6=k " #" # 0 Y Γ(mk + η) Y Y Γ(mk0 + η) Γ(mkv,−(d,n) + 1 + η) Γ(V η) Γ(V η) v v P P Γ( v mkv 0 + V η) v Γ(η) Γ(η) Γ( v mkv + V η) Γ(η) k0 6=k v 0 6=v " Y where v is implicitly the token assignment of the nth slot of document d. Moreover Pr W Z−(d,n) , η becomes " 0 Y Γ(mk + η) Γ(V η) v P k0 Γ( m + V η) Γ(η) v v v k0 6=k Y #" # Y Γ(mk0 + η) Γ(mkv,−(d,n) + η) Γ(V η) v P . Γ(η) Γ(η) Γ( v mkv,−(d,n) + V η) v 0 6=v P P 0 We know that v mkv,−(d,n) + 1 = v mkv : because the nth slot of document d with token v has been assigned topic k, if we exclude it from the dataset, then the total number of times the token v is assigned to topic k is one less. So mk +η Pr W zd,n = k, Z−(d,n) , η = P v,−(d,n) . k Pr W−(d,n) Z−(d,n) , η v mv,−(d,n) + V η 5 We conclude that mkv,−(d,n) + η mdk,−n + α . Pr zd,n = k Z−(d,n) , W, α, η ∝ P k v mv,−(d,n) + V η Note that we can exclude the nd − 1 + Kα term since it doesn’t vary with k. 3.2 Gibbs sampling algorithm From the conditional distribution computed above, we implement the following algorithm, informally described in the main text in section 5.3. 1. Randomly allocate to each token in the corpus a topic assignment drawn uniformly from {1, . . . , K}. 2. For each token, sequentially draw a new topic assignment via multinomial sampling where Pr zd,n mkv,−(d,n) + η d P = k Z−(d,n) , W, α, η ∝ m + α . k,−n k v mv,−(d,n) + V η 3. Repeat step 2 4,000 times as a burn in phase. 4. Repeat step 2 4,000 more times, and store every 50th sample. 3.3 Predictive distributions The collapsed Gibbs sampler gives an estimate of each word’s topic assignment, but not the parameters θd and βk since these are collapsed out of the likelihood function. In order to describe a document’s topic distribution and the topics themselves, the following predictive distributions are used. k m +η βbkv = PV v k v=1 (mv + η) and (2) d m +α θbdk = PK k d . k=1 mk + α (3) The derivation of the predictive distribution for a Dirichlet is standard, see for example Murphy (2012) section 3.4.4. 6 3.4 Convergence As with all Markov Chain Monte Carlo methods, the realized value of any one chain depends on the random starting values and determining if a chain has converged is important. To address these concerns, for each specification of the model we run five different chains starting from five different initial seeds. Along each chain we then compute the value of the model’s perplexity at regular intervals. Perplexity is a common measure of fit in the natural language processing literature. The formula is P K bk bv β θ n log v=1 d,v k=1 d k PD N d d=1 PD PV exp − d=1 where nd,v is the number of times word v occurs in document d. Table 1: Perplexity scores for five chains (50-topic model) Iteration 4000 5000 6000 7000 8000 Mean SD Chain 1 911.60384 911.37954 911.2509 911.27972 911.34918 911.372636 0.139199655 Chain 2 912.3436 912.23545 911.73059 911.80829 912.19474 912.062534 0.274409956 Chain 3 911.71599 911.77959 911.44483 911.14402 910.91158 911.399202 0.370818234 Chain 4 912.03563 911.54668 912.09184 911.58896 911.74574 911.80177 0.251162034 Chain 5 912.93617 912.89446 912.18517 911.8613 911.8404 912.3435 0.539793824 Table 1 reports values of perplexity along five chains drawn for the 50-topic model at various iterations.5 Various features are worth noting. First, from the 4,000th iteration onwards the perplexity values are quite stable, indicating that the chains have converged. Second, the differences in perplexity across chains are marginal, indicating that the estimates are not especially sensitive to starting values. Third, chain 1 performs marginally better both in terms of its average perplexity and standard deviation. Thus for our analysis we choose chain 1. 4 Estimating aggregate document distributions As explained in the text, we are more interested in the topic distributions at the meetingspeaker-section level rather than at the individual statement level. Denote by θi,t,s the topic distribution of the aggregate document. Let wi,t,s,n be the nth word in the document, zi,t,s,n its topic assignment, vi,t,s,n its token index, mi,t,s the number of words in the docuk i,t,s ment assigned to topic k, and mk,−n the number of words besides the nth word assigned to 5 We only report values at a limited number of iterations here for space. The same analysis done over finer intervals yields very similar results. 7 topic k. To re-sample the distribution θi,t,s , for each iteration j ∈ {4050, 4100, . . . , 8000} of the Gibbs sampler: 1. Form mi,t,s from the topic assignments of all the words that compose the aggregate k document (i, t, s) from the Gibbs sampling. 2. Drop wi,t,s,n from the sample and form the count mi,t,s k,−n . 3. Assign a new topic for word wi,t,s,n by sampling from v Pr zi,t,s,n = k z−(i,t,s,n) , wi,t,s ∝ βbk i,t,s,n mi,t,s k,−n + α (4) where z−(i,t,s,n) is the vector of topic assignments in document (i, s, t) excluding word n and wi,t,s is the vector of words in the document. 4. Proceed sequentially through all words. 5. Repeat 20 times. We then obtain the aggregate document predictive distribution i,t,s m k θbi,t,s = PK k k=1 +α mi,t,s +α k . (5) This is identical to the regular Gibbs sampling procedure except for two differences. First, the topics are kept fixed at the values of their predictive distributions rather than estimated. Second, because topics do not need to be estimated, many fewer iterations are needed to sample topic assignments for each document. References Bird, S., E. Klein, and E. Loper (2009): Natural Language Processing with Python. O’Reilly Media. Heinrich, G. (2009): “Parameter estimation for text analysis,” Technical report, vsonix GmbH and University of Leipzig. Murphy, K. P. (2012): Machine Learning: A Probabilistic Perspective, Adaptive Computation and Machine Learning. MIT Press. 8