FAM-FAM2-Atkins20110908-RR-F1

advertisement
TOPIC MODELS
1
Technical Appendix
This appendix provides an overview of further details related to model choices and fitting
topic models, as well as comparisons with other statistical approaches to text modeling.
Comparison of Topic Models and Alternative Quantitative Text Models
To help clarify the unique attributes of topic models (also known as Latent Dirichlet
allocation [LDA]; Blei et al., 2003), we briefly consider the motivations behind their
development. Given a word-document matrix (WDM), one approach to identifying underlying
dimensions of the data is to use principle components analysis (PCA). PCA uses the covariance
matrix from the WDM to derive a set of ordered and orthogonal dimensions characterizing the
variability in the WDM. While PCA is a general and useful dimensionality reduction technique,
it is not one that has traditionally been used for text analysis. However, the basis of PCA is
related to a greatly influential approach to modeling text, Latent Semantic Analysis (LSA;
Landauer & Dumais, 1998). LSA employs a matrix-factorization technique known as singular
value decomposition (SVD). SVD is a linear algebra method based on performing an
eigenvector decomposition of a matrix, and can be used to reduce the dimensionality of data by
selecting subsets of the dimensions that capture most of the variance in the observed data. PCA
and SVD share many similarities, but whereas PCA is applied to a covariance matrix of the
WDM, SVD is applied directly to the WDM. In LSA, the SVD of the WDM is used to
construct a “semantic space” (i.e., a set of weights that map the observed data to a reduced set of
underlying dimensions and broadly similar to factor loadings in factor analysis). Individual terms
and documents can then be mapped into this reduced dimensionality semantic space, which in
turn can be used for making comparisons of semantic similarity.
TOPIC MODELS
2
However, there are shortcomings to the LSA method (for an in-depth discussion, see
Griffiths, Steyvers, & Tenenbaum, 2007). One specific example is that LSA struggles to capture
what is known as polysemy—in which the same word can have multiple distinct semantic
meanings—such as the word play (i.e., play an instrument, play at the park, or going to a play at
the theater). Methods that map documents and words into a multi-dimensional space, such as
LSA, only have a single point-representation for a given word and are unable to properly capture
polysemy. Perhaps a more relevant issue for present purposes is that the dimensions of the
semantic space are not easily interpreted. Specifically, there is no inherent meaning in the
dimensions of LSA’s semantic space; much like factor analysis, original factor loadings are
typically hard to interpret, and hence the common practice is to apply some type of factor
rotation to enhance interpretability. However, factor rotation does not always yield interpretable
factors (e.g., the “bars” data from Griffiths & Steyvers, 2004).
The shortcomings of LSA were addressed by two models: a) Probabilistic Latent
Semantic Indexing (pLSI; Hoffman, 1999), and b) LDA (Blei et al., 2003) which re-formulated
the problem of unsupervised modeling of text, in terms of probability theory. pLSI modeled each
document as a mixture of probability distributions over words, and Blei et al. extended pLSI to
make it fully Bayesian by assigning Dirichlet priors on both the document mixture proportions
and the topic-word distributions, yielding LDA. Since the introduction of LDA, a large number
of variants on this model have been introduced, and these are now often more generally referred
to as “topic models”.
The key difference between LSA and topic models is that, in topic models, documents are
represented as an additive mixture of probability distributions over words (i.e., topics), in which
each topic typically captures an interpretable dimension of semantic meaning. Specifically, the
TOPIC MODELS
3
high-probability words within each topic learned by LDA tend to be closely semantically related.
The interpretability of topic models confers them with a significant advantage over previous
approaches based in classical matrix factorization techniques for dimensionality reduction, such
as LSA. Given this brief history of text modeling, we now describe the details of LDA,
refraining from cumbersome statistical detail.
Formal Description of LDA
As previously indicated, LDA is an unsupervised machine learning method for finding a
set of topics that can be used to summarize a collection of documents. Given the input of the
WDM, the model estimates 1) a set a topics that captures underlying semantic subjects in the
corpus, and 2) a representation of each document in terms of the set of topics. Specifically, each
topic is modeled as a probability distribution over words, and each document is modeled as a
probability distribution (i.e., a mixture) of these topics, where the topics with high probability for
a given document capture the semantic content that is most prevalent within the document.
For clarity, we summarize here the notational conventions used in the Appendix. Uppercase letters are used to indicate fixed integer values in our model: V = the size of the vocabulary,
i.e., the number of unique word-types in the dataset, after pre-processing such as removal of
stop-words; D = the number of documents in the corpus; T = the number of topics (as set by the
experimenter / data analyst). Bold-formatting is used for complete vectors of parameters,
whereas regular type is used to refer to a parameter value of a specific element within the vector.
For example, z is the vector of latent (unobserved) indicator variables for all observed words w
in the dataset, where zi indicates the topic-assignment for the ith observed word-token 𝑤𝑖 (and
therefore, can take the values corresponding to one of the topics — 𝑧𝑖 ∈ {1, 2, … , 𝑇}). Finally,
upper-case Greek letters refer to full matrices of parameters (e.g., Φ for the probabilities of
TOPIC MODELS
4
words for each topic and Θ for the probabilities of topics for each document). The probability
distributions for a specific topic 𝜙𝑡 or a specific document 𝜃𝑑 will use the lower-case form of the
Greek letter and be indexed by a subscript.
Using this notation, the model can be described as follows: Each topic t (where 𝑡 ∈
{1 … 𝑇}) is represented by a multinomial distribution over the set of V unique word-types in the
corpus. The distribution for topic t is given by parameter 𝜙𝑡 , which is a V-length vector of realvalued numbers which sum to one. Thus, the set of T topics is represented by a V by T matrix,
Φ, in which all columns sum to one, and where 𝜙𝑤,𝑡 is the probability that topic t will generate
word-type w. The number of topics that the model estimates, T, is a hyper-parameter that is
specified by the data analyst.1 Each document d (where 𝑑 ∈ {1 … 𝐷}) is represented by a
multinomial distribution over the set of T topics. The distribution for document d is given by
parameter 𝜃𝑑 , which is a T-length vector of real-valued numbers which sum to one. Thus, the set
of T topics is represented by a D by T matrix, Θ, in which all rows sum to one, and where 𝜃𝑑,𝑡 is
the probability of topic t given document d. Finally, to make the model fully Bayesian, Dirichlet
priors are placed on phi_t, the distribution over words for each topic, as well as theta_d, the
distribution over topics for each documents. Intuitively, these prior distributions serve as
smoothing parameters, and add a small pseudo-count to the matrices Φ and Θ (which is very
useful, since these matrices typically are very sparse, and would otherwise be comprised mostly
of zeroes).
Because LDA is a probabilistic generative model, it describes a step-wise process by
which the model generates the observed data (i.e., the set of documents and the specific words
1
We describe some methods for choosing the value of T later on in the paper.
TOPIC MODELS
5
observed for each document), and reviewing these steps can provide an intuitive feel for how the
model works. The generative process for LDA is as follows:
1. For each topic t, sample a multinomial distribution 𝝓𝑡 over the V word-types from
a Dirichlet prior distribution with hyper-parameter 𝛽.
2. For each document d, sample a multinomial distribution 𝜽𝑑 over the T topics from
a Dirichlet prior distribution with hyper-parameter 𝛼.
3. For each document d, generate 𝑤𝑑,𝑖 , the word in the ith slot observed word as
follows:
a. Sample a topic 𝑧𝑖 according to the probability of the topic given the
document (these probabilities are given by the vector, 𝜽𝑑 ). Note that 𝑧𝑖 is
an indicator variable that indicates the topic for word 𝑤𝑑,𝑖 and takes a
value from 1 … 𝑇.
b. Sample word 𝑤𝑑,𝑖 from the topic according to the probability of word w in
topic 𝑧𝑖 (these probabilities are given by the vector topic 𝝓z i ).
The LDA model is illustrated in Figure A1 using graphical model plate notation (Jordan,
2004). Graphical plate notation provides a figurative representation for understanding
probabilistic generative models because it captures the relationships between all of the model
variables and the observed data. Each variable is represented as a node in the figure, where the
shaded nodes correspond to observed data, and the unshaded nodes correspond to unobserved
model parameters. The boxes, or “plates”, indicate that a process is repeated multiple times. For
example, the figure shows that – for all documents – a document’s probability distribution 𝜽𝑑
over topics is sampled from a Dirichlet distribution with hyperparameter 𝛼. Then, for all words
in a document, one first samples a topic (where the z parameter is an indicator for this topic), and
TOPIC MODELS
6
then the word (w), which is observed, is sampled according to the probability distribution 𝝓z for
the corresponding topic.
θ
β
z

w
Topics
Words
α
Documents
Figure A1: Graphical Model For LDA
The output of the LDA model consists of (1) the vector z, containing the topic
assignments for each word in the corpus, (2) the matrix Φ of topic-word probabilities, and (3) the
matrix Θ of document-topic probabilities. Prior to considering how the output can be used and
interpreted, we review how the model is estimated, including parameters that must be set.
Model Inference and Parameter Settings
Topic Model Inference
A common inference method for topic models is known as collapsed Gibbs sampling,
which was used in the current analyses and was first described in Griffiths & Steyvers (2004). A
brief summary of this method is provided here (for additional mathematical details and
derivations, see Griffiths & Steyvers, 2004).
TOPIC MODELS
7
Learning the parameters of the model involves inferring the set of T distributions 𝝓𝑡=1…𝑇
of topics over words, and the D distributions 𝜽𝒅=𝟏…𝑫 of documents over topics. In collapsed
Gibbs sampling, all of the topic-word and document-topic distributions are learned by
sequentially updating the vector z of latent assignments of all words in the corpus to individual
topics (the actual distributions 𝜽𝑑 and 𝝓𝑡 do not need to be directly updated, because they have
been “collapsed”, i.e., integrated out of the equations).
The vector of z assignments is initialized randomly (by assigning each word wi to a
random topic by setting zi to a random value between 1 and T). The z-assignments are then
updated in a random order using the following Gibbs update equation:
P(𝑧i = 𝑡|𝒛−𝑖 , 𝑤𝑖 , 𝑑, α, β) ∝ ∑V
𝑤,𝑡
𝑁−𝑖
+β
𝑤,𝑡
w=1 𝑁−𝑖 +𝑉β
𝑑,𝑡
𝑁−𝑖
+α
𝑑,𝑡
T
∑t=1 𝑁−𝑖
+Tα
𝑤,𝑡
𝑑,𝑡
where 𝑁−𝑖
is the current count of the number of times word w has been assigned to topic t, 𝑁−𝑖
is the count of the number of times words in document d have been assigned to topic t. The
subscript “– 𝑖” is used to indicate that these calculations are made after the current word 𝑤𝑖 - that
we are sampling a new topic-assignment 𝑧𝑖 for - has been removed from these counts. The
hyperparameters for the Dirichlet prior distributions, 𝛼 and 𝛽, as well as the number of topics, T,
are parameters that must be specified by the researcher.
In words, the above equation states that we assign word w in document d to a topic t, in
proportion to the current estimates of (1) the probability of that word given topic t, which is the
term on the left in the right-hand side of the equation, and (2) the probability of the topic t given
the document d, which is the term on the right, in the right-hand side of the equation. Each
iteration of the Gibbs sampler updates each of the zi assignments once.
TOPIC MODELS
8
Once the Gibbs sampler has been allowed to sufficiently “burn-in” (by running the
sampler for a number of iterations until the distributions are stable), one can directly estimate the
parameters of interest, 𝜽𝑑 and 𝝓𝑡 as follows:
𝜙̂
𝑤,𝑡 =
𝜃̂
𝑑,𝑡 =
𝑤,𝑡
𝑁−𝑖
+β
𝑤,𝑡
∑Vw=1 𝑁−𝑖
+ 𝑊β
𝑁 𝑑,𝑡 + α
∑Tt=1 𝑁 𝑑,𝑡 + 𝑇α
which are the same estimates based on the current counts of z-assignments that are used in the
Gibbs update equation and have the same interpretation. Note that the only difference between
these terms and those in the Gibbs update equation is that the “-i” notation has not been used.
This is because the Gibbs update involves conditioning zi on all of the z assignments except the
one that is currently being updated. Here, however, we are computing our final posterior
estimates of the parameters based on all assignments.
Setting the Hyperparameters 𝛼 and 𝛽
In addition to the parameters that are estimated by the model (described above), the
hyperparameters for the Dirichlet prior distributions, 𝛼 and 𝛽, as well as the number of topics, T,
must be specified by the data analyst. There is a significant body of work that has looked at how
to optimally choose the number of topics, set hyperparameters, and also has considered the
influence of these settings on the fitted model (see, e.g., the Topic Modeling Bibliography at
http://www.cs.princeton.edu/~mimno/topics.html).
Choosing hyperparameters can be potentially complex as ultimately, all hyperparameters
interact in their influence on the fitted LDA model. Nonetheless, there are some heuristics for
understanding the hyperparameters influence on LDA output. For the Dirichlet priors:
TOPIC MODELS

9
The 𝛼 hyperparameter serves as a smoothing parameter on 𝜃. As 𝛼 becomes smaller,
documents’ distributions over topics 𝜽𝑑 will become more peaked (i.e., the words within
each document will tend to be assigned to a smaller subset of the topics).

The β hyperparameter is a smoothing parameter on the topic-word distributions 𝝓𝑡 . As β
changes, this will influence the granularity of the topics (Griffiths & Steyvers, 2004); smaller
values of β will lead to more fine-grained topics, and larger values will lead to coarser, more
general topics.
A common practice for setting these values was described in Griffiths & Steyvers (2004): Given
a specific setting for the number of topics, T, setting 𝛼 = 50/𝑇 and 𝛽 = .01 generally provides
good topics (see Wallach, Mimno, & McCallum, 2009 for an alternative approach in which the
hyperparameters are optimized).
Determining the Number of Topics (T)
Given the above heuristic for the settings of 𝛼 and 𝛽, the only hyperparameter left to
choose is T, the number of topics. It is useful to bear in mind that in choosing the number of
topics the data analyst is not in search of the “true” number of topics but rather a useful number
of topics for a given purpose. Moreover, different research questions with the same corpus
might very well use different numbers of topics. There are two general strategies for choosing T:
1) interpretability of topics, and 2) cross-validation using either an internal or external criterion.
Some common values of T that are used are: 50, 100, 200, and 400. One option for choosing
among these settings is simply to run the model with each of these values, visually inspect the
resulting set of topics (by looking at the distribution of topics over words), and choose the set of
topics that is most interpretable. However, unlike other methods of dimensionality reduction, the
interpretability of the topics under different settings of T is fairly steady, due in part to the robust
TOPIC MODELS
10
nature of Bayesian methods. However, two things that will happen as T increases, is that at some
point (depending on the nature and diversity of the data), there will be topic redundancy (i.e.,
multiple topics will capture similar dimensions of the data) and topic idiosyncrasy (i.e.,
individual topics are dedicated to very small portions of the overall corpus). For example, in the
couple therapy corpus, there are 134 couples. Thus, at higher numbers of topics (e.g., 200 or
400), there were clearly topics summarizing semantic content for individual couples. In most
applications to this data we would be interested in common linguistic content over couples (e.g.,
how does negative emotion language change during therapy?). Hence, practically, a smaller
number of topics were warranted, and the number of topics in the present application were
chosen by varying the total number of topics and examining the interpretability of topics.
An alternative approach to determining the number of topics is to use cross-validation. A
small proportion (e.g., 10%) of the data from each document can be withheld from the model
fitting process, and then the accuracy of the model to predict the withheld portion can be
assessed. Then, iterative models can be fit varying the number of topics (or hyper-parameters),
and the model with the best prediction can then determine the number of topics. This approach
is more challenging when there is no external criterion. For example, only using the text data
itself, it is possible to use an internal criterion such as perplexity, which is a measure of how well
the model can predict missing words in a document (e.g., see Griffiths and Steyvers, 2004).
Alternatively, if there is an external criterion (i.e., external to the text), then a similar approach
can be used to optimize the total number of topics in predicting the external criterion. Consider
the example of behavioral codes in the present data where we used cross-validation to estimate
prediction accuracy in a regression with topics. Topic models were fit on a subsample of the
complete data (e.g., a subset of the sessions or a subset of the couples). We then varied the
TOPIC MODELS
11
number of topics, and the topic output was used in a logistic regression model to predict the
behavioral codes in the withheld sessions or couples. The model accuracy for the withheld data
could then be used to determine the number of topics. However, again, we emphasize that this
approach – while certainly objective – can lead to different numbers of topics for different
outcomes (e.g., across different behavioral codes). Hence, in the topic modeling literature thus
far, it is quite common to see applications of topic models in which the number of topics were
chosen based on interpretability of the topics.
Defining Documents
One additional model setting, which has not been a typical concern in topic modeling, is
the question of how to represent “documents”. Most corpora used in topic modeling consist of
self-contained news or journal articles, and thus have a natural document representation.
However, since the couple therapy corpus consists of transcripts with couples having dialogues
or 3-person conversations with therapists, over multiple sessions and communication
assessments, there is not a single representation of documents, and defining documents was an
explicit decision.
For the work presented in this paper, we chose to treat the text from an individual
speaker, within each individual therapy session or communication assessment transcript, as a
unique “document”. The reason for this choice was that it put documents in 1-1 alignment with
the behavioral coding systems, which were assigned to each spouse within each therapy session
or communication assessment. Although other meta-data (e.g., couple outcomes) were at
different levels of granularity, it would be straightforward to simply average over the
“documents” to get a topic-based representation at the level of couples. Note that since the topic
model assigns each individual word to a topic, one can naturally move from a more coarse-
TOPIC MODELS
12
grained document representation to a finer-grained representation (e.g., by treating each unique
couple as a document, and then using the assignments (i.e., z) of words for individual speakers
within individual sessions, to generate document representations at different levels of
granularity).
Utilizing the LDA Output
After the data is processed, transformed into a WDM, and the model is fit, the outputs of
LDA (z , Φ, and Θ) provide various lenses to understand the underlying semantic representation
of the corpus. Throughout the paper, we have illustrated some of these applications. Here we
will briefly summarize the LDA output and how its various components were used throughout
the paper.
The distribution of topics over words, Φ, provides an interpretable set of dimensions (i.e.,
topics) that can be used to illustrate the latent semantic dimensions of a text corpus. In Figure 1
of the paper, we showed a number of the topics that were discovered by LDA (by showing in
order the 15 most likely words under each of the topics probability distributions, 𝝓𝑡 ). These
high probability words most strongly define a given topic, and throughout the rest of the paper,
the high-probability words (rather than simply the topic-numbers) were used to label each topic,
which allows for easy interpretation of various dimensions (e.g., to see what dimensions of data
are predictive of specific behavioral codes).
The distribution of documents over topics Θ gives an ordering of which topics are most
frequent under each document. In Figure 5, we averaged across the husband and wife documents
within each therapy session to compute a vector 𝜽𝑑 capturing the frequency of each topic within
a particular couple’s speech over time. By looking at how this 𝜽𝑑 distribution evolved over the
course of therapy, we are able to view both the time-course of the therapy intervention, as well as
TOPIC MODELS
13
any effect this has on the couple’s speech over the course of therapy. We also utilize the 𝝓𝑡
distributions to label the topics in Figure 5, so that the evolution of the couple’s speech over
therapy is readily interpretable. Furthermore, if we wish to drill-down, to look at the original
text of the data, we can use the vector z of topic-assignments to augment the transcripts, to see
which words in each transcript are being assigned to which topics.
In addition, the Θ matrix was used for the prediction of behavioral codes. This document
by topic matrix describes how prevalent each topic is in each document, and hence, it can be
used to create a topic-based matrix of covariates for predicting behavioral codes or other
outcomes associated with couples or sessions. These could be then used as in Figure 6 to predict
the behavioral codes for new documents (i.e., for data that was not used during training in the
cross-validation procedure). Additionally, as in Figure 7, the regression coefficients on each
topic that are learned by the regression model can then be used to further explore how each topic
relates to the different behavioral codes (by ordering the topics according to the regressionmodel weights).
Appendix References
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine
Learning Research, 3, 993–1022.
Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National
Academy of Sciences, 101(Suppl. 1). doi:10.1073/pnas.0307752101
Griffiths, T. L., Steyvers, M., & Tenenbaum, J. B. T. (2007). Topics in semantic
representation. Psychological Review, 114, 211-244.
Hoffman, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual
International SIGIR Conference on Research and Development in Information Retrieval.
TOPIC MODELS
Jordan, M. I. (2004). Graphical Models. Statistical Science, 19, 140155. doi:10.1214/088342304000000026
Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato's problem: The latent semantic
analysis theory of acquisition, induction, and representation of knowledge. Psychological
Review, 104, 211-240. doi: 10.1037/0033-295X.104.2.211
Wallach, H., Mimno, D., & McCallum, A. (2009). Rethinking LDA: Why Priors Matter.
In Proceedings of the 23rd Annual Conference on Neural Information Processing Systems.
14
Download