ET-LDA: Joint Topic Modeling for Aligning Events and their Twitter

ET-LDA: Joint Topic Modeling for Aligning Events and their Twitter Feedback
Yuheng Hu†
†
Ajita John‡
Fei Wang§
Subbarao Kambhampati†
Department of Computer Science, Arizona State University, Tempe, AZ 85281
Collaborative Applications Research, Avaya Labs, Basking Ridge, NJ 07920
§
IBM T. J. Watson Research Lab, Hawthorne, NY 10532
†
{yuhenghu, rao}@asu.edu ‡ ajita@avaya.com § feiwang03@gmail.com
‡
Abstract
During broadcast events such as the Superbowl, the U.S.
Presidential and Primary debates, etc., Twitter has become
the de facto platform for crowds to share perspectives and
commentaries about them. Given an event and an associated
large-scale collection of tweets, there are two fundamental research problems that have been receiving increasing attention
in recent years. One is to extract the topics covered by the
event and the tweets; the other is to segment the event. So
far these problems have been viewed separately and studied
in isolation. In this work, we argue that these problems are
in fact inter-dependent and should be addressed together. We
develop a joint Bayesian model that performs topic modeling
and event segmentation in one unified framework. We evaluate the proposed model both quantitatively and qualitatively
on two large-scale tweet datasets associated with two events
from different domains to show that it improves significantly
over baseline models.
1
Introduction
During public broadcast events such as the Superbowl, the
U.S. Presidential and Primary debates, the last episode of
a TV drama series, etc., Twitter has become the de facto
platform for crowds to share perspectives and commentaries
about these events. Given an event and an associated largescale collection of tweets, we face two fundamental problems in analyzing and understanding them, namely, extracting the topics covered in the event and tweets, and segmenting the event into topically coherent segments. Tackling the
two problems is critical to applications like computational
advertising, community detection, journalistic investigation,
storytelling, playback of events, etc. While both topical
modeling and event segmentation have received considerable attention in recent years, they have been mainly viewed
as separate problems and studied in isolation. For example,
there have been significant efforts on developing Bayesian
models to discover the patterns that reflect the underlying topics from the document (Blei, Ng, and Jordan 2003;
Griffiths et al. 2004; Wang and McCallum 2006; Titov and
McDonald 2008). Similarly, there is also a rich body of
work devoted to segmentation of events/discourses/meetings
via heuristics, machine learning, etc. (Hearst 1993; Boykin
c 2012, Association for the Advancement of Artificial
Copyright Intelligence (www.aaai.org). All rights reserved.
and Merlino 2000; Galley et al. 2003; Dielmann and Renals
2004).
Directly applying these current solutions to analyze the
event and its associated tweets however has a major drawback: they treat event and tweets independently, thus ignoring the topical influences of the event on its associated
tweets. In reality they are obviously inter-dependent. For example, in practice, when tweets are generated by the crowds
to express their interests in the event, their content is essentially influenced by the topics covered in the event in some
way. Based on such dependencies, i.e., topical influences,
a person can respond to the event in a variety of ways. For
example, she may choose to comment directly on a specific
topic in the event which is of concern and/or interest to her.
So, her tweets would be deeply influenced by that specific
topic. In another situation, she could also comment broadly
about the event. Therefore, the tweets would be less influenced by the specific topics but more by the general topics
of the event.
In this paper, we are interested in jointly modeling the
topics of the event and its associated tweets, as well as segmenting the event in one unified model. Our work is motivated by the observation that the topical influences from
the event on its associated tweets are not only used for indicating the topics mentioned in the event but also indicating
the content/topics in tweets and the tweeting behaviors of
the crowd. Besides, by accounting for such influences on
tweets, we can obtain a richer context about the evolution of
topics and the topical boundaries in the event which is critical to the event segmentation, as mentioned in (Shamma,
Kennedy, and Churchill 2009).
We build our joint model based on Latent Dirchlet Allocation (LDA), a Bayesian model proven to be effective
for topic modeling. In our model, an event may consist
of many paragraphs, each of which discusses a particular
set of topics. These topics evolve over the timeline of the
event. We assume that whether the topic mixture of a paragraph changes from the one in its preceding paragraph follows a binomial distribution parameterized by the similarity
between their topic distributions. With some probability, the
two paragraphs are merged to form a segment; otherwise, a
new segment is created. Additionally, we assume the event
(in fact the segments) can impose topical influences on the
associated tweets. Under such influences, the words in the
tweets can belong to two distinct types of topics: general
topics, which are high-level and constant across the entire
event, and specific topics, which are detailed and relate to
specific segments of the event. We define a tweet in which
most words belong to general topics as a “general tweet”,
indicating a weak topical influence from the event, whereas
a tweet with more words about the specific topics is defined
as a “specific tweet”, indicating a strong topical influence
from one segment of the event. Similar to the event segmentation, whether the event has strong or weak influence
on tweets depends on a binomial distribution. To learn our
model, we derive inference and estimate parameters using
Gibbs sampling. In the update equations, we can observe
how the tweets help regularize the topic modeling process
via topical influences and vice versa. To test our model, we
apply it to two large-scale tweet datasets associated with two
events from different domains (a) President Obama’s Middle
East speech on May 19, 2011 and (b) the Republican Primary debate on September 9, 2011. We examine the results
both quantitatively and qualitatively to demonstrate that our
model improves significantly over baseline models.
2
Related Work
Topic modeling methods, such as Latent Dirichlet Allocation (Blei, Ng, and Jordan 2003) have achieved great success in discovering underlying topics from text documents.
Recently, there has been increasing interest in developing
better and sophisticated topic modeling schemes. One line
of such research is to extend topic models on networked
documents, e.g., research publications, blogs etc. PHITS
(Hofmann 2001) models the documents and their interconnectivity based on topic-specific distributions. Further
extensions include (Dietz, Bickel, and Scheffer 2007), LinkPLSA-LDA (Nallapati et al. 2008) and RTM (Chang and
Blei 2009). In addition, some works consider the dynamics of topics which include dynamic topic model (Blei and
Lafferty 2006). Also, recent efforts apply topic modeling on
social media such as (Ramage, Dumais, and Liebling 2010;
Hu and Liu 2012).
In parallel, there is a rich body of work on automatic topic
segmentation of events/texts/meetings. Many approaches
have been developed. For example, (Hearst 1993) uses a
measure of lexical cohesion between adjoining paragraphs
for segmenting texts. LCSeg (Galley et al. 2003) uses a
similar approach on both text and meeting transcripts and
gains better performance than that achieved by applying
text/monologue-based techniques. In addition to lexical approaches, machine learning methods have also been considered. (Beeferman, Berger, and Lafferty 1999) combines
a variety of features such as statistical language modeling,
cue phrases, discourse information to segment broadcast
news. Recent advances have used generative models such
as (Purver et al. 2006).
The focus of most of the above work is either to model
topics in documents (where documents are assumed to be
the homogenous, e.g, research papers) or segment the events
alone. However, they do not provide insights into how to
characterize one source of text (tweets) in response to another (event). A distinct difference in our work is that, the
event and the associated tweets are heterogenous: the topics
in a tweet may be sampled from different types of topic mixtures (general or specific). Additionally, the topic mixtures
in an event evolve over its timeline. While the current paper
focuses on the technical development and evaluation of the
ET-LDA framework, our companion papers (Hu, John, and
Seligmann 2011; Hu et al. 2012) elaborate on the motivations for joint analysis and alignment of events and tweets.
3
Joint modeling of Event and Twitter Feeds
In this section, we show how to represent the event and its
Twitter feeds by a hierarchical Bayesian model based on Latent Dirichlet Allocation (LDA), so that the topic modeling
of the event/tweets and event segmentation can be achieved.
Table 1 lists the notation used in this paper.
Table 1: Mathematical Notation
Notation
S
Ns
T
Mt
θ(s)
ψ (t)
δ (s)
c(s)
λ(t)
c(t)
s(t)
ws , wt
zs , z t
α, β
3.1
Description
a set of paragraphs in the event’s transcript
the number of words in paragraph s
a set of tweets associated with the event
the number of words in tweet t
topic mixture of the specific topics from
a paragraph s of the event
topic mixture of the general topics from
tweets corpus
parameter for choosing to draw topics
in paragraph s from θ(s) or θ(s−1)
indicates whether the topic of a paragraph
is drawn from current or previous segment’s topics.
parameter for choosing to draw topics
in t from θ or ψ
indicates whether the topic of a tweet
is drawn from specific or general topics
a referred segment, to which a specific topic
in a tweet is associated
words in event’s transcript, tweets, respectively
topic assignments of words in event,
tweets, respectively.
Dirichlet/beta parameters of the
Multinomial/Bernoulli distributions
Model
Our proposed model called the joint Event and Tweets LDA
(ET-LDA), aims to model (1) the event’s topics and their
evolution (event segmentation), as well as (2) the associated
tweets’ topics and the crowd’s tweeting behaviors. Therefore, the model has two major components with each capturing one perspective of our target. The conceptual model of
ET-LDA is shown in Fig. 1a and its graphical model representation is in Fig. 1b. Both parts have the LDA-like model,
and are connected by the link which captures the topical influences from the event on its Twitter feeds.
More specifically, in the event part, we assume that an
event is formed by discrete sequentially-ordered segments,
each of which discusses a particular set of topics. A segment consists of one or many coherent paragraphs available
General Topics
remain constant
through transcript
Event Transcript

Each segment
has specific
topics
A specific tweet draws
words mostly from
specific topics
and thus refers to
particular segments.
A general
tweet draws
words mostly
from general
topics


 (s)
c( s )
 (t )
 (t )

 (s)
s(t )
c(t )
A referred segment
could be a segment
in the past, a current
segment, or a
segment in the future.
z si


wsi
(a) Conceptual model of ET-LDA
Ns
wti
K
S

 (t )
zti
Mt
T
(b) Plate model of ET-LDA
1
Figure 1: The graphical model representation of ET-LDA
from the transcript of the event.1 Each paragraph s is associated with a particular distribution of topics θ(s) . To model
the topic evolutions in the event, we apply the Markov assumption on θ(s) : with some probability, θ(s) is the same as
the distribution of topics of previous paragraph s − 1, measured by the delta function δ(θ(s−1) , θ(s) ); otherwise, a new
distribution of topics θ(s) is sampled for s, chosen from a
Dirichlet(αθ ). This pattern of dependency is produced by
associating a binary variable c(s) with each paragraph, indicating whether its topic is the same as that of the previous
paragraph or different. If the topic remains the same, these
paragraphs are merged to form one segment. This variable
is associated with a binomial distribution δ (s) parameterized
by a symmetric beta prior αδ .
In the tweets part, we assume that a tweet consists of
words which can belong to two distinct types of topics: general topics, which are high-level and constant across the entire event, and specific topics, which are detailed and relate
to the segments of the event. As a result, the distribution
of general topics is fixed for a tweet. However, the distribution of specific topics keeps varying with respect to the
development of the event. We define a tweet in which most
words belong to general topics as a general tweet, indicating
a weak topical influence from the event. In contrast, a tweet
with more words about the specific topics is defined as a
specific tweet, indicating a strong topical influence from one
segment of the event. In other words, a specific tweet refers
to a segment of the event. Similar to the event segmentation, each word in a tweet is associated with a distribution of
topics. It can be either sampled from a mixture of specific
topics θ(s) or a mixture of general topics ψ (t) over K topics
depending on a binary variable c(t) sampled from a binomial
distribution λ(t) . In the first case, θ(s) is from a referring
segment s of the event, where s is chosen according to a categorical distribution s(t) . Unlike δ (s) , λ(t) is controlled by
an asymmetrical beta prior parameterized by the preference
parameter αλγ (for specific topics) and αλψ (for general topics). An important property of the categorical distribution
s(t) is to allow choosing any segment in the event. This reflects the fact that a person may compose a tweet on topics
1
For many publicly televised events, transcripts are readily published by news services like NY Times etc. Paragraph outlines in
the transcripts are usually determined through human interpretation
and may not necessarily correspond to topic changes in the event.
discussed in a segment that (1) was in the past (2) is currently occurring, or (3) will occur after the tweet is posted
(usually when she expects certain topics to be discussed in
the event)
To summarize, we have the following generative process:
Procedure Generation Process in ET-LDA
foreach paragraph s ∈ S do
draw a segment choice indicator c(s) ∼ Bernoulli(δ (s) )
if c(s) = 1 then
draw a topic mixture θ(s) ∼ Dirichlet(αθ )
else
draw a topic mixture θ(s) ∼ δ(θ(s−1) , θ(s) )
foreach word wsi ∈ s do
draw a topic zsi ∼ M ultinomial(θ(s) )
draw a word wsi ∼ φzsi
foreach tweet t ∈ T do
foreach word wti ∈ t do
draw a topic changing indicator
c(t) ∼ Bernoulli(λ(t) )
if c(t) = 1 then
draw a topic mixture ψ (t) ∼ Dirichlet(αψ )
draw a general topic zti ∼ M ultinomial(ψ (t) )
else
draw a paragraph s ∼ Categorical(γ (t) )
draw a specific topic zti ∼ M ultinomial(θ(s) )
draw a word from its associated topic wti ∼ φzti
With the model hyperparameters α, β, the joint distribution of observed and hidden variables ws , wt , zs , zt , cs , ct ,
and st can be written as blow.
P (ws , wt , zs , zt , cs , ct , st |αδ , αθ , αγ , αλ , αψ , β) =
Z
Z
(t)
(t)
···
P (ws |zs , φ)P (wt |zt , φ)P (φ|β)P (st |γ )P (γ |αγ )
P (zs |θ
(s)
)P (zt |θ
P (zt |ψ
(t)
(t)
(s)
dγ
3.2
dθ
(s)
, st , ct = 0)P (θ
, ct = 1)P (ψ
dδ
(s)
(t)
dλ
(t)
dψ
(s)
|αψ )P (ct |λ
(t)
|αθ , cs )P (cs |δ
(t)
)P (λ
(t)
(s)
)P (δ
(s)
|αδ )
|αλγ , αλψ )
dφ
(1)
Inference in the Model via Gibbs Sampling
The computation of the posterior distribution of the hidden
variables zs , zt , cs , ct and st is intractable for the ET-LDA
model because of the coupling between α, β. Therefore,
in this paper, we utilize approximate methods like collapsed
Gibbs sampling algorithm (Griffiths et al. 2004) for parameter estimation. Note that Gibbs sampling allows the learning
of a model by iteratively updating each latent variable given
the remaining variables.
To begin with, we need to compute conditional probability P (zt , zs , ws , wt , cs , ct , st |z0t , z0s , ws , wt , c0s , c0t , s0t ), where
z0t , z0s , c0s , c0t , s0t are vectors of assignments of topics, segment indicators, topic switching indicators and segment
choice indicators for all words in the collection except for
the one at position i in a tweet or an event’s transcript. According to the Bayes rule, we can compute this conditional
probability in terms of the joint probability distribution of
the latent and observed variables shown in Eq.1. Next, to
make the sampling procedure clearer, we factorize this joint
probability as:
in which T is the total number words in tweet t which are
under the general topics, and nik is the number of times topic
k assigns to words i.
Next, we evaluate the third term in Eq.2. By integrating
out δ (s) we compute:
P (cs ) =
P (ws , wt |zs , zt )P (zs , zt |cs , ct , st )P (cs )P (ct )P (st )
(2)
By integrating out the parameter φ we can obtain the first
term in Eq.2:
P (ct ) =
Y Γ(αλγ + αλψ ) Γ(Mt0 + αλγ )Γ(Mt1 + αλψ )
Γ(αλγ )Γ(αλψ )
P (ws , wt |zs , zt ) =
K QW
Γ(W β) K Y
Γ(β)W
k=1
k
k
w=1 Γ(nsw + ntw + β)
k
Γ(nk
+
n
+
W β)
s(.)
t(.)
(3)
where W is the size of the vocabulary, nksw and nktw are the
numbers of times topic k assigned to word w in the event and
the tweets. nks(.) and nkt(.) are the total number of words in
the event and tweets assigned to topic k. Γ(·) is the gamma
function.
To evaluate the second term in Eq.2, we need to consider
the value of a tweet’s topic switching indicator ct because
it determines whether a word’s topic zt is general, i.e, sampling from ψ (t) (when ct = 1) or specific, i.e., sampling
from θ(s) (when ct = 0). Based on the model structure in
Fig.1b, we factor the second term as P (zs , zt |cs , cs , st ) =
P (zs )P (zt |cs , ct = 0, st )P (zt |ct = 1, st ) and compute
each of these factors individually. So when ct = 0, by integrating out θ(s) and canceling the factor that does not depend
on this value, we obtain:
P (zs )P (zt |cs , ct = 0, st ) =
Γ(Kαθ )
Γ(αθ )K
S Y
S QK
S
k=1
S
Γ(n(.)i
i=1
S
Γ(nk i + ntk i + αθ )
+
S
nt(.)i
+ Kαθ )
(4)
nSk i
where S is a set of segments of the event.
is the number of times topic k appears in the segment Si , and ntSk i is
the number of times topic k appears in tweets, where these
tweets refer to the content in segment Si . Similarly, when
ct = 1, by integrating out ψ (t) and canceling the factor that
does not depend on this value, we have:
P (zt |ct = 1, st ) =
Γ(Kαψ )
Γ(αψ )K
T Y
T QK
i=1
i
k=1 Γ(nk + αψ )
i
Γ(n(.) + Kαψ )
(5)
Γ(Mt + αλγ + αλψ )
(7)
where Mt is the total number of words in tweet t, Mt0 is the
number of words that are under the specific topics, and Mt1
is the number of words in t that are under the general topics.
Last, we need to derive the fifth term. Again, by integrating
out γ (t) we have:
P (st ) =
(6)
where S is the total number of paragraphs in an event. Ss1
is the number of segments (the number of times the topic of
paragraph s differs from its preceding paragraph, i.e., cs =
1).
Similarly, for the fourth term in Eq.2, we integrate out λ(t)
and get:
t∈T
P (ws , wt , zs , zt , cs , ct , st ) =
Γ(2αδ ) Γ(Ss0 + αδ )Γ(Ss1 + αδ )
Γ(αδ )2
Γ(S + 2αδ )
Γ(Kαγ )
Γ(αγ )K
T Y
T QS
i
s=1 Γ(ns + αγ )
Γ(ni(.) + Sαγ )
i=1
(8)
where nis is the number of times paragraph s (in fact its associated segment) is referred by tweet t.
Now, the conditional probability can be obtained by multiplying and canceling of terms in Eq.3–8. We show the core
case (when cs = 0) here while the other case (when cs = 1)
is omitted due to the space limit.
0
0
0
0
0
P (zt , zs , ws , wt , cs = 0, ct = 0, st |zt , zs , ws , wt , cs , ct , st ) =
S
S
k
n i + ntk i + αθ − 1
ni + αγ − 1
nk
sw + ntw + β − 1
× Sk
× is
Si
k
i
n
+ Sαγ − 1
nk
+
n
+
W
β
−
1
n(.) + nt(.) + Kαθ − 1
(.)
s(.)
t(.)
Mt0 + αλγ − 1
Mt + αλγ + αλψ − 1
×
St0 + αδ − 1
Mt + 2αδ − 1
(9)
and when ct = 1 we have the conditional probability:
0
0
0
0
0
P (zt , zs , ws , wt , cs = 0, ct = 1, st |zt , zs , ws , wt , cs , ct , st ) =
k
nk
sw + ntw + β − 1
k
k
ns(.) + nt(.) + W β −
Mt1 + αλγ − 1
Mt + αλγ + αλψ − 1
1
×
×
nik + αψ − 1
i
n(.) + Kαψ −
St1 + αδ − 1
Mt + 2αδ − 1
1
×
nis
ni(.)
+ αγ − 1
+ Sαγ − 1
(10)
In both of these expressions, counts are computed without taking into account assignments of the considered word
wsi and wti . After algebraic manipulation to Eq.9 and 10,
we can easily derive Gibbs update equations for variables
zt , zs , cs , ct and st which are omitted here.2 Sampling with
these questions is fast and in practice convergence can be
achieved in time similar to that needed by LDA implementations.
2
For each variable, in order to derive its update equation, one
can first pick the factors that depend on it in Eq.2 and then select
the corresponding factors in Eq.9 or Eq.10 under the condition that
cs = 0. For cs = 1, the derivation is the same.)
Table 2: Top words in topics from MESpeech for ET-LDA and LDA
ET-LDA
(Top specific topic of each
sampled segment of the event)
ET-LDA
(Top general topics
from the tweets collection)
LDA on event
(3 out of a total of 20 topics)
LDA on tweets
(3 out of a total of 20 topics)
4
Topics
“Foreign policy”
“Terrorism”
“Aid Egypt”
“Arab spring”
“Obama”
“Israel & Palestine”
“MiddleEast/Arab”
“Security/Terrorism”
“Israel Palestine issues”
“Arab Spring”
“Security/Peace”
“Obama”
S1
S2
S4
Top words
Hillary Clinton State Dept Sec Leaders Rights Transition MiddleEast Democracy
Bin Laden Mass Murderer Dignity Power Blood Dictator Message peaceful
Aid Trade Reform Greatest Resource Billion Debt Egypt Support Help Good News
Arabia Bahrain Mosques Stepped Mespeech Syrian Leader Government Religion
Obama Economics Failed President Job Tough Critique Jews Policies Weakness
Israel Palestine Borders State Negotiations Lines Hamas Permanent Occupation
Young People Deny month Country Region Democracy Women violence Cairo Iraq
Many America Home Transition State Peace Security Conflict Hate Blood Al Qaeda
Country Palestinian security Israel Know Between Leader resolve Issue Boarder
Obama Town Assad Month Syria Libya Countries Leave Dictators must Jews
Iran Bin Laden Dead Oil Region War Murder Iraq Risk Nuclear Peace Army
Wonderful Obama Job Approval GOP Middle East Mespeech Talking Economics
Experiments
In this section, we examine the effectiveness of our proposed
joint model against other baselines. Three main tasks are undertaken to evaluate the ET-LDA: (1) the topics extracted
from the whole corpus (tweets and transcripts of events)
are compared with those separately extracted from the LDA
model, (2) the capability of predicting topical influences of
the events on unseen tweets in the test set is compared with
LDA, and (3) the quality of event segmentation is compared
with LCSeg – a popular HMM-based segmenting tool in the
literature (Galley et al. 2003).
Data Sets and Experimental Setup We use two largescale tweet datasets associated with two events from different domains: (1) President Obama’s Middle East speech on
May 19, 2011 and (2) the Republican Primary debate on
Sept 7, 2011. The first tweet dataset consists of 25,921
tweets tagged with “#MESpeech” and the second dataset
consists of 121,256 tweets tagged with “#ReaganDebate”.
Both tweet datasets were crawled via the Twitter API using these two hashtags (which were officially posted by
the White House and NBC News, respectively, before the
event). In the rest of this paper, we use the hashtags to refer
to these events. Furthermore, we split both tweet datasets
into a 80-20 training and test sets.
We obtained the transcripts of both events from the New
York Times,3 where MESpeech has 73 paragraphs and ReaganDebate has 230 paragraphs. We applied preprocessing to both tweets and transcripts by removing non-English
tweets, retweets, punctuation and stopwords and stemming
all terms. Further, it is known that topic modeling methods
(including LDA) behave badly when applied to short documents (Hu et al. 2009). To remedy this, we follow the
scheme in (Sahami and Heilman 2006) to augment a tweet’s
context. First, we treat a tweet as a query and send it to a
search engine. After generating a set of top-n query snippets
d1 , ..., dn , we compute the TF-IDF term vector vi for each
di . Finally, we pick the top-m terms from vi and concatenate
them to t to form an expanded tweet. In the experiments, we
used the Google custom search engine for retrieving snippets and set n = 5 and m = 10. Such augmentation was
3
http://www.nytimes.com/2011/05/20/world/middleeast/20prexy-text.html
http://www.nytimes.com/2011/09/08/us/politics/08republican-debate-text.html
and
applied to both tweet datasets.
In the experiments, we used the Gibbs sampling algorithm
for training ET-LDA on the tweets dataset with the transcript
of associated event. The sampler was run for 1000 iterations
for both datasets. Coarse parameter tuning for the prior distributions was performed. We varied the number of topics
K in ET-LDA and chose the one which maximizes the loglikehood P (ws , wt |K), a standard approach in Bayesian
statistics (Griffiths and Steyvers 2004). As a result, we set
K = 20. In addition, we set model hyperparameters αδ =
0.1, αθ = 0.1, αγ = 0.1, αλγ = αλψ = 0.5, αψ = 0.1, and
β = 0.01.
Topic Modeling: We first study the performance of ETLDA on topic modeling for the two events and their tweets
against the baseline LDA (which was trained on the event
transcripts and tweet datasets separately with K = 20). Table 2 and 3 present the top words, i.e., the highest probability words from topics for (i) top specific topics discovered from a sample of 3 (out of 7 for MESpeech, or, out
of 14 for ReaganDebate) segments of the events, and (ii)
for a sample of the general topics from the tweets collection
using the ET-LDA model. The results of segmentation are
shown next. For comparison, both tables also list the top
words for the topics discovered from (iii) the event and (iv)
tweets individually using LDA. Note that all of the topics
have been manually labeled for identification (e.g. “Arab
Spring”, “Immigration”) to reflect our interpretation of their
meaning from their top words.
It is clear that all specific and general topics from the
ET-LDA model are very reasonable from a reading of the
transcripts. Furthermore, we observe that the specific topics are sensitive to the event’s context and keep evolving
as the event progresses. On the other hand, general topics
and their top words capture the overall themes of the event
well. But unlike specific topics, these are repeatedly used
across the entire event by the crowd in their tweets, in particular when expressing their views on the themes of the event
(e.g., “Arab spring”, “Immigration”) or even on some popular issues that are neither related nor explicitly discussed in
the event (e.g., “Obama” in MESpeech, “Conservative” in
ReaganDebate).
The results of LDA seem less reasonable by comparison.
Table 3: Top words in topics from ReaganDebate for ET-LDA and LDA
LDA on tweets
(out of 20 topics)
Segment
Boundary
Although LDA may extract general topics like “Israel Palestine issues” just like ET-LDA since these topics remain constant throughout the document, LDA cannot extract specific
topics for the event. In fact, “Israel Palestine issues” shows
the advantage of ET-LDA: it is the top general topic for entire tweet collection (which is very relevant to and influenced
by the event) whereas LDA fails to identify that (its top topic
is ’Obama’ which is less relevant). The data showed that
people tweeted about this issue a lot. Besides, some top
words for LDA topics are not so related to the event. This
lack of correspondence is more pronounced for LDA when
it is applied to the tweet datasets, e.g., GOP Job Approval in
topic “Obama” of the tweets corpus by LDA. This is mainly
because ET-LDA successfully characterizes the topical influences from the event on its Twitter feeds such that the
content/topics of tweets are regularized, whereas the LDA
method ignores these influences and thus gives less reasonable results.
12:14PM
12:24PM
12:34PM
12:44PM
12:54PM
1:04PM
1:10PM
Event Timeline
Segment
Boundary
(a) Segmentation of MESpeech
8:00PM
8:24PM
8:48PM
9:12PM
Event Timeline
9:36PM
10:00PM
(b) Segmentation of ReaganDebate
Figure 2: Segmentation of the Event
Prediction Performance: Next, we study the prediction
performance of ET-LDA. Specifically, we are interested in
the prediction of topical influences from the event on the unseen tweets in our test set (20% of total tweets). Thus, we
first run the Gibbs sampling algorithm, described in previous section, on the training set for each event/tweet dataset.
Then we extend the sampler state with samples from the test
set. For comparison, we adopt LDA as our baseline approach. However, since LDA treats the event and tweets
individually, we measure the topical influences by the distance of topic mixtures of the unseen tweets to the ones of
the segments of the event (as determined by ET-LDA in advance). This distance is measured by the Jensen-Shannon
divergence.
0.66
0.64
0.62
ET−LDA
LDA
ET−LDA
LDA
0.6
0.62
Normalized Likert Scale
LDA on event
(out of 20 topics)
Top words
Job Payroll Economy Crisis Market Commitment Income Tax Wages Pledges Cured
Obamacare Wrong Health Care Regulations Bush People Law Everyone Stage Treats
Perry Social Security Benefits Ponzi Scheme Republican Annual Circumstances
Ron Paul President Real Blessed GOP Tea Party Purpose Government Conservative Support
President Job Obamacare Health Care Critique Policies Democrats Low Rate U.S country
Legislative Legal Immigration Law Solution Fence Economics Committed Debate Taxpayer
Constitution Law Government Wrong Federal Question Funding Monstrous Financially
Fed Funding Expenditures Devastating Economy Policies Hurt Admin. Democratic President
Plan Romney Cheaper Free Debate Mandate Individual Obamacare Question Better
Perry Ridiculous Crazy Ponzi Exaggerated Provocative Tcot Blasts Investment Reformed
Warfare Job Creation Obamacare Approval Poll Troop Withdraw Working Country Iraq Vote
Monster Stupid Bloody Congress Obama Jobs Apple Lost One Sided Blasting Hopeless
Normalized Likert Scale
Topics
S2
“Job market”
S3
“Health care”
S10
“Social sec.”
“Conservative”
“Obama”
“Immigration”
“Social Sec.”
“Regulations”
“Health care”
“Social Sec.”
“Obama”
“Economics”
ET-LDA
(Top specific topic from
each sampled segment)
ET-LDA
(Top general topics
from Tweets collection)
0.6
0.58
0.56
0.54
0.58
0.56
0.54
0.52
0.52
0.5
0.5
0.48
1
2
3
Segments
(a) MESpeech
4
5
0.48
1
2
3
4
5
Segments
(b) ReaganDebate
Figure 3: Predictive performance of ET-LDA compared
with LDA model on 5 randomly sampled segments.
To evaluate the “goodness” of prediction results by our
proposed model, we asked 30 graduate students from the
engineering school of our university (who were selected as
they follow the news closely and tweet at least three times
per week) to manually label the quality and strength of the
predicted topical influences from events on the unseen tweet
datasets on a Likert scale of 1 to 5 rating. We then averaged
these ratings over the value diversity (i.e., normalization). In
Fig. 3a and 3b, we present the results of the two methods on
5 randomly sampled segments.
In light of the observed differences in Fig. 3a and 3b, we
study the statistical significance of ET-LDA with respect to
LDA. We perform paired-t-tests for models with a significance level of α = 0.05 (p = 0.0161 and p = 0.0029 for
MESpeech and ReaganDebate, respectively). This reveals
that the improvement in prediction performance of ET-LDA
is statistically significant.
Event Segmentation: Finally, we study the quality and
effectiveness of ET-LDA on the segmentation of the two
events based on their transcripts. The results of the event
segmentation (obtained using K = 20 in ET-LDA) are
shown in Fig. 2a and 2b. To evaluate our model, we compare its results with the ones from a popular HMM-based
tool LCSeg (trained on 15-state HMM) on the Pk measure
(Beeferman, Berger, and Lafferty 1999). Note that this measure is the probability that a randomly chosen pair of words
from the event will be incorrectly separated by a hypothesized segment boundary. Therefore, the lower Pk indicates better agreement with the human-annotated segmentation results, i.e., better performance. In practice, we first
ask four graduate students in our department to annotate the
segments of the events based on their transcripts (two for
each event) and later ask another graduate student to judge,
for one event, which human annotation is better. We pick
the better one of each event and treat it as the hypothesized
segmentation. Then, we compute the Pk value. The results
of two methods are shown in Table. 4.
Table 4: Comparisons of segmentation results on two events
Pk
MESpeech
ET-LDA LCSeg
0.295
0.361
ReaganDebate
ET-LDA LCSeg
0.31
0.397
The results show that our model significantly outperforms
the LCSeg – as the latter cannot merge topic mixtures in
paragraphs according to their similarity, and thus places a lot
of segmentation boundaries (i.e., over-segmented), resulting
in poor performance.
5
Conclusion
In this paper, we have described a joint statistical model ETLDA that characterizes topical influences between an event
and its associated Twitter feeds (tweets). Our model enables
the topic modeling of the event/tweets and the segmentation of the event in one unified framework. We evaluated
ET-LDA both quantitatively and qualitatively through three
tasks. Based on the experimental results, our model shows
significant improvements over the baseline methods.
We believe this paper presents the first step towards
understanding complex interactions between events and
social media feedback. In fact, beyond the transcripts of
publicly televised events that we used in this paper, ET-LDA
can also handle other forms of text sources that describe an
event. For example, one can explore how people respond to
an event (and how it is different from journalists’ responses
in media) by applying our model to the news articles
and the social media feedback about this event. We also
believe that this paper reveals a perspective that is useful
for the extraction of a variety of further dimensions such
as sentiment and polarity. For example, one can examine
how the crowd’s mood is affected by the event based on the
topical influences.
Acknowledgements: ET-LDA was initially formulated and
prototyped at Avaya Labs during a summer internship
there by Yuheng Hu, who thanks Dorée Duncan Seligmann for helpful discussions. Hu and Kambhampati’s research at ASU is supported in part by the ONR grant
N000140910032.
References
Beeferman, D.; Berger, A.; and Lafferty, J. 1999. Statistical
models for text segmentation. Machine learning.
Blei, D., and Lafferty, J. 2006. Dynamic topic models. In
ICML. ACM.
Blei, D. M.; Ng, A. Y.; and Jordan, M. I. 2003. Latent dirichlet allocation. Journal of Machine Learning Research.
Boykin, S., and Merlino, A. 2000. Machine learning of event
segmentation for news on demand. Communications of the
ACM.
Chang, J., and Blei, D. 2009. Relational topic models for
document networks. In Artificial Intelligence and Statistics.
Dielmann, A., and Renals, S. 2004. Dynamic bayesian networks for meeting structuring. In ICASSP. IEEE.
Dietz, L.; Bickel, S.; and Scheffer, T. 2007. Unsupervised
prediction of citation influences. In ICML. ACM.
Galley, M.; McKeown, K.; Fosler-Lussier, E.; and Jing, H.
2003. Discourse segmentation of multi-party conversation.
In ACL. Association for Computational Linguistics.
Griffiths, T., and Steyvers, M. 2004. Finding scientific topics.
PNAS.
Griffiths, T. L.; Steyvers, M.; Blei, D. M.; and Tenenbaum,
J. B. 2004. Integrating topics and syntax. In NIPS.
Hearst, M. 1993. Texttiling: A quantitative approach to discourse segmentation. Sequoia.
Hofmann, D. 2001. The missing link-a probabilistic model of
document content and hypertext connectivity. In NIPS. The
MIT Press.
Hu, X., and Liu, H. 2012. Text analytics in social media.
Mining Text Data.
Hu, X.; Sun, N.; Zhang, C.; and Chua, T. 2009. Exploiting internal and external semantics for the clustering of short
texts using world knowledge. In CIKM. ACM.
Hu, Y.; John, A.; Seligmann, D.; and Wang, F. 2012. What
were the tweets about? topical associations between public
events and twitter feeds. In Proceedings of ICWSM. AAAI.
Hu, Y.; John, A.; and Seligmann, D. 2011. Event analytics
via social media. In Proceedings of the 2011 ACM workshop
on Social and behavioural networked media access. ACM.
Nallapati, R.; Ahmed, A.; Xing, E.; and Cohen, W. 2008.
Joint latent topic models for text and citations. In KDD.
ACM.
Purver, M.; Griffiths, T.; Körding, K.; and Tenenbaum, J.
2006. Unsupervised topic modelling for multi-party spoken
discourse. In ACL. Association for Computational Linguistics.
Ramage, D.; Dumais, S.; and Liebling, D. 2010. Characterizing microblogs with topic models. In ICWSM. The AAAI
Press.
Sahami, M., and Heilman, T. 2006. A web-based kernel
function for measuring the similarity of short text snippets.
In WWW. ACM.
Shamma, D.; Kennedy, L.; and Churchill, E. 2009. Tweet
the debates: understanding community annotation of uncollected sources. In Proceedings of the first SIGMM workshop
on Social media. ACM.
Titov, I., and McDonald, R. 2008. Modeling online reviews
with multi-grain topic models. In WWW. ACM.
Wang, X., and McCallum, A. 2006. Topics over time: a nonmarkov continuous-time model of topical trends. In KDD.
ACM.