Detecting future social unrest in unprocessed Twitter data Ryan Compton*

advertisement
Detecting future social unrest in unprocessed Twitter
data
“Emerging Phenomena and Big Data”
Ryan Compton*
Craig Lee
Tsai-Ching Lu
Lalindra De Silva
Michael Macy
The University of Utah
Salt Lake City, UT
Cornell University
Ithaca, NY
HRL Laboratories
Malibu, California
rfcompton@hrl.com
filter through this large and dynamic supply of information
for messages exhibiting several independent and informative
features.
Abstract—We have implemented a social media data mining
system capable of forecasting events related to Latin American
social unrest. Our method directly extracts a small number of
tweets from publicly-available data on twitter.com, condenses
similar tweets into coherent forecasts, and assembles a detailed
and easily-interpretable audit trail which allows end users to
quickly collect information about an upcoming event.
Early detection of civil unrest events is valuable for several
industrial and government applications. For example, if a port
is likely to shut down due to a riot, shipping companies may
opt to redirect freight in order to prevent unexpected losses. If
a massive protest is planned to happen in front of an embassy,
governments may elect to postpone diplomatic visits in order to
ensure the safety of their politicians. The value of civil unrest
forecasting has recently caught the attention of researchers
from a wide variety of disciplines [4] [5] [6] [7].
Our system functions by continually applying multiple textual
and geographic filters to a large volume of data streaming from
twitter.com via the public API as well as a commercial data
feed. To be specific, we search the entirety of twitter.com for
a few carefully chosen keywords, search within those tweets
for mentions of future dates, filter again using various logistic
regression classifiers, and finally assign a location to an event by
geocoding retweeters. Geocoding is done using our previouslydeveloped in-house geocoding service which, at the time of this
writing, can infer the home location for over 62M twitter.com
users [1]. Additionally, we identify demographics likely interested
in an upcoming event by searching retweeter’s recent posts for
demographic-specific keywords.
I.
We use early detection rather than prediction. Our method
is based on direct extraction of relevant tweets instead of a
physical model describing a large-scale theory of population
behavior (e.g. [6] [7]). Our forecasts are based on a small
number of highly important tweets and are independent of
“trends” observed in the aggregate of all tweets.
To be specific, our detection algorithm can be broken
down into the application of several filters which we use to
continually monitor streaming data from twitter.com, cf alg. 1.
I NTRODUCTION
Widespread adoption of social media has made it possible
for any individual to rapidly communicate with an audience of
thousands [2]. Unlike traditional media, where several difficult
time-consuming steps must be carried out prior to publication,
information in social media becomes publicly available within
a few seconds of its creation.
Algorithm 1: Future event detection
Input: millions of today’s tweets on twitter.com
Output: a few dozen tweets relevant to upcoming
events
t1 = tweets with text containing specific keywords
t2 = tweets in t1 with text containing future dates
t3 = tweets in t2 which have passed through a text
classifier
t4 = tweets in t3 whose retweeters localize within
nations of interest or whose text contains mentions of
specific locations
return t4
In this work we utilize social media’s accelerated publication speed to report on events prior their occurrence, while
they are still in their planning stages. Very recently, the speed
of publication on twitter.com has motivated its use as a tool
for the organization of future civil unrest events [3]. This fact,
combined with the public availability of twitter.com data, has
motivated our use of it as well. We will demonstrate how we
Supported by the Intelligence Advanced Research Projects Activity
(IARPA) via Department of Interior National Business Center (DoI / NBC)
contract number D12PC00285. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any
copyright annotation thereon. The views and conclusions contained herein are
those of the authors and should not be interpreted as necessarily representing
the official policies or endorsements, either expressed or implied, of IARPA,
DoI/NBE, or the U.S. Government.
978-1-4673-6213-9/13/$31.00 ©2013 IEEE
The tweets output by alg. 1 are often few in number
and rich in information about upcoming Latin American civil
unrest events. We believe that any user of our system will be
able to easily interpret each day’s results with minimal effort.
This paper is organized as follows: section II describes each
56
ISI 2013, June 4-7, 2013, Seattle, Washington, USA
talking about 2013-01-10 as 2013-01-10 is closer in time to
2012-12-29 than 2012-01-10 is.
step of alg. 1 in detail. section III showcases our user interface
and has information about the system’s past performance.
Finally, section IV discusses future work and concluding
remarks.
II.
Once we have extracted dates from the text, we assert
that the mentioned dates occur after the tweets post time. We
assume that tweets mentioning past dates are unlikely to be
indicative of future events.
M ETHOD
Our goal is to generate forecasts of the form:
When the future date filter is applied the number of
tweets is reduced substantially, a quick in-house experiment
on 144167 tweets containing unrest keywords collected from
the twitter.com API on 2013-03-01 found that only 1512 of
these tweets also contained future dates.
{population, event_type, date,
location, probability}
Where “population” describes the demographic of the event
participants (eg education, labor, agriculture) , “event type”
gives further detail about the reason for the event (eg employment, housing, economic policies), “date” is the date the event
is forecast to occur on, “location” is the city where we expect
the event to occur, and “probability” is how likely it is that
the event will actually happen.
For each tweet passing this filter we tentatively issue a
forecast for the mentioned date.
C. Logistic regression classification
Co-mentions of keywords with future dates does not guarantee that a particular tweet is indeed about civil unrest. We
have developed two classifiers to classify tweets based on
their relevance to a civil unrest event. Our first classifier is a
standard logistic regression classifier trained on tweets. The
features for the classifier were unigrams and bigrams that
surpassed a frequency threshold of 3 in the training data.
The training data was acquired using three annotators through
Amazon Mechanical Turk and they annotated 3000 tweets for
their relevance to a civil unrest event (pairwise inter-annotator
agreement ranged from 0.68 to 0.74).
We obtain twitter.com data in two ways: purchasing a
commercial realtime data feed consisting of a random sample
of 10% of all tweets [8] and continually querying twitter.com’s
rate-limited public API [9].
A. Keyword searches
The first filter a tweet must pass is a simple check for mentions of Latin American civil unrest keywords. The advantage
of this filter is that it is possible to apply it to the entirety of
twitter.com by using the public search API. Latin American
civil unrest keywords tend to be esoteric enough that we can
often obtain complete coverage for several keywords each day.
Our second classifier makes use of recent work we have
done establishing that tweets from organizations are roughly
three-times more likely to be civil unrest-related than similar
tweets from individuals [10]. In order to exploit this concept,
we designed an auxiliary classifier that classifies the source
user type of a tweet into two categories - organizations and
individuals. For this classifier, we make use of an ensemble
framework for user type identification based on heuristics, an
ngram classifier, and a linguistic classifier. The heuristics were
designed to capture two strong cues that are characteristic
of organization tweets - 1) they almost always contains a
URL and 2) organizational tweets rarely contain replied tweets
(tweets beginning with @user mentions). The ngram classifier
was based on unigrams and bigrams and the linguistic classifier
captures several types of linguistic features that are characteristic of tweets in either category. These three components
in the ensemble are then utilized in linear combination using
another logistic regression classifier to determine the user type
of any given tweet. After we have identified the posting user
as individual/organization using this classifier, we adjust the
forecast probability accordingly, by incorporating the likelihoods to derive the postrior probability of a tweet being civil
unrest-related GIVEN its user type.
We have manually identified a collection of 26 keywords
which we believe are of substantial importance and query the
API to obtain full coverage of these keywords.
A set of 335 words was identified by a domain expert for
use in our model. Due to twitter.com rate-limiting, searching
the public API for all occurrences of all 335 keyswords is not
possible so searches for these 335 keywords are constrained
to our commercial 10% data feed.
B. Future date searches
Simple checks for keyword mentions are poor indicators
of tweet content. A quick experiment has shown that, in
both English and Spanish, only about 20% of tweets that
contain a civil unrest keyword are indeed about civil unrest.
Furthermore, it is unclear how to forecast an event date from
only tweets with certain keywords. We thus apply a second
filter, one for mentions of future dates, to the tweets containing
protest keywords.
Our future date filter searches first for month names and
abbreviations in Spanish and Portuguese and second for numbers less than 31 within three whitespace separated tokens from
each other. Thus, an example matching date pattern would be
”10 de enero”. Four-digit years are rare in tweets, in order
to determine the year of the mentioned date we use the year
which minimizes the number of days between the mentioned
date and the tweet’s post time. In our example, if a tweet
mentions ”10 de enero” on 2012-12-29 we assume the user is
D. Event geocoding
Identification of event locations is central to the goal of
this project. We infer the location of an upcoming event by
searching the tweet for mentions of cities or monuments from
a manually compiled list. In the event that no locations are
mentioned in text, we assign a location to an event based on
the location of retweeters.
57
For each tweet passing the logistic regression filter, we
query the twitter.com API for user IDs of all the tweet’s
retweeters. User IDs are then fed into our geocoder (outlined
below) and filtered based on whether or not they reside in Latin
America. We note here that this filter is remarkably difficult to
pass. Of the 1512 tweets collected in the previous step in our
example, only 36 passed the Latin America geocoding filter.
We assume that civil unrest events tend to occur in population centers. Our geocoded users are reverse-geocoded into
the nearest city with population greater than 500000. The most
commonly occurring city is used as the location of our event.
User geocoding: We identify retweeter locations with our
previously developed in-house geocoder [1]. At the time of
this writing, our geocoder is able to estimate home locations
of 62M twitter.com users with a median error of 11.1km. The
distinguishing feature of our geocoder is its ability to infer
a user’s location based on the locations of that user’s online
social ties. This is accomplished via a technique similar to “label propagation” [11] applied to the twitter.com bidirectional
@mention network. Here, “bidirectional” means that edges in
our network are formed when user i has @mentioned user j
and user j has @mentioned user i at least once.
Briefly, our user geocoder works as follows: We begin
by extracting home locations for users based on the number
of times they have tweeted with location services turned on.
When our 10% sample of twitter.com contains 3 or more
tweets from a user within a 15km radius we use the geometric
median of those tagged tweets to establish the user’s home
location. This provides us with home locations for 3780887
users. Experiments with self-reported locations in profile information and time zone pruning showed negligible benefits
[1]. Denote the bidirectional @mention network by N and
define a geometric median function φ : N → N by
(
φ(n) =
Fig. 1: Example forecast. A march related to Petroleos Mexicanos (Pemex) is planned for March 18 in Mexico City. Our
system detected the event on March 5th. The interactive map
provides end-users with links to retweeter accounts.
n
if n is gps-known
P
argmin ni adj. to n d(ni , x) else
E. Demographics and event code assignment
x
(1)
We condense duplicate forecasts for the same date/location
into one forecast by averaging their probabilities.
where each node n ∈ N stores information about the
latitude/longitude for the user associated to that node. The
distance function is computed with Vincenty’s formulas [12].
The optimization in eq. (1) is known as the l1 multivariate
median or geometric median [13]. Our geocoding algorithm
can now be stated concisely.
Our domain expert has provided us with lists of terms relevant to several demographics and event types in Latin America.
To assign a demographic to each forecast we collect the tweet
histories of every retweeter of every tweet associated with a
forecast and search our list of terms. The most commonly
occurring classes of terms are used to assign our forecast’s
demographic and event code.
Algorithm 2: Geocoding
Initialize: N , gps-known ground truth user locations
while not converged do
foreach n ∈ N do
n ← φ(n)
end
end
return N
III.
R ESULTS
Successful end-user interpretation is important. By approaching this problem from the viewpoint of early detection
rather than prediction we can more easily provide an audit trail
which can be understood with minimal effort. For each forecast
generated we provide the tweets used, the retweeter locations,
the keywords matched, and links to all retweeter accounts. (cf
fig. 1).
Empirically, alg. 2 converges, providing us with location
estimates for 62289295 twitter.com users.
58
Our system has been in place since 2012-12-17 (cf fig. 2),
in that time the rate at which forecasts are generated has been
steadily increasing as we continue to update our algorithms.
Incorporation of the twitter API for complete coverage of a
small set of keywords happened in late Febuary 2013 and
the number of forecasts generated each day can be seen to
increase sharply around this time. Aside from the geocoder, it
is possible to implement our system without commercial data
feeds, i.e. our results are reproducible from publicly-available
data.
The amount of events forecast to occur on each day is
plotted in fig. 3. There is no obvious pattern. In the future,
after substantially more forecasts have been generated, it may
be possible to use this time series as part of a prediction-based
approach.
Assessing the performance of our system is relatively
straightforward given the audit trails. Manual examination of
283 forecasts generated by this system revealed 157 forecasts
which are indeed about upcoming Latin American civil unrest
events and 126 forecasts related to sporting events, other public
functions, or simple chatter.
Fig. 2: Cumulative sum of the number of forecasts generated
since 2012-12-17. The addition of twitter.com API searches
happened in late February and led to a notable increase in the
amount of forecasts generated each day.
IV.
C ONCLUSION
Twitter.com has become a powerful tool for the organization of mass gatherings of all types. However, the shear volume
of twitter.com makes it difficult to automatically identify
new and valuable information in real time. In this work we
have provided a straightforward approach for the detection
of upcoming civil unrest events in Latin America based on
successive textual and geographic filters.
Traditional news media is often assumed to be perfectly
accurate and can therefore only report on events once they
have occurred. The fact that it is now possible to relax the
assumption of perfect accuracy and report on events before
their occurrence is remarkable and continued work on this
project is already in progress.
Immediate future work includes more advanced tweet
classification using larger training sets, our @mention network,
and dictionary-based approaches. We also plan to analyze the
links shared in tweets for further information on upcoming
events.
Fig. 3: Number of events forecast to happen on each day. There
is a significant spike on February 14th, these are mostly false
positives. A quick look at the audit trails reveals that these
false positives were due to twitter.com user’s propensity to
“protest” February 14th.
V.
REFERENCES
R EFERENCES
number of events forcast
nation
50
19
39
53
15
9
156
6
1
82
Argentina
Brazil
Chile
Colombia
Ecuador
El Salvador
Mexico
Paraguay
Uruguay
Venezuela
[1] D. Jurgens, “Inferring location in online communities based on social
relationships,” HRL technical report, 2013.
[2] H. Kwak, C. Lee, H. Park, and S. Moon, “What is twitter, a social
network or a news media?” WWW, 2010.
[3] J. Skinner, “Social media and revolution: The arab spring and the
occupy movement as seen through three information studies paradigms,”
2011.
[4] E. Stepanova, “The role of information communication technologies in
the arab spring,” PONARS Eurasia Policy Memo No. 159, 2011.
[5] F. Chen, J. Arredondo, R. P. Khandpur, C.-T. Lu, D. Mares, D. Gupta,
and N. Ramakrishnan, “Spatial surrogates to forecast social mobilization
and civil unrests,” Position Paper in CCC Workshop on ”From GPS and
Virtual Globes to Spatial Computing-2012”, 2012.
[6] D. Braha, “Global civil unrest: Contagion, self-organization, and prediction,” PLoS One, 2012.
TABLE I: Number of forecasts generated for each country.
Mexico is highly active on twitter.com and receives the most
coverage from our system.
59
[7]
[8]
[9]
[10]
[11]
[12]
[13]
N. Johnson, S. Carran, J. Botner, K. Fontaine, N. Laxague, P. Nuetzel,
J. Turnley, and B. Tivnan, “Pattern in escalations in insurgent and
terrorist activity,” Science, 2011.
http://gnip.com/twitter/decahose/.
https://dev.twitter.com/.
L. D. Silva and E. Riloff, “Exploiting the textual content of tweets for
user type classification,” ICWSM (submitted), 2013.
X. Zhu and Z. Ghahramani, “Learning from labeled and unlabeled
data with label propagation,” Technical Report CMU-CALD-02-107,
Carnegie Mellon University, Tech. Rep., 2002.
http://www.geodesy.org.
Y. Vardi and C.-H. Zhang, “The multivariate l1-median and associated
data depth,” Proceedings of the National Academy of Sciences, vol. 97,
no. 4, pp. 1423–1426, 2000.
60
Download