Detecting future social unrest in unprocessed Twitter data “Emerging Phenomena and Big Data” Ryan Compton* Craig Lee Tsai-Ching Lu Lalindra De Silva Michael Macy The University of Utah Salt Lake City, UT Cornell University Ithaca, NY HRL Laboratories Malibu, California rfcompton@hrl.com filter through this large and dynamic supply of information for messages exhibiting several independent and informative features. Abstract—We have implemented a social media data mining system capable of forecasting events related to Latin American social unrest. Our method directly extracts a small number of tweets from publicly-available data on twitter.com, condenses similar tweets into coherent forecasts, and assembles a detailed and easily-interpretable audit trail which allows end users to quickly collect information about an upcoming event. Early detection of civil unrest events is valuable for several industrial and government applications. For example, if a port is likely to shut down due to a riot, shipping companies may opt to redirect freight in order to prevent unexpected losses. If a massive protest is planned to happen in front of an embassy, governments may elect to postpone diplomatic visits in order to ensure the safety of their politicians. The value of civil unrest forecasting has recently caught the attention of researchers from a wide variety of disciplines [4] [5] [6] [7]. Our system functions by continually applying multiple textual and geographic filters to a large volume of data streaming from twitter.com via the public API as well as a commercial data feed. To be specific, we search the entirety of twitter.com for a few carefully chosen keywords, search within those tweets for mentions of future dates, filter again using various logistic regression classifiers, and finally assign a location to an event by geocoding retweeters. Geocoding is done using our previouslydeveloped in-house geocoding service which, at the time of this writing, can infer the home location for over 62M twitter.com users [1]. Additionally, we identify demographics likely interested in an upcoming event by searching retweeter’s recent posts for demographic-specific keywords. I. We use early detection rather than prediction. Our method is based on direct extraction of relevant tweets instead of a physical model describing a large-scale theory of population behavior (e.g. [6] [7]). Our forecasts are based on a small number of highly important tweets and are independent of “trends” observed in the aggregate of all tweets. To be specific, our detection algorithm can be broken down into the application of several filters which we use to continually monitor streaming data from twitter.com, cf alg. 1. I NTRODUCTION Widespread adoption of social media has made it possible for any individual to rapidly communicate with an audience of thousands [2]. Unlike traditional media, where several difficult time-consuming steps must be carried out prior to publication, information in social media becomes publicly available within a few seconds of its creation. Algorithm 1: Future event detection Input: millions of today’s tweets on twitter.com Output: a few dozen tweets relevant to upcoming events t1 = tweets with text containing specific keywords t2 = tweets in t1 with text containing future dates t3 = tweets in t2 which have passed through a text classifier t4 = tweets in t3 whose retweeters localize within nations of interest or whose text contains mentions of specific locations return t4 In this work we utilize social media’s accelerated publication speed to report on events prior their occurrence, while they are still in their planning stages. Very recently, the speed of publication on twitter.com has motivated its use as a tool for the organization of future civil unrest events [3]. This fact, combined with the public availability of twitter.com data, has motivated our use of it as well. We will demonstrate how we Supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center (DoI / NBC) contract number D12PC00285. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/NBE, or the U.S. Government. 978-1-4673-6213-9/13/$31.00 ©2013 IEEE The tweets output by alg. 1 are often few in number and rich in information about upcoming Latin American civil unrest events. We believe that any user of our system will be able to easily interpret each day’s results with minimal effort. This paper is organized as follows: section II describes each 56 ISI 2013, June 4-7, 2013, Seattle, Washington, USA talking about 2013-01-10 as 2013-01-10 is closer in time to 2012-12-29 than 2012-01-10 is. step of alg. 1 in detail. section III showcases our user interface and has information about the system’s past performance. Finally, section IV discusses future work and concluding remarks. II. Once we have extracted dates from the text, we assert that the mentioned dates occur after the tweets post time. We assume that tweets mentioning past dates are unlikely to be indicative of future events. M ETHOD Our goal is to generate forecasts of the form: When the future date filter is applied the number of tweets is reduced substantially, a quick in-house experiment on 144167 tweets containing unrest keywords collected from the twitter.com API on 2013-03-01 found that only 1512 of these tweets also contained future dates. {population, event_type, date, location, probability} Where “population” describes the demographic of the event participants (eg education, labor, agriculture) , “event type” gives further detail about the reason for the event (eg employment, housing, economic policies), “date” is the date the event is forecast to occur on, “location” is the city where we expect the event to occur, and “probability” is how likely it is that the event will actually happen. For each tweet passing this filter we tentatively issue a forecast for the mentioned date. C. Logistic regression classification Co-mentions of keywords with future dates does not guarantee that a particular tweet is indeed about civil unrest. We have developed two classifiers to classify tweets based on their relevance to a civil unrest event. Our first classifier is a standard logistic regression classifier trained on tweets. The features for the classifier were unigrams and bigrams that surpassed a frequency threshold of 3 in the training data. The training data was acquired using three annotators through Amazon Mechanical Turk and they annotated 3000 tweets for their relevance to a civil unrest event (pairwise inter-annotator agreement ranged from 0.68 to 0.74). We obtain twitter.com data in two ways: purchasing a commercial realtime data feed consisting of a random sample of 10% of all tweets [8] and continually querying twitter.com’s rate-limited public API [9]. A. Keyword searches The first filter a tweet must pass is a simple check for mentions of Latin American civil unrest keywords. The advantage of this filter is that it is possible to apply it to the entirety of twitter.com by using the public search API. Latin American civil unrest keywords tend to be esoteric enough that we can often obtain complete coverage for several keywords each day. Our second classifier makes use of recent work we have done establishing that tweets from organizations are roughly three-times more likely to be civil unrest-related than similar tweets from individuals [10]. In order to exploit this concept, we designed an auxiliary classifier that classifies the source user type of a tweet into two categories - organizations and individuals. For this classifier, we make use of an ensemble framework for user type identification based on heuristics, an ngram classifier, and a linguistic classifier. The heuristics were designed to capture two strong cues that are characteristic of organization tweets - 1) they almost always contains a URL and 2) organizational tweets rarely contain replied tweets (tweets beginning with @user mentions). The ngram classifier was based on unigrams and bigrams and the linguistic classifier captures several types of linguistic features that are characteristic of tweets in either category. These three components in the ensemble are then utilized in linear combination using another logistic regression classifier to determine the user type of any given tweet. After we have identified the posting user as individual/organization using this classifier, we adjust the forecast probability accordingly, by incorporating the likelihoods to derive the postrior probability of a tweet being civil unrest-related GIVEN its user type. We have manually identified a collection of 26 keywords which we believe are of substantial importance and query the API to obtain full coverage of these keywords. A set of 335 words was identified by a domain expert for use in our model. Due to twitter.com rate-limiting, searching the public API for all occurrences of all 335 keyswords is not possible so searches for these 335 keywords are constrained to our commercial 10% data feed. B. Future date searches Simple checks for keyword mentions are poor indicators of tweet content. A quick experiment has shown that, in both English and Spanish, only about 20% of tweets that contain a civil unrest keyword are indeed about civil unrest. Furthermore, it is unclear how to forecast an event date from only tweets with certain keywords. We thus apply a second filter, one for mentions of future dates, to the tweets containing protest keywords. Our future date filter searches first for month names and abbreviations in Spanish and Portuguese and second for numbers less than 31 within three whitespace separated tokens from each other. Thus, an example matching date pattern would be ”10 de enero”. Four-digit years are rare in tweets, in order to determine the year of the mentioned date we use the year which minimizes the number of days between the mentioned date and the tweet’s post time. In our example, if a tweet mentions ”10 de enero” on 2012-12-29 we assume the user is D. Event geocoding Identification of event locations is central to the goal of this project. We infer the location of an upcoming event by searching the tweet for mentions of cities or monuments from a manually compiled list. In the event that no locations are mentioned in text, we assign a location to an event based on the location of retweeters. 57 For each tweet passing the logistic regression filter, we query the twitter.com API for user IDs of all the tweet’s retweeters. User IDs are then fed into our geocoder (outlined below) and filtered based on whether or not they reside in Latin America. We note here that this filter is remarkably difficult to pass. Of the 1512 tweets collected in the previous step in our example, only 36 passed the Latin America geocoding filter. We assume that civil unrest events tend to occur in population centers. Our geocoded users are reverse-geocoded into the nearest city with population greater than 500000. The most commonly occurring city is used as the location of our event. User geocoding: We identify retweeter locations with our previously developed in-house geocoder [1]. At the time of this writing, our geocoder is able to estimate home locations of 62M twitter.com users with a median error of 11.1km. The distinguishing feature of our geocoder is its ability to infer a user’s location based on the locations of that user’s online social ties. This is accomplished via a technique similar to “label propagation” [11] applied to the twitter.com bidirectional @mention network. Here, “bidirectional” means that edges in our network are formed when user i has @mentioned user j and user j has @mentioned user i at least once. Briefly, our user geocoder works as follows: We begin by extracting home locations for users based on the number of times they have tweeted with location services turned on. When our 10% sample of twitter.com contains 3 or more tweets from a user within a 15km radius we use the geometric median of those tagged tweets to establish the user’s home location. This provides us with home locations for 3780887 users. Experiments with self-reported locations in profile information and time zone pruning showed negligible benefits [1]. Denote the bidirectional @mention network by N and define a geometric median function φ : N → N by ( φ(n) = Fig. 1: Example forecast. A march related to Petroleos Mexicanos (Pemex) is planned for March 18 in Mexico City. Our system detected the event on March 5th. The interactive map provides end-users with links to retweeter accounts. n if n is gps-known P argmin ni adj. to n d(ni , x) else E. Demographics and event code assignment x (1) We condense duplicate forecasts for the same date/location into one forecast by averaging their probabilities. where each node n ∈ N stores information about the latitude/longitude for the user associated to that node. The distance function is computed with Vincenty’s formulas [12]. The optimization in eq. (1) is known as the l1 multivariate median or geometric median [13]. Our geocoding algorithm can now be stated concisely. Our domain expert has provided us with lists of terms relevant to several demographics and event types in Latin America. To assign a demographic to each forecast we collect the tweet histories of every retweeter of every tweet associated with a forecast and search our list of terms. The most commonly occurring classes of terms are used to assign our forecast’s demographic and event code. Algorithm 2: Geocoding Initialize: N , gps-known ground truth user locations while not converged do foreach n ∈ N do n ← φ(n) end end return N III. R ESULTS Successful end-user interpretation is important. By approaching this problem from the viewpoint of early detection rather than prediction we can more easily provide an audit trail which can be understood with minimal effort. For each forecast generated we provide the tweets used, the retweeter locations, the keywords matched, and links to all retweeter accounts. (cf fig. 1). Empirically, alg. 2 converges, providing us with location estimates for 62289295 twitter.com users. 58 Our system has been in place since 2012-12-17 (cf fig. 2), in that time the rate at which forecasts are generated has been steadily increasing as we continue to update our algorithms. Incorporation of the twitter API for complete coverage of a small set of keywords happened in late Febuary 2013 and the number of forecasts generated each day can be seen to increase sharply around this time. Aside from the geocoder, it is possible to implement our system without commercial data feeds, i.e. our results are reproducible from publicly-available data. The amount of events forecast to occur on each day is plotted in fig. 3. There is no obvious pattern. In the future, after substantially more forecasts have been generated, it may be possible to use this time series as part of a prediction-based approach. Assessing the performance of our system is relatively straightforward given the audit trails. Manual examination of 283 forecasts generated by this system revealed 157 forecasts which are indeed about upcoming Latin American civil unrest events and 126 forecasts related to sporting events, other public functions, or simple chatter. Fig. 2: Cumulative sum of the number of forecasts generated since 2012-12-17. The addition of twitter.com API searches happened in late February and led to a notable increase in the amount of forecasts generated each day. IV. C ONCLUSION Twitter.com has become a powerful tool for the organization of mass gatherings of all types. However, the shear volume of twitter.com makes it difficult to automatically identify new and valuable information in real time. In this work we have provided a straightforward approach for the detection of upcoming civil unrest events in Latin America based on successive textual and geographic filters. Traditional news media is often assumed to be perfectly accurate and can therefore only report on events once they have occurred. The fact that it is now possible to relax the assumption of perfect accuracy and report on events before their occurrence is remarkable and continued work on this project is already in progress. Immediate future work includes more advanced tweet classification using larger training sets, our @mention network, and dictionary-based approaches. We also plan to analyze the links shared in tweets for further information on upcoming events. Fig. 3: Number of events forecast to happen on each day. There is a significant spike on February 14th, these are mostly false positives. A quick look at the audit trails reveals that these false positives were due to twitter.com user’s propensity to “protest” February 14th. V. REFERENCES R EFERENCES number of events forcast nation 50 19 39 53 15 9 156 6 1 82 Argentina Brazil Chile Colombia Ecuador El Salvador Mexico Paraguay Uruguay Venezuela [1] D. Jurgens, “Inferring location in online communities based on social relationships,” HRL technical report, 2013. [2] H. Kwak, C. Lee, H. Park, and S. Moon, “What is twitter, a social network or a news media?” WWW, 2010. [3] J. Skinner, “Social media and revolution: The arab spring and the occupy movement as seen through three information studies paradigms,” 2011. [4] E. Stepanova, “The role of information communication technologies in the arab spring,” PONARS Eurasia Policy Memo No. 159, 2011. [5] F. Chen, J. Arredondo, R. P. Khandpur, C.-T. Lu, D. Mares, D. Gupta, and N. Ramakrishnan, “Spatial surrogates to forecast social mobilization and civil unrests,” Position Paper in CCC Workshop on ”From GPS and Virtual Globes to Spatial Computing-2012”, 2012. [6] D. Braha, “Global civil unrest: Contagion, self-organization, and prediction,” PLoS One, 2012. TABLE I: Number of forecasts generated for each country. Mexico is highly active on twitter.com and receives the most coverage from our system. 59 [7] [8] [9] [10] [11] [12] [13] N. Johnson, S. Carran, J. Botner, K. Fontaine, N. Laxague, P. Nuetzel, J. Turnley, and B. Tivnan, “Pattern in escalations in insurgent and terrorist activity,” Science, 2011. http://gnip.com/twitter/decahose/. https://dev.twitter.com/. L. D. Silva and E. Riloff, “Exploiting the textual content of tweets for user type classification,” ICWSM (submitted), 2013. X. Zhu and Z. Ghahramani, “Learning from labeled and unlabeled data with label propagation,” Technical Report CMU-CALD-02-107, Carnegie Mellon University, Tech. Rep., 2002. http://www.geodesy.org. Y. Vardi and C.-H. Zhang, “The multivariate l1-median and associated data depth,” Proceedings of the National Academy of Sciences, vol. 97, no. 4, pp. 1423–1426, 2000. 60