Event Detection in Twitter using Aggressive Filtering and Hierarchical Tweet Clustering Georgiana Ifrim, Bichen Shi, Igor Brigadir Insight Centre for Data Analytics University College Dublin SNOW Data Challenge April 8, 2014 Outline Background Method Proposed Method Details Results Future Work Background Social media outlets (e.g., Twitter) play an increasing role in the cycle of news production Journalists use Twitter for news selection and presentation Twitter: • An endless, real-time, global stream of news • Large scale and very noisy (redundant, messy content) Challenge: Extract (close to real-time) newsworthy topics/event/stories from the Twitter stream, in a format usable by news professionals (e.g., topic-timestamp, topic-headline, topictags, tweet-ids, photo-urls) Challenge From this: #Obama #follow #followme #followforfollow #followme #follower #followers #alwaysfollowback #followbackalways #teamfollowback I VOTED !!! #OBAMA http://instagram.com/p/RsoNuMgLkr/ @BarackObama #TeaamObama !!!! βππππΊπΈ Om 12u zou de eerste uitslag binnen zijn. Nu nog steeds niks. Dit trek ik niet. Wekker over 4u en we kijken dan wel. #obama #forward My President is Black β β β β β ββββββββββ β β β β β ββββββββββ β β β β β ββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ #Obama2012 #Retweet voted at school! @barackobama I love you! #Forward2012 πΊπΈπ http://instagr.am/p/RtONyRC9Tm/ Romney Romney Romney Romney Romney Romney Romney Romney!!!!!!!!!!!!!!!!!!!!!πΊπΈπΊπΈπΊπΈ πΊπΈ #RomneyRyan2012 Challenge To this: Obama wins Vermont Romney wins Kentucky Bernie Sanders wins Senate seat in Vermont Romney wins Indiana Method Proposed 1. Aggressive Data Filtering (to remove noise, to scale) 2. Hierarchical Clustering of Tweets + Dendrogram Cutting (to obtain clusters without need of knowing #clusters a-priori) 3. Ranking of Clusters (to favor news-like topics) 4. Extracting Topic-Headlines (usable information) 5. Re-clustering Topic-Headlines (to remove topic fragmentation) 6. Extracting Final Topics (as presented to the user) Method Details Software Collecting Twitter streams SNOW Challenge Code (based on Twitter4J API) All other development (https://github.com/heerme/twitter-topics) Python2.7 + libraries: scipy, numpy, sklearn, nltk, json Tweet-NLP: CMUTweetTagger (trained on tweets, entity detection) Efficient clustering: fastcluster (C++ lib, interface to Python/R) Data Collection US Presidential Elections 2012 Collected from tweet ids (06-11-2012, 23:30 to 07-11-2012 06:51) 1,084,200 raw tweets text (english + non-english, 252MByte) Syria, Ukraine, Bitcoin 2014 Collected from keywords + user ids (25-02-2014, 17:30 to 26-02-2014, 18:15) 1,088,593 raw tweets JSON (english + non-english, 4.3GByte) 943,175 english tweets JSON (3.8GByte) 943,175 tweets text (extract subset of fields from JSON object, 240MByte) Replace re-tweet text with original tweet text Data Pre-processing Tweet filtering Clean tweet-text. Remove: urls, user mentions, hashtags, punctuation, digits Tokenize remaining text into tokens Rebuild tweet by appending: user mentions (@) + hashtags (#) + text tokens Remove tweets based on structure (remove if too many @, # or too few text tokens) Term filtering Keep only bi-grams + tri-grams occurring in at least a percentage of tweets in time window (e.g., min(10, n_tweets_in_window * 0.0025)) Tweet-Term Matrix (binary) Remove out-of-vocabulary tweets and very short tweets (with less than 5 tokens) Retains about 20% of the original raw tweet stream (in each time window) Hierarchical Tweet Clustering Computing tweet pairwise-distance Scale and normalize tweet-term matrix Cosine as distance metric (euclidean similar results); sklearn + scipy Computing hierarchical clustering => dendrogram fastcluster C++ library (interface to R and Python) Dendrogram cutting Cut at 0.5 distance threshold (better libraries based on topology of dendrogram available in R, e.g., Dynamic Tree Cut: only specify min number of examples in each cluster) One cluster = one potential topic Hierarchical Tweet Clustering Ranking clusters Retain only clusters with at least 10 tweets (size constraint) Score each cluster based on: Compute cluster-centroid (vector of terms) Get maximum term-score (over all centroid terms) Term score: entity_score * burstiness_score Assign the highest term score as cluster score Normalize cluster score by cluster size Entity score = 2.5 (identify entity-terms with CMUTweetTagger) Burstiness score = df-idf_t, with t=4 (prior work on Bngram) Interesting extensions to cluster score: article_score, tweet_importance based on trustworthiness or clout of users issuing tweet Hierarchical Tweet (Re)Clustering Selecting topic-headlines Take top-20 ranked clusters as potential topics Select first (time-wise) tweet in each cluster as topic-headline Re-cluster headlines Hierarchical clustering of headlines Score headline-clusters using max score headline Rank headline-clusters, take top-10 Final topics Select first (published) headline in each cluster, present raw tweet (less url) to user Gather all distinct keywords of headlines in headline-cluster to create topic-tags Tweet ids for topic: the ids of corresponding headlines. If headlines do not cluster, only one tweet id Results Top-10 topics first time window in US stream (07-11-2012 00:00 – 00:10) 1. WASHINGTON (AP) - Obama wins Vermont; Romney wins Kentucky. #Election2012 2. Not a shocker NBC reporting #Romney wins Indiana & Kentucky #Obama wins Vermont 3. RT @SkyNewsBreak: Sky News projection: Romney wins Kentucky. #election2012 4. AP RACE CALL: Democrat Peter Shumlin wins governor race in Vermont. #Election2012 5. CNN Virginia exit poll: Obama 49\%, Romney 49\% #election2012 6. Mitt Romney Losing in Massachusetts a state that he governed. Why vote for him when his own people don't want him? #Obama2012 7. Twitter is gonna be live and popping when Obama wins! #Obama2012 8. INDIANA RESULTS: Romney projected winner (via @NBC) #election2012 9. If Obama wins I'm going to celebrate... If Romney wins I'm going to watch Sesame Street one last time #Obama2012 10. #election2012 important that Romney won INdependents in Virginia by 11 pts. With parties about even, winning Inds is key Results Top-10 topics first time window in Syria stream (25-02-2014 18:00 – 18:15) 1. The new, full Godzilla trailer has roared online 2. At half-time Borussia Dortmund lead Zenit St Petersburg 2-0. 3. Ukraine Currency Hits Record Low Amid Uncertainty: Ukrainian currency, the hryvnia, hits all-time low against ... 4. Ooh, my back! Why workers' aches pains are hurting the UK economy 5. Uganda: how campaigners are preparing to counter the anti-gay bill 6. JPost photographer snaps what must be the most inadvertantly hilarious political picture of the decade 7. Fans gather outside Ghostbusters firehouse in N.Y.C. to pay tribute to Harold Ramis 8. Man survives a shooting because the Bible in his top pocket stopped two bullets 9. Ukraine's toppling craze reaches even legendary Russian commander, who fought Napoleon 10. Newcastle City Hall. Impressive booking first from bottom on the left... Discussion Parameter choices Filtering parameters dependent on window size (nr of tweets in window) Unigrams vs N-grams (N>1) Bi-grams + N-grams good for content + scalability Cluster ranking (Normalized) Df-idf_t seems a good choice, but cluster-score may benefit from using tweet importance (based on user importance) Topic Precision (~80%, based on googling topic-headlines) On average about 8-9 out of 10 headlines are published news Efficiency Aspect System takes about 0.5min per 15min slot (scales well for larger time slots) Conclusion Encouraging results in using Twitter stream as a news aggregator (truly global) Both sides now: media outlets (CNN, BBC, Reuters, AP) and regular people post updates on (breaking) stories We need a good topic-benchmark to refine techniques (e.g., comprehensive set of ground truth topics) Future Work Improve retrieval of newsworthy stories -E.g., ‘This is what happens when you put two pit bulls in a photo booth’, vs ‘Ukraine currency hits record low amid uncertainty’ -May depend on type of stories we are after (BBC vs Sun) -Tweet/user importance filtering may help -News streamed in same time frame may help (vocabulary selection) Fragmentation due to breaking news stories -Same story discussed from different angles: Lee Rigby murders: Michael Adebolajo given whole-life jail term Lee Rigby murder sentence expected shortly. Pictured: the scene outside the Old Bailey in London Judge Mr Justice Sweeney says behaviour of Lee #Rigby's killers was "sickening and pitiless" -Combination of tweet and term clustering may help (e.g., cluster headlines in term rather than tweet space) Thank You! Open source code: https://github.com/heerme/twitter-topics Different Newspapers in UK In the TV comedy seriesYes Minister, fictional Prime Minister Jim Hacker explains to his staff the readership of the main newspapers: “The Daily Mirror is read by people who think they run the country, The Guardian is read by people who think they ought to run the country, The Times is read by people who actually do run the country, The Daily Mail is read by the wives of the people who run the country, The Financial Times is read by people who own the country, The Morning Star is read by people who think the country ought to be run by another country, and The Daily Telegraph is read by people who think it is.”, Source: Wikipedia