presentation - SNOW Workshop

advertisement
Event Detection in Twitter using Aggressive
Filtering and Hierarchical Tweet Clustering
Georgiana Ifrim, Bichen Shi, Igor Brigadir
Insight Centre for Data Analytics
University College Dublin
SNOW Data Challenge
April 8, 2014
Outline
Background
Method Proposed
Method Details
Results
Future Work
Background
Social media outlets (e.g., Twitter) play an increasing role in
the cycle of news production
Journalists use Twitter for news selection and presentation
Twitter:
• An endless, real-time, global stream of news
• Large scale and very noisy (redundant, messy content)
Challenge: Extract (close to real-time) newsworthy
topics/event/stories from the Twitter stream, in a format usable by
news professionals (e.g., topic-timestamp, topic-headline, topictags, tweet-ids, photo-urls)
Challenge
From this:
#Obama #follow #followme #followforfollow #followme #follower #followers
#alwaysfollowback #followbackalways #teamfollowback
I VOTED !!! #OBAMA http://instagram.com/p/RsoNuMgLkr/
@BarackObama #TeaamObama !!!! βœŠπŸŽ‹πŸŽ‰πŸŽŠπŸ‡ΊπŸ‡Έ
Om 12u zou de eerste uitslag binnen zijn. Nu nog steeds niks. Dit trek ik niet. Wekker over
4u en we kijken dan wel. #obama #forward
My President is Black β˜…β˜…β˜…β˜…β˜…β–„β–„β–„β–„β–„β–„β–„β–„β–„β–„ β˜…β˜…β˜…β˜…β˜…β–„β–„β–„β–„β–„β–„β–„β–„β–„β–„
β˜…β˜…β˜…β˜…β˜…β–„β–„β–„β–„β–„β–„β–„β–„β–„β–„ β–„β–„β–„β–„β–„β–„β–„β–„β–„β–„β–„β–„β–„β–„β–„ β–„β–„β–„β–„β–„β–„β–„β–„β–„β–„β–„β–„β–„β–„β–„
β–„β–„β–„β–„β–„β–„β–„β–„β–„β–„β–„β–„β–„β–„β–„ #Obama2012 #Retweet
voted at school! @barackobama I love you! #Forward2012 πŸ‡ΊπŸ‡ΈπŸ’™
http://instagr.am/p/RtONyRC9Tm/
Romney Romney Romney Romney Romney Romney Romney Romney!!!!!!!!!!!!!!!!!!!!!πŸ‡ΊπŸ‡ΈπŸ‡ΊπŸ‡ΈπŸ‡ΊπŸ‡Έ
πŸ‡ΊπŸ‡Έ #RomneyRyan2012
Challenge
To this:
Obama wins Vermont
Romney wins Kentucky
Bernie Sanders wins Senate seat in Vermont
Romney wins Indiana
Method Proposed
1. Aggressive Data Filtering (to remove noise, to scale)
2. Hierarchical Clustering of Tweets + Dendrogram Cutting (to
obtain clusters without need of knowing #clusters a-priori)
3. Ranking of Clusters (to favor news-like topics)
4. Extracting Topic-Headlines (usable information)
5. Re-clustering Topic-Headlines (to remove topic fragmentation)
6. Extracting Final Topics (as presented to the user)
Method Details
Software
Collecting Twitter streams
SNOW Challenge Code (based on Twitter4J API)
All other development (https://github.com/heerme/twitter-topics)
Python2.7 + libraries: scipy, numpy, sklearn, nltk, json
Tweet-NLP: CMUTweetTagger (trained on tweets, entity detection)
Efficient clustering: fastcluster (C++ lib, interface to Python/R)
Data Collection
US Presidential Elections 2012
Collected from tweet ids (06-11-2012, 23:30 to 07-11-2012 06:51)
1,084,200 raw tweets text (english + non-english, 252MByte)
Syria, Ukraine, Bitcoin 2014
Collected from keywords + user ids (25-02-2014, 17:30 to 26-02-2014,
18:15)
1,088,593 raw tweets JSON (english + non-english, 4.3GByte)
943,175 english tweets JSON (3.8GByte)
943,175 tweets text (extract subset of fields from JSON object,
240MByte)
Replace re-tweet text with original tweet text
Data Pre-processing
Tweet filtering
Clean tweet-text. Remove: urls, user mentions, hashtags, punctuation, digits
Tokenize remaining text into tokens
Rebuild tweet by appending: user mentions (@) + hashtags (#) + text tokens
Remove tweets based on structure (remove if too many @, # or too few text tokens)
Term filtering
Keep only bi-grams + tri-grams occurring in at least a percentage of tweets in time
window (e.g., min(10, n_tweets_in_window * 0.0025))
Tweet-Term Matrix (binary)
Remove out-of-vocabulary tweets and very short tweets (with less than 5 tokens)
Retains about 20% of the original raw tweet stream (in each time window)
Hierarchical Tweet Clustering
Computing tweet pairwise-distance
Scale and normalize tweet-term matrix
Cosine as distance metric (euclidean similar results); sklearn + scipy
Computing hierarchical clustering => dendrogram
fastcluster C++ library (interface to R and Python)
Dendrogram cutting
Cut at 0.5 distance threshold (better libraries based on topology of
dendrogram available in R, e.g., Dynamic Tree Cut: only specify min
number of examples in each cluster)
One cluster = one potential topic
Hierarchical Tweet Clustering
Ranking clusters
Retain only clusters with at least 10 tweets (size constraint)
Score each cluster based on:
Compute cluster-centroid (vector of terms)
Get maximum term-score (over all centroid terms)
Term score: entity_score * burstiness_score
Assign the highest term score as cluster score
Normalize cluster score by cluster size
Entity score = 2.5 (identify entity-terms with CMUTweetTagger)
Burstiness score = df-idf_t, with t=4 (prior work on Bngram)
Interesting extensions to cluster score: article_score, tweet_importance
based on trustworthiness or clout of users issuing tweet
Hierarchical Tweet (Re)Clustering
Selecting topic-headlines
Take top-20 ranked clusters as potential topics
Select first (time-wise) tweet in each cluster as topic-headline
Re-cluster headlines
Hierarchical clustering of headlines
Score headline-clusters using max score headline
Rank headline-clusters, take top-10
Final topics
Select first (published) headline in each cluster, present raw tweet (less
url) to user
Gather all distinct keywords of headlines in headline-cluster to create
topic-tags
Tweet ids for topic: the ids of corresponding headlines. If headlines do
not cluster, only one tweet id
Results
Top-10 topics first time window in US stream (07-11-2012 00:00 – 00:10)
1. WASHINGTON (AP) - Obama wins Vermont; Romney wins Kentucky. #Election2012
2. Not a shocker NBC reporting #Romney wins Indiana & Kentucky #Obama wins Vermont
3. RT @SkyNewsBreak: Sky News projection: Romney wins Kentucky. #election2012
4. AP RACE CALL: Democrat Peter Shumlin wins governor race in Vermont. #Election2012
5. CNN Virginia exit poll: Obama 49\%, Romney 49\% #election2012
6. Mitt Romney Losing in Massachusetts a state that he governed. Why vote for him when
his own people don't want him? #Obama2012
7. Twitter is gonna be live and popping when Obama wins! #Obama2012
8. INDIANA RESULTS: Romney projected winner (via @NBC) #election2012
9. If Obama wins I'm going to celebrate... If Romney wins I'm going to watch Sesame Street
one last time #Obama2012
10. #election2012 important that Romney won INdependents in Virginia by 11 pts. With
parties about even, winning Inds is key
Results
Top-10 topics first time window in Syria stream (25-02-2014 18:00 – 18:15)
1. The new, full Godzilla trailer has roared online
2. At half-time Borussia Dortmund lead Zenit St Petersburg 2-0.
3. Ukraine Currency Hits Record Low Amid Uncertainty: Ukrainian currency, the hryvnia,
hits all-time low against ...
4. Ooh, my back! Why workers' aches pains are hurting the UK economy
5. Uganda: how campaigners are preparing to counter the anti-gay bill
6. JPost photographer snaps what must be the most inadvertantly hilarious political picture
of the decade
7. Fans gather outside Ghostbusters firehouse in N.Y.C. to pay tribute to Harold Ramis
8. Man survives a shooting because the Bible in his top pocket stopped two bullets
9. Ukraine's toppling craze reaches even legendary Russian commander, who fought
Napoleon
10. Newcastle City Hall. Impressive booking first from bottom on the left...
Discussion
Parameter choices
Filtering parameters dependent on window size (nr of tweets in window)
Unigrams vs N-grams (N>1)
Bi-grams + N-grams good for content + scalability
Cluster ranking
(Normalized) Df-idf_t seems a good choice, but cluster-score may benefit
from using tweet importance (based on user importance)
Topic Precision (~80%, based on googling topic-headlines)
On average about 8-9 out of 10 headlines are published news
Efficiency Aspect
System takes about 0.5min per 15min slot (scales well for larger time
slots)
Conclusion
Encouraging results in using Twitter stream as a news
aggregator (truly global)
Both sides now: media outlets (CNN, BBC, Reuters, AP) and
regular people post updates on (breaking) stories
We need a good topic-benchmark to refine techniques (e.g.,
comprehensive set of ground truth topics)
Future Work
Improve retrieval of newsworthy stories
-E.g., ‘This is what happens when you put two pit bulls in a photo booth’, vs
‘Ukraine currency hits record low amid uncertainty’
-May depend on type of stories we are after (BBC vs Sun)
-Tweet/user importance filtering may help
-News streamed in same time frame may help (vocabulary selection)
Fragmentation due to breaking news stories
-Same story discussed from different angles:
Lee Rigby murders: Michael Adebolajo given whole-life jail term
Lee Rigby murder sentence expected shortly. Pictured: the scene outside the Old Bailey in
London
Judge Mr Justice Sweeney says behaviour of Lee #Rigby's killers was "sickening and pitiless"
-Combination of tweet and term clustering may help (e.g., cluster headlines in
term rather than tweet space)
Thank You!
Open source code:
https://github.com/heerme/twitter-topics
Different Newspapers in UK
In the TV comedy seriesYes Minister, fictional Prime Minister
Jim Hacker explains to his staff the readership of the main
newspapers:
“The Daily Mirror is read by people who think they run the country,
The Guardian is read by people who think they ought to run the
country, The Times is read by people who actually do run the
country, The Daily Mail is read by the wives of the people who
run the country, The Financial Times is read by people who
own the country, The Morning Star is read by people who think
the country ought to be run by another country, and The Daily
Telegraph is read by people who think it is.”, Source: Wikipedia
Download