talk

advertisement
TwitterMonitor: Trend Detection over the Twitter Stream
EvenTweet: Online Localized Event Detection from Twitter
Presenter:
Liu,Ya
Tian,Yujia
Pham, Anh
TwitterMoniter: Trend Detection over
the Twitter Stream
Michael Mathioudakis, Nick Koudas
INTRODUCTION
 TwitterMonitor, a system that performs trend detection over
the Twitter stream.
 Identifies emerging topics on Twitter in real time and
provides analytics that synthesize and accurate description of
each topic.
Michael Mathioudakis, Nick Koudas, Nick Koudas,TwitterMonitor: trend detection over the twitter stream.,
In: SIGMOD Conference, pp. 155-1158, 2010
TREND DETECTION AND ANALYSIS
 Trend detection in two steps.
 Analyzes trends in a third step:
 Identifies ‘bursty’ keywords,
 Groups bursty keywords into trends,
 Extracts additional information to discover interesting
aspects of it.
Michael Mathioudakis, Nick Koudas, Nick Koudas,TwitterMonitor: trend detection over the twitter stream.,
In: SIGMOD Conference, pp. 155-1158, 2010
Detecting Bursty Keywords
 Keyword: An unusually high rate in the stream.
 New topic emerged and seeks to explore in the further.
 Algorithm: QueueBurst
 1) One-pass.
 2) Real-time.
 3) Adjustable against ‘spurious’ bursts.
 4) Adjustable against spam.
 5) theoretically sound.
Michael Mathioudakis, Nick Koudas, Nick Koudas,TwitterMonitor: trend detection over the twitter stream.,
In: SIGMOD Conference, pp. 155-1158, 2010
From Bursty Keywords to Trends
 Group keywords together.
 Every moment t, computes keywords set 𝐾𝑡 , and then
divides it into subsets 𝐾𝑡 𝑖 ( i.e. a ‘trend’).
 GroupBurst, based on co-occurrences.
 Retrieves a few minutes’ history of tweets, and group
keywords together if co-occurred in a relatively large number
of recent tweets.
 A greedy strategy that produces groups in a small number of
steps.
Michael Mathioudakis, Nick Koudas, Nick Koudas,TwitterMonitor: trend detection over the twitter stream.,
In: SIGMOD Conference, pp. 155-1158, 2010
Trend Analysis
 Compose a more accurate description:
 Identify more keywords associated with it.
Context extraction algorithms (PCA, SVD, etc.)
search the recent history and reports the most
correlated keywords.
Grapevine’s entity extractor to identify the entities.
 Frequently cited sources are added to the trend description.
 Identifies frequent geographical origins.
 A chart will be produced for each trend and gets updated.
Michael Mathioudakis, Nick Koudas, Nick Koudas,TwitterMonitor: trend detection over the twitter stream.,
In: SIGMOD Conference, pp. 155-1158, 2010
Architecture
Index
Michael Mathioudakis, Nick Koudas, Nick Koudas,TwitterMonitor: trend detection over the twitter stream.,
In: SIGMOD Conference, pp. 1156, 2010
Architecture: Back-End
 The StreamListener module receives sample which consists
10M out of 50M tweets per day, via the Twitter API.
 Then seperates tweet information into fields and exports two
feeds:


Reporting tweets with all their fields to an Index module
Reporting only the text and timestamp of tweets to Bursty Keywords
Detection module
Michael Mathioudakis, Nick Koudas, Nick Koudas,TwitterMonitor: trend detection over the twitter stream.,
In: SIGMOD Conference, pp. 155-1158, 2010
Architecture: Back-End(Cont.)
 After bursty keywords are identified and grouped into
trends, the Index is contacted by the Trend Analysis module to
retrieve information on tweets that belong to each trend.
Michael Mathioudakis, Nick Koudas, Nick Koudas,TwitterMonitor: trend detection over the twitter stream.,
In: SIGMOD Conference, pp. 155-1158, 2010
Architecture: Front-End
Michael Mathioudakis, Nick Koudas, Nick Koudas,TwitterMonitor: trend detection over the twitter stream.,
In: SIGMOD Conference, pp. 1157, 2010
Architecture: Front-End (Cont.)
 A webpage reports recent trends in real time
 An interface allows users to rank trends by recency or
current activity rate and submit their own short description
for trends.
 Use an additional tab to display daily trends.
Michael Mathioudakis, Nick Koudas, Nick Koudas,TwitterMonitor: trend detection over the twitter stream.,
In: SIGMOD Conference, pp. 155-1158, 2010
Demonstration
 Every trend will be represented by the entities, by the related bursty
keywords.
 The audience will have the option to use the interface in order to
acquire more information.
They will be shown additional keywords and skim through
representative tweets
They will be able to track a trend’s popularity over time and spot
the origin.
They will interact with the system by tracking the displayed
trends according different criteria and submitting descriptions.
Michael Mathioudakis, Nick Koudas, Nick Koudas,TwitterMonitor: trend detection over the twitter stream.,
In: SIGMOD Conference, pp. 155-1158, 2010
EvenTweet: Online Localized Event
Detection from Twitter
Hamed Abdelhaq, Christian Sengstock, and Michael
Gertz
1. Introduction
2. Localized Event Detection
 Temporal Keyword Extraction
 Spatial Keyword Identification
 Keyword Clustering
 Cluster Scoring
3. System overview
4. Demonstration
INTRODUCTION
 EvenTweet, a system to detect localized events from a stream




of tweets in real-time.
Only about 1% of tweets are georeferenced.
Focuses on detecting localized events from a stream of tweets
in real-time.
Adopts a continuous analysis of the most recent tweets
within a time-based sliding window.
Described by 1) related keywords & 2) estimation of the start
time and the geographic location.
Hamed Abdelhaq, Christian Sengstock, Michael Gertz: EvenTweet: Online Localized Event Detection from Twitter.
PVLDB 6(12): pp. 1326-1329 (2013)
INTRODUCTION
 Tracks evolution over time: a fine-grained temporal
resolution. A scoring scheme the gives a score of each event
over time.
 Don’t estimate geo-coordinates for non-geotagged tweets,
but be able to identify localized events using a possibly small
amont of geo-tagged tweets:
Both geo- and non-geo-tagged tweets are used to
identify words best describing events.
Only geo-tagged tweets are used to estimate the spatial
distribution of such words.
Hamed Abdelhaq, Christian Sengstock, Michael Gertz: EvenTweet: Online Localized Event Detection from Twitter.
PVLDB 6(12): pp. 1326-1329 (2013)
1. Introduction
2. Localized Event Detection
 Temporal Keyword Extraction
 Spatial Keyword Identification
 Keyword Clustering
 Cluster Scoring
3. System overview
4. Demonstration
Localized Event Detection
Basic Definitions
 Event: a phenomenon that stimulates people to post
messages for a certain period of time.
 Localized events: Events happen within a small region,
having a small spatial extent.
(e.g., concerts, soccer matches, road works)
Hamed Abdelhaq, Christian Sengstock, Michael Gertz: EvenTweet: Online Localized Event Detection from Twitter.
PVLDB 6(12): pp. 1326-1329 (2013)
Localized Event
 A localized event is described as a tuple:
le = (el, et, K)
 el is event location, represented as a small set of connected
rectangular.
 et is the start time.
 K is a set of words frequently published during the event time
and at that location.
Hamed Abdelhaq, Christian Sengstock, Michael Gertz: EvenTweet: Online Localized Event Detection from Twitter.
PVLDB 6(12): pp. 1326-1329 (2013)
Online Detection
Basic Notation:
 Each tweet tw = (W, uid, l, t)
 W: a set of words
 uid: a user id
 l = (lon, lat): a geographic location
 t: timestamp
 Use a timeline divided into a sequence of equal-length time
frames (…fc-1, fc), where fc denotes the current time frame.
 Each time frame represents a short time interval during which
tweets are posted.
Hamed Abdelhaq, Christian Sengstock, Michael Gertz: EvenTweet: Online Localized Event Detection from Twitter.
PVLDB 6(12): pp. 1326-1329 (2013)
Basic Notation (cont.)
 We use a time-based sliding window winkfc composed of k
time frames and fc as its end point.
 The detection procedure of EvenTweet is triggered every
time a new time frame elapses.
Hamed Abdelhaq, Christian Sengstock, Michael Gertz: EvenTweet: Online Localized Event Detection from Twitter.
PVLDB 6(12): pp. 1326-1329 (2013)
1. Introduction
2. Localized Event Detection
 Temporal Keyword Extraction
 Spatial Keyword Identification
 Keyword Clustering
 Cluster Scoring
3. System overview
4. Demonstration
Temporal Keyword Extraction
 Extraction of words showing a bursty frequency in the
current time frame (these words are called keywords,Yc)
 Given a set of words Wc from the tweets published during the
recent time frame fc, extract a subset Yc ⊆ Wc which
represents words likely to describe localized events.
Hamed Abdelhaq, Christian Sengstock, Michael Gertz: EvenTweet: Online Localized Event Detection from Twitter.
PVLDB 6(12): pp. 1326-1329 (2013)
Temporal Keyword Extraction
(cont.)
 Use discrepancy paradigm to extract keywords based on their
burstiness.
 Assume:
 during timeframe fc
 u(w, c): normalized by the number of users publishing tweets containing
word w
Hamed Abdelhaq, Christian Sengstock, Michael Gertz: EvenTweet: Online Localized Event Detection from Twitter.
PVLDB 6(12): pp. 1326-1329 (2013)
Temporal Keyword Extraction
(cont.)
 In addition,
 histw = (u(w, 1), u(w, 2), …, u(w, m)) is a fixed historical sequence of
usage values for w collected before the current time frame fc, such that
m < c.
 It is used when the system needs to describe the normal behavior of
word w over previous time frames.
Hamed Abdelhaq, Christian Sengstock, Michael Gertz: EvenTweet: Online Localized Event Detection from Twitter.
PVLDB 6(12): pp. 1326-1329 (2013)
Temporal Keyword Extraction
(cont.)
 The discrepancy paradigm measures the deviation
between the word usage value u(w,c) in the current
time frame and an expected word usage baseline,
b(w), which estimated from histw. histw is drawn from
Gaussian distribution with mean b(w).μ and
deviation b(w).σ
 Higher deviation, higher burstiness degree
Hamed Abdelhaq, Christian Sengstock, Michael Gertz: EvenTweet: Online Localized Event Detection from Twitter.
PVLDB 6(12): pp. 1326-1329 (2013)
Temporal Keyword Extraction
(cont.)
 The burtinesss degree of a word w is the z-score defined:
b_degree(w, c) :=( u(w,c)−b(w).μ)/b(w).σ
 Choose words whose burstiness degree is larger than two
standard deviations above the mean as keywords.
 Keywords observed for the first time will have μ=0 and
σ=0.
Hamed Abdelhaq, Christian Sengstock, Michael Gertz: EvenTweet: Online Localized Event Detection from Twitter.
PVLDB 6(12): pp. 1326-1329 (2013)
1. Introduction
2. Localized Event Detection
 Temporal Keyword Extraction
 Spatial Keyword Identification
 Keyword Clustering
 Cluster Scoring
3. System overview
4. Demonstration
Spacial Keyword Identification
 Find keywords which are highly localized.
 Only use georeferenced tweets.
During the sliding window winkfc
g
- Usage ratio of ki:
# users using ki in cell g
# users in cell g
- Calculate the density of keyword ki in cell g:
usage ratio of ki in g
usage ratio of ki in G
grid G
- Repeat this for all cells in G. We’ll have Si (discrete
spatial distribution of ki). Also called Spatial
Signature of ki
Hamed Abdelhaq, Christian Sengstock, Michael Gertz: EvenTweet: Online Localized Event Detection from Twitter.
PVLDB 6(12): pp. 1326-1329 (2013)
Spacial Keyword Identification
 Only use georeferenced tweets.
- Calculate Entropy H(Si)
g
- Discard all keywords with entropy larger than a
threshold ρ. Why?
- We’ll haveYc = set of filtered keywords
g
Hamed Abdelhaq, Christian Sengstock, Michael Gertz: EvenTweet: Online Localized Event Detection from Twitter.
PVLDB 6(12): pp. 1326-1329 (2013)
Keyword Clustering
 Each Si is a vector.
 Clustering event keywords using their Si
 Similarity calculation: Cosine similarity
Hamed Abdelhaq, Christian Sengstock, Michael Gertz: EvenTweet: Online Localized Event Detection from Twitter.
PVLDB 6(12): pp. 1326-1329 (2013)
Cosine Similarity,Wikipedia, http://en.wikipedia.org/wiki/Cosine_similarity
Keyword Clustering
- There is a distance threshold Т
- If a new keyword falls out of the
threshold, it forms a new cluster itself.
Hamed Abdelhaq, Christian Sengstock, Michael Gertz: EvenTweet: Online Localized Event Detection from Twitter.
PVLDB 6(12): pp. 1326-1329 (2013)
Saed Sayad, Kmeans clustering, http://www.saedsayad.com/clustering_kmeans.htm
Cluster Scoring
 To determine which clusters of keywords is more likely being
referred to localized events, filter out spurious clusters.
 To score a cluster:
1. Score each keyword
2. Sum up all scores
Hamed Abdelhaq, Christian Sengstock, Michael Gertz: EvenTweet: Online Localized Event Detection from Twitter.
PVLDB 6(12): pp. 1326-1329 (2013)
Cluster Scoring
1. Score each keyword
1
k_score(i,cl) := oi . b_degree(ki,ei) . prominence(ki,cl)
2
prominence(ki , cl) :=
𝑐𝑙.𝑜𝑖
𝑐𝑙.𝑙𝑡
. (1 −
𝑓𝑐 − 𝑐𝑙.𝑒𝑖
)
𝑐𝑙.𝑙𝑡
3
ei: the time frame we’re looking at
oi: the number of times ki was clustered in cl
2. Sum up all scores
score(cl) :=
|𝑐𝑙.𝑌|
𝑖=1 𝑘_𝑠𝑐𝑜𝑟𝑒(𝑖, 𝑐𝑙)
Hamed Abdelhaq, Christian Sengstock, Michael Gertz: EvenTweet: Online Localized Event Detection from Twitter.
PVLDB 6(12): pp. 1326-1329 (2013)
1. Introduction
2. Localized Event Detection
 Temporal Keyword Extraction
 Spatial Keyword Identification
 Keyword Clustering
 Cluster Scoring
3. System overview
4. Demonstration
System Overview
Hamed Abdelhaq, Christian Sengstock, Michael Gertz: EvenTweet: Online Localized Event Detection from Twitter.
PVLDB 6(12): pp. 1326-1329 (2013)
Demonstration
Hamed Abdelhaq, Christian Sengstock, Michael Gertz: EvenTweet: Online Localized Event Detection from Twitter.
PVLDB 6(12): pp. 1326-1329 (2013)
Download