now

advertisement
SNA: Research
Dr. Nawaporn Wisitpongphan
1
TwitterMonitor:
Trend Detection
over the Twitter
Stream
Michael Mathioudakis, Nick Koudas
Michael Mathioudakis, Nick Koudas, Nick Koudas, TwitterMonitor: trend detection over the twitter stream.,
In: SIGMOD Conference, pp. 155-1158, 2010
INTRODUCTION

TwitterMonitor is a system that performs trend detection over the
Twitter stream.

Identifies emerging topics on Twitter in real time and provides
analytics that synthesize and accurate description of each topic.
TREND DETECTION AND
ANALYSIS

Step 1: Trend Detection.

Identifies ‘bursty’ keywords,keywords that suddenly appear in tweets at an
unusually high rate

Groups bursty keywords into trends where a trend is identified as a set of
bursty keywords that occur frequently together in tweets
“keyword ‘NBA’ may usually appear in 5 tweets per minute,
yet suddenly exhibit a rate of 100 tweets/min. Such ‘bursts’
in keyword frequency are typically associated with sudden
popular interest in a particular topic”

Step 2: Analyzes trends in a third step:

Extracts additional information to discover interesting aspects of it.
QueueBurst Algorithm:
Detecting Bursty Keywords

Algorithm: QueueBurst

1) One-pass.


2) Real-time.


In some cases, a keyword may appear in many tweets over a short period of time
simply by coincidence. QueueBurst avoids reporting such instances as real burst.
4) Adjustable against spam.


Identify bursty keyword as it arrives
3) Adjustable against ‘spurious’ bursts.


Stream data need only be read once to declare when a keyword is bursty.
Ignore spam users: Spam user groups repetitively generate large numbers of similar
tweets.
5) theoretically sound.

QueueBurst is based on queuing theory results.
GroupBurst: From Bursty
Keywords to Trends

Group keywords together.

Every moment t, computes keywords set 𝐾𝑡 , and then divides it into
disjoint subsets 𝐾𝑡 𝑖 ( i.e. a ‘trend’).

Propose GroupBurst, based on co-occurrences.

Retrieves a few minutes’ history of tweets, and group keywords
together if co-occurred in a relatively large number of recent tweets.

A greedy strategy that produces groups in a small number of steps.
Trend Analysis

Compose description of each trend.

Identify more keywords associated with it.

Non-bursty keyword occurs in the same tweets as the bursty ones.

Use context extraction algorithms (PCA, SVD, etc.) to search the recent
history and reports the most correlated word.

Use Grapevine’s entity extractor to identify frequently mentioned
entities in trends.

Frequently cited sources are added to the trend description.

Identifies frequent geographical origins of tweet that belong to the
trend.

A chart will be produced for each trend.

Show popularity of trends over time.
Architecture
Index
Architecture:
Back-End

The TwitterListener module receives sample which consists 1.2M out
of 6M tweets per day, via the Twitter API.

Then separates tweet information into fields and exports two feeds:


Reporting tweets with all their fields to an Index module

Reporting only the text and timestamp of tweets to Bursty Keywords
Detection module
After bursty keywords are identified and grouped into trends, the
Index is contacted by the Trend Analysis module to retrieve
information on tweets that belong to each trend.
Architecture: Front-End
Architecture: Front-End
(Cont.)

A webpage reports recent trends in real time

An interface allows users to rank trends by frecency or current
activity rate and submit their own short description for trends.

Use an additional tab to display daily trends.
Demonstration

Every trend will be represented by the entities, by
the related bursty keywords.

The audience will have the option to use the
interface in order to acquire more information.

They will be shown additional keywords and skim through representative
tweets

They will be able to track a trend’s popularity over time and spot the origin.

They will interact with the system by tracking the displayed trends
according different criteria and submitting descriptions.
EvenTweet: Online
Localized Event
Detection from
Twitter
Hamed Abdelhaq, Christian Sengstock, and Michael
Gertz
Hamed Abdelhaq, Christian Sengstock, Michael Gertz: EvenTweet: Online Localized Event Detection from Twitter.
VLDB 6(12): pp. 1326-1329 (2013)
1. Introduction
2. Localized Event Detection

Temporal Keyword Extraction

Spatial Keyword Identification

Keyword Clustering

Cluster Scoring
3. System overview
4. Demonstration
INTRODUCTION

EvenTweet, a system to detect localized events from a
stream of tweets in real-time.

Only about 1% of tweets are georeferenced.

Focuses on detecting localized events from a stream of
tweets in real-time.

Adopts a continuous analysis of the most recent tweets
within a time-based sliding window.

Described by 1) related keywords & 2) estimation of the
start time and the geographic location.
INTRODUCTION

Tracks evolution over time: a fine-grained temporal
resolution. A scoring scheme that gives a score of each
event over time.

Identify localized events using a possibly small amount
of geo-tagged tweets:

Both geo- and non-geo-tagged tweets are used to identify
words best describing events.

Only geo-tagged tweets are used to estimate the spatial
distribution of such words.
1. Introduction
2. Localized Event Detection

Temporal Keyword Extraction

Spatial Keyword Identification

Keyword Clustering

Cluster Scoring
3. System overview
4. Demonstration
Localized Event Detection
Basic Definitions

Event: a phenomenon that stimulates people to post messages for a
certain period of time.

Localized events: Events happen within a small region, having a
small spatial extent.
(e.g., concerts, soccer matches, road works)
Localized Event

A localized event is described as a tuple:
le = (el, et, K)
 el
is event location, represented as a small set
of connected rectangular.
 et
K
is the start time.
is a set of words frequently published during
the event time and at that location.
Online Detection
Basic Notation:

Each tweet tw = (W, uid, l, t)
 W:
a set of words
 uid: a user id
 l = (lon, lat): a geographic location
 t: timestamp


Use a timeline divided into a sequence of equallength time frames (…fc-1, fc), where fc denotes
the current time frame.
Each time frame represents a short time interval
during which tweets are posted.
Basic Notation (cont.)

Use a time-based sliding window winkfc composed of k time frames
and fc as its end point.

The detection procedure of EvenTweet is triggered every time a new
time frame elapses.
Temporal Keyword
Extraction

Extraction of words showing a bursty frequency in the current time
frame (these words are called keywords, Yc)

Given a set of words Wc from the tweets published during the recent
time frame fc, extract a subset Yc ⊆ Wc which represents words likely
to describe localized events.
Temporal Keyword
Extraction (cont.)

Use discrepancy paradigm to extract keywords based on their
burstiness.

For Each Timeframe fc

u(w, c): number of users publishing tweets containing word w, normalized
by the number of users

histw = (u(w, 1), u(w, 2), …, u(w, m)) is a fixed historical sequence of
usage values for w collected before the current time frame fc, such that
m < c.  use history to distinguish normal behavior from bursty

The discrepancy paradigm measures the deviation between the word
usage value u(w,c) in the current time frame and an expected word
usage baseline, b(w), which estimated from histw. histw is drawn from
Gaussian distribution with mean b(w).μ and deviation b(w).σ

Higher deviation, higher burstiness degree
Temporal Keyword
Extraction (cont.)

The burtinesss degree of a word w is the z-score defined:
b_degree(w, c) :=( u(w,c)−b(w).μ)/b(w).σ

Choose words whose burstiness degree is larger than two standard
deviations above the mean as keywords.

Keywords observed for the first time will have μ=0 and σ=0.
Spacial Keyword Identification

Find keywords which are highly localized.

Only use georeferenced tweets.
During the sliding window winkfc
g
grid G
- Usage ratio of ki:
# users using ki in cell g
# users in cell g
- Calculate the density of keyword ki in
usage ratio of ki in g
cell g:
usage ratio of ki in G
- Repeat this for all cells in G. We’ll
have Si (discrete spatial distribution of
ki). Also called Spatial Signature of ki
Spacial Keyword Identification

Only use georeferenced tweets.
- Calculate Entropy H(Si)
g
- Discard all keywords with entropy larger
than a threshold ρ. Why?
- Large entropy  keyword spread out
in space
- Small entropy  keyword occurs at
only a few locations
- We’ll have Yc = set of filtered keywords
g
Keyword Clustering

Each Si is a vector.

Clustering event keywords using their Si

Similarity calculation: Cosine similarity
Keyword Clustering
- There is a distance threshold Т
- If a new keyword falls out of the
threshold, it forms a new cluster itself.
Cluster Scoring
Due to the noisy nature and the increasing vocabulary size
of tweets, the extracted keyword set is enormous and has
many spurious keywords, which results in creating clusters
related to no events.

To determine which clusters of keywords is more likely
being referred to localized events, filter out spurious
clusters.

To score a cluster:
1. Calculate score each keyword
2. Calculate score of each cluster by summing up
scores of all keywords
3. The clusters with high scores are considered event
clusters
Cluster Scoring
1. Keyword Score
k_score(i,cl) := oi . b_degree(ki,ei) . prominence(ki,cl)
prominence(ki , cl) :=
𝑐𝑙.𝑜𝑖
𝑐𝑙.𝑙𝑡
. (1 −
𝑓𝑐 − 𝑐𝑙.𝑒𝑖
)
𝑐𝑙.𝑙𝑡
ei: the time frame we’re looking at
oi: the number of times ki was clustered in cl
cl.lt : cluster life time
2. Cluster Score
score(cl) :=
|𝑐𝑙.𝑌|
𝑖=1 𝑘_𝑠𝑐𝑜𝑟𝑒(𝑖, 𝑐𝑙)
System Overview
Demonstration
Next Week

Read one interesting research paper from VLDB
conference

Present in the class =)
Other Interesting Papers from
VLDB

RealTime Twitter Recommendation: Online Motif
Detection in Large Dynamic Graphs

TeRec: A Temporal Recommender System Over Tweet
Stream

Unicorn: A System for Searching the Social Graph

Entity Extraction, Linking, Classification, and Tagging
for Social Media: a Wikipedia-Based Approach

Piggybacking on social networks

Mobility and Social Networking: A Data Management
Perspective

Recommendation by Examples
http://db.disi.unitn.eu/pages/VLDBProgram/lib/FullProgram.html#D1F1400T1530R2
Download