Prasanna Giridhar * , Tanvir Amin * , Lance Kaplan + , Jemin
George + , Raghu Ganti ++ , Tarek Abdelzaher *
* University of Illinois at Urbana Champaign
+ U.S. Army Research Lab
++ IBM Research, USA
1
Explosive growth in deployment of physical sensors.
Many times activities recorded by these sensors deviate from the norm:
Closure of a freeway due to forest fire.
Change in building occupancy due to shutdown.
Unusual behavior tend to attract human attention and get reported socially as well.
2
Several research works in the past for detecting events in the physical as well as the social domain.
Can we use the social media as a tool for explaining the underlying cause of anomalies?
A system for identifying the discriminative social feeds that can be correlated with sensor anomalies.
The more unusual the event, higher probability.
Evaluation performed on real time traffic data.
3
STEP 1: Initialization of the system
Continuous stream of tweets using parameters
Keywords
Location
Continuous stream of data from physical sensors
4
STEP 2: Identification of sensor anomalies
Run a black box algorithm.
Store attributes for sensors classified positively by the algorithm
Cluster the sensors which provide redundant data
5
STEP 2: Identification of sensor anomalies
Run a black box algorithm.
Store attributes for sensors classified positively by the algorithm
Cluster the sensors which provide redundant data t1,t2
6
STEP 2: Identification of sensor anomalies
Run a black box algorithm.
Store attributes for sensors classified positively by the algorithm
Cluster the sensors which provide redundant data
7
STEP 3: Identification of discriminative social feeds
Social feeds often have keywords describing an event
Keywords: malaysian, airlines, 370
8
Single Keyword?
Airlines
9
Keyword pair?
Malaysian, Airlines
10
Keyword triplet?
Malaysia, Airlines, 370
Malaysia, Airlines, Satellite
11
Signature Events per
Signature
Signatures per Event
Single keyword 3.621
1.1579
Keyword Pair 1.1416
1.2725
Keyword Triplet 1.0628
0.4393
Signature profile on the twitter data collected
Ideal 1-to-1 mapping for keyword pair
12
Problem : Given a list of keyword pairs for the current and past window, how to find the most discriminating subset?
Difference in rate of occurrences:
(traffic,jam) 50 times today compared to past average of 35
(drunk, kills) 12 times today compared to a past average of 0.
Increase in percentage:
(traffic,jam) 1 time today compared to past average of 0
(drunk, kills) 12 times today compared to a past average of 2
Overcome disadvantages using Information Gain Theory
13
Information Gain Theory and Entropy
Entropy measures randomness introduced by a variable
Using conditional entropy value determine information gain about an event by the keyword pair. This can be formulated as:
Information Gain = H(Y) − H(Y|X)
Y: variable associated with event; y=0 (normal) and y=1 (anomalous)
X: variable associated with keyword pair; x=0 (absent) and x=1
(present)
14
STEP 4: Ranking discriminative events
Identify tweets for discriminative pairs.
Score proportional to conditional entropy.
The lower the entropy value, the higher is the discriminating power.
15
STEP 5: Matching tweets with sensor anomalies
We align both the data based on spatiotemporal properties associated with the event.
For example
Sensor ID40456 on I-15
Northbound with unusual activity
Unusual Tweet: “SFvSD game tonight, stuck @15N traffic!!!”
16
STEP 6: Output the matched explanations
Final step is to provide the explanations.
A user interface which enables to track unusual events on a per-day basis.
17
Twitter feeds collected for a period of 2 weeks: Aug 19 to September 01, 2013 with a radius of 30 miles
Three cities in CA:
• Los Angeles
• San Francisco
• San Diego
Physical sensors data retrieved from PeMS (Caltrans Performance Measurement
System http://pems.dot.ca.gov/ ) : 5 minutes report for flow, speed, occupancy, delay
18
Performance measured using Precision and Mean Average rank for our Information gain theory approach against other baseline approaches
Table: Precision using different methods
B1 corresponds to Difference in rate of occurrences and B2 to Increase in percentage.
Table: Average position of tweets from the top
19
Sensor anomaly detected
Highway I-80 Eastbound in SF
Landmarks: Bay bridge
Duration: 4 days
20
21
US101 blockage due to Bomb squad in LA
22
Traffic on 15N due to game in SD
23
Abnormal behavior recorded in social medium.
Tool to explain the abnormalities.
Major activities explained with high precision.
Explanations ranked among top two tweets.
24
Scalability Issues
Credibility of social feeds
Geo localization of tweets
25
Q+A
26