Spatiotemporal Stream Mining Using TRACDS, Middle East

advertisement
Spatiotemporal Stream Mining using
TRACDS
Middle East Technical University
October 31, 2012
Margaret H Dunham, Michael Hahsler, Yu Su, Sudheer
Chelluboina, and Hadil Shaiba
Computer Science and Engineering
This work is supported by NSFIIS-0948893
10/31/2012, METU
IDA@SMU
Intelligent Data Analysis Lab
Team led by
Margaret H. Dunham
Michael Hahsler
Mission
At IDA@SMU we create novel techniques inspired by knowledge
discovery, data mining, machine learning, artificial intelligence and
statistical analysis to work with data from various sources.
Current Focus
TM
 Massive data stream modeling: TRACDS
 Hurricane intensity prediction
 Effective metagenomic classification for the
Human Genome Project
 Recommender systems: R/Apache Mahout
10/31/2012, METU
http://www.lyle.smu.edu/IDA
Outline
• Spatiotemporal Stream Data
• TRACDS
• Hurricane Intensity Prediction
• PIIH
• PIIH online
10/31/2012, METU
From Sensors to Streams
• Data captured and sent by a set of sensors is
usually referred to as “stream data”.
• Real-time sequence of encoded signals which
contain desired information. It is continuous, ordered
(implicitly by arrival time or explicitly by timestamp or
by geographic coordinates) sequence of items
• May be viewed as arriving in discrete time intervals.
• Stream data is infinite - the data keeps coming.
• Examples: Weather data, network data (VoIP),
traffic data.
10/31/2012, METU
Stream Data Format
• Events arriving in a stream
• At any time, t, we can view the state of the
problem as represented by a vector of n
numeric values: Vt = <S1t, S2t, ..., Snt>
V1
S1
S2
…
Sn
S11
S21
…
Sn1
Time
10/31/2012, METU
V2
S12
S22
…
Sn2
…
…
…
…
…
Vq
S1q
S2q
…
Snq
Modeling Stream Data
– Summarization (Synopsis) of data
– Temporal and Spatial
– Dynamic
– Continuous (infinite stream)
– Concept Drift
• Learn
• Forget
– Sublinear growth rate - Clustering
10/31/2012, METU
MM
A first order Markov Chain is a finite or countably infinite sequence of
events {E1, E2, … } over discrete time points, where Pij = P(Ej | Ei),
and at any time the future behavior of the process is based solely
on the current state
A Markov Model (MM) is a graph with m vertices or states, S, and
directed arcs, A, such that:
• S ={N1,N2, …, Nm}, and
• A = {Lij | i 1, 2, …, m, j 1, 2, …, m} and Each arc,
Lij = <Ni,Nj> is labeled with a transition probability
Pij = P(Nj | Ni).
10/31/2012, METU
10/31/2012, METU
Problem with Markov Chains
•
The required structure of the MC may not be certain at the
model construction time.
•
As the real world being modeled by the MC changes, so should
the structure of the MC.
•
Not scalable – grows linearly as number of events.
•
Our solution:
– Extensible Markov Model (EMM)
– Cluster real world events
– Allow Markov chain to grow and shrink dynamically
10/31/2012, METU
10/31/2012, METU
EMM (Extensible Markov Model)
• Time Varying Discrete First Order Markov
Model
• Continuously evolves
• Nodes are clusters of real world states.
• Learning continues during application phase.
• Learning:
– Transition probabilities between nodes
– Node labels (centroid of cluster)
– Nodes are added and removed as data arrives
• Applications:
– Anomaly/Rare Event Detection
– Prediction
– Classification
10/31/2012, METU
10/31/2012, METU
EMM Definition
Extensible Markov Model (EMM): at any time t, EMM consists
of an MC with designated current node and algorithms to
modify it, where algorithms include:
• EMMCluster, which defines a technique for matching between
input data at time t + 1 and existing states in the MC at time t.
• EMMIncrement algorithm, which updates MC at time t + 1
given the MC at time t and clustering measure result at time t
+ 1.
• EMMDecrement algorithm, which removes nodes from the
EMM when needed.
10/31/2012, METU
10/31/2012, METU
EMM Cluster
• Nearest Neighbor (or any clustering
technique)
• If none “close” create new node
• Labeling of cluster is centroid of
members in cluster or Clustering Feature
• O(n)
Here n is the number of states
10/31/2012, METU
10/31/2012, METU
EMM Sublinear Growth
Servent Data
10/31/2012, METU
10/31/2012, METU
Growth Rate Automobile Traffic
Minnesota Traffic Data
10/31/2012, METU
10/31/2012, METU
EMM Learning
<18,10,3,3,1,0,0>
<17,10,2,3,1,0,0>
<16,9,2,3,1,0,0>
<14,8,2,3,1,0,0>
<14,8,2,3,0,0,0>
<18,10,3,3,1,1,0.>
10/31/2012, METU
10/31/2012,
METU
2/3
2/3
2/21
2/3
1/1
1/2
1/2
N3
N1
1/3
N2
1/1
1/2
1/1
EMM Forgetting
N1
N3
1/3
1/3
2/2
1/3
N3
1/3
1/3
N2
1/6
1/6
1/3
1/2
N5
10/31/2012,
METU
10/31/2012,
N1
N6
METU
N5
1/6
N6
Outline
• Spatiotemporal Stream Data
• TRACDS
• Hurricane Intensity Prediction
• PIIH
• PIIH online
10/31/2012, METU
Traditional Stream Clustering
Standard Data Stream Clustering ignores temporal aspect of data
10/31/2012, METU
Stream Clustering
• Clusters change over time – they move
• Some techniques use micro clusters/reclustering
• Reclustering is often off line (batch while stream data
comes).
• STREAM
– Partitions stream data into segments
– Clusters each segment (k-medians)
– Iteratively reclusters the centers of these clusters
S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. O'Callaghan. “Clustering data streams: Theory and
practice.” IEEE Transactions on Knowledge and Data Engineering, 15(3):515-528, 2003.
10/31/2012, METU
Temporal Relationship Among Clusters in
Data Streams
10/31/2012, METU
TRACDS NOTE
• TRACDS is not:
– Another stream clustering algorithm
• TRACDS is:
– A new way of looking at clustering
– Built on top of an existing clustering algorithm
• TRACDS may be used with any stream clustering
algorithm
10/31/2012, METU
10/31/2012, METU
TRAC-DS Overview
10/31/2012, METU
10/31/2012, METU
TRACDS Clustering Operations
10/31/2012, METU
TRACDS Example
C
EMM
10/31/2012, METU
http://www.lyle.smu.edu/IDA/TRACDS
Outline
• Spatiotemporal Stream Data
• TRACDS
• Hurricane Intensity Prediction
• PIIH
• PIIH online
10/31/2012, METU
10/31/2012, METU
10/31/2012, METU
Lower 9th Ward of New Orleans, Louisiana, Feb 27, 2006
Photographer: Mackenzie Schott
Hurricanes
Hurricanes are tropical cyclones with sustained winds of at least
64 kt (119 km/h, 74 mph) .
The major issues in forecasting hurricanes are predicting their
tracks of movement and their intensities. Compared with
prediction of track movement, intensity prediction is still relatively
inaccurate.
Time step [0h, 12h, 24h, …, 120h]
10/31/2012, METU
10/31/2012, METU
Hurricane Intensity Prediction

Hurricane Intensity:


Maximum sustained surface wind.
Highest average wind speed within 1 minute and10m above
surface.
“Maximum Sustained Wind”. Wikipedia. Wikimedia foundation, 27
August 2011. Web. 4 December 2011. Retrieved from
http://en.wikipedia.org/wiki/Maximum_sustained_wind.

Rapid Intensification

24-h increase in maximum wind speed >= 30knots.
“Rapid Intensification,” accessed on 10/24/12,
http://www.hurrnet.com/tutorial/forecasts/intensity/rapid.htm .
10/31/2012, METU
Predicting Intensity
• Statistical models predict intensity based on measured stream data.
• Current state of storm
• History of this storm
• How similar storms behaved in past
• Regression models are the most popular.
• NOAA (branch of U.S. Government)
– collects stream data.
– Yearly updates it models based on data from previous year
– Makes predictions in a quasi-real time manner.
10/31/2012, METU
Hurricane Intensity Prediction
“Objective: Improve forecast skill to accuracy
and confidence levels required for
decision‐making and risk management”
NOAA’s National Weather
Service Strategic Plan 2010-2020
 Very difficult to predict Intensity
(rapid intensification)
 National Hurricane Center (NHC) uses
– Dynamical models: computational
intensive and slow
Path of Hurricane Katrina (2005)
Color shows intensity
– Statistical models: Statistical Hurricane 
Intensity Prediction Scheme (SHIPS)

•
Current Storm – SANDY
http://www.nhc.noaa.gov/archive/2012/SANDY_gr
aphics.shtml
10/31/2012, METU
Category 5 - 175 mph
Damage: estimated $125 billion
 Fatalities: >1,800

“Hurricane Katrina – Most Destructive Hurricane Ever to Strike the U.S.”,
August 28, 2005, February 12, 2007, http://www.katrina.noaa.gov/ .
Remote Sensing

Storm features are gathered from the earth's observations
using remote sensing.

Real time data are gathered every few hours and stored
in large databases.


Historical data of more than 20 years of the earth's
behavior is stored in the database.
Methods:

Satellite

Buoy

Ship

Aircraft
10/31/2012, METU
Outline
• Spatiotemporal Stream Data
• TRACDS
• Hurricane Intensity Prediction
• PIIH
• PIIH online
10/31/2012, METU
Hurricane Data
0h, 12h, 24h, …
… hurricane 274
…
0h, 12h, 24h, …
0h, 12h, 24h, …
hurricane 2
Hurricane Data
hurricane 1
The data contains 16 predictors. The dataset is formed by time ordered 12
hour interval records and contains the hurricane data from seasons 1982 to
2003.
1982
16 predictors
10/31/2012,2003
METU
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
25,0,1,-5.83,668,0,140,14.9,-53.5,13.25,40.5,23,6.6,27,372.5,19600
25,0,1,-5.83,708,0,140,12.7,-53.45,13.65,37.5,17.5,5.69,4,317.5,19600
30,5,1,-3.58,682,150,135,12.75,-53.35,13.25,34,1.5,5.79,15,382.5,18225
35,5,1,-4.9,674,175,130,14.2,-53.35,13.4,33,-12,6.66,-13,497,16900
50,15,1,0.44,681,750,113.52,17.1,-53.15,13.2,35,-20,8.32,-7,855,12885.79
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
30,0,0.99,-7.02,656,0,124.55,19.05,-52.55,14.75,51,0.5,6.68,45,571.5,15512.49
30,0,0.98,-7.02,675,0,123.75,17.3,-52.6,14.15,54,5,6.63,22,519,15314.28
35,5,0.98,-4.16,722,175,119.55,17.9,-52.6,14.65,58,10,7.43,34,626.5,14292
65,30,0.97,4.09,635,1950,88.77,19.15,-52.1,14.7,54.5,27.5,8.63,33,1244.75,7879.26
75,10,0.97,6.25,724,750,70.08,17.8,-52.15,12.55,54,48.5,8.61,45,1335,4910.92
95,20,0.96,9.17,641,1900,37.59,14.85,-52.9,11.1,56.5,55,7.87,15,1410.75,1413.13
95,0,0.96,7.2,691,0,33.33,15.6,-53.45,9.25,51.5,44.5,8.97,32,1482,1110.98
95,0,0.95,0.82,713,0,35.62,17.9,-53.25,7.85,47,38,10.72,31,1700.5,1268.43
95,0,0.95,2.4,813,0,28.12,20.85,-52.65,7.25,45,45,12.84,63,1980.75,790.65
115,20,0.93,10.65,635,2300,-11.1,24.45,-52.7,4.55,41.5,57.5,15.81,24,2811.75,123.2
110,-5,0.93,14.51,622,-550,-26.24,30.7,-53.55,1.15,40.5,50.5,21.2,28,3377,688.71
90,-20,0.91,18.15,613,-1800,-17.97,37.05,-53.95,0,46,29.5,27.08,42,3334.5,322.99
70,-20,0.91,21.86,668,-1400,1.01,40.3,-53.7,0,52.5,20,30.72,41,2821,1.02
70,0,0.89,26.22,688,0,2.35,45.05,-52.7,0.25,50.5,37.5,35.18,31,3153.5,5.5
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
……
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Intensity
Construct EMM
10/31/2012, METU
Use EMM for Prediction
10/31/2012, METU
EMM, TRACDS and Hurricane Data
• Approach: Using TRACDS algorithms, construct multiple EMMs.
One will be built for each time point into the future for which
predictions are to be made: 12 hours, …, 120 hours.
• NOAA provides 16 different features or predictors (attribute values).
• Clustering is performed based on a distance calculation from input
feature vector to centroid of clusters in EMMs.
• However the importance of these to intensity prediction is not
uniform.
• How can we determine weight for each feature? Used during
clustering.
10/31/2012, METU
Weighted Feature Learning -Extensible
Markov Model (WFL-EMM)
WFL-EMM assumes that the different predictors contribute differently during
the prediction.
1
Weights for
predictors
0
f1
f2
f3
f4
f5
f6
f7
V1 = <20 50 100 30 25 4 10>
V2 = <20 80 50 20 10 10 10>
……
In WFL-EMM, a weight vector u = <u1, …, un > to indicate the weights for
different predictors, where ui ∈[0, 1] . ui =1 means the ith predictor is
important and ui =0 implies that the ith predictor is ignored.
10/31/2012, METU
Weighted Feature Learning -Extensible
Markov Model (WFL-EMM)
GA Learning Process
The question is how to locate a fitness weight vector u = <u1, …, un > for
hurricane intensity predictions.
Genetic algorithm (GA) is introduced in WFL-EMM to find the best
fitness weight vector, which gives the smallest error of the prediction.
10/31/2012, METU
Weighted Feature Learning -Extensible
Markov Model (WFL-EMM)
Given a weights vector u = <u1, …, un >.
Two steps of data transformation

Normalization: normalize all the predictor within the range of [0, 1]
First standardize the predictor values by
where
and sd(x) are the mean and standard deviation of the ith predictor.
Then a non-linear normalization maps zi to interval [0, 1],
where
is damping coefficient.
 Transformation: Assume a normalized record d = <d1,…, dn>. Then the
record is transformed as d’ = < u1 d1,…, un dn>.
10/31/2012, METU
Weighted Feature Learning -Extensible
Markov Model (WFL-EMM)
10/31/2012, METU
Weighted Feature Learning -Extensible
Markov Model (WFL-EMM)
GA Learning Process
• The question is how to locate a fitness weight vector u =
<u1, …, un > for hurricane intensity predictions.
• These weights are used during the clustering and
.applied to the distance/similarity measure used for
clustering
• Genetic algorithm (GA) is introduced in WFLEMM to find the best fitness weight vector, which gives
the smallest error of the prediction
10/31/2012, METU
Weighted Feature Learning -Extensible
Markov Model (WFL-EMM)
GAs try to locate a fitness solution from the a solution space.
Weight vector u = <u1, …, un > spans a vector space [0, 1]n since each ui is
a real value ranged in [0, 1].
Solution space
Fitness solution
10/31/2012, METU
Weighted Feature Learning -Extensible
Markov Model (WFL-EMM)
GA Learning Process
Genetic algorithm evolution
Each time, two chromosomes are selected
randomly from the ith population with a
probability proportional to their fitness, where
a chromosome is a Gray code string of a
weight vector u.
Chromosome 1
Chromosome 2
Population i
10/31/2012, METU
Weighted Feature Learning -Extensible
Markov Model (WFL-EMM)
GA Learning Process
Genetic algorithm evolution
Chromosome 1
Chromosome 2
crossover
mutation
Randomly alter one or more bits
in the offspring based on a given
probability.
inversion
Randomly select a break point in
a chromosome and then
exchange the position of the two
pieces.
New chromosome
Calculate the fitness of the obtained chromosome and place it
into the population i+1
10/31/2012, METU
Weighted Feature Learning -Extensible
Markov Model (WFL-EMM)
GA Learning Process
Fitness of the chromosome
A chromosome is first decoded into a weight vector u. Apply this obtained
u to generate a GEMM by using the training data. Then the fitness is
calculated by either mean absolute deviation (MAD) or root mean square
error (RMSE) based on the testing data. The best fitness weight vector u is
located during the evolution of a GA.
Fitness
where
10/31/2012, METU
Results
- Experiment 2: Evaluating WFL-EMM by using k-fold cross validation technique over
the dataset from 1982 to 2003 (set MAD as fitness).
10/31/2012, METU
Results
It is interesting to look at the weights of the features because these weights
reveals information about what the main drivers of intensity change might be.
10/31/2012, METU
Learn
feature
weights
using
Genetic
Algorithm.
Weights for
features
over time.
10/31/2012, METU
PIIH – Prediction Intensity Interval Model for Hurricanes
TRACDS
TM
Historic hurricane data
Features





Current wind speed
Various temperatures
Time of the year
Direction of movement
GOES Satellite Data (IR)
Currently 23 features from the
Statistical Hurricane Intensity
Prediction Scheme (SHIPS)
10/31/2012, METU
Data stream
clustering + temporal
order model
Prediction using PIIH – Irene (2011)
Current features of hurricane
10/31/2012, METU
Prediction using PIIH – Irene (2011)
Current features of hurricane
10/31/2012, METU
Aggregate possible
future scenarios
into a prediction
PIIH Output for Irene (2011)
MAD
MSE
PIIH
14.28
310.79
SHIFOR 5*
12.64
229.49
LGEM
15.06
411.73
SHIPS
14.80
319.64
D-SHIPS
17.11
500.36
MAD … Mean average deviation
MSE … Mean squared error
* Baseline model
10/31/2012, METU
PIIH Advantages
• Real Time
• Dynamic
• Machine Learning
• Confidence Bands
• By analyzing the 2011 storms through Nate, we observed
the following:
– 96.33% of observations fell within the 95% confidence band
– 92.8% of observations fell within the 90% confidence band
– 74.27% of observations fell within the 68% confidence band
10/31/2012, METU
Outline
• Spatiotemporal Stream Data
• TRACDS
• Hurricane Intensity Prediction
• PIIH
• PIIH online
10/31/2012, METU
10/31/2012, METU
http://IDA.lyle.smu.edu/PIIH/
Future Work
1. Deploy model with NOAA




Add decay model over land
Evaluate additional features
Predict rapid intensification
Interface with NOAA’s systems
2. Improve the TRACDSTM model
 Data stream clustering
 Higher-order effects
 Improve model selection and outlier handling
10/31/2012, METU
PIIH Bibliography
10/31/2012, METU
Thank you!
http://www.lyle.smu.edu/IDA
10/31/2012, METU
Download