EMM - Southern Methodist University

advertisement
Spatiotemporal Stream
Mining Using EMM
Margaret H. Dunham
Southern Methodist University
Dallas, Texas 75275
mhd@engr.smu.edu
This material is based in part upon work supported by the National Science Foundation under Grant No.
9820841
4/24/09 - KSU
1
Completely Data Driven Model
 No assumptions about data
 We only know the general format of the data
 THE DATA WILL TELL US WHAT THE MODEL
SHOULD LOOK LIKE!
4/24/09 - KSU
2
Motivation
 A growing number of applications generate streams of data.
 Computer network monitoring data
 Call detail records in telecommunications (Cisco VoIP
2003)
 Highway transportation traffic data (MnDot 2005)
 Online web purchase log records (JCPenney 2003,
Travelociy 2005)
 Sensor network data (Ouse, Derwent 2002)
 Stock exchange, transactions in retail chains, ATM
operations in banks, credit card transactions.
4/24/09 - KSU
3
EMM Build
<18,10,3,3,1,0,0>
<17,10,2,3,1,0,0>
<16,9,2,3,1,0,0>
<14,8,2,3,1,0,0>
2/3
2/3
2/21
2/3
1/1
1/2
1/2
N3
N1
1/3
N2
1/1
1/2
1/1
<14,8,2,3,0,0,0>
<18,10,3,3,1,1,0.>
…
4/24/09 - KSU
4
Spatiotemporal Stream Mining Using EMM
 Spatiotemporal Stream Data
 EMM vs MM vs other dynamic MM
techniques
 EMM Overview
 EMM Applications
4/24/09 - KSU
5
Spatiotemporal Environment
 Observations arriving in a stream
 At any time, t, we can view the state of
the problem as represented by a vector
of n numeric values:
Vt = <S1t, S2t, ..., Snt>
V1
S1
S2
…
Sn
S11
S21
…
Sn1
V2
S12
S22
…
Sn2
…
…
…
…
…
Vq
S1q
S2q
…
Snq
Time
4/24/09 - KSU
6
Data Stream Modeling Requirements
 Single pass: Each record is examined at most once
 Bounded storage: Limited Memory for storing
synopsis
 Real-time: Per record processing time must be low
 Summarization (Synopsis )of data
 Use data NOT SAMPLE
 Temporal and Spatial
 Dynamic
 Continuous (infinite stream)
 Learn
 Forget
 Sublinear growth rate - Clustering
4/24/09 - KSU
77
MM
A first order Markov Chain is a finite or countably infinite
sequence of events {E1, E2, … } over discrete time points,
where Pij = P(Ej | Ei), and at any time the future behavior of
the process is based solely on the current state
A Markov Model (MM) is a graph with m vertices or states, S,
and directed arcs, A, such that:
 S ={N1,N2, …, Nm}, and
 A = {Lij | i 1, 2, …, m, j 1, 2, …, m} and Each arc,
Lij = <Ni,Nj> is labeled with a transition probability
Pij = P(Nj | Ni).
4/24/09 - KSU
8
Problem with Markov Chains
 The required structure of the MC may not be certain at
the model construction time.
 As the real world being modeled by the MC changes, so
should the structure of the MC.
 Not scalable – grows linearly as number of events.
 Our solution:
 Extensible Markov Model (EMM)
 Cluster real world events
 Allow Markov chain to grow and shrink dynamically
4/24/09 - KSU
9
Extensible Markov Model (EMM)
 Time Varying Discrete First Order Markov Model
 Nodes (Vertices) are clusters of real world
observations.
 Learning continues during application phase.
 Learning:
 Transition probabilities between nodes
 Node labels (centroid/medoid of cluster)
 Nodes are added and removed as data arrives
4/24/09 - KSU
10
Related Work
 Splitting Nodes in HMMs
 Create new states by splitting an existing state

M.J. Black and Y. Yacoob,”Recognizing facial expressions in image sequences using
local parameterized models of image motion”, Int. Journal of Computer Vision,
25(1), 1997, 23-48.
 Dynamic Markov Modeling
 States and transitions are cloned

G. V. Cormack, R. N. S. Horspool. “Data compression using dynamic Markov
Modeling,” The Computer Journal, Vol. 30, No. 6, 1987.
 Augmented Markov Model (AMM)
 Creates new states if the input data has never been seen in
the model, and transition probabilities are adjusted

Dani Goldberg, Maja J Mataric. “Coordinating mobile robot group behavior using a
model of interaction dynamics,” Proceedings, the Third International Conference
on Autonomous Agents (agents ’99), Seattle, Washington
4/24/09 - KSU
11
EMM vs AMM
Our proposed EMM model is similar to AMM, but is more
flexible:
 EMM continues to learn during the application phase.
 The EMM is a generic incremental model whose nodes can
have any kind of representatives.
 State matching is determined using a clustering technique.
 EMM not only allows the creation of new nodes, but deletion
(or merging) of existing nodes. This allows the EMM model to
“forget” old information which may not be relevant in the
future. It also allows the EMM to adapt to any main memory
constraints for large scale datasets.
 EMM performs one scan of data and therefore is suitable for
online data processing.
4/24/09 - KSU
12
EMM Operations
 Input: EMM
 Output: EMM’
 EMM Build – Modify/add nodes/arcs based on input
observations
 EMM Prune – Removes nodes/arcs
 EMM Merge – Combine multiple EMM nodes
 EMM Split – Split a node into multiple nodes
 EMM Age – Modify relative weights of old versus new
oberservations
 EMM Combine – Merge multiple EMMS by merging
specific states and transitions.
4/24/09 - KSU
14
Example from rEMM (R Package Available)
Loc_1 Loc_2 Loc_3 Loc_4 Loc_5 Loc_6 Loc_7
1
20
50
100
30
25
4
10
2
20
80
50
20
10
10
10
3
40
30
75
20
30
20
25
4
15
60
30
30
10
10
15
5
40
15
25
10
35
40
9
6
5
5
40
35
10
5
4
7
0
35
55
2
1
3
5
8
20
60
30
11
20
15
10
9
45
40
15
18
20
20
15
10
15
20
40
40
10
10
14
11
5
45
55
10
10
15
0
12
10
30
10
4
15
15
10
Courtesy Mike Hahsler
EMM Prune
N1
N3
1/3
1/3
2/2
1/3
N2
1/2
N5
4/24/09 - KSU
N1
1/3
N3
1/3
1/6
Delete N2
1/6
1/3
N6
N5
1/6
N6
16
Artificial Data
−0.2 0.0 0.2 0.4 0.6 0.8 1.0
x
EMM Advantages








4/24/09 - KSU
Dynamic
Adaptable
Use of clustering
Learns rare event
Sublinear Growth Rate
Creation/evaluation quasi-real time
Distributed / Hierarchical extensions
Overlap Learning and Testing
18
EMM Applications
 Predict – Forecast future state values.
 Evaluate (Score) – Assess degree of model
compliance. Find the probability that a new
observation belongs to the same class of data
modeled by the given EMM.
 Analyze – Report model characteristics
concerning EMM.
 Visualize – Draw graph
 Probe – Report specific detailed information
about a state (if available)
4/24/09 - KSU
19
EMM Results
 Predicting Flooding
Ouse and Derwent – River flow data from England
http://www.nercwallingford.ac.uk/ih/nrfa/index.html
 Rare Event Detection
VoIP Traffic Data obtained at Cisco Systems
Minnesota Traffic Data
 Classification
DNA/RNA Sequence Analysis
4/24/09 - KSU
20
Derwent River (UK)
28023
28043
28117
number of state in model
800
28011
700
28048600
threshold 0.994
500
threshold 0.995
400
threshold 0.996
300
threshold 0.997
200
threshold 0.998
28010
100
0
1 108 215 322 429 536 643 750 857 964 1071 1178 1285 1392 1499
number of input data (total 1574)
4/24/09 - KSU
21
Sublinear Growth Rate
Data
Der
went
Ouse
Sim
Jaccrd
Dice
Cosine
Ovrlap
Jaccrd
Dice
Cosine
Ovrlap
4/24/09 - KSU
0.99
156
72
11
2
56
40
6
1
Threshold
0.992 0.994 0.996
190
268
389
92
123
191
14
19
31
2
3
3
66
81
105
43
52
66
8
10
13
1
1
1
0.998
667
389
61
4
162
105
24
1
22
Prediction Error Rates
 Normalized Absolute Ratio Error (NARE)
NARE =

N
t 1
| O(t )  P(t ) |

N
t 1
O(t )
 Root Means Square (RMS)
2
(
O
(
t
)

P
(
t
))
t 1
N
RMS =
N
4/24/09 - KSU
23
EMM Performance – Prediction (Ouse)
NARE
RMS
RLF
0.321423
1.5389
Th=0.95
EMM Th=0.99
Th=0.995
0.068443
0.046379
0.055184
0.43774
0.4496
0.57785
4/24/09 - KSU
No of
States
20
56
92
24
EMM Water Level Prediction – Ouse Data
8
7
Water Level (m)
6
5
4
3
2
1
667
630
593
556
519
482
445
408
371
334
297
260
223
186
149
112
75
38
1
0
Input Time Series
RLF Prediction
4/24/09 - KSU
EMM Prediction
Observed
25
Rare Event
 Rare - Anomalous – Surprising
 Out of the ordinary
 Not outlier detection
 No knowledge of data distribution
 Data is not static
 Must take temporal and spatial values into
account
 May be interested in sequence of events
 Ex: Snow in upstate New York is not rare
 Snow in upstate New York in June is rare
 Rare events may change over time
4/24/09 - KSU
26
Rare Event Examples
 The amount of traffic through a site in a
particular time interval as extremely high or low.
 The type of traffic (i.e. source IP addresses or
destination addresses) is unusual.
 Current traffic behavior is unusual based on
recent precious traffic behavior.
 Unusual behavior at several sites.
4/24/09 - KSU
27
Rare Event Detection Applications
 Intrusion Detection
 Fraud
 Flooding
 Unusual automobile/network traffic
4/24/09 - KSU
28
Our Approach
 By learning what is normal, the model can
predict what is not
 Normal is based on likelihood of occurrence
 Use EMM to build model of behavior
 We view a rare event as:
 Unusual event
 Transition between events states which does
not frequently occur.
 Base rare event detection on determining
events or transitions between events that do not
frequently occur.
 Continue learning
4/24/09 - KSU
30
EMMRare
 EMMRare algorithm indicates if the current input event is
rare. Using a threshold occurrence percentage, the input
event is determined to be rare if either of the following
occurs:
 The frequency of the node at time t+1 is below this
threshold
 The updated transition probability of the MC transition
from node at time t to the node at t+1 is below the
threshold
4/24/09 - KSU
31
Determining Rare
Occurrence Frequency (OFc) of a node Nc :
OFc =
 CN
CNc
i
i
Normalized Transition Probability (NTPmn), from one
state, Nm, to another, Nn :
NTPmn =
CLmn
 CN
i
i
4/24/09 - KSU
32
EMMRare
Given:
•
•
•
•
Rule#1: CNi <= thCN
Rule#2: CLij <= thCL
Rule#3: OFc <= thOF
Rule#4: NTPmn <= thNTP
Input: Gt: EMM at time t
i: Current state at time t
R= {R1, R2,…,RN}: A set of rules
Output: At: Boolean alarm at time t
Algorithm:
1 Ri = True
At =
0 Ri = False
4/24/09 - KSU
33
VoIP Traffic Data
12/13/05
4/24/09 - KSU
34
Rare Event in Cisco Data
4/24/09 - KSU
35
Temporal Heat Map
 Also called Temporal Chaos Game Representation (TCGR)
 Temporal Heat Map (THM) is a visualization technique for streaming
data derived from multiple sensors.
 It is a two dimensional structure similar to an infinite table.
 Each row of the table is associated with one sensor value.
 Each column of the table is associated with a point in time.
 Each cell within the THM is a color representation of the sensor
value
 Colors normalized (in our examples)

0 – While

0.5 – Blue

1.0 - Red
4/24/09 - KSU
36
•Values →
Cisco – Internal VoIP Traffic Data
•Complete Stream: CiscoEMM.png
•VoIP traffic data was provided by Cisco Systems and represents logged VoIP traffic in
their Richardson, Texas facility from Mon Sep 22 12:17:32 2003 to Mon Nov 17
11:29:11
•Time 2003.
→
4/24/09 - KSU
37
Rare Event Detection
Detected unusual
weekend traffic pattern
Weekdays Weekend
4/24/09 - KSU
Minnesota DOT Traffic Data
38
TCGR Example
acgtgcacgtaactgattccggaaccaaatgtgcccacgtcga
Moving Window
Pos 0-8
Pos 1-9
A
2
1
C
3
3
G
3
3
T
1
2
4
2
1
C
0.6
0.6
G
0.6
0.6
T
0.2
0.4
0.8
0.4
0.2
…
Pos 34-42 2
Pos 0-8
Pos 1-9
A
0.4
0.2
…
Pos 34-42 0.4
4/24/09 - KSU
39
TCGR Example (cont’d)
TCGRs for Sub-patterns of length 1, 2, and 3
4/24/09 - KSU
40
TCGR Example (cont’d)
ACGT
4/24/09 - KSU
Window 0: Pos 0-8
Window 1: Pos 1-9
acgtgcacg
cgtgcacgt
Window 17: Pos 17-25
Window 18: Pos 18-26
tccggaacc
ccggaacca
Window 34: Pos 34-42
ccacgtcga
41
TCGR – Mature miRNA
(Window=5; Pattern=3)
C. elegans
Homo sapiens
Mus musculus
All Mature
4/24/09 - KSU
ACG
CGC
GCG
UCG
43
Research Approach
1. Represent potential miRNA sequence with
TCGR sequence of count vectors
2. Create EMM using count vectors for known
miRNA (miRNA stem loops, miRNA targets)
3. Predict unknown sequence to be miRNA (miRNA
stem loop, miRNA target) based on normalized
product of transition probabilities along clustering
path in EMM
4/24/09 - KSU
44
Related Work 1
 Predicted occurrence of pre-miRNA segments
form a set of hairpin sequences
 No assumptions about biological function or
conservation across species.
 Used SVMs to differentiate the structure of
hiarpin segments that contained pre-miRNAs
from those that did not.
 Sensitivey of 93.3%
 Specificity of 88.1%
1
C. Xue, F. Li, T. He, G. Liu, Y. Li, nad X. Zhang, “Classification of Real and Pseudo
MicroRNA Precursors using Local Structure-Sequence Features and Support Vector
Machine,” BMC Bioinformatics, vol 6, no 310.
4/24/09 - KSU
45
Preliminary Test Data1
 Positive Training: This dataset consists of 163 human premiRNAs with lengths of 62-119.
 Negative Training: This dataset was obtained from
protein coding regions of human RefSeq genes. As these
are from coding regions it is likely that there are no true
pre-miRNAs in this data. This dataset contains 168
sequences with lengths between 63 and 110 characters.
 Positive Test: This dataset contains 30 pre-miRNAs.
 Negative Test: This dataset contains 1000 randomly
chosen sequences from coding regions.
1 C. Xue, F. Li, T. He, G. Liu, Y. Li, nad X. Zhang, “Classification of Real and
Pseudo MicroRNA Precursors using Local Structure-Sequence Features
and Support Vector Machine,” BMC Bioinformatics, vol 6, no 310.
4/24/09 - KSU
46
TCGRs for Xue Training Data
P
O
S
I
T
I
V
E
N
E
G
A
T
I
V
E
4/24/09 - KSU
47
TCGRs for Xue Test Data
P
O
S
I
T
I
V
E
N
E
G
AT
I
V
E
4/24/09 - KSU
48
4/24/09 - KSU
49
References
1)
2)
3)
4)
5)
6)
7)
8)
Margaret H. Dunham, Nathaniel Ayewah, Zhigang Li, Kathryn Bean, and Jie Huang, “Spatiotemporal Prediction
Using Data Mining Tools,” Chapter XI in Spatial Databases: Technologies, Techniques and Trends, Yannis
Manolopouos, Apostolos N. Papadopoulos and Michael Gr. Vassilakopoulos, Editors, 2005, Idea Group
Publishing, pp 251-271.
Margaret H. Dunham, Yu Meng, and, Jie Huang, “Extensible Markov Model,” Proceedings IEEE ICDM Conference,
November 2004, pp 371-374.
Yu Meng, Margaret Dunham, Marco Marchetti, and Jie Huang, ”Rare Event Detection in a Spatiotemporal
Environment,” Proceedings of the IEEE Conference on Granular Computing, May 2006, pp 629-634.
Yu Meng and Margaret H. Dunham, “Online Mining of Risk Level of Traffic Anomalies with User's Feedbacks,”
Proceedings of the IEEE Conference on Granular Computing, May 2006, pp 176-181.
Yu Meng and Margaret H. Dunham, “Mining Developing Trends of Dynamic Spatiotemporal Data Streams,”
Journal of Computers, Vol 1, No 3, June 2006, pp 43-50.
Charlie Isaksson, Yu Meng, and Margaret H. Dunham, “Risk Leveling of Network Traffic Anomalies,” International
Journal of Computer Science and Network Security, Vol 6, No 6, June 2006, pp 258-265.
Margaret H. Dunham, Donya Quick, Yuhang Wang, Monnie McGee, Jim Waddle, “Visualization of DNA/RNA
Structure using Temporal CGRs,”Proceedings of the IEEE 6th Symposium on Bioinformatics & Bioengineering
(BIBE06), October 16-18, 2006, Washington D.C. ,pp 171-178.
Charlie Isaksson and Margaret H. Dunham, “A Comparative Study of Outlier Detection,” 2009, accepted to
appear LDM conference, 2009.
4/24/09 - KSU
50
Download