Finding Repeated Structure in Time Series: Algorithms and Applications

advertisement
Finding Repeated
Structure in Time Series:
(we will start 5 min
Algorithms and late to allow folks
to find the room)
Applications
Tutorial 3
发现时间序列中的重复模式 : 算法与应用
Abdullah Mueen
University of New Mexico, USA
Eamonn Keogh
University of California Riverside, USA
Slides available at
http://www.cs.unm.edu/~mueen/Tutorial/
Funding by NSF
IIS-1161997 II
1
Anouncements
• There will be four Q&A segments. Please hold
your question till the next segment.
• There is a feedback form on every chair.
Negative/positive, anonymous/known
feedbacks are welcome.
• Last (i.e. after lunch) part of the tutorial will
be given in Madrid 2. I will remind you one
more time.
2
25.1750
25.2250
25.2500
25.2500
25.2750
25.3250
25.3500
25.3500
25.4000
25.4000
25.3250
25.2250
25.2000
25.1750
..
..
24.6250
24.6750
24.6750
24.6250
24.6250
24.6250
24.6750
24.7500
What are Time Series?
A time series is a collection of
observations made sequentially in
time.
29
28
27
26
25
24
23
0
50
100
150
200
250
300
350
400
450
500
3
Repeated Pattern (Motif)
时间序列中的重复模式
Find the subsequences having very high
similarity to each other.
30
20
10
0
0
0
2000
100
4000
200
300
6000
400
500
8000
600
10000
700
800
Finding Motifs in Time Series, Jessica Lin, Eamonn Keogh, Stefano Lonardi, Pranav Patel, KDD 2002
4
General Outline
• Applications (~40 minutes)
– As Subroutines in Data Mining
– In Other Scientific Research
• Algorithms (~100 minutes)
– Uni-dimensional
– Multi-dimensional
– Open Problems
• Demonstration (~10 minutes)
5
Applications Outline
• Applications
– As Subroutines in Data Mining
•
•
•
•
Never Ending Learning
Time Series Clustering
Rule Discovery
Dictionary Building
– In Other Scientific Research
•
•
•
•
Data center chiller management
Worm locomotion analysis
Physiological Prediction
Activity recognition
– Motifs in Other Data-types
• Audio
• Shapes
• Motion
6
Motifs allow us to learn, forever, without an explicit teacher…
If you have parallel texts, then over time you
can learn a dictionary with high accuracy.
…And God said, “Let there be light”; and there was light. And God saw the light, that it was good. And God …
…Y dijo Dios: Sea la luz; y fue la luz. Y vio Dios que la luz era buena . Y llamó Dios a …
7
Motifs allow us to learn, forever, without an explicit teacher…
If you have parallel texts, then over time you
can learn a dictionary with high accuracy.
…And God said, “Let there be light”; and there was light. And God saw the light, that it was good. And God …
…Y dijo Dios: Sea la luz; y fue la luz. Y vio Dios que la luz era buena . Y llamó Dios a …
Note the mapping is non-linear, the learning algorithms in this domain are non-trivial.
Suppose however that the unknown “language” is
not discrete, but real-valued time series? In this
case, repeated pattern discovery can help*…
8
*Yuan Hao, Yanping Chen et al (2013). Towards Never-Ending Learning from Time Series Streams. SIGKDD 2013
Motifs allow us to learn, forever, without an explicit teacher…
P (X-axis acceleration of wrist)
Real-valued
“text”
5 seconds
B (RFID tag)
on
off
Glove
on
off
Iron
(omitted)
0
1000
2000
Binary “text”
3000
This dataset contains standard IADL housekeeping activities
(vacuuming, ironing, dusting, brooming,, watering plants etc).
We have a discrete (binary) “text” that notes if the hand is near
a cleaning instrument, and a real-valued accelerometer “text”
9
We can run motif discovery on the time series stream. If we find
motifs, we can see if they correlate with the discrete streams…
C1 observed
C1 observed
C1 observed
5 seconds
P (X-axis acceleration of right wrist)
B (RFID tag)
on
off
Glove
on
off
Iron
(omitted)
0
1000
Glove 1/1
Iron 0/1
(omitted)
2000
Glove 2/2
Iron 0/2
(omitted)
3000
Glove 3/3
Iron 0/3
(omitted)
In this snippet, the motifs seem to correlate with the presence
of a glove…
10
How well does this work?
Over an hour of activity, we learn to recognize a behavior in the time series that
indicates the user is putting on a glove.
Events shown in last slide take place here
1
0.5
0
P(X-axis acceleration of right wrist)
P(C1=glove) = 0.70
glove
P(C1=iron) = 0.23
iron
fan
0 minutes
P(C1=fan) = 0.07
55 minutes
True positives
Discovered pattern
Note: There are
false positives, but
we considered
only a single axis
for simplicity.
False positives
Learned concept C1
0
200
400
0
200
11400
Motifs allow us to cluster subsequences of a time series…
0
200
400
600
800
1000
1200
And how would we evaluate our answer?
Thanawin Rakthanmanon, Eamonn Keogh, Stefano Lonardi, and Scott Evans (2011). Time Series Epenthesis: Clustering Time Series Streams
12
Requires Ignoring Some Data. ICDM 2011
Motifs allow us to cluster subsequences of a time series…
0
200
400
600
800
1000
1200
And how would we evaluate our answer?
== Poem ==
In a sort of Runic rhyme,
To the throbbing of the bells-Of the bells, bells, bells,
To the sobbing of the bells;
Keeping time, time, time,
As he knells, knells, knells,
In a happy Runic rhyme,
To the rolling of the bells,-Of the bells, bells, bells-To the tolling of the bells,
Of the bells, bells, bells, bells,
Bells, bells, bells,-To the moaning and the groaning of the bells.
Data: last 30 seconds from 4-min poem, The Bells by Edgar Allan Poe
Edgar Allan 13
Poe
Motifs allow us to cluster subsequences of a time series…
0
200
400
600
800
1000
1200
And how would we evaluate our answer?
== Poem ==
In a sort of Runic rhyme,
To the throbbing of the bells-Of the bells, bells, bells,
To the sobbing of the bells;
Keeping time, time, time,
As he knells, knells, knells,
In a happy Runic rhyme,
To the rolling of the bells,-Of the bells, bells, bells-To the tolling of the bells,
Of the bells, bells, bells, bells,
Bells, bells, bells,-To the moaning and the groaning of the bells.
== Text in each clusters ==
bells, bells, bells,
Bells, bells, bells,
Of the bells, bells, bells,
Of the bells, bells, bells-To the throbbing of the bells-To the sobbing of the bells;
To the tolling of the bells,
To the rolling of the bells,-To the moaning and the groantime, time, time,
knells, knells, knells,
sort of Runic rhyme,
groaning of the bells.
14
Motifs allow us to cluster subsequences of a time series…
0
200
400
600
800
1000
1200
Key observations that make this possible:
• Time Series Motifs!
• We are willing to allow some data to be unexplained by the clustering
• We score the possible clustering's with MDL, this is parameter-free!
• Allowing the clusters to be of different lengths/sizes
Thanawin Rakthanmanon, Eamonn Keogh, Stefano Lonardi, and Scott Evans (2011). Time Series Epenthesis: Clustering Time Series Streams
15
Requires Ignoring Some Data. ICDM 2011
Motifs are useful, but can we predict the future?
What happens next?
-10
-15
-20
-25
-30 0
1000
2000
3000
4000
5000
6000
7000
8000
(unpublished work, email Dr. Keogh for preprint)
Prediction vs. Forecasting (informal definitions)
Forecasting is “always on”, it constantly predicts a value say, two minutes out (we are not doing this)
Prediction only make a prediction occasionally, when it is sure what will happen next
16
Why Predict the (short-term) Future?
If a robot can predict that is it about to fall, it may
be able to..
• Prevent the fall
• Mitigate the damage of the fall
More importantly, if the robot can predict a
human’s actions
• The robot could catch the human!
• This would allow more natural human/robot
interaction.
• Real time is not fast enough for interaction!
We need to be a half second before real time.
•
• Other examples:
• Predict a car crash, tighten seatbelts, apply brakes
• Predict the next spoken word after ‘data’ is
‘mining’, then begin prefetching WebPages..
• etc
17
Previous attempts at this have largely failed…
However, we can do this, and time series
motifs are the key tool
The rule discovery technique will use:
• Time Series Motifs
• MDL (minimum description length)
• Admissible speed-up techniques (not
discussed here)
18
Let us start by finding motifs
-10
-15
-20
-25
-30 0
1000
2000
3000
4000
5000
6000
7000
8000
9000
Second occurrence
First occurrence
0
20
40
60
80
100
19
We can convert the motifs to a rule
-10
-15
-20
-25
-30 0
1000
2000
3000
4000
5000
6000
7000
8000
9000
We can use the motif to make a
rule…
0
20
40
60
80
IF we see thisshape, (antecedent)
THEN we see thatshape, (consequent)
within maxlag time
100
t1 = 7.58
0
20
40
60
0
20
40
maxlag = 0
The Euclidean distance between thisshape and
the observed window must be within a
threshold t1 = 7.58
20
We can monitor streaming data with our rule..
t1 = 7.58
0
20
40
60
0
20
40
maxlag = 0
8000
9000
21
685
The rule gets invoked…
t1 = 7.58
0
20
40
60
0
20
40
maxlag = 0
rule fires here
5735
predicting this shape’s occurrence
5785
5835
5
22
685
It seems to work!
t1 = 7.58
0
20
40
60
0
20
40
maxlag = 0
rule fires here
5735
predicting this shape’s occurrence
5785
5835
5
23
What is the ground truth?
The first verse of The Raven by Poe in MFCC space
Once upon a midnight dreary, while I pondered weak and weary..
-10
-15
-20
-25
-30 0
1000
2000
3000
..rapping at my chamber door……
4000
5000
6000
7000
8000
9000
t1 = 7.58
at
0
my
20
chamber
40
60
0
door
80
20
40
60
0
20
40
maxlag = 0
100
The phrase “at my chamber door” does appear 6 more times, and we do fire our rule
correctly each time, and have no false positives.
What are we invariant to?
• Who is speaking? Somewhat, we can handle other males, but females are tricky.
• Rate of speech? To a large extent, yes.
• Foreign accents? Sore throat? etc. Not yet.
24
t = 0.4
maxlag = 41sec
maxlag = 4 sec
-9
waiting for elevator
elevator moves up
walking away
-10
-11
0
stepping inside
-9
waiting for
40 elevator
y- acceleration
t1 = 0.4
y- acceleration
Why we need the Maxlag
parameter
stops walking away
elevatorelevator
moves up
80
120
-10
elevator stops
stepping inside
-11
0
40
80
120
Here the maxlag depends on the number
of floors we have in our building.
We can hand-edit this rule to generalize
for short buildings to tall buildings
Can physicians edit medical rules to
generalize from male to female…
25
This works, really!
maxlag = 20 minutes
15500
16000
16500
17000
0
120
0
180
IF we see a Clothes Washer used
THEN we will see Clothes Dryer used within 20 minutes
27
Insect Behavior Dictionary
Beet Leafhopper (Circulifer tenellus)
input resistor
conductive glue
to insect
to soil near plant
V
voltage source
voltage reading
2
0 0
6
plant membrane
1
50
100
150
200
Stylet
4
The electrical penetration graph or EPG is a
system used by biologists to study the
interaction of insects with plants.
3
2
1
0
-1
-2
4500
4000
3500
3000
2500
2000
1500
1000
500
0
-3
1
2
3
4
Additional examples
of the motif
5
5
6
7
8
x 104
15 minutes of EPG recorded on Beet Leafhopper
Instance at 20,925
Instance at 25,473
0
50
100
150
200
250
300
350
400
As a bead of sticky secretion, which is byproduct of sap feeding, is ejected, it
temporarily forms a highly conductive
bridge between the insect and the plant.
Exact Discovery of Time Series Motifs. A Mueen, et al. SDM, 2009.
29
Insect Behavior Dictionary
5
5
N
N
S
0
-5
3900
4000
0
4100
4200
3900
4000
4100
-5
4800
4900
Length :334
5
0
0
900
1000 1100
1200
900
1000 1100
1200
-5
600
800
Length :412
5
0
0
7000
7200
7400
7600
7000
Length :799
4800
4900
5000
5100
1000
600
800
1000
Length :565
5
-5
5100
Length :384
5
-5
5000
7200
7400
7600
-5
5200
5400
5600
5800
5200
5400
5600
5800
Length :799
More motifs reveal different feeding patterns of Beet Leafhopper.
30
Applications Outline
• Applications
– As Subroutines in Data Mining
•
•
•
•
Never Ending Learning
Time Series Clustering
Rule Discovery
Dictionary Building
– In Other Scientific Research
•
•
•
•
Data center chiller management
Worm locomotion analysis
Physiological Prediction
Activity recognition
– Motifs in Other Data-types
• Audio
• Shapes
• Motion
31
Sustainable Operation and Management of Data
Center Chillers using Temporal Data Mining
HP Labs with Virginia Tech
“Our primary goal is to link the time series temperature data gathered from
chiller units to high level sustainability characterizations… thus using time series
motifs as a crucial intermediate representation to aid in data reduction.”
“switching from motif 8 to motif 5 gives us a nearly
$40,000 in annual savings!” Patnaik et al. SIGKDD09
32
A dictionary of behavioral motifs reveals clusters
of genes affecting C. elegans locomotion
Laboratory of Molecular Biology, Cambridge,
United Kingdom
Goal: Detect genotype by
using the locomotion only.
Convert postures to four
dimensional time series.
Brown et al. PNAS 2013; 110(2): 791–796.
33
A dictionary of behavioral motifs reveals clusters of genes affecting C. elegans
locomotion
Brown A E X et al. PNAS
2013;110:791-796
34
Variability in motif structure is lower in juvenile Directed
than in Undirected and similar to that in adult song.
Amplitude Envelop
Social performance reveals unexpected vocal competency in young songbirds
Satoshi Kojima and Allison J. Doupe, PNAS 2011, 108 (4) 1687-1692.
35
Motif discovery in physiological datasets: A
methodology for inferring predictive elements
University of Michigan and MIT
We evaluated our solution on a population of patients who
experienced sudden cardiac death (SDDB) and attempted to
discover electrocardiographic activity that may be associated
with the endpoint of death. To assess the predictive patterns
discovered, we compared likelihood scores for time series motifs
in the sudden death population…
Zeeshan Syed et al. TKDD 2010
36
Motif discovery in physiological datasets: A
methodology for inferring predictive elements
University of Michigan and MIT
Using a maximum likelihood separator,
we were able to use our motif to
correctly identify 70% of the patients
who suffered sudden cardiac death
during 24 hours of recording while
classifying none of the normal
individuals, and only 8% of the patients
from the supraventricular dataset as
being at risk.
Zeeshan Syed et al. TKDD 2010
37
Constrained Motif Discovery in Time Series
Toyoaki Nishida, Kyoto University
“we use time series motifs to find gesture patterns with
applications to robot-human interactions” Okada,
Izukura and Nishida 2011
38
Discovering Characteristic Actions from On-Body
Sensor Data
David Minnen, Thad Starner, Irfan Essa, and Charles Isbell, Georgia Tech
Our algorithm successfully discovers motifs that correspond to the real
exercises with a recall rate of 96.3% and overall accuracy of 86.7% over
six exercises and 864 occurrences.
Minnen et al. Symp. Wearable Computers 2006
39
Motifs can Spot Dance Moves…
Choreography
D D
A
Step
Action
A
Side steps with no arm movement
B
Rock steps sideways without arm movement
C
Rock steps sideways with arm movement
D
Side steps with arm movement
E
Side steps with arms up in the air
F
Standing still with head bopping
B
D
E
F
C
D
F
E
C
D
F
D
C
A
B
E
A
A
Leg
z
y
x
0/2
0/2
2/4
Arm
z
y
x
1/4
1/2
1/2
Hand
z
y
x
0/3
0/4
0/2
z
y
x
0/4
0/4
0/4
Motifs are from the same Hip
dance steps or the same
transitions 86% of the time.
0
0.5
1
1.5
2
Time
Data: H. Pohl et al. SMC 2010
Algorithm: Mueen, ICDM 2013
2.5
3
x 104
40
Applications Outline
• Applications
– As Subroutines in Data Mining
•
•
•
•
Never Ending Learning
Time Series Clustering
Rule Discovery
Dictionary Building
– In Other Scientific Research
•
•
•
•
Data center chiller management
Worm locomotion analysis
Physiological Prediction
Activity recognition
– Motifs in Other Data-types
• Audio
• Shapes
• Motion
43
Motifs in Audio
Mice calls are inaudible and have significant noise
Manual inspection over temporal signal is impossible
Features like MFCC are not good for animal song
Just use the raw spectrogram to find repeated calls
44
Yuan Hao et al. Parameter-Free Audio Motif Discovery in Large Data Archives. ICDM 2013: 261-270
Motifs in Shapes
Lampasas River Cornertang
Projectile shapes
Algorithm detects a rare
cornertang segment – an
object that has long intrigued
anthropologists.
Petroglyphs
Algorithm detects similar
petroglyphs drawn across
continents and centuries
Castroville Cornertang
0
50
45
100 150 200 250 300 350
Motifs in Motion
Mueen et al. A disk-aware algorithm for time series motif discovery. Data
Min. Knowl. Discov. 2011.
Yankov et al. Detecting time series motifs under uniform scaling. KDD 2007
Two motion can be stitched
together by transiting from one
motif to the other, a very useful
technique for motion synthesis.
46
Time Series Motifs have 1,000 of Uses
• ..for discovering motifs in the music data is called the Mueen-Keogh (MK)
algorithm.. Cabredo et al. 2011
• we apply the MK motif algorithm to time series retrieved from seismic
signals… Cassisi et al 2012
• we take motif developed by Keogh in order to support a medical expert in
discovering interesting knowledge. Kitaguchi.
• for the problem of estimation of Micro-drilled Hole Wall of PWBs we take
the Motif method developed by Keogh... Toshiki et al. (fabrication)
• the most efficient motif provided a power savings of 41 This translates to
an annual reduction of 287 tons of CO2. Watson InterPACK09.
• We use Keogh's Motifs for unsupervised discovery of abnormal human
behavior in multi-dimensional time series data... Vahdatpour SDM 2010.
• variability of behavior, using motifs, provides more consistent groupings of
households across different clustering algorithms… Ian Dent 2014
47
Questions and Comments
48
Algorithms Outline
• Algorithms
– Definition, Distance Measures and Invariances
– Exact Algorithms
•
•
•
•
Fixed Length
Enumeration of All length
K-motif Discovery
Online Maintenance
– Approximate Algorithms
• Random Projection Algorithm
– Multi-dimensional Motif Discovery
– Open Problems
49
A four-slide digression, to make sure you understand
what invariances are, and why they are important
50
Suppose we are walking in a
cemetery in Japan.
We see an interesting grave
marker, and we want to
learn more about it.
We can take a photo of it
and search a database….
51
Campana and Keogh (2010). A Compression Based Distance
Measure for Texture. SDM 2010.
52
In order to do this, we must have a
distance measure with the right
invariances
Color invariance
Occlusion invariance
Size invariance
53
Time Series Data has Unique Invariances
• These invariances are domain/problem dependent
• They include
–
–
–
–
–
–
–
Complexity invariance
Warping invariance
Uniform scaling invariance
Occlusion invariance
Rotation/phase invariance
Offset invariance
Z-normalization of each subsequence removes these
Amplitude invariance
• Sometimes you achieve the invariance in the distance
measure, sometimes by preprocessing the data.
• In this work, we will just assume offset/amplitude
invariance. See [a] for a visual tour of time series invariances.
54
[a] Batista, et al (2011) A Complexity-Invariant Distance Measure for Time Series. SDM 2011
Z-Normalization ensures scale and offset invariances
2
1
xˆi 
0
x  x
x
-1
Wandering Baseline
-2
3
Query
3
1
1
-1
-1
-3
-3
-5
0
100
-5
Distance to the query
Threshold = 30
120
80
40
0
0
500
1000
1500
Without Normalization only 75% of the beats are missed
55
Distance Measures
• The choices are
– Euclidean Distance
– Correlation
– Dynamic Time Warping
– Longest Common Subsequences
– Uniformly Scaled Euclidean Distance
– Sliding Nearest Neighbor Distance
56
Euclidean Distance Metric
Given two time series
x = x1…xn
x
and
y = y1…yn
their z-Normalized Euclidean distance is
defined as:
xˆi 
x  x
x
d ( x, y ) 
, yˆ i 
y  y
y
y
n
2
ˆ
ˆ
(
x

y
)
 i i
i 1
d(x,y)
Early abandoning reduces
number of operations
when minimizing
sum of squared error exceeded best2
2
0
-2
0
10
20
30
40
50
60
70
80
90
57
Pearson’s Correlation Coefficient
58
Relationship with Euclidean Distance
Minimizing z-normalized Euclidean distance and Maximizing
Pearson’s correlation coefficient are identical in effect for
motif discovery.
59
Euclidean Vs Dynamic Time Warping
Euclidean Distance
Sequences are aligned “one to one”.
“Warped” Time Axis
Nonlinear alignments are possible.
60
How is DTW
Calculated?
DTW (Q, C)  D(m, n)
D(i, j )  (qi  c j )2  min D(i, j 1), D(i 1, j ), D(i 1, j 1)
C
Q
C
Q
• Quadratic time complexity
Warping path w
• DTW is not a metric
61
Definition of Time Series Motifs
1. Length of the motif
2. Support of the motif
3. Similarity of the Pattern
4. Relative Position of the Pattern
2
1
0
-1
-2
20
40
60
80
100
120
140
160
180
200
62
Algorithms Outline
• Algorithms
– Definition, Distance Measures and Invariances
– Exact Algorithms
•
•
•
•
Fixed Length
Enumeration of All length
K-motif Discovery
Online Maintenance
– Approximate Algorithms
• Random Projection Algorithm
– Multi-dimensional Motif Discovery
– Open Problems
63
Simplest Definition of Time Series
Motifs
Given a length, the most similar/least
distant pair of non-overlapping
subsequences
2
1
0
-1
-2
20
40
60
80
100
120
140
160
180
200
1. Length of the motif = Given
2. Support of the motif = 2
3. Similarity of the Pattern = Euclidean distance
4. Relative Position of the Pattern = non-overlapping
Exact Discovery of Time Series Motifs. A Mueen, et al. SDM, 2009.
64
Problem Formulation
time:1000
100
200
300
400
500
600
700
800
900
-7000
-7500
-8000
The most similar pair of
non-overlapping
subsequences
1000
1
2
3
4
5
6
7
8
.
.
.
873
The closest pair of points
in high dimensional
space
Optimal algorithm in two dimension : Θ(n log n)
For large dimensionality d, optimum algorithm is
effectively Θ(n2d)
65
Lower Bound
 If P, Q and R are three points in a d-space
d(P,Q)+d(Q,R) ≥ d(P,R)
P
Q

d(P,Q) ≥ |d(Q,R) - d(P,R)|
R
 A third point R provides a very inexpensive lower
bound on the true distance
 If the lower bound is larger than the existing best,
skip d(P, Q)
d(P,Q) ≥ |d(Q,R) - d(P,R)| ≥ BestPairDistance
66
Circular Projection
Pick a reference point r
Circularly Project all points
on a line passing through the
reference point
9
20
7
16
22
4
15
21
5
18
17
10
r
19
14
3
2
8
1
12
Equivalent to computing
distance from r and then
sorting the points according
to distance
11
24
13
6
23
67
The Order Line
9
20
7
16
22
4
15
5
d(P, r)
r
P
k=3
k=2
k=1
r
BestPairDistance
2
11
13
6

|d(Q, r) - d(P, r)|
3
8
1
24
Q
0
r
14
12
d(Q, r)
18
17
10
19
21
23
k=1:n-1
•Compare every pair
having k-1 points in
between
•Do k scans of the order
line, starting with the 1st
to kth point
68
Correctness
• If we search for all offset=1,2,…,n-1 then all possible
pairs are considered.
– n(n-1)/2 pairs
• if for any offset=k, none of the k scans needs an
actual distance computation
then for the rest of the offsets=k+1,…,n-1 no
distance computation will be needed.
r
69
Performance
X103
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
-0.5
10
Brute Force
30
50
70
Brute Force with Optimized Distance Computation
X103
110
90
Our Algorithm (MK)
• Orders of Magnitude faster
• Exact in execution
• No sacrifice of the quality of the results
70
Multiple References
• Use multiple reference points for tighter lower
bounds.
Ptolemaic bound
x
y
Pruning by Ptolemaic Inequality
1010
Ptolemaic bound wins
109
108
107
106
Linear bound wins
105
r1
r2
105
106
107
108
109
1010
Pruning by Triangular Inequality
71
Pruning by Multiple References
R3
max( | d(P,Ri) - d(Q,Ri) | )
R2
Logarithm of time
(seconds)
P
Time to find the motif in a
database of 64,000
random walk time series
of length 1,024
8
7
6
Q
5
R1
0
10
20
30
40
Number of Reference Time
Series Used
3000
2500
2000
1500
1000
500
5
10
15
20
25
30
35
1800
1600
1400
1200
1000
800
600
400
200
0
0
50
60
R4
1400
1200
1000
800
600
400
200
5
10
15
20
25
30
35
0
0
5
10
15
20
25
30
35
72
Algorithms Outline
• Algorithms
– Definition, Distance Measures and Invariances
– Exact Algorithms
•
•
•
•
Fixed Length
Enumeration of All length
K-motif Discovery
Online Maintenance
– Approximate Algorithms
• Random Projection Algorithm
– Multi-dimensional Motif Discovery
– Open Problems
73
Questions and Comments
74
Simplest Definition of Time Series
Motifs
The most similar/least distant pairs of
non-overlapping subsequences at all
lengths.
2
1
0
-1
-2
20
40
60
80
100
120
140
160
180
200
1. Length of the motif = Given All
2. Support of the motif = 2
3. Similarity of the Pattern = Euclidean distance
4. Relative Position of the Pattern = non-overlapping
75
Abdullah Mueen: Enumeration of Time Series Motifs of All Lengths. ICDM 2013: 547-556
Goals: Enumerating Motifs
1. Remove the length parameter
2. Search for motifs in a range of lengths and
report
– ALL of the motifs of all of the lengths
3. Retain Scalability
76
Bound on Extension
10
9
8
7
6
5
4
3
4
3
2
1
0
-1
-2
-3
-4
1
2
3
4
Without Normalization
5
Values Changed
1
2
3
4
5
With Normalization
77
Intuition
Area between blue and red is
the distance between the signals
Append 20 and re-normalize
Append 10 and re-normalize
Normalized
2
2
2
1.5
1.5
1.5
1
1
1
0.5
0.5
0.5
0
0
0
-0.5
-0.5
-0.5
-1
-1
-1
-1.5
-1.5
-1.5
-2 1
2
3
Length 4
4
5
-2
1
2
3
4
5
-2
1
2
3
4
5
Length 5
Length 5
Area shrinks
Area shrinks further
If infinity is appended to both the signals, they will
have zero area/distance.
78
Bounding Euclidean Distance
79
Experimental Validation of the
Bounds
Normalized Distance
35
30
25
20
15
25
24.5
24
23.5
23
22.5
22
21.5
21
20.5
20
2
10
5
0
2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3
x 105
x 105
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Pairs in ascending order of distances
Blue: Distances of random signals of length 255
Gray: Distances of the signals when they are extended by one random
sample
Red: The upper and lower bounds before observing the extensions
80
Intuition
n = 10000
-7
x 103
-7.5
-8
-8.5
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
Location 2
Location 1
Distance
145,
5410,
1.26
145,
5410,
1.79
8345,
4211,
1.63
8345,
4211,
2.63
8345,
4211,
1.63
145,
5410,
1.79
1655,
9461,
2.96
1655,
9461,
3.61
6531,
2501,
2.71
6531,
2501,
3.17
6531,
2501,
2.71
1655,
9461,
3.61
851,
1440,
3.73
851,
1440,
3.83
851,
1440,
3.83
2512,
3110,
3.98
2512,
3110,
4.18
2512,
3110,
4.18
1685,
9260,
4.57
1685,
9260,
4.27
1685,
9260,
4.27
----,
-----,
-----
----,
-----,
-----
81
Intuition
8345,
4211,
1.63
8345,
4211,
1.23
8345,
4211,
1.23
145,
5410,
1.79
145,
5410,
1.98
6531,
2501,
1.71
6531,
2501,
2.71
6531,
2501,
1.71
145,
5410,
1.98
1655,
9461,
3.61
1655,
9461,
3.68
851,
1440,
3.61
851,
1440,
3.83
851,
1440,
3.61
1655,
9461,
3.68
• Once in every 10 lengths, the exact ordered list is
required to be populated.
• This yields a 10x speed-up from running fixedlength motif discovery for all lengths.
82
Sanity Check
4
2
0
-2
White Noise
0
1000
2000
3000
4000
5000
5
5
5
5
0
0
0
0
(1)
-5
-5
-5
1380 1400 1420 1440 1460 1320 1340 1360 1380 1400 1420
Length :87
Length :105
(4)
(3)
(2)
600
6000
700
Length :299
800
-5
2200
2300
2400
Length :299
• Three Patterns planted in a random signal
with different scaling.
• The algorithm finds them appropriately.
http://www.cs.unm.edu/~mueen/Projects/MOEN/index.html
83
x 105
7
x 104
18
Smart Brute Force
EEG
EOG
Random Walk
Iterative MK
6
5
4
Execution Time in Seconds
Execution Time in Seconds
Experimental Results: Scalability
Smart Brute Force
EEG
Random Walk
EOG
16
14
12
10
3
2
MOEN
1
0
8
6
MOEN
4
2
0
0
2
4
6
8
10
Data Length (n)
12
14
16
x 104
1
2
3
4
5
6
7
8
9
10
x 102
Range of Lengths (maxLen-minLen+1)
84
Algorithms Outline
• Algorithms
– Definition, Distance Measures and Invariances
– Exact Algorithms
•
•
•
•
Fixed Length
Enumeration of All length
K-motif Discovery
Online Maintenance
– Approximate Algorithms
• Random Projection Algorithm
– Multi-dimensional Motif Discovery
– Open Problems
85
Online Time Series Motifs
• Streaming time series
• Sliding window of the recent history
– What minute long trace repeated in the last hour?
Abdullah Mueen, Eamonn J. Keogh: Online discovery and maintenance of time series motifs. KDD 2010
86
Problem Formulation
Discovery
100
200
300
400
500
600
700
800
900
1000
-7000
-7500
-8000
1
2
3
4
5
6
7
8
.
.
.
873
The most similar pair of
non-overlapping
subsequences
Maintenance
time:1001
1
2
3
4
5
6
7
8
.
.
.
873
874
time:1002 time:1003
1
2
3
4
5
6
7
8
.
.
.
873
874
875
1
2
3
4
5
6
7
8
.
.
.
873
874
875
876
time:1004
1
2
3
4
5
6
7
8
.
.
.
873
874
875
876
877
time:1005
1
2
3
4
5
6
7
8
.
.
.
873
874
875
876
877
878
Update motif pair after
every time tick
time
87
Challenges
• A subsequence is a high dimensional point
– The dynamic closest pair of points problem
• Closest pair may change upon every update
• Naïve approach: Do quadratic comparisons.
Deletion
3
3
4
6
7
6
5
2
4
1
7
6
8
3
4
1
7
Insertion
5
8
2
5
9
8
2
88
Related Work
• Goal: Algorithm with Linear update time
• Previous method for dynamic closest pair
(Eppstein,00)
– A matrix of all-pair distances is maintained
• O(w2) space required
– Quad-tree is used to update the matrix
• Maintain a set of neighbors and reverse
neighbors for all points
• We do it in O( w w ) space
89
Maintaining Motif
• Motifs → Closest pair
• Upon insertion
– Find the nearest neighbor; Needs O(w) comparisons.
• Upon deletion
– Find the next NN of all the reverse NN
3
3
4
1
7
6
1
7
6
5
2
4
1
7
6
8
3
4
5
9
8
2
Insertion
5
9
8
2
Deletion
90
Data Structure
Total number of nodes
in R-lists is w
R-lists
Points that
should be
updated
(ptr, distance)
N-lists
Neighbors in order
of distances
7
8
4
5
7
1
1
2
3
4
5
4
7
5
2
3
8
6
8
5
7
1
4
6
3
7
1
6
4
8
1
5
1
7
5
3
2
8
6
2
1
7
8
4
3
6
4
6
3
2
6
7
8
8
3
2
7
5
1
4
1
3
4
5
2
8
6
2
6
5
7
1
3
4
FIFO Sliding window
3
4
1
7
6
5
8
2
91
Observations
While inserting
 Updating NN of old points is not necessary
 A point can be removed from the neighbor list if it
violates the temporal order
Average
length O( w)
92
Example
3
7
4
3
4
8
58
64
7
46
101
6
8
5
2
35
76
5
10
9
1
2
3
4
5
6
7
8
9
10
1
21
31
24
3
31
26
6
5
2
3
4
5
43
67
8
7
54
7
9
65
6
8
2
8
9
93
Insect
RW
EOG
EEG
0
0.5
1
1.5
2
2.5
3
3.5
Window Size (w)
Average Length of N-list
• Up to
4
x 104
0.7
0.6
0.5
0.4
Insect
EOG
RW
EEG
0.3
0.2
0.1
0
0
200
400
600
800
1000
Motif Length (m)
from general dynamic closest pair
per point with increasing window size
RW
110
100
90
80
70
60
50
40
30
20
10
EOG
Insect
EEG
0
Average Update Time (Sec)
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
0.5 1
1.5 2
2.5 3
Window Size (w)
3.5 4 4
x 10
Average Length of N-list
Average Update Time (Sec)
Results
RW
350
300
250
200
150
EOG
100
Insect
EEG
50
0
0
200
400
600
800
Motif Length (m)
1000
94
Algorithms Outline
• Algorithms
– Definition, Distance Measures and Invariances
– Exact Algorithms
•
•
•
•
Fixed Length
Enumeration of All length
K-motif Discovery
Online Maintenance
– Approximate Algorithms
• Random Projection Algorithm
– Multi-dimensional Motif Discovery
– Open Problems
95
Questions and Comments
96
Finding Repeated
Structure in Time
Series: Algorithms and
Applications
Lunch break; We meet
back in Madrid 2 at
1:45pm
Abdullah Mueen
University of New Mexico, USA
Eamonn Keogh
University of California Riverside, USA
Slides available at: http://www.cs.unm.edu/~mueen/Tutorial/
97
General Outline
• Applications (~40 minutes)
– As Subroutines in Data Mining
– In Other Scientific Research
• Algorithms (~100 minutes)
– Uni-dimensional
– Multi-dimensional
• Demonstration (~10 minutes)
98
Algorithms Outline
• Algorithms
– Definition, Distance Measures and Invariances
– Exact Algorithms
•
•
•
•
Fixed Length
Enumeration of All length
K-motif Discovery
Online Maintenance
– Approximate Algorithms
• Random Projection Algorithm
– Multi-dimensional Motif Discovery
– Open Problems
99
Simplest Definition of Time Series
Motifs
The non-overlapping subsequences at
all lengths having k or more τmatches.
2
1
0
-1
-2
20
40
60
80
100
120
140
160
180
200
1. Length of the motif = Given All
2. Support of the motif = 2 k and τ
3. Similarity of the Pattern = Euclidean distance
4. Relative Position of the Pattern = non-overlapping100
Optimal algorithm is hard
• Search for locations of the τ-balls that contain
k subsequences
• NP-Hard
• Instead we search for a motif representative
that has k subsequences within τ
101
How do we find the motif
representative?
• Simply take one of the two occurrences as the
representative
• Take the average of the two
• Find all occurrences within a threshold of pair
and train a HMM to capture the concept
(Minnen’07)
102
Using each one in the pair as the
representative (k=4, τ=0.9)
EEG Data
4
2
0
3
2
-2
1
-4
EPG Data
0
50
100
150 0
5000
10000
0
50
100
150
5000
10000
4
3
5
2
4
0
2
1
-2
-4
0
It takes only k similarity searches to find other
occurrences. The overall complexity remains the same.
103
Finding top-K motif
•
•
Run MK for K times
Replace occurrences by random noise
between iterations
4000
3000
A
2000
1000
1000
2000
3000
4000
5000
6000
Length :155
C
Length :157
B
Length :255
Length :251
A A B A x A x A B A x A A x B x A A B A x B A x B A C AxC A A x C
104
Algorithms Outline
• Algorithms
– Definition, Distance Measures and Invariances
– Exact Algorithms
•
•
•
•
Fixed Length
Enumeration of All length
K-motif Discovery
Online Maintenance
– Approximate Algorithms
• Random Projection Algorithm
– Multi-dimensional Motif Discovery
– Open Problems
105
How do we find approximate motif in a
time series?
The obvious brute force search algorithm is just too slow…
Our algorithm is based on a hot idea from bioinformatics, random projection* and the fact
that SAX allows us to lower bound discrete representations of time series.
* J Buhler and M Tompa. Finding motifs using random projections. In
RECOMB'01. 2001.
Symbolic Aggregate ApproXimation
107
How do we obtain SAX?
C
C
0
40
60
80
100
120
c
First convert the time
series to PAA
representation, then
convert the PAA to
symbols
It take linear time
20
c
c
b
b
a
0
20
b
a
40
60
80
100
baabccbc
120
108
c
c
c
b
b
a
0
20
b
a
40
60
Time series subsequences tend to have a
highly Gaussian distribution
80
100
120
Why a Gaussian?
0.999
0.997
0.99
0.98
Probability
0.95
0.90
0.75
0.50
0.25
0.10
0.05
0.02
0.01
0.003
0.001
-10
0
10
A normal probability plot of the (cumulative) distribution of
values from subsequences of length 128.
109
Visual Comparison
3
2
1
0
-1
-2
-3
DFT
f
e
d
c
b
a
PLA
Haar
APCA
A raw time series of length 128 is transformed into the word
“ffffffeeeddcbaabceedcbaaaaacddee.”
– We can use more symbols to represent the time series since each symbol
requires fewer bits than real-numbers (float, double)
110
A simple worked example of approximate motif discovery algorithm
The next 3 slides
T
0
(m= 1000 )
500
1000
C1
C^1 a c b a
^
S
a
1
2 b
: :
: :
58 a
: :
985 b
c
c
:
:
c
:
c
b
a
:
:
c
:
c
a
b
:
:
a
:
c
Assume that we have a time series T
of length 1,000, and a motif of length
16, which occurs twice, at time T1
and time T58.
a = 3 { a,b,c}
n = 16
w=4
Bill Yuan-chi Chiu, Eamonn J. Keogh, Stefano Lonardi: Probabilistic
discovery of time series motifs. KDD 2003: 493-498
111
A simple worked example of approximate motif discovery algorithm
A mask {1,2} was randomly chosen,
so the values in columns {1,2} were
used to project matrix into buckets.
Collisions are recorded by
incrementing the appropriate
location in the collision matrix
112
A simple worked example of approximate motif discovery algorithm
A mask {2,4} was randomly chosen,
so the values in columns {2,4} were
used to project matrix into buckets.
Once again, collisions are
recorded by incrementing the
appropriate location in the
collision matrix
113
We can calculate the expected values in the
matrix, assuming there are NO patterns…
k   i 
E( k , a, w, d , t )      1- 
 2  i 0  w 
d
t is the length of the
projected string. We
conclude that if we have k
random strings of size w,
an entry of the similarity
matrix will be hit on
average
t
 w a 1
 i  a 

 
i
1
 
a
w i
two randomly-generated
words of size w over an
alphabet of size a, the
probability that they
match with up to d errors
114
Motif significance involves several
independent dimensions
Length/size
Similarity
Support
Assessing significance requires estimating a
function S:Rd→R over these dimensions so
we can rank the motifs
115
More invariances mean more
independent dimensions
Dimensionality
Similarity
Complexity
Support
Length/size
116
Multi-dimensional Motif
• Synchronous
– Treat it as an even higher dimensional problem
– Simple extensions of uni-dimensional algorithms
work
– To find sub-dimensional motifs, all possible subspaces have to be considered
117
Multi-dimensional Motif
• Non-Synchronous
– Lags among motifs are common
– Subsets of dimensions can possibly construct a
motif
Alireza Vahdatpour, Navid Amini, Majid Sarrafzadeh: Toward Unsupervised Activity Discovery Using Multi-Dimensional Motif
Detection in Time Series. IJCAI 2009: 1261-1266
David Minnen, Charles Isbell, Irfan Essa, and Thad Starner. Detecting Subdimensional Motifs: An Efficient Algorithm for
118
Generalized Multivariate Pattern Discovery. ICDM '07
Coincidence table
coincident(ri, rj) is the
number of overlapping
occurrences of motif i
and motif j
119
Single-dimensional motifs to graph
•
•
•
•
Produce a co-occurrence graph
Nodes are single dimensional motifs
An edge between x and y denotes, x and y always co-occur within a time lag
Cluster the graph using min-cut algorithms to find multi-dimensional motifs
Leg
z
y
x
0/2
0/2
2/4
Arm
z
y
x
1/4
1/2
1/2
Hand
z
y
x
0/3
0/4
0/2
Hip
z
y
x
0/4
0/4
0/4
0
0.5
1
1.5
2
2.5
3
x 104
120
Open Problems
• New Invariances:
– P1: Find repeated patterns under warping
distance.
– P2: Finding motifs under Complexity invariance.
• Uniform scaling (Yankov 06)
• Significance:
– P3: Assessing significance of motifs without
discretization.
• Parameter-free
• Data adaptive
121
Open Problems
• Algorithmic:
– P4: Optimal k-motif for a given threshold
– P5: Exact multi-dimensional motif discovery
• Application:
– P6: Finding hidden state machine from motifs where
“states” can be clusters of subsequences and
“transitions” between states can be rules
• Systems:
– P7: A suite with all the techniques packaged together
– P8: Parallel motif discovery using hardware
acceleration
122
Code Snippet
• Live Demonstration of the MOEN tool.
x = load('LSF5_10.txt');
for i = 1:30000-128
y(i,:) = x(i:i+127);
end
tic; y*y'; toc;
!./mk 'LSF5_10.txt' 30000 128
123
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
Abdullah Mueen, Eamonn J. Keogh: Online discovery and maintenance of time series motifs. KDD 2010: 1089-1098
Abdullah Mueen, Eamonn J. Keogh, Qiang Zhu, Sydney Cash, M. Brandon Westover: Exact Discovery of Time Series
Motifs. SDM 2009: 473-484
Abdullah Mueen, Eamonn J. Keogh, Nima Bigdely Shamlo: Finding Time Series Motifs in Disk-Resident Data. ICDM 2009:
367-376
Abdullah Mueen: Enumeration of Time Series Motifs of All Lengths. ICDM 2013: 547-556
Mueen et al. A disk-aware algorithm for time series motif discovery. Data Min. Knowl. Discov. 2011.
Yuan Hao, Mohammad Shokoohi-Yekta, George Papageorgiou, Eamonn J. Keogh: Parameter-Free Audio Motif Discovery
in Large Data Archives. ICDM 2013: 261-270
Dragomir Yankov, Eamonn J. Keogh, Jose Medina, Bill Yuan-chi Chiu, Victor B. Zordan: Detecting time series motifs under
uniform scaling. KDD 2007: 844-853
Xiaopeng Xi, Eamonn J. Keogh, Li Wei, Agenor Mafra-Neto: Finding Motifs in a Database of Shapes. SDM 2007: 249-260
Bill Yuan-chi Chiu, Eamonn J. Keogh, Stefano Lonardi: Probabilistic discovery of time series motifs. KDD 2003: 493-498
Pranav Patel, Eamonn J. Keogh, Jessica Lin, Stefano Lonardi: Mining Motifs in Massive Time Series Databases. ICDM
2002: 370-377
Alireza Vahdatpour, Navid Amini, Majid Sarrafzadeh: Toward Unsupervised Activity Discovery Using Multi-Dimensional
Motif Detection in Time Series. IJCAI 2009: 1261-1266
Debprakash Patnaik, Manish Marwah, Ratnesh Sharma, and Naren Ramakrishnan. 2009. Sustainable operation and
management of data center chillers using temporal data mining. In Proceedings of the 15th ACM SIGKDD international
conference on Knowledge discovery and data mining (KDD '09)
Yuan Li, Jessica Lin, Tim Oates: Visualizing Variable-Length Time Series Motifs. SDM 2012: 895-906
Tim Oates, Arnold P. Boedihardjo, Jessica Lin, Crystal Chen, Susan Frankenstein, and Sunil Gandhi. 2013. Motif discovery
in spatial trajectories using grammar inference. In Proceedings of the 22nd ACM international conference on Conference
on information & knowledge management(CIKM '13). ACM, New York, NY, USA, 1465-1468.
Sorrachai Yingchareonthawornchai, Haemwaan Sivaraks, Thanawin Rakthanmanon, Chotirat Ann
Ratanamahatana: Efficient Proper Length Time Series Motif Discovery. ICDM 2013: 1265-1270
David Minnen, Charles Lee Isbell Jr., Irfan A. Essa, Thad Starner: Discovering Multivariate Motifs using Subsequence
Density Estimation and Greedy Mixture Learning. AAAI 2007: 615-620
124
References
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
Philippe Beaudoin, Stelian Coros, Michiel van de Panne, and Pierre Poulin. 2008. Motion-motif graphs. In Proceedings of
the 2008 ACM SIGGRAPH/Eurographics Symposium on Computer Animation (SCA '08)
David Minnen, Charles L. Isbell, Irfan A. Essa, Thad Starner: Detecting Subdimensional Motifs: An Efficient Algorithm for
Generalized Multivariate Pattern Discovery. ICDM 2007: 601-606
Yuan Hao, Yanping Chen, Jesin Zakaria, Bing Hu, Thanawin Rakthanmanon, Eamonn J. Keogh: Towards never-ending
learning from time series streams. KDD 2013: 874-882
Gustavo E. A. P. A. Batista, Xiaoyue Wang, Eamonn J. Keogh: A Complexity-Invariant Distance Measure for Time
Series. SDM 2011: 699-710
Thanawin Rakthanmanon, Eamonn J. Keogh, Stefano Lonardi, Scott Evans: Time Series Epenthesis: Clustering Time Series
Streams Requires Ignoring Some Data. ICDM 2011: 547-556
Yasser F. O. Mohammad, Toyoaki Nishida:Constrained Motif Discovery in Time Series. New Generation Comput. 27(4):
319-346 (2009)
Nuno Castro, Paulo J. Azevedo:Time Series Motifs Statistical Significance. SDM 2011: 687-698
Bill Yuan-chi Chiu, Eamonn J. Keogh, Stefano Lonardi: Probabilistic discovery of time series motifs. KDD 2003: 493-498
Zeeshan Syed, Collin Stultz, Manolis Kellis, Piotr Indyk, John V. Guttag: Motif discovery in physiological datasets: A
methodology for inferring predictive elements. TKDD 4(1) (2010)
Yasser F. O. Mohammad, Toyoaki Nishida: Exact Discovery of Length-Range Motifs. ACIIDS (2) 2014: 23-32
*Yuan Hao, Yanping Chen et al (2013). Towards Never-Ending Learning from Time Series Streams. SIGKDD 2013
Brown et al. PNAS 2013; 110(2): 791–796. A dictionary of behavioral motifs reveals clusters of genes affecting C. elegans
locomotion
125
Questions and Comments
THANK YOU
126
Download