Finding Repeated Structure in Time Series: (we will start 5 min Algorithms and late to allow folks to find the room) Applications Tutorial 3 发现时间序列中的重复模式 : 算法与应用 Abdullah Mueen University of New Mexico, USA Eamonn Keogh University of California Riverside, USA Slides available at http://www.cs.unm.edu/~mueen/Tutorial/ Funding by NSF IIS-1161997 II 1 Anouncements • There will be four Q&A segments. Please hold your question till the next segment. • There is a feedback form on every chair. Negative/positive, anonymous/known feedbacks are welcome. • Last (i.e. after lunch) part of the tutorial will be given in Madrid 2. I will remind you one more time. 2 25.1750 25.2250 25.2500 25.2500 25.2750 25.3250 25.3500 25.3500 25.4000 25.4000 25.3250 25.2250 25.2000 25.1750 .. .. 24.6250 24.6750 24.6750 24.6250 24.6250 24.6250 24.6750 24.7500 What are Time Series? A time series is a collection of observations made sequentially in time. 29 28 27 26 25 24 23 0 50 100 150 200 250 300 350 400 450 500 3 Repeated Pattern (Motif) 时间序列中的重复模式 Find the subsequences having very high similarity to each other. 30 20 10 0 0 0 2000 100 4000 200 300 6000 400 500 8000 600 10000 700 800 Finding Motifs in Time Series, Jessica Lin, Eamonn Keogh, Stefano Lonardi, Pranav Patel, KDD 2002 4 General Outline • Applications (~40 minutes) – As Subroutines in Data Mining – In Other Scientific Research • Algorithms (~100 minutes) – Uni-dimensional – Multi-dimensional – Open Problems • Demonstration (~10 minutes) 5 Applications Outline • Applications – As Subroutines in Data Mining • • • • Never Ending Learning Time Series Clustering Rule Discovery Dictionary Building – In Other Scientific Research • • • • Data center chiller management Worm locomotion analysis Physiological Prediction Activity recognition – Motifs in Other Data-types • Audio • Shapes • Motion 6 Motifs allow us to learn, forever, without an explicit teacher… If you have parallel texts, then over time you can learn a dictionary with high accuracy. …And God said, “Let there be light”; and there was light. And God saw the light, that it was good. And God … …Y dijo Dios: Sea la luz; y fue la luz. Y vio Dios que la luz era buena . Y llamó Dios a … 7 Motifs allow us to learn, forever, without an explicit teacher… If you have parallel texts, then over time you can learn a dictionary with high accuracy. …And God said, “Let there be light”; and there was light. And God saw the light, that it was good. And God … …Y dijo Dios: Sea la luz; y fue la luz. Y vio Dios que la luz era buena . Y llamó Dios a … Note the mapping is non-linear, the learning algorithms in this domain are non-trivial. Suppose however that the unknown “language” is not discrete, but real-valued time series? In this case, repeated pattern discovery can help*… 8 *Yuan Hao, Yanping Chen et al (2013). Towards Never-Ending Learning from Time Series Streams. SIGKDD 2013 Motifs allow us to learn, forever, without an explicit teacher… P (X-axis acceleration of wrist) Real-valued “text” 5 seconds B (RFID tag) on off Glove on off Iron (omitted) 0 1000 2000 Binary “text” 3000 This dataset contains standard IADL housekeeping activities (vacuuming, ironing, dusting, brooming,, watering plants etc). We have a discrete (binary) “text” that notes if the hand is near a cleaning instrument, and a real-valued accelerometer “text” 9 We can run motif discovery on the time series stream. If we find motifs, we can see if they correlate with the discrete streams… C1 observed C1 observed C1 observed 5 seconds P (X-axis acceleration of right wrist) B (RFID tag) on off Glove on off Iron (omitted) 0 1000 Glove 1/1 Iron 0/1 (omitted) 2000 Glove 2/2 Iron 0/2 (omitted) 3000 Glove 3/3 Iron 0/3 (omitted) In this snippet, the motifs seem to correlate with the presence of a glove… 10 How well does this work? Over an hour of activity, we learn to recognize a behavior in the time series that indicates the user is putting on a glove. Events shown in last slide take place here 1 0.5 0 P(X-axis acceleration of right wrist) P(C1=glove) = 0.70 glove P(C1=iron) = 0.23 iron fan 0 minutes P(C1=fan) = 0.07 55 minutes True positives Discovered pattern Note: There are false positives, but we considered only a single axis for simplicity. False positives Learned concept C1 0 200 400 0 200 11400 Motifs allow us to cluster subsequences of a time series… 0 200 400 600 800 1000 1200 And how would we evaluate our answer? Thanawin Rakthanmanon, Eamonn Keogh, Stefano Lonardi, and Scott Evans (2011). Time Series Epenthesis: Clustering Time Series Streams 12 Requires Ignoring Some Data. ICDM 2011 Motifs allow us to cluster subsequences of a time series… 0 200 400 600 800 1000 1200 And how would we evaluate our answer? == Poem == In a sort of Runic rhyme, To the throbbing of the bells-Of the bells, bells, bells, To the sobbing of the bells; Keeping time, time, time, As he knells, knells, knells, In a happy Runic rhyme, To the rolling of the bells,-Of the bells, bells, bells-To the tolling of the bells, Of the bells, bells, bells, bells, Bells, bells, bells,-To the moaning and the groaning of the bells. Data: last 30 seconds from 4-min poem, The Bells by Edgar Allan Poe Edgar Allan 13 Poe Motifs allow us to cluster subsequences of a time series… 0 200 400 600 800 1000 1200 And how would we evaluate our answer? == Poem == In a sort of Runic rhyme, To the throbbing of the bells-Of the bells, bells, bells, To the sobbing of the bells; Keeping time, time, time, As he knells, knells, knells, In a happy Runic rhyme, To the rolling of the bells,-Of the bells, bells, bells-To the tolling of the bells, Of the bells, bells, bells, bells, Bells, bells, bells,-To the moaning and the groaning of the bells. == Text in each clusters == bells, bells, bells, Bells, bells, bells, Of the bells, bells, bells, Of the bells, bells, bells-To the throbbing of the bells-To the sobbing of the bells; To the tolling of the bells, To the rolling of the bells,-To the moaning and the groantime, time, time, knells, knells, knells, sort of Runic rhyme, groaning of the bells. 14 Motifs allow us to cluster subsequences of a time series… 0 200 400 600 800 1000 1200 Key observations that make this possible: • Time Series Motifs! • We are willing to allow some data to be unexplained by the clustering • We score the possible clustering's with MDL, this is parameter-free! • Allowing the clusters to be of different lengths/sizes Thanawin Rakthanmanon, Eamonn Keogh, Stefano Lonardi, and Scott Evans (2011). Time Series Epenthesis: Clustering Time Series Streams 15 Requires Ignoring Some Data. ICDM 2011 Motifs are useful, but can we predict the future? What happens next? -10 -15 -20 -25 -30 0 1000 2000 3000 4000 5000 6000 7000 8000 (unpublished work, email Dr. Keogh for preprint) Prediction vs. Forecasting (informal definitions) Forecasting is “always on”, it constantly predicts a value say, two minutes out (we are not doing this) Prediction only make a prediction occasionally, when it is sure what will happen next 16 Why Predict the (short-term) Future? If a robot can predict that is it about to fall, it may be able to.. • Prevent the fall • Mitigate the damage of the fall More importantly, if the robot can predict a human’s actions • The robot could catch the human! • This would allow more natural human/robot interaction. • Real time is not fast enough for interaction! We need to be a half second before real time. • • Other examples: • Predict a car crash, tighten seatbelts, apply brakes • Predict the next spoken word after ‘data’ is ‘mining’, then begin prefetching WebPages.. • etc 17 Previous attempts at this have largely failed… However, we can do this, and time series motifs are the key tool The rule discovery technique will use: • Time Series Motifs • MDL (minimum description length) • Admissible speed-up techniques (not discussed here) 18 Let us start by finding motifs -10 -15 -20 -25 -30 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 Second occurrence First occurrence 0 20 40 60 80 100 19 We can convert the motifs to a rule -10 -15 -20 -25 -30 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 We can use the motif to make a rule… 0 20 40 60 80 IF we see thisshape, (antecedent) THEN we see thatshape, (consequent) within maxlag time 100 t1 = 7.58 0 20 40 60 0 20 40 maxlag = 0 The Euclidean distance between thisshape and the observed window must be within a threshold t1 = 7.58 20 We can monitor streaming data with our rule.. t1 = 7.58 0 20 40 60 0 20 40 maxlag = 0 8000 9000 21 685 The rule gets invoked… t1 = 7.58 0 20 40 60 0 20 40 maxlag = 0 rule fires here 5735 predicting this shape’s occurrence 5785 5835 5 22 685 It seems to work! t1 = 7.58 0 20 40 60 0 20 40 maxlag = 0 rule fires here 5735 predicting this shape’s occurrence 5785 5835 5 23 What is the ground truth? The first verse of The Raven by Poe in MFCC space Once upon a midnight dreary, while I pondered weak and weary.. -10 -15 -20 -25 -30 0 1000 2000 3000 ..rapping at my chamber door…… 4000 5000 6000 7000 8000 9000 t1 = 7.58 at 0 my 20 chamber 40 60 0 door 80 20 40 60 0 20 40 maxlag = 0 100 The phrase “at my chamber door” does appear 6 more times, and we do fire our rule correctly each time, and have no false positives. What are we invariant to? • Who is speaking? Somewhat, we can handle other males, but females are tricky. • Rate of speech? To a large extent, yes. • Foreign accents? Sore throat? etc. Not yet. 24 t = 0.4 maxlag = 41sec maxlag = 4 sec -9 waiting for elevator elevator moves up walking away -10 -11 0 stepping inside -9 waiting for 40 elevator y- acceleration t1 = 0.4 y- acceleration Why we need the Maxlag parameter stops walking away elevatorelevator moves up 80 120 -10 elevator stops stepping inside -11 0 40 80 120 Here the maxlag depends on the number of floors we have in our building. We can hand-edit this rule to generalize for short buildings to tall buildings Can physicians edit medical rules to generalize from male to female… 25 This works, really! maxlag = 20 minutes 15500 16000 16500 17000 0 120 0 180 IF we see a Clothes Washer used THEN we will see Clothes Dryer used within 20 minutes 27 Insect Behavior Dictionary Beet Leafhopper (Circulifer tenellus) input resistor conductive glue to insect to soil near plant V voltage source voltage reading 2 0 0 6 plant membrane 1 50 100 150 200 Stylet 4 The electrical penetration graph or EPG is a system used by biologists to study the interaction of insects with plants. 3 2 1 0 -1 -2 4500 4000 3500 3000 2500 2000 1500 1000 500 0 -3 1 2 3 4 Additional examples of the motif 5 5 6 7 8 x 104 15 minutes of EPG recorded on Beet Leafhopper Instance at 20,925 Instance at 25,473 0 50 100 150 200 250 300 350 400 As a bead of sticky secretion, which is byproduct of sap feeding, is ejected, it temporarily forms a highly conductive bridge between the insect and the plant. Exact Discovery of Time Series Motifs. A Mueen, et al. SDM, 2009. 29 Insect Behavior Dictionary 5 5 N N S 0 -5 3900 4000 0 4100 4200 3900 4000 4100 -5 4800 4900 Length :334 5 0 0 900 1000 1100 1200 900 1000 1100 1200 -5 600 800 Length :412 5 0 0 7000 7200 7400 7600 7000 Length :799 4800 4900 5000 5100 1000 600 800 1000 Length :565 5 -5 5100 Length :384 5 -5 5000 7200 7400 7600 -5 5200 5400 5600 5800 5200 5400 5600 5800 Length :799 More motifs reveal different feeding patterns of Beet Leafhopper. 30 Applications Outline • Applications – As Subroutines in Data Mining • • • • Never Ending Learning Time Series Clustering Rule Discovery Dictionary Building – In Other Scientific Research • • • • Data center chiller management Worm locomotion analysis Physiological Prediction Activity recognition – Motifs in Other Data-types • Audio • Shapes • Motion 31 Sustainable Operation and Management of Data Center Chillers using Temporal Data Mining HP Labs with Virginia Tech “Our primary goal is to link the time series temperature data gathered from chiller units to high level sustainability characterizations… thus using time series motifs as a crucial intermediate representation to aid in data reduction.” “switching from motif 8 to motif 5 gives us a nearly $40,000 in annual savings!” Patnaik et al. SIGKDD09 32 A dictionary of behavioral motifs reveals clusters of genes affecting C. elegans locomotion Laboratory of Molecular Biology, Cambridge, United Kingdom Goal: Detect genotype by using the locomotion only. Convert postures to four dimensional time series. Brown et al. PNAS 2013; 110(2): 791–796. 33 A dictionary of behavioral motifs reveals clusters of genes affecting C. elegans locomotion Brown A E X et al. PNAS 2013;110:791-796 34 Variability in motif structure is lower in juvenile Directed than in Undirected and similar to that in adult song. Amplitude Envelop Social performance reveals unexpected vocal competency in young songbirds Satoshi Kojima and Allison J. Doupe, PNAS 2011, 108 (4) 1687-1692. 35 Motif discovery in physiological datasets: A methodology for inferring predictive elements University of Michigan and MIT We evaluated our solution on a population of patients who experienced sudden cardiac death (SDDB) and attempted to discover electrocardiographic activity that may be associated with the endpoint of death. To assess the predictive patterns discovered, we compared likelihood scores for time series motifs in the sudden death population… Zeeshan Syed et al. TKDD 2010 36 Motif discovery in physiological datasets: A methodology for inferring predictive elements University of Michigan and MIT Using a maximum likelihood separator, we were able to use our motif to correctly identify 70% of the patients who suffered sudden cardiac death during 24 hours of recording while classifying none of the normal individuals, and only 8% of the patients from the supraventricular dataset as being at risk. Zeeshan Syed et al. TKDD 2010 37 Constrained Motif Discovery in Time Series Toyoaki Nishida, Kyoto University “we use time series motifs to find gesture patterns with applications to robot-human interactions” Okada, Izukura and Nishida 2011 38 Discovering Characteristic Actions from On-Body Sensor Data David Minnen, Thad Starner, Irfan Essa, and Charles Isbell, Georgia Tech Our algorithm successfully discovers motifs that correspond to the real exercises with a recall rate of 96.3% and overall accuracy of 86.7% over six exercises and 864 occurrences. Minnen et al. Symp. Wearable Computers 2006 39 Motifs can Spot Dance Moves… Choreography D D A Step Action A Side steps with no arm movement B Rock steps sideways without arm movement C Rock steps sideways with arm movement D Side steps with arm movement E Side steps with arms up in the air F Standing still with head bopping B D E F C D F E C D F D C A B E A A Leg z y x 0/2 0/2 2/4 Arm z y x 1/4 1/2 1/2 Hand z y x 0/3 0/4 0/2 z y x 0/4 0/4 0/4 Motifs are from the same Hip dance steps or the same transitions 86% of the time. 0 0.5 1 1.5 2 Time Data: H. Pohl et al. SMC 2010 Algorithm: Mueen, ICDM 2013 2.5 3 x 104 40 Applications Outline • Applications – As Subroutines in Data Mining • • • • Never Ending Learning Time Series Clustering Rule Discovery Dictionary Building – In Other Scientific Research • • • • Data center chiller management Worm locomotion analysis Physiological Prediction Activity recognition – Motifs in Other Data-types • Audio • Shapes • Motion 43 Motifs in Audio Mice calls are inaudible and have significant noise Manual inspection over temporal signal is impossible Features like MFCC are not good for animal song Just use the raw spectrogram to find repeated calls 44 Yuan Hao et al. Parameter-Free Audio Motif Discovery in Large Data Archives. ICDM 2013: 261-270 Motifs in Shapes Lampasas River Cornertang Projectile shapes Algorithm detects a rare cornertang segment – an object that has long intrigued anthropologists. Petroglyphs Algorithm detects similar petroglyphs drawn across continents and centuries Castroville Cornertang 0 50 45 100 150 200 250 300 350 Motifs in Motion Mueen et al. A disk-aware algorithm for time series motif discovery. Data Min. Knowl. Discov. 2011. Yankov et al. Detecting time series motifs under uniform scaling. KDD 2007 Two motion can be stitched together by transiting from one motif to the other, a very useful technique for motion synthesis. 46 Time Series Motifs have 1,000 of Uses • ..for discovering motifs in the music data is called the Mueen-Keogh (MK) algorithm.. Cabredo et al. 2011 • we apply the MK motif algorithm to time series retrieved from seismic signals… Cassisi et al 2012 • we take motif developed by Keogh in order to support a medical expert in discovering interesting knowledge. Kitaguchi. • for the problem of estimation of Micro-drilled Hole Wall of PWBs we take the Motif method developed by Keogh... Toshiki et al. (fabrication) • the most efficient motif provided a power savings of 41 This translates to an annual reduction of 287 tons of CO2. Watson InterPACK09. • We use Keogh's Motifs for unsupervised discovery of abnormal human behavior in multi-dimensional time series data... Vahdatpour SDM 2010. • variability of behavior, using motifs, provides more consistent groupings of households across different clustering algorithms… Ian Dent 2014 47 Questions and Comments 48 Algorithms Outline • Algorithms – Definition, Distance Measures and Invariances – Exact Algorithms • • • • Fixed Length Enumeration of All length K-motif Discovery Online Maintenance – Approximate Algorithms • Random Projection Algorithm – Multi-dimensional Motif Discovery – Open Problems 49 A four-slide digression, to make sure you understand what invariances are, and why they are important 50 Suppose we are walking in a cemetery in Japan. We see an interesting grave marker, and we want to learn more about it. We can take a photo of it and search a database…. 51 Campana and Keogh (2010). A Compression Based Distance Measure for Texture. SDM 2010. 52 In order to do this, we must have a distance measure with the right invariances Color invariance Occlusion invariance Size invariance 53 Time Series Data has Unique Invariances • These invariances are domain/problem dependent • They include – – – – – – – Complexity invariance Warping invariance Uniform scaling invariance Occlusion invariance Rotation/phase invariance Offset invariance Z-normalization of each subsequence removes these Amplitude invariance • Sometimes you achieve the invariance in the distance measure, sometimes by preprocessing the data. • In this work, we will just assume offset/amplitude invariance. See [a] for a visual tour of time series invariances. 54 [a] Batista, et al (2011) A Complexity-Invariant Distance Measure for Time Series. SDM 2011 Z-Normalization ensures scale and offset invariances 2 1 xˆi 0 x x x -1 Wandering Baseline -2 3 Query 3 1 1 -1 -1 -3 -3 -5 0 100 -5 Distance to the query Threshold = 30 120 80 40 0 0 500 1000 1500 Without Normalization only 75% of the beats are missed 55 Distance Measures • The choices are – Euclidean Distance – Correlation – Dynamic Time Warping – Longest Common Subsequences – Uniformly Scaled Euclidean Distance – Sliding Nearest Neighbor Distance 56 Euclidean Distance Metric Given two time series x = x1…xn x and y = y1…yn their z-Normalized Euclidean distance is defined as: xˆi x x x d ( x, y ) , yˆ i y y y y n 2 ˆ ˆ ( x y ) i i i 1 d(x,y) Early abandoning reduces number of operations when minimizing sum of squared error exceeded best2 2 0 -2 0 10 20 30 40 50 60 70 80 90 57 Pearson’s Correlation Coefficient 58 Relationship with Euclidean Distance Minimizing z-normalized Euclidean distance and Maximizing Pearson’s correlation coefficient are identical in effect for motif discovery. 59 Euclidean Vs Dynamic Time Warping Euclidean Distance Sequences are aligned “one to one”. “Warped” Time Axis Nonlinear alignments are possible. 60 How is DTW Calculated? DTW (Q, C) D(m, n) D(i, j ) (qi c j )2 min D(i, j 1), D(i 1, j ), D(i 1, j 1) C Q C Q • Quadratic time complexity Warping path w • DTW is not a metric 61 Definition of Time Series Motifs 1. Length of the motif 2. Support of the motif 3. Similarity of the Pattern 4. Relative Position of the Pattern 2 1 0 -1 -2 20 40 60 80 100 120 140 160 180 200 62 Algorithms Outline • Algorithms – Definition, Distance Measures and Invariances – Exact Algorithms • • • • Fixed Length Enumeration of All length K-motif Discovery Online Maintenance – Approximate Algorithms • Random Projection Algorithm – Multi-dimensional Motif Discovery – Open Problems 63 Simplest Definition of Time Series Motifs Given a length, the most similar/least distant pair of non-overlapping subsequences 2 1 0 -1 -2 20 40 60 80 100 120 140 160 180 200 1. Length of the motif = Given 2. Support of the motif = 2 3. Similarity of the Pattern = Euclidean distance 4. Relative Position of the Pattern = non-overlapping Exact Discovery of Time Series Motifs. A Mueen, et al. SDM, 2009. 64 Problem Formulation time:1000 100 200 300 400 500 600 700 800 900 -7000 -7500 -8000 The most similar pair of non-overlapping subsequences 1000 1 2 3 4 5 6 7 8 . . . 873 The closest pair of points in high dimensional space Optimal algorithm in two dimension : Θ(n log n) For large dimensionality d, optimum algorithm is effectively Θ(n2d) 65 Lower Bound If P, Q and R are three points in a d-space d(P,Q)+d(Q,R) ≥ d(P,R) P Q d(P,Q) ≥ |d(Q,R) - d(P,R)| R A third point R provides a very inexpensive lower bound on the true distance If the lower bound is larger than the existing best, skip d(P, Q) d(P,Q) ≥ |d(Q,R) - d(P,R)| ≥ BestPairDistance 66 Circular Projection Pick a reference point r Circularly Project all points on a line passing through the reference point 9 20 7 16 22 4 15 21 5 18 17 10 r 19 14 3 2 8 1 12 Equivalent to computing distance from r and then sorting the points according to distance 11 24 13 6 23 67 The Order Line 9 20 7 16 22 4 15 5 d(P, r) r P k=3 k=2 k=1 r BestPairDistance 2 11 13 6 |d(Q, r) - d(P, r)| 3 8 1 24 Q 0 r 14 12 d(Q, r) 18 17 10 19 21 23 k=1:n-1 •Compare every pair having k-1 points in between •Do k scans of the order line, starting with the 1st to kth point 68 Correctness • If we search for all offset=1,2,…,n-1 then all possible pairs are considered. – n(n-1)/2 pairs • if for any offset=k, none of the k scans needs an actual distance computation then for the rest of the offsets=k+1,…,n-1 no distance computation will be needed. r 69 Performance X103 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 -0.5 10 Brute Force 30 50 70 Brute Force with Optimized Distance Computation X103 110 90 Our Algorithm (MK) • Orders of Magnitude faster • Exact in execution • No sacrifice of the quality of the results 70 Multiple References • Use multiple reference points for tighter lower bounds. Ptolemaic bound x y Pruning by Ptolemaic Inequality 1010 Ptolemaic bound wins 109 108 107 106 Linear bound wins 105 r1 r2 105 106 107 108 109 1010 Pruning by Triangular Inequality 71 Pruning by Multiple References R3 max( | d(P,Ri) - d(Q,Ri) | ) R2 Logarithm of time (seconds) P Time to find the motif in a database of 64,000 random walk time series of length 1,024 8 7 6 Q 5 R1 0 10 20 30 40 Number of Reference Time Series Used 3000 2500 2000 1500 1000 500 5 10 15 20 25 30 35 1800 1600 1400 1200 1000 800 600 400 200 0 0 50 60 R4 1400 1200 1000 800 600 400 200 5 10 15 20 25 30 35 0 0 5 10 15 20 25 30 35 72 Algorithms Outline • Algorithms – Definition, Distance Measures and Invariances – Exact Algorithms • • • • Fixed Length Enumeration of All length K-motif Discovery Online Maintenance – Approximate Algorithms • Random Projection Algorithm – Multi-dimensional Motif Discovery – Open Problems 73 Questions and Comments 74 Simplest Definition of Time Series Motifs The most similar/least distant pairs of non-overlapping subsequences at all lengths. 2 1 0 -1 -2 20 40 60 80 100 120 140 160 180 200 1. Length of the motif = Given All 2. Support of the motif = 2 3. Similarity of the Pattern = Euclidean distance 4. Relative Position of the Pattern = non-overlapping 75 Abdullah Mueen: Enumeration of Time Series Motifs of All Lengths. ICDM 2013: 547-556 Goals: Enumerating Motifs 1. Remove the length parameter 2. Search for motifs in a range of lengths and report – ALL of the motifs of all of the lengths 3. Retain Scalability 76 Bound on Extension 10 9 8 7 6 5 4 3 4 3 2 1 0 -1 -2 -3 -4 1 2 3 4 Without Normalization 5 Values Changed 1 2 3 4 5 With Normalization 77 Intuition Area between blue and red is the distance between the signals Append 20 and re-normalize Append 10 and re-normalize Normalized 2 2 2 1.5 1.5 1.5 1 1 1 0.5 0.5 0.5 0 0 0 -0.5 -0.5 -0.5 -1 -1 -1 -1.5 -1.5 -1.5 -2 1 2 3 Length 4 4 5 -2 1 2 3 4 5 -2 1 2 3 4 5 Length 5 Length 5 Area shrinks Area shrinks further If infinity is appended to both the signals, they will have zero area/distance. 78 Bounding Euclidean Distance 79 Experimental Validation of the Bounds Normalized Distance 35 30 25 20 15 25 24.5 24 23.5 23 22.5 22 21.5 21 20.5 20 2 10 5 0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3 x 105 x 105 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Pairs in ascending order of distances Blue: Distances of random signals of length 255 Gray: Distances of the signals when they are extended by one random sample Red: The upper and lower bounds before observing the extensions 80 Intuition n = 10000 -7 x 103 -7.5 -8 -8.5 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Location 2 Location 1 Distance 145, 5410, 1.26 145, 5410, 1.79 8345, 4211, 1.63 8345, 4211, 2.63 8345, 4211, 1.63 145, 5410, 1.79 1655, 9461, 2.96 1655, 9461, 3.61 6531, 2501, 2.71 6531, 2501, 3.17 6531, 2501, 2.71 1655, 9461, 3.61 851, 1440, 3.73 851, 1440, 3.83 851, 1440, 3.83 2512, 3110, 3.98 2512, 3110, 4.18 2512, 3110, 4.18 1685, 9260, 4.57 1685, 9260, 4.27 1685, 9260, 4.27 ----, -----, ----- ----, -----, ----- 81 Intuition 8345, 4211, 1.63 8345, 4211, 1.23 8345, 4211, 1.23 145, 5410, 1.79 145, 5410, 1.98 6531, 2501, 1.71 6531, 2501, 2.71 6531, 2501, 1.71 145, 5410, 1.98 1655, 9461, 3.61 1655, 9461, 3.68 851, 1440, 3.61 851, 1440, 3.83 851, 1440, 3.61 1655, 9461, 3.68 • Once in every 10 lengths, the exact ordered list is required to be populated. • This yields a 10x speed-up from running fixedlength motif discovery for all lengths. 82 Sanity Check 4 2 0 -2 White Noise 0 1000 2000 3000 4000 5000 5 5 5 5 0 0 0 0 (1) -5 -5 -5 1380 1400 1420 1440 1460 1320 1340 1360 1380 1400 1420 Length :87 Length :105 (4) (3) (2) 600 6000 700 Length :299 800 -5 2200 2300 2400 Length :299 • Three Patterns planted in a random signal with different scaling. • The algorithm finds them appropriately. http://www.cs.unm.edu/~mueen/Projects/MOEN/index.html 83 x 105 7 x 104 18 Smart Brute Force EEG EOG Random Walk Iterative MK 6 5 4 Execution Time in Seconds Execution Time in Seconds Experimental Results: Scalability Smart Brute Force EEG Random Walk EOG 16 14 12 10 3 2 MOEN 1 0 8 6 MOEN 4 2 0 0 2 4 6 8 10 Data Length (n) 12 14 16 x 104 1 2 3 4 5 6 7 8 9 10 x 102 Range of Lengths (maxLen-minLen+1) 84 Algorithms Outline • Algorithms – Definition, Distance Measures and Invariances – Exact Algorithms • • • • Fixed Length Enumeration of All length K-motif Discovery Online Maintenance – Approximate Algorithms • Random Projection Algorithm – Multi-dimensional Motif Discovery – Open Problems 85 Online Time Series Motifs • Streaming time series • Sliding window of the recent history – What minute long trace repeated in the last hour? Abdullah Mueen, Eamonn J. Keogh: Online discovery and maintenance of time series motifs. KDD 2010 86 Problem Formulation Discovery 100 200 300 400 500 600 700 800 900 1000 -7000 -7500 -8000 1 2 3 4 5 6 7 8 . . . 873 The most similar pair of non-overlapping subsequences Maintenance time:1001 1 2 3 4 5 6 7 8 . . . 873 874 time:1002 time:1003 1 2 3 4 5 6 7 8 . . . 873 874 875 1 2 3 4 5 6 7 8 . . . 873 874 875 876 time:1004 1 2 3 4 5 6 7 8 . . . 873 874 875 876 877 time:1005 1 2 3 4 5 6 7 8 . . . 873 874 875 876 877 878 Update motif pair after every time tick time 87 Challenges • A subsequence is a high dimensional point – The dynamic closest pair of points problem • Closest pair may change upon every update • Naïve approach: Do quadratic comparisons. Deletion 3 3 4 6 7 6 5 2 4 1 7 6 8 3 4 1 7 Insertion 5 8 2 5 9 8 2 88 Related Work • Goal: Algorithm with Linear update time • Previous method for dynamic closest pair (Eppstein,00) – A matrix of all-pair distances is maintained • O(w2) space required – Quad-tree is used to update the matrix • Maintain a set of neighbors and reverse neighbors for all points • We do it in O( w w ) space 89 Maintaining Motif • Motifs → Closest pair • Upon insertion – Find the nearest neighbor; Needs O(w) comparisons. • Upon deletion – Find the next NN of all the reverse NN 3 3 4 1 7 6 1 7 6 5 2 4 1 7 6 8 3 4 5 9 8 2 Insertion 5 9 8 2 Deletion 90 Data Structure Total number of nodes in R-lists is w R-lists Points that should be updated (ptr, distance) N-lists Neighbors in order of distances 7 8 4 5 7 1 1 2 3 4 5 4 7 5 2 3 8 6 8 5 7 1 4 6 3 7 1 6 4 8 1 5 1 7 5 3 2 8 6 2 1 7 8 4 3 6 4 6 3 2 6 7 8 8 3 2 7 5 1 4 1 3 4 5 2 8 6 2 6 5 7 1 3 4 FIFO Sliding window 3 4 1 7 6 5 8 2 91 Observations While inserting Updating NN of old points is not necessary A point can be removed from the neighbor list if it violates the temporal order Average length O( w) 92 Example 3 7 4 3 4 8 58 64 7 46 101 6 8 5 2 35 76 5 10 9 1 2 3 4 5 6 7 8 9 10 1 21 31 24 3 31 26 6 5 2 3 4 5 43 67 8 7 54 7 9 65 6 8 2 8 9 93 Insect RW EOG EEG 0 0.5 1 1.5 2 2.5 3 3.5 Window Size (w) Average Length of N-list • Up to 4 x 104 0.7 0.6 0.5 0.4 Insect EOG RW EEG 0.3 0.2 0.1 0 0 200 400 600 800 1000 Motif Length (m) from general dynamic closest pair per point with increasing window size RW 110 100 90 80 70 60 50 40 30 20 10 EOG Insect EEG 0 Average Update Time (Sec) 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0.5 1 1.5 2 2.5 3 Window Size (w) 3.5 4 4 x 10 Average Length of N-list Average Update Time (Sec) Results RW 350 300 250 200 150 EOG 100 Insect EEG 50 0 0 200 400 600 800 Motif Length (m) 1000 94 Algorithms Outline • Algorithms – Definition, Distance Measures and Invariances – Exact Algorithms • • • • Fixed Length Enumeration of All length K-motif Discovery Online Maintenance – Approximate Algorithms • Random Projection Algorithm – Multi-dimensional Motif Discovery – Open Problems 95 Questions and Comments 96 Finding Repeated Structure in Time Series: Algorithms and Applications Lunch break; We meet back in Madrid 2 at 1:45pm Abdullah Mueen University of New Mexico, USA Eamonn Keogh University of California Riverside, USA Slides available at: http://www.cs.unm.edu/~mueen/Tutorial/ 97 General Outline • Applications (~40 minutes) – As Subroutines in Data Mining – In Other Scientific Research • Algorithms (~100 minutes) – Uni-dimensional – Multi-dimensional • Demonstration (~10 minutes) 98 Algorithms Outline • Algorithms – Definition, Distance Measures and Invariances – Exact Algorithms • • • • Fixed Length Enumeration of All length K-motif Discovery Online Maintenance – Approximate Algorithms • Random Projection Algorithm – Multi-dimensional Motif Discovery – Open Problems 99 Simplest Definition of Time Series Motifs The non-overlapping subsequences at all lengths having k or more τmatches. 2 1 0 -1 -2 20 40 60 80 100 120 140 160 180 200 1. Length of the motif = Given All 2. Support of the motif = 2 k and τ 3. Similarity of the Pattern = Euclidean distance 4. Relative Position of the Pattern = non-overlapping100 Optimal algorithm is hard • Search for locations of the τ-balls that contain k subsequences • NP-Hard • Instead we search for a motif representative that has k subsequences within τ 101 How do we find the motif representative? • Simply take one of the two occurrences as the representative • Take the average of the two • Find all occurrences within a threshold of pair and train a HMM to capture the concept (Minnen’07) 102 Using each one in the pair as the representative (k=4, τ=0.9) EEG Data 4 2 0 3 2 -2 1 -4 EPG Data 0 50 100 150 0 5000 10000 0 50 100 150 5000 10000 4 3 5 2 4 0 2 1 -2 -4 0 It takes only k similarity searches to find other occurrences. The overall complexity remains the same. 103 Finding top-K motif • • Run MK for K times Replace occurrences by random noise between iterations 4000 3000 A 2000 1000 1000 2000 3000 4000 5000 6000 Length :155 C Length :157 B Length :255 Length :251 A A B A x A x A B A x A A x B x A A B A x B A x B A C AxC A A x C 104 Algorithms Outline • Algorithms – Definition, Distance Measures and Invariances – Exact Algorithms • • • • Fixed Length Enumeration of All length K-motif Discovery Online Maintenance – Approximate Algorithms • Random Projection Algorithm – Multi-dimensional Motif Discovery – Open Problems 105 How do we find approximate motif in a time series? The obvious brute force search algorithm is just too slow… Our algorithm is based on a hot idea from bioinformatics, random projection* and the fact that SAX allows us to lower bound discrete representations of time series. * J Buhler and M Tompa. Finding motifs using random projections. In RECOMB'01. 2001. Symbolic Aggregate ApproXimation 107 How do we obtain SAX? C C 0 40 60 80 100 120 c First convert the time series to PAA representation, then convert the PAA to symbols It take linear time 20 c c b b a 0 20 b a 40 60 80 100 baabccbc 120 108 c c c b b a 0 20 b a 40 60 Time series subsequences tend to have a highly Gaussian distribution 80 100 120 Why a Gaussian? 0.999 0.997 0.99 0.98 Probability 0.95 0.90 0.75 0.50 0.25 0.10 0.05 0.02 0.01 0.003 0.001 -10 0 10 A normal probability plot of the (cumulative) distribution of values from subsequences of length 128. 109 Visual Comparison 3 2 1 0 -1 -2 -3 DFT f e d c b a PLA Haar APCA A raw time series of length 128 is transformed into the word “ffffffeeeddcbaabceedcbaaaaacddee.” – We can use more symbols to represent the time series since each symbol requires fewer bits than real-numbers (float, double) 110 A simple worked example of approximate motif discovery algorithm The next 3 slides T 0 (m= 1000 ) 500 1000 C1 C^1 a c b a ^ S a 1 2 b : : : : 58 a : : 985 b c c : : c : c b a : : c : c a b : : a : c Assume that we have a time series T of length 1,000, and a motif of length 16, which occurs twice, at time T1 and time T58. a = 3 { a,b,c} n = 16 w=4 Bill Yuan-chi Chiu, Eamonn J. Keogh, Stefano Lonardi: Probabilistic discovery of time series motifs. KDD 2003: 493-498 111 A simple worked example of approximate motif discovery algorithm A mask {1,2} was randomly chosen, so the values in columns {1,2} were used to project matrix into buckets. Collisions are recorded by incrementing the appropriate location in the collision matrix 112 A simple worked example of approximate motif discovery algorithm A mask {2,4} was randomly chosen, so the values in columns {2,4} were used to project matrix into buckets. Once again, collisions are recorded by incrementing the appropriate location in the collision matrix 113 We can calculate the expected values in the matrix, assuming there are NO patterns… k i E( k , a, w, d , t ) 1- 2 i 0 w d t is the length of the projected string. We conclude that if we have k random strings of size w, an entry of the similarity matrix will be hit on average t w a 1 i a i 1 a w i two randomly-generated words of size w over an alphabet of size a, the probability that they match with up to d errors 114 Motif significance involves several independent dimensions Length/size Similarity Support Assessing significance requires estimating a function S:Rd→R over these dimensions so we can rank the motifs 115 More invariances mean more independent dimensions Dimensionality Similarity Complexity Support Length/size 116 Multi-dimensional Motif • Synchronous – Treat it as an even higher dimensional problem – Simple extensions of uni-dimensional algorithms work – To find sub-dimensional motifs, all possible subspaces have to be considered 117 Multi-dimensional Motif • Non-Synchronous – Lags among motifs are common – Subsets of dimensions can possibly construct a motif Alireza Vahdatpour, Navid Amini, Majid Sarrafzadeh: Toward Unsupervised Activity Discovery Using Multi-Dimensional Motif Detection in Time Series. IJCAI 2009: 1261-1266 David Minnen, Charles Isbell, Irfan Essa, and Thad Starner. Detecting Subdimensional Motifs: An Efficient Algorithm for 118 Generalized Multivariate Pattern Discovery. ICDM '07 Coincidence table coincident(ri, rj) is the number of overlapping occurrences of motif i and motif j 119 Single-dimensional motifs to graph • • • • Produce a co-occurrence graph Nodes are single dimensional motifs An edge between x and y denotes, x and y always co-occur within a time lag Cluster the graph using min-cut algorithms to find multi-dimensional motifs Leg z y x 0/2 0/2 2/4 Arm z y x 1/4 1/2 1/2 Hand z y x 0/3 0/4 0/2 Hip z y x 0/4 0/4 0/4 0 0.5 1 1.5 2 2.5 3 x 104 120 Open Problems • New Invariances: – P1: Find repeated patterns under warping distance. – P2: Finding motifs under Complexity invariance. • Uniform scaling (Yankov 06) • Significance: – P3: Assessing significance of motifs without discretization. • Parameter-free • Data adaptive 121 Open Problems • Algorithmic: – P4: Optimal k-motif for a given threshold – P5: Exact multi-dimensional motif discovery • Application: – P6: Finding hidden state machine from motifs where “states” can be clusters of subsequences and “transitions” between states can be rules • Systems: – P7: A suite with all the techniques packaged together – P8: Parallel motif discovery using hardware acceleration 122 Code Snippet • Live Demonstration of the MOEN tool. x = load('LSF5_10.txt'); for i = 1:30000-128 y(i,:) = x(i:i+127); end tic; y*y'; toc; !./mk 'LSF5_10.txt' 30000 128 123 References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. Abdullah Mueen, Eamonn J. Keogh: Online discovery and maintenance of time series motifs. KDD 2010: 1089-1098 Abdullah Mueen, Eamonn J. Keogh, Qiang Zhu, Sydney Cash, M. Brandon Westover: Exact Discovery of Time Series Motifs. SDM 2009: 473-484 Abdullah Mueen, Eamonn J. Keogh, Nima Bigdely Shamlo: Finding Time Series Motifs in Disk-Resident Data. ICDM 2009: 367-376 Abdullah Mueen: Enumeration of Time Series Motifs of All Lengths. ICDM 2013: 547-556 Mueen et al. A disk-aware algorithm for time series motif discovery. Data Min. Knowl. Discov. 2011. Yuan Hao, Mohammad Shokoohi-Yekta, George Papageorgiou, Eamonn J. Keogh: Parameter-Free Audio Motif Discovery in Large Data Archives. ICDM 2013: 261-270 Dragomir Yankov, Eamonn J. Keogh, Jose Medina, Bill Yuan-chi Chiu, Victor B. Zordan: Detecting time series motifs under uniform scaling. KDD 2007: 844-853 Xiaopeng Xi, Eamonn J. Keogh, Li Wei, Agenor Mafra-Neto: Finding Motifs in a Database of Shapes. SDM 2007: 249-260 Bill Yuan-chi Chiu, Eamonn J. Keogh, Stefano Lonardi: Probabilistic discovery of time series motifs. KDD 2003: 493-498 Pranav Patel, Eamonn J. Keogh, Jessica Lin, Stefano Lonardi: Mining Motifs in Massive Time Series Databases. ICDM 2002: 370-377 Alireza Vahdatpour, Navid Amini, Majid Sarrafzadeh: Toward Unsupervised Activity Discovery Using Multi-Dimensional Motif Detection in Time Series. IJCAI 2009: 1261-1266 Debprakash Patnaik, Manish Marwah, Ratnesh Sharma, and Naren Ramakrishnan. 2009. Sustainable operation and management of data center chillers using temporal data mining. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '09) Yuan Li, Jessica Lin, Tim Oates: Visualizing Variable-Length Time Series Motifs. SDM 2012: 895-906 Tim Oates, Arnold P. Boedihardjo, Jessica Lin, Crystal Chen, Susan Frankenstein, and Sunil Gandhi. 2013. Motif discovery in spatial trajectories using grammar inference. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management(CIKM '13). ACM, New York, NY, USA, 1465-1468. Sorrachai Yingchareonthawornchai, Haemwaan Sivaraks, Thanawin Rakthanmanon, Chotirat Ann Ratanamahatana: Efficient Proper Length Time Series Motif Discovery. ICDM 2013: 1265-1270 David Minnen, Charles Lee Isbell Jr., Irfan A. Essa, Thad Starner: Discovering Multivariate Motifs using Subsequence Density Estimation and Greedy Mixture Learning. AAAI 2007: 615-620 124 References 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. Philippe Beaudoin, Stelian Coros, Michiel van de Panne, and Pierre Poulin. 2008. Motion-motif graphs. In Proceedings of the 2008 ACM SIGGRAPH/Eurographics Symposium on Computer Animation (SCA '08) David Minnen, Charles L. Isbell, Irfan A. Essa, Thad Starner: Detecting Subdimensional Motifs: An Efficient Algorithm for Generalized Multivariate Pattern Discovery. ICDM 2007: 601-606 Yuan Hao, Yanping Chen, Jesin Zakaria, Bing Hu, Thanawin Rakthanmanon, Eamonn J. Keogh: Towards never-ending learning from time series streams. KDD 2013: 874-882 Gustavo E. A. P. A. Batista, Xiaoyue Wang, Eamonn J. Keogh: A Complexity-Invariant Distance Measure for Time Series. SDM 2011: 699-710 Thanawin Rakthanmanon, Eamonn J. Keogh, Stefano Lonardi, Scott Evans: Time Series Epenthesis: Clustering Time Series Streams Requires Ignoring Some Data. ICDM 2011: 547-556 Yasser F. O. Mohammad, Toyoaki Nishida:Constrained Motif Discovery in Time Series. New Generation Comput. 27(4): 319-346 (2009) Nuno Castro, Paulo J. Azevedo:Time Series Motifs Statistical Significance. SDM 2011: 687-698 Bill Yuan-chi Chiu, Eamonn J. Keogh, Stefano Lonardi: Probabilistic discovery of time series motifs. KDD 2003: 493-498 Zeeshan Syed, Collin Stultz, Manolis Kellis, Piotr Indyk, John V. Guttag: Motif discovery in physiological datasets: A methodology for inferring predictive elements. TKDD 4(1) (2010) Yasser F. O. Mohammad, Toyoaki Nishida: Exact Discovery of Length-Range Motifs. ACIIDS (2) 2014: 23-32 *Yuan Hao, Yanping Chen et al (2013). Towards Never-Ending Learning from Time Series Streams. SIGKDD 2013 Brown et al. PNAS 2013; 110(2): 791–796. A dictionary of behavioral motifs reveals clusters of genes affecting C. elegans locomotion 125 Questions and Comments THANK YOU 126