All data files and format can be imagined as a a single row of data. The number of columns is the dimensionality of the data. The number of rows is the numerosity. Customer 1 12 23 12 43 25 Yes Customer 2 34 56 12 76 23 Yes Customer 3 45 23 45 34 23 No Customer 4 54 23 23 55 54 Yes 25.1750 25.2250 25.2500 25.2500 25.2750 25.3250 25.3500 25.3500 25.4000 25.4000 25.3250 25.2250 25.2000 25.1750 .. .. 24.6250 24.6750 24.6750 24.6250 24.6250 24.6250 24.6750 24.7500 What are Time Series? A time series is a collection of observations made sequentially in time. 29 28 27 26 25 24 23 0 50 100 150 200 250 300 350 400 450 500 Note that virtually all similarity measurements, indexing and dimensionality reduction techniques discussed in this tutorial can be used with other data types. Time Series are Ubiquitous! I People measure things... • George Bush's popularity rating. • Their blood pressure. • The annual rainfall in Dubrovnik. • The value of their Yahoo stock. • The number of web hits per second. … and things change over time. Thus time series occur in virtually every medical, scientific and businesses domain. Time Series are Ubiquitous! II A random sample of 4,000 graphics from 15 of the world’s newspapers published from 1974 to 1989 found that more than 75% of all graphics were time series (Tufte, 1983). Image data, may best be thought of as time series… Video data, may best be thought of as time series… Steady pointing Hand moving to shoulder level Point Hand at rest 0 10 20 30 40 50 60 70 80 90 Steady pointing Hand moving to shoulder level Hand moving down to grasp gun Hand moving above holster Hand at rest Gun-Draw 0 10 20 30 40 50 60 70 80 90 Handwriting data, may best be thought of as time series… George Washington manuscript Time Series are Everywhere… Bioinformatics: Aach, J. and Robotics: Schmill, M., Oates, T. & Church, G. (2001). Aligning gene expression time series with time warping algorithms. Bioinformatics. Volume 17, pp 495-508. Cohen, P. (1999). Learned models for continuous planning. In 7th International Workshop on Artificial Intelligence and Statistics. Medicine: Caiani, E.G., et. al. (1998) Warped-average template technique to track on a cycle-by-cycle basis the cardiac filling phases on left ventricular volume. IEEE Computers in Cardiology. Gesture Recognition: Gavrila, D. M. & Davis,L. S.(1995). Towards 3-d model-based tracking and recognition of human movement: a multi-view approach. In IEEE IWAFGR Chemistry: Gollmer, K., & Posten, C. (1995) Detection of distorted pattern using dynamic time warping algorithm and application for supervision of bioprocesses. IFAC CHEMFAS-4 Meteorology/ Tracking/ Biometrics / Astronomy / Finance / Manufacturing … Why is Working With Time Series so Difficult? Part I Answer: How do we work with very large databases? 1 Hour of EKG data: 1 Gigabyte. Typical Weblog: 5 Gigabytes per week. Space Shuttle Database: 200 Gigabytes and growing. Macho Database: 3 Terabytes, updated with 3 gigabytes a day. Since most of the data lives on disk (or tape), we need a representation of the data we can efficiently manipulate. Why is Working With Time Series so Difficult? Part II Answer: We are dealing with subjectivity The definition of similarity depends on the user, the domain and the task at hand. We need to be able to handle this subjectivity. Why is working with time series so difficult? Part III Answer: Miscellaneous data handling problems. • Differing data formats. • Differing sampling rates. • Noise, missing values, etc. We will not focus on these issues in this tutorial. What do we want to do with the time series data? Clustering Motif Discovery Classification Rule 10 Discovery Query by Content s = 0.5 c = 0.3 Novelty Detection All these problems require similarity matching Clustering Motif Discovery Classification Rule 10 Discovery Query by Content s = 0.5 c = 0.3 Novelty Detection Here is a simple motivation…. You go to the doctor because of chest pains. Your ECG looks strange… You doctor wants to search a database to find similar ECGs, in the hope that they will offer clues about your condition... Two questions: • How do we define similar? • How do we search quickly? The similarity matching problem can come in two flavors I Query Q (template) 1 6 2 7 3 8 4 9 5 10 1: Whole Matching C6 is the best match. Database C Given a Query Q, a reference database C and a distance measure, find the Ci that best matches Q. The similarity matching problem can come in two flavors II Query Q (template) 2: Subsequence Matching Database C The best matching subsection. Given a Query Q, a reference database C and a distance measure, find the location that best matches Q. Note that we can always convert subsequence matching to whole matching by sliding a window across the long sequence, and copying the window contents. Defining Distance Measures Definition: Let O1 and O2 be two objects from the universe of possible objects. The distance (dissimilarity) is denoted by D(O1,O2) What properties should a distance measure have? • D(A,B) = D(B,A) • D(A,A) = 0 • D(A,B) = 0 IIf A= B • D(A,B) D(A,C) + D(B,C) Symmetry Constancy of Self-Similarity Positivity Triangular Inequality Why is the Triangular Inequality so Important? Virtually all techniques to index data require the triangular inequality to hold. Suppose I am looking for the closest point to Q, in a database of 3 objects. Further suppose that the triangular inequality holds, and that we have precomplied a table of distance between all the items in the database. a Q c b a a b c b 6.70 c 7.07 2.30 Why is the Triangular Inequality so Important? Virtually all techniques to index data require the triangular inequality to hold. Suppose I am looking for the closest point to Q, in a database of 3 objects. Further suppose that the triangular inequality holds, and that we have precomplied a table of distance between all the items in the database. a Q I find a and calculate that it is 2 units from Q, it becomes my best-so-far. I find b and calculate that it is 7.81 units away from Q. I don’t have to calculate the distance from Q to c! D(Q,b) D(Q,c) + D(b,c) D(Q,b) - D(b,c) D(Q,c) 7.81 - 2.30 D(Q,c) 5.51 D(Q,c) So I know that c is at least 5.51 units away, but my best-sofar is only 2 units away. c b I know a a b c b 6.70 c 7.07 2.30 A Final Thought on the Triangular Inequality I Sometimes the triangular inequality requirement maps nicely onto human intuitions. Consider the similarity between the horse, the zebra and the lion. The horse and the zebra are very similar, and both are very unlike the lion. A Final Thought on the Triangular Inequality II Sometimes the triangular inequality requirement fails to map onto human intuition. Consider the similarity between the horse, a man and the centaur. The horse and the man are very different, but both share many features with the centaur. This relationship does not obey the triangular inequality. The centaur example is due to Remco Velkamp The Minkowski Metrics DQ, C qi ci n p p C i 1 Q p = 1 Manhattan (Rectilinear, City Block) p = 2 Euclidean p = Max (Supremum, “sup”) D(Q,C) Euclidean Distance Metric Given two time series Q = q1…qn and C = c1…cn their Euclidean distance is defined as: DQ, C qi ci n 2 C Q i 1 Ninety percent of all work on time series uses the Euclidean distance measure. D(Q,C) Optimizing the Euclidean Distance Calculation DQ, C qi ci n 2 Instead of using the Euclidean distance we can use the Squared Euclidean distance i 1 Dsquared Q, C qi ci n 2 i 1 Euclidean distance and Squared Euclidean distance are equivalent in the sense that they return the same rankings, clusterings and classifications. This optimization helps with CPU time, but most problems are I/O bound. Preprocessing the data before distance calculations If we naively try to measure the distance between two “raw” time series, we may get very unintuitive results. This is because Euclidean distance is very sensitive to some distortions in the data. For most problems these distortions are not meaningful, and thus we can and should remove them. In the next 4 slides I will discuss the 4 most common distortions, and how to remove them. • Offset Translation • Amplitude Scaling • Linear Trend • Noise Transformation I: Offset Translation 3 3 2.5 2.5 2 2 1.5 1.5 1 1 0.5 0.5 0 0 50 100 150 200 250 300 0 D(Q,C) 0 50 100 150 200 250 300 Q = Q - mean(Q) C = C - mean(C) D(Q,C) 0 0 50 100 150 200 250 300 50 100 150 200 250 300 Transformation II: Amplitude Scaling 0 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000 Q = (Q - mean(Q)) / std(Q) C = (C - mean(C)) / std(C) D(Q,C) Transformation III: Linear Trend 5 4 Removed offset translation 3 2 Removed amplitude scaling 1 0 12 -1 10 -2 8 -3 0 20 40 60 80 100 120 140 160 180 200 6 4 2 0 5 -2 -4 0 4 20 40 60 80 100 120 140 160 180 200 Removed linear trend 3 2 The intuition behind removing linear trend is this. Removed offset translation 1 0 Removed amplitude scaling -1 Fit the best fitting straight line to the time series, then subtract that line from the time series. -2 -3 0 20 40 60 80 100 120 140 160 180 200 Transformation IIII: Noise 8 8 6 6 4 4 2 2 0 0 -2 -2 -4 0 20 40 60 80 100 120 140 -4 0 20 40 60 Q = smooth(Q) The intuition behind removing noise is this. Average each datapoints value with its neighbors. C = smooth(C) D(Q,C) 80 100 120 140 A Quick Experiment to Demonstrate the Utility of Preprocessing the Data 3 2 9 6 Clustered using Euclidean distance on the raw data. 8 5 7 4 1 9 8 7 Instances from Cylinder-Bell-Funnel with small, random amounts of trend, offset and scaling added. 5 6 4 3 2 1 Clustered using Euclidean distance on the raw data, after removing noise, linear trend, offset translation and amplitude scaling. Summary of Preprocessing The “raw” time series may have distortions which we should remove before clustering, classification etc. Of course, sometimes the distortions are the most interesting thing about the data, the above is only a general rule. We should keep in mind these problems as we consider the high level representations of time series which we will encounter later (Fourier transforms, Wavelets etc). Since these representations often allow us to handle distortions in elegant ways. Dynamic Time Warping Fixed Time Axis “Warped” Time Axis Sequences are aligned “one to one”. Nonlinear alignments are possible. Note: We will first see the utility of DTW, then see how it is calculated. Euclidean Dynamic Time Warping Let us compare Euclidean Distance and DTW on 4 problems Point 0 10 20 30 40 50 60 70 80 90 4 3 2 1 0 GunDraw -1 GUN -2 Sign language -3 0 4 4 3 3 2 2 1 1 0 0 -1 -1 -2 -2 -3 0 50 100 150 200 250 300 -3 0 50 100 150 200 250 10 20 30 40 50 60 70 80 90 -4 0 10 20 300 Nuclear Trace Word Spotting 30 40 50 60 70 80 Results: Error Rate DataSet Word Spotting Sign language GUN Nuclear Trace Euclidean 4.78 28.70 5.50 11.00 DTW 1.10 25.93 1.00 0.00 We use 1-nearest-neighbor, leaving one out evaluation Results: Time (msec ) DataSet Word Spotting Sign language GUN Nuclear Trace Euclidean DTW 40 8,600 10 1,110 60 11,820 210 144,470 DTW is hundred to thousands of times slower than Euclidean distance. 215 110 197 687 Utility of Dynamic Time Warping: Example II, Clustering Power-Demand Time Series. Each sequence corresponds to a week’s demand for power in a Dutch research facility in 1997 [van Selow 1999]. Wednesday was a national holiday For both dendrograms the two 5-day weeks are correctly grouped. Note however, that for Euclidean distance the three 4-day weeks are not clustered together. and the two 3-day weeks are also not clustered together. In contrast, Dynamic Time Warping clusters the three 4-day weeks together, and the two 3-day weeks together. Euclidean Dynamic Time Warping Time taken to create hierarchical clustering of power-demand time series. • Time to create dendrogram using Euclidean Distance 0.012 seconds • Time to create dendrogram using Dynamic Time Warping 2.0 minutes How is DTW Calculated? I We create a matrix the size of |Q| by |C|, then fill it in with the distance between every pair of point in our two time series. C Q C Q How is DTW Calculated? II Every possible warping between two time series, is a path though the matrix. We want the best one… DTW (Q, C ) min C Q C Q Warping path w K k 1 wk K How is DTW Calculated? III This recursive function gives use the minimum cost path. (i,j) = d(qi,cj) + min{ (i-1,j-1) , (i-1,j ) , (i,j-1) } C Q C Q Warping path w recursive function gives use the How is DTW This minimum cost path. Calculated? IIII (i,j) = d(qi,cj) + min{ (i-1,j-1) , (i-1,j ) , (i,j-1) } C (i,j) = d(qi,cj) + min{ (i-1,j-1) (i-1,j ) (i,j-1) } Let us visualize the cumulative matrix on a real world problem I This example shows 2 one-week periods from the power demand time series. Note that although they both describe 4-day work weeks, the blue sequence had Monday as a holiday, and the red sequence had Wednesday as a holiday. Let us visualize the cumulative matrix on a real world problem II What we have seen so far… • Dynamic time warping gives much better results than Euclidean distance on virtually all problems (recall the classification example, and the clustering example) • Dynamic time warping is very very slow to calculate! Is there anything we can do to speed up similarity search under DTW? Fast Approximations to Dynamic Time Warp Distance I C Q C Q Simple Idea: Approximate the time series with some compressed or downsampled representation, and do DTW on the new representation. How well does this work... Fast Approximations to Dynamic Time Warp Distance II 22.7 sec 1.3 sec … strong visual evidence to suggests it works well. Good experimental evidence the utility of the approach on clustering, classification and query by content problems also has been demonstrated. In general, it is hard to speed up a single DTW calculation However, if we have to make many DTW calculations (which is almost always the case), we can potentiality speed up the whole process by lowerbounding. Keep in mind that the lowerbounding trick works for any situation were you have an expensive calculation that can be lowerbound (string edit distance, graph edit distance etc) I will explain how lowerbounding works in a generic fashion in the next two slides, then show concretely how lowerbounding makes dealing with massive time series under DTW possible… Lower Bounding I Assume that we have two functions: • DTW(A,B) • lower_bound_distance(A,B) The true DTW function is very slow… The lower bound function is very fast… By definition, for all A, B, we have lower_bound_distance(A,B) DTW(A,B) Lower Bounding II We can speed up similarity search under DTW by using a lower bounding function. Intuition Try to use a cheap lower bounding calculation as often as possible. Only do the expensive, full calculations when it is absolutely necessary. Algorithm Lower_Bounding_Sequential_Scan(Q) 1. best_so_far = infinity; 2. for all sequences in database 3. LB_dist = lower_bound_distance( Ci, Q); if LB_dist < best_so_far 4. 5. true_dist = DTW(C Ci, Q); if true_dist < best_so_far 6. 7. best_so_far = true_dist; 8. index_of_best_match = i; endif 9. endif 10. 11. endfor Lower Bound of Yi et. al. max(Q) min(Q) LB_Yi Yi, B, Jagadish, H & Faloutsos, C. Efficient retrieval of similar time sequences under time warping. ICDE 98, pp 23-27. The sum of the squared length of gray lines represent the minimum the corresponding points contribution to the overall DTW distance, and thus can be returned as the lower bounding measure Lower Bound of Kim et al C A D LB_Kim Kim, S, Park, S, & Chu, W. An index-based approach for similarity search supporting time warping in large sequence databases. ICDE 01, pp 607-614 B The squared difference between the two sequence’s first (A), last (D), minimum (B) and maximum points (C) is returned as the lower bound Global Constraints • Slightly speed up the calculations • Prevent pathological warpings C Q C Q Sakoe-Chiba Band Itakura Parallelogram A global constraint constrains the indices of the warping path wk = (i,j)k such that j-r i j+r Where r is a term defining allowed range of warping for a given point in a sequence. r= Sakoe-Chiba Band Itakura Parallelogram A Novel Lower Bounding Technique I Q C U Sakoe-Chiba Band Ui = max(qi-r : qi+r) Li = min(qi-r : qi+r) L Q Q C U Q Itakura Parallelogram L A Novel Lower Bounding Technique II C C Q U Sakoe-Chiba Band (ci U i ) 2 if ci U i n LB _ Keogh(Q, C ) (ci Li ) 2 if ci Li i 1 0 otherwise L Q C Q C Itakura Parallelogram U LB_Keogh Q L The tightness of the lower bound for each technique is proportional to the length of gray lines used in the illustrations LB_Kim LB_Yi LB_Keogh Sakoe-Chiba LB_Keogh Itakura Let us empirically evaluate the quality of the lowering bounding techniques. This is a good idea, since it is an implementation free measure of quality. First we must discuss our experimental philosophy… Experimental Philosophy • We tested on 32 datasets from such diverse fields as finance, medicine, biometrics, chemistry, astronomy, robotics, networking and industry. The datasets cover the complete spectrum of stationary/ non-stationary, noisy/ smooth, cyclical/ non-cyclical, symmetric/ asymmetric etc • Our experiments are completely reproducible. We saved every random number, every setting and all data. • To ensure true randomness, we use random numbers created by a quantum mechanical process. • We test with the Sakoe-Chiba Band, which is the worst case (the Itakura Parallelogram would give us much better results). 32 datasets from such diverse fields as finance, medicine, biometrics, chemistry, astronomy, robotics, networking and industry 1 Sunspot 2,899 17 Evaporator 2 Power 3 ERP data 4 37,830 35,040 18 Ballbeam 198,400 19 Tongue Spot Exrates 2,567 20 Fetal ECG 5 Shuttle 6,000 21 Balloon 6 Water 6,573 22 Stand’ & Poor 7 Chaotic 1,800 23 Speech 1,020 8 Steamgen 38,400 24 Soil Temp 2,304 9 Ocean 4,096 25 Wool 2,790 10 Tide 8,746 26 Infrasound 8,192 11 CSTR 22,500 27 Network 18,000 12 Winding 17,500 28 EEG 11,264 13 Dryer2 5,202 29 Koski EEG 144,002 14 Robot Arm 2,048 30 Buoy Sensor 55,964 15 Ph Data 6,003 31 Burst 16 Power Plant 2,400 32 Random Walk I 2,000 700 22,500 4,002 17,610 9,382 65,536 Tightness of Lower Bound Experiment • We measured T T Lower Bound Estimateof Dynamic Time W arp Distance True Dynamic Time W arp Distance 0T1 • For each dataset, we randomly extracted 50 sequences of length 256. We compared each sequence to the 49 others. The larger the better Query length of 256 is about the mean in the literature. • For each dataset we report T as average ratio from the 1,225 (50*49/2) comparisons made. LB_Keogh LB_Yi LB_Kim 1.0 0.8 0.6 0.4 0.2 0 1 17 18 2 19 3 20 4 21 5 22 6 7 23 8 24 9 25 10 26 11 27 12 28 13 29 14 15 30 16 31 32 Effect of Query Length on Tightness of Lower Bounds 31 32 LB_Keogh LB_Yi LB_Kim Tightness of Lower Bound T 1.0 0.8 0.6 0.4 0.2 0 16 32 64 128 256 512 1024 Query Length How do the tightness of lower bounds translate to speed up? Fraction of the database on which we must do full DTW calculation 1 Random Walk II 0.8 Sequential Scan LB_Keogh 0.6 0.4 0.2 0 210 212 214 216 Note that the X-axis is logarithmic 218 220 These experiments suggest we can use the new lower bounding technique to speed up sequential search. That’s super! Excellent! But what we really need is a technique to index the time series According to the most referenced paper on time series similarity searching “dynamic time warping cannot be speeded up by indexing *”, As we noted in an earlier slide, virtually all indexing techniques require the triangular inequality to hold. DTW does NOT obey the triangular inequality! * Agrawal, R., Lin, K. I., Sawhney, H. S., & Shim, K. (1995). Fast similarity search in the presence of noise, scaling, and translation in times-series databases. VLDB pp. 490-501. In fact, it was shown that DTW can be indexed! (VLDB02) We won’t give details here, other than to note that the technique is based on the lowerbounding technique we have just seen Let us quickly see some success stories, problems that we now solve, given that we can index DTW Success Story I The lower bounding technique has been used to support indexing of massive archives of handwritten text. Surprisingly, DTW works better on this problem that more sophisticated approaches like Markov models R. Manmatha, T. M. Rath: Indexing of Handwritten Historical Documents - Recent Progress. In: Proc. of the 2003 Symposium on Document Image Understanding Technology (SDIUT), Greenbelt, MD, April 9-11, 2003, pp. 77-85. T. M. Rath and R. Manmatha (2002): Lower-Bounding of Dynamic Time Warping Distances for Multivariate Time Series. Technical Report MM-40, Center for Intelligent Information Retrieval, University of Massachusetts Amherst. Grease is the word… Success Story II The lower bounding technique has been used to support “query by humming”, by several groups of researchers Best 3 Matches 1) Bee Gees: Grease 2) Robbie Williams: Grease 3) Sarah Black: Heatwave Yunyue Zhu, Dennis Shasha (2003). Query by Humming: a Time Series Database Approach, SIGMOD. Ning Hu, Roger B. Dannenberg (2003). Polyphonic Audio Matching and Alignment for Music Retrieval Success Story III The lower bounding technique is being used by ChevronTexaco for comparing seismic data Weighted Distance Measures I Intuition: For some queries different parts of the sequence are more important. Weighting features is a well known technique in the machine learning community to improve classification and the quality of clustering. Weighted Distance Measures II DQ, C DQ, C ,W qi ci n D(Q,C) 2 i 1 wi qi ci n 2 D(Q,C,W) i 1 The height of this histogram indicates the relative importance of that part of the query W Weighted Distance Measures III How do we set the weights? One Possibility: Relevance Feedback Definition: Relevance Feedback is the reformulation of a search query in response to feedback provided by the user for the results of previous versions of the query. Term Vector Term Weights [Jordan , Cow, Bull, River] [ 1 , 1 , 1 , 1 ] Search Display Results Gather Feedback Term Vector [Jordan , Cow, Bull, River] Term Weights [ 1.1 , 1.7 , 0.3 , 0.9 ] Update Weights Relevance Feedback for Time Series The original query The weigh vector. Initially, all weighs are the same. Note: In this example we are using a piecewise linear approximation of the data. We will learn more about this representation later. The initial query is executed, and the five best matches are shown (in the dendrogram) One by one the 5 best matching sequences will appear, and the user will rank them from between very bad (-3) to very good (+3) Based on the user feedback, both the shape and the weigh vector of the query are changed. The new query can be executed. The hope is that the query shape and weights will converge to the optimal query. Similarity Measures Many papers have introduced a new similarity measure for time series, do they make a contribution? • 61.9% of the papers fail to show a single example of a matching time series under the measure. • 52.3% of the papers fail to compare their new measure to a single strawman. • 85.7% of the papers fail to demonstrate an objective test of their new measure. 38.1% of the papers do show an example of a query and a match, but what does that mean if we don’t see all the non-matches? Query Best Match 2nd Best Match 3rd Best Match 38.1% of the papers do show an example of a query and a match, but what does that mean if we don’t see all the non-matches? … it means little Query Best Match 2nd Best Match 3rd Best Match Database Subjective Evaluation of Similarity Measures Euclidean Distance We believe that one of the best (subjective) ways to evaluate a proposed similarity measure is to use it to create a dendrogram of several time series from the domain of interest. Distance measure introduced in one of the papers in the survey 8 8 6 6 7 4 5 3 4 2 3 7 2 5 1 1 Objective Evaluation of Similarity Measures We can use nearest neighbor classification to evaluate the usefulness of propose distance measures. We compared 11 different measures to Euclidean distance, using 1-NN. Some of the measures require the user to set some parameters. In these cases we wrapped the classification algorithm in a loop for each parameter, searched over all possible parameters and reported only the best result. Cylinder-Bell-Funnel: This synthetic dataset has been in the literature for 8 years, and has been cited at least a dozen times. It is a 3-class problem; we create 128 examples of each class for these experiments. Control-Chart: This synthetic dataset has been freely available for the UCI Data Archive since June 1998. It is a 6-class problem, with 100 examples of each class. Results: Classification Error Rates Approach Cylinder-Bell-F’ Control-Chart Euclidean Distance 0.003 0.013 Aligned Subsequence 0.451 0.623 Piecewise Normalization 0.130 0.321 Autocorrelation Functions 0.380 0.116 Cepstrum 0.570 0.458 String (Suffix Tree) 0.206 0.578 Important Points 0.387 0.478 Edit Distance 0.603 0.622 String Signature 0.444 0.695 Cosine Wavelets 0.130 0.371 Hölder 0.331 0.593 Piecewise Probabilistic 0.202 0.321 Lets summarize, then take a break! • Almost all time series problems require similarity calculations. • Euclidean Distance accounts for 90% of the literature. • DTW outperforms Euclidean distance for classification, clustering, indexing etc • DTW is very slow compared to Euclidean distance, but recent work on tight lower bounds has opened up the possibility of tackling massive datasets that would have been unthinkable just a few years ago… • More generally, lower bounding is useful. • Weighted distance measures may be useful for some problem. Motivating example revisited… You go to the doctor because of chest pains. Your ECG looks strange… You doctor wants to search a database to find similar ECGs, in the hope that they will offer clues about your condition... Two questions: •How do we define similar? •How do we search quickly? Indexing Time Series We have seen techniques for assessing the similarity of two time series. However we have not addressed the problem of finding the best match to a query in a large database (other than the lower bounding trick) Query Q The obvious solution, to retrieve and examine every item (sequential scanning), simply does not scale to large datasets. 1 6 2 7 3 8 4 9 5 10 Database C We need some way to index the data... We can project time series of length n into ndimension space. The first value in C is the X-axis, the second value in C is the Y-axis etc. One advantage of doing this is that we have abstracted away the details of “time series”, now all query processing can be imagined as finding points in space... …we can project the query time series Q into the same n-dimension space and simply look for the nearest points. Q Interesting Sidebar The Minkowski Metrics have simple geometric interoperations... Euclidean Weighted Euclidean Manhattan Max Mahalanobis …the problem is that we have to look at every point to find the nearest neighbor.. We can group clusters of datapoints with “boxes”, called Minimum Bounding Rectangles (MBR). R1 R2 R4 R5 R3 R6 R9 R7 R8 We can further recursively group MBRs into larger MBRs…. …these nested MBRs are organized as a tree (called a spatial access tree or a multidimensional tree). Examples include R-tree, Hybrid-Tree etc. R10 R11 R10 R11 R12 R1 R2 R3 R4 R5 R6 R7 R8 R9 Data nodes containing points R12 We can define a function, MINDIST(point, MBR), which tells us the minimum possible distance between any point and any MBR, at any level of the tree. MINDIST(point, MBR) = 5 MINDIST(point, MBR) = 0 Note that this is another example of a useful lower bound We can use the MINDIST(point, MBR), to do fast search.. R10 R11 R10 R11 R12 R1 R2 R3 R4 R5 R6 R7 R8 R9 Data nodes containing points R12 We can use the MINDIST(point, MBR), to do fast search.. 0 R10 R11 10 17 R10 R11 R12 R1 R2 R3 R4 R5 R6 R7 R8 R9 Data nodes containing points R12 We now go to disk, and retrieve all the data objects whose pointers are in the green node. We measure the true distance between our query and those objects. Let us imagine two scenarios, the closest object, the “best-so-far” has a value of.. • 1.5 units (we are done searching!) • 4.0 units (we have to look in R2, but then we are done) 0 R10 10 17 R10 R11 R12 R11 0 2 10 R1 R2 R3 R4 R5 R6 R7 R8 R9 Data nodes containing points R12 If we project a query into ndimensional space, how many additional (nonempty) MBRs must we examine before we are guaranteed to find the best match? For the one dimensional case, the answer is clearly 2... If we project a query into ndimensional space, how many additional (nonempty) MBRs must we examine before we are guaranteed to find the best match? For the one dimensional case, the answer is clearly 2... For the two dimensional case, the answer is 8... If we project a query into ndimensional space, how many additional (nonempty) MBRs must we examine before we are guaranteed to find the best match? For the one dimensional case, the answer is clearly 2... For the three dimensional case, the answer is 26... More generally, in n-dimension space n For the two we must examine 3 -1 MBRs dimensional n = 21 10,460,353,201 MBRs case, the answer is 8... This is known as the curse of dimensionality Spatial Access Methods We can use Spatial Access Methods like the R-Tree to index our data, but… The performance of R-Trees degrade exponentially with the number of dimensions. Somewhere above 6-20 dimensions the RTree degrades to linear scanning. Often we want to index time series with hundreds, perhaps even thousands of features…. GEMINI GEneric Multimedia INdexIng {Christos Faloutsos} Establish a distance metric from a domain expert. Produce a dimensionality reduction technique that reduces the dimensionality of the data from n to N, where N can be efficiently handled by your favorite SAM. Produce a distance measure defined on the N dimensional representation of the data, and prove that it obeys Dindexspace(A,B) Dtrue(A,B). i.e. The lower bounding lemma. Plug into an off-the-shelve SAM. We have 6 objects in 3-D space. We issue a query to find all objects within 1 unit of the point (-3, 0, -2)... A 3 2.5 2 1.5 C 1 B 0.5 F 0 -0.5 -1 3 2 D 1 0 E -1 0 -1 -2 -2 -3 -3 -4 3 2 1 Consider what would happen if we issued the same query after reducing the dimensionality to 2, assuming the dimensionality technique obeys the lower bounding lemma... The query successfully finds the object E. A 3 2 C 1 B F 0 -1 3 2 D 1 0 E -1 0 -1 -2 -2 -3 -3 -4 3 2 1 Example of a dimensionality reduction technique in which the lower bounding lemma is satisfied Informally, it’s OK if objects appear closer in the dimensionality reduced space, than in the true space. Note that because of the dimensionality reduction, object F appears to less than one unit from the query (it is a false alarm). 3 2.5 A 2 1.5 C F 1 0.5 0 B -0.5 -1 D E -4 -3 -2 -1 0 1 2 3 This is OK so long as it does not happen too much, since we can always retrieve it, then test it in the true, 3-dimensional space. This would leave us with just E , the correct answer. Example of a dimensionality reduction technique in which the lower bounding lemma is not satisfied Informally, some objects appear further apart in the dimensionality reduced space than in the true space. 3 2.5 Note that because of the dimensionality reduction, object E appears to be more than one unit from the query (it is a false dismissal). A 2 E 1.5 1 C 0.5 This is unacceptable. 0 F -0.5 B D -1 -4 -3 -2 -1 0 1 2 3 We have failed to find the true answer set to our query. GEMINI GEneric Multimedia INdexIng {Christos Faloutsos} Establish a distance metric from a domain expert. Produce a dimensionality reduction technique that reduces the dimensionality of the data from n to N, where N can be efficiently handled by your favorite SAM. Produce a distance measure defined on the N dimensional representation of the data, and prove that it obeys Dindexspace(A,B) Dtrue(A,B). i.e. The lower bounding lemma. Plug into an off-the-shelve SAM. The examples on the previous slides illustrate why the lower bounding lemma is so important. Now all we have to do is to find a dimensionality reduction technique that obeys the lower bounding lemma, and we can index our time series! Notation for Dimensionality Reduction For the future discussion of dimensionality reduction we will assume that M is the number time series in our database. n is the original dimensionality of the data. (i.e. the length of the time series) N is the reduced dimensionality of the data. CRatio = N/n is the compression ratio. An Example of a Dimensionality Reduction Technique I C 0 20 40 60 80 n = 128 100 120 140 Raw Data 0.4995 0.5264 0.5523 0.5761 0.5973 0.6153 0.6301 0.6420 0.6515 0.6596 0.6672 0.6751 0.6843 0.6954 0.7086 0.7240 0.7412 0.7595 0.7780 0.7956 0.8115 0.8247 0.8345 0.8407 0.8431 0.8423 0.8387 … The graphic shows a time series with 128 points. The raw data used to produce the graphic is also reproduced as a column of numbers (just the first 30 or so points are shown). An Example of a Dimensionality Reduction Technique II C 0 20 40 60 80 100 120 .............. 140 Raw Data Fourier Coefficients 0.4995 0.5264 0.5523 0.5761 0.5973 0.6153 0.6301 0.6420 0.6515 0.6596 0.6672 0.6751 0.6843 0.6954 0.7086 0.7240 0.7412 0.7595 0.7780 0.7956 0.8115 0.8247 0.8345 0.8407 0.8431 0.8423 0.8387 … 1.5698 1.0485 0.7160 0.8406 0.3709 0.4670 0.2667 0.1928 0.1635 0.1602 0.0992 0.1282 0.1438 0.1416 0.1400 0.1412 0.1530 0.0795 0.1013 0.1150 0.1801 0.1082 0.0812 0.0347 0.0052 0.0017 0.0002 ... We can decompose the data into 64 pure sine waves using the Discrete Fourier Transform (just the first few sine waves are shown). The Fourier Coefficients are reproduced as a column of numbers (just the first 30 or so coefficients are shown). Note that at this stage we have not done dimensionality reduction, we have merely changed the representation... An Example of a Dimensionality Reduction Technique III C C’ 0 20 40 60 80 100 We have discarded 15 16 of the data. 120 140 Raw Data 0.4995 0.5264 0.5523 0.5761 0.5973 0.6153 0.6301 0.6420 0.6515 0.6596 0.6672 0.6751 0.6843 0.6954 0.7086 0.7240 0.7412 0.7595 0.7780 0.7956 0.8115 0.8247 0.8345 0.8407 0.8431 0.8423 0.8387 … Truncated Fourier Fourier Coefficients Coefficients 1.5698 1.0485 0.7160 0.8406 0.3709 0.4670 0.2667 0.1928 0.1635 0.1602 0.0992 0.1282 0.1438 0.1416 0.1400 0.1412 0.1530 0.0795 0.1013 0.1150 0.1801 0.1082 0.0812 0.0347 0.0052 0.0017 0.0002 ... 1.5698 1.0485 0.7160 0.8406 0.3709 0.4670 0.2667 0.1928 n = 128 N=8 Cratio = 1/16 … however, note that the first few sine waves tend to be the largest (equivalently, the magnitude of the Fourier coefficients tend to decrease as you move down the column). We can therefore truncate most of the small coefficients with little effect. An Example of a Dimensionality Reduction Technique IIII C C’ 0 20 40 60 80 100 120 140 Raw Data 0.4995 0.5264 0.5523 0.5761 0.5973 0.6153 0.6301 0.6420 0.6515 0.6596 0.6672 0.6751 0.6843 0.6954 0.7086 0.7240 0.7412 0.7595 0.7780 0.7956 0.8115 0.8247 0.8345 0.8407 0.8431 0.8423 0.8387 … Sorted Truncated Fourier Fourier Coefficients Coefficients 1.5698 1.0485 0.7160 0.8406 0.3709 0.1670 0.4667 0.1928 0.1635 0.1302 0.0992 0.1282 0.2438 0.2316 0.1400 0.1412 0.1530 0.0795 0.1013 0.1150 0.1801 0.1082 0.0812 0.0347 0.0052 0.0017 0.0002 ... 1.5698 1.0485 0.7160 0.8406 0.2667 0.1928 0.1438 0.1416 Instead of taking the first few coefficients, we could take the best coefficients This can help greatly in terms of approximation quality, but makes indexing hard (impossible?). Note this applies also to Wavelets aabbbccb 0 20 40 60 80 100 120 0 20 40 60 80 100 120 0 20 40 60 80 100 120 0 20 40 60 80 100 120 0 20 40 60 80 100120 0 20 40 60 80 100 120 0 20 40 60 80 100 120 a DFT DWT SVD APCA PAA PLA a b b b c c SYM b Time Series Representations Data Adaptive Sorted Coefficients Singular Symbolic Piecewise Value Polynomial Decomposition Piecewise Linear Approximation Adaptive Piecewise Constant Approximation Natural Language Non Data Adaptive Trees Strings Wavelets Random Mappings Spectral Orthonormal Bi-Orthonormal Discrete Fourier Transform Haar Daubechies Coiflets dbn n > 1 Interpolation Regression Piecewise Aggregate Approximation Discrete Cosine Transform Symlets UUCUCUCD 0 20 40 60 80 100 120 0 20 40 60 80 100 120 0 20 40 60 80 100 120 0 20 40 60 80 100 120 0 20 40 60 80 100 120 0 20 40 60 80 100 120 0 20 40 60 80 100120 U U C U C U D D DFT DWT SVD APCA PAA PLA SYM Discrete Fourier Transform I X Basic Idea: Represent the time series as a linear combination of sines and cosines, but keep only the first n/2 coefficients. X' 0 20 40 60 80 100 120 140 Why n/2 coefficients? Because each sine wave requires 2 numbers, for the phase (w) and amplitude (A,B). Jean Fourier 0 1768-1830 1 2 3 4 n C (t ) ( Ak cos( 2wk t ) Bk sin( 2wk t )) k 1 5 6 7 8 9 Excellent free Fourier Primer Hagit Shatkay, The Fourier Transform - a Primer'', Technical Report CS95-37, Department of Computer Science, Brown University, 1995. http://www.ncbi.nlm.nih.gov/CBBresearch/Postdocs/Shatkay/ Discrete Fourier Transform II X X' 0 20 40 60 80 100 120 140 Pros and Cons of DFT as a time series representation. • Good ability to compress most natural signals. • Fast, off the shelf DFT algorithms exist. O(nlog(n)). • (Weakly) able to support time warped queries. 0 1 2 3 • Difficult to deal with sequences of different lengths. • Cannot support weighted distance measures. 4 5 6 7 8 9 Note: The related transform DCT, uses only cosine basis functions. It does not seem to offer any particular advantages over DFT. Discrete Wavelet Transform I X Basic Idea: Represent the time series as a linear combination of Wavelet basis functions, but keep only the first N coefficients. X' DWT 0 20 40 60 80 100 120 140 Haar 0 Although there are many different types of wavelets, researchers in time series mining/indexing generally use Haar wavelets. Alfred Haar 1885-1933 Haar 1 Haar 2 Haar 3 Haar wavelets seem to be as powerful as the other wavelets for most problems and are very easy to code. Haar 4 Haar 5 Excellent free Wavelets Primer Haar 6 Haar 7 Stollnitz, E., DeRose, T., & Salesin, D. (1995). Wavelets for computer graphics A primer: IEEE Computer Graphics and Applications. Discrete Wavelet Transform II X X' DWT 0 20 40 60 80 100 120 Ingrid Daubechies 140 1954 - Haar 0 Haar 1 We have only considered one type of wavelet, there are many others. Are the other wavelets better for indexing? YES: I. Popivanov, R. Miller. Similarity Search Over Time Series Data Using Wavelets. ICDE 2002. NO: K. Chan and A. Fu. Efficient Time Series Matching by Wavelets. ICDE 1999 Later in this tutorial I will answer this question. Discrete Wavelet Transform III Pros and Cons of Wavelets as a time series representation. X X' DWT 0 20 40 60 80 100 120 140 • Good ability to compress stationary signals. • Fast linear time algorithms for DWT exist. • Able to support some interesting non-Euclidean similarity measures. Haar 0 Haar 1 Haar 2 Haar 3 Haar 4 Haar 5 Haar 6 Haar 7 • Signals must have a length n = 2some_integer • Works best if N is = 2some_integer. Otherwise wavelets approximate the left side of signal at the expense of the right side. • Cannot support weighted distance measures. Singular Value Decomposition I X Basic Idea: Represent the time series as a linear combination of eigenwaves but keep only the first N coefficients. X' SVD 0 20 40 60 80 100 120 140 eigenwave 0 eigenwave 1 SVD is similar to Fourier and Wavelet approaches is that we represent the data in terms of a linear combination of shapes (in this case eigenwaves). James Joseph Sylvester 1814-1897 eigenwave 2 eigenwave 3 SVD differs in that the eigenwaves are data dependent. Camille Jordan (1838--1921) eigenwave 4 eigenwave 5 eigenwave 6 eigenwave 7 SVD has been successfully used in the text processing community (where it is known as Latent Symantec Indexing ) for many years. Good free SVD Primer Singular Value Decomposition - A Primer. Sonia Leach Eugenio Beltrami 1835-1899 Singular Value Decomposition II How do we create the eigenwaves? We have previously seen that we can regard time series as points in high dimensional space. X X' SVD 0 20 40 60 80 100 120 140 We can rotate the axes such that axis 1 is aligned with the direction of maximum variance, axis 2 is aligned with the direction of maximum variance orthogonal to axis 1 etc. eigenwave 0 eigenwave 1 eigenwave 2 eigenwave 3 Since the first few eigenwaves contain most of the variance of the signal, the rest can be truncated with little loss. eigenwave 4 eigenwave 5 eigenwave 6 eigenwave 7 A UV T This process can be achieved by factoring a M by n matrix of time series into 3 other matrices, and truncating the new matrices at size N. Singular Value Decomposition III Pros and Cons of SVD as a time series representation. X X' SVD 0 20 40 60 80 100 120 140 • Optimal linear dimensionality reduction technique . • The eigenvalues tell us something about the underlying structure of the data. eigenwave 0 eigenwave 1 eigenwave 2 eigenwave 3 eigenwave 4 • Computationally very expensive. • Time: O(Mn2) • Space: O(Mn) • An insertion into the database requires recomputing the SVD. • Cannot support weighted distance measures or non Euclidean measures. eigenwave 5 eigenwave 6 eigenwave 7 Note: There has been some promising research into mitigating SVDs time and space complexity. Basic Idea: Represent the time series as a linear combination of Chebyshev Polynomials Chebyshev Polynomials X X' Cheb 0 20 40 60 80 100 120 Ti(x) = 140 1 x 2x2−1 Pros and Cons of SVD as a time series representation. • Time series can be of arbitrary length • Only O(n) time complexity • Is able to support multi-dimensional time series*. Pafnuty Chebyshev 1821-1946 4x3−3x 8x4−8x2+1 16x5−20x3+5x 32x6−48x4+18x2−1 64x7−112x5+56x3−7x 128x8−256x6+160x4−32x2+1 * Raymond T. Ng, Yuhan Cai: Indexing Spatio-Temporal Trajectories with Chebyshev Polynomials. SIGMOD 2004 Piecewise Linear Approximation I Basic Idea: Represent the time series as a sequence of straight lines. X Karl Friedrich Gauss X' 1777 - 1855 0 20 40 60 80 100 120 140 Lines could be connected, in which case we are allowed N/2 lines Each line segment has • length • left_height (right_height can be inferred by looking at the next segment) If lines are disconnected, we are allowed only N/3 lines Personal experience on dozens of datasets suggest disconnected is better. Also only disconnected allows a lower bounding Euclidean approximation Each line segment has • length • left_height • right_height How do we obtain the Piecewise Linear Approximation? Piecewise Linear Approximation II X X' 0 20 40 60 80 100 120 140 Optimal Solution is O(n2N), which is too slow for data mining. A vast body on work on faster heuristic solutions to the problem can be classified into the following classes: • Top-Down • Bottom-Up • Sliding Window • Other (genetic algorithms, randomized algorithms, Bspline wavelets, MDL etc) Recent extensive empirical evaluation of all approaches suggest that Bottom-Up is the best approach overall. Piecewise Linear Approximation III Pros and Cons of PLA as a time series representation. X X' 0 20 40 60 80 100 120 140 • Good ability to compress natural signals. • Fast linear time algorithms for PLA exist. • Able to support some interesting non-Euclidean similarity measures. Including weighted measures, relevance feedback, fuzzy queries… •Already widely accepted in some communities (ie, biomedical) • Not (currently) indexable by any data structure (but does allows fast sequential scanning). Basic Idea: Convert the time series into an alphabet of discrete symbols. Use string indexing techniques to manage the data. Symbolic Approximation I X X' a a b b b c c b 0 20 40 60 80 100 120 140 0 a Potentially an interesting idea, but all work thusfar are very ad hoc. Pros and Cons of Symbolic Approximation as a time series representation. 1 a b 2 b 3 4 b 5 c c • Potentially, we could take advantage of a wealth of techniques from the very mature field of string processing and bioinformatics. • It is not clear how we should discretize the times series (discretize the values, the slope, shapes? How big of an alphabet? etc) 6 b 7 •There is no known technique to allow the support of Euclidean queries. (Breaking news, later in tutorial) Symbolic Approximation II SAX X X' 0 20 40 60 80 100 120 140 Clipped Data X X' 00000000000000000000000000000000000000000000111111111111111111111110000110000001111111111111111111111111111111111111111111111111 0 20 40 60 80 100 120 140 …110000110000001111…. 44 Zeros 23 Ones 4 Zeros 2 Ones 6 Zeros 49 Ones 44 Zeros|23|4|2|6|49 Piecewise Aggregate Approximation I Basic Idea: Represent the time series as a sequence of box basis functions. Note that each box is the same length. X X' 0 20 40 60 80 100 120 140 x1 x2 x3 x4 x5 xi Nn ni N x j j Nn ( i 1) 1 Given the reduced dimensionality representation we can calculate the approximate Euclidean distance as... DR ( X , Y ) n N 2 x y i1 i i N This measure is provably lower bounding. x6 x7 Independently introduced by two authors x8 Keogh, Chakrabarti, Pazzani & Mehrotra, KAIS (2000) Byoung-Kee Yi, Christos Faloutsos, VLDB (2000) Piecewise Aggregate Approximation II X X' 0 20 40 60 80 100 120 140 X1 Pros and Cons of PAA as a time series representation. • Extremely fast to calculate • As efficient as other approaches (empirically) • Support queries of arbitrary lengths • Can support any Minkowski metric • Supports non Euclidean measures • Supports weighted Euclidean distance • Simple! Intuitive! X2 X3 X4 X5 X6 X7 X8 • If visualized directly, looks ascetically unpleasing. Adaptive Piecewise Constant Approximation I Basic Idea: Generalize PAA to allow the piecewise constant segments to have arbitrary lengths. Note that we now need 2 coefficients to represent each segment, its value and its length. X Raw Data (Electrocardiogram) X Adaptive Representation (APCA) 0 20 40 60 80 100 120 Reconstruction Error 2.61 140 Haar Wavelet or PAA Reconstruction Error 3.27 <cv1,cr1> <cv2,cr2> DFT Reconstruction Error 3.11 <cv3,cr3> 0 <cv4,cr4> 50 100 150 200 250 The intuition is this, many signals have little detail in some places, and high detail in other places. APCA can adaptively fit itself to the data achieving better approximation. Adaptive Piecewise Constant Approximation II X X The high quality of the APCA had been noted by many researchers. However it was believed that the representation could not be indexed because some coefficients represent values, and some represent lengths. However an indexing method was discovered! (SIGMOD 2001 best paper award) 0 20 40 60 80 100 120 140 <cv1,cr1> <cv2,cr2> <cv3,cr3> <cv4,cr4> Unfortunately, it is non-trivial to understand and implement…. Adaptive Piecewise Constant Approximation III • Pros and Cons of APCA as a time series representation. X X 0 20 40 60 80 100 120 140 <cv1,cr1> • Fast to calculate O(n). • More efficient as other approaches (on some datasets). • Support queries of arbitrary lengths. • Supports non Euclidean measures. • Supports weighted Euclidean distance. • Support fast exact queries , and even faster approximate queries on the same data structure. <cv2,cr2> <cv3,cr3> <cv4,cr4> • Somewhat complex implementation. • If visualized directly, looks ascetically unpleasing. Natural Language • Pros and Cons of natural language as a time series representation. X rise, plateau, followed by a rounded peak 0 20 40 60 80 100 120 • The most intuitive representation! • Potentially a good representation for low bandwidth devices like text-messengers 140 • Difficult to evaluate. rise plateau followed by a rounded peak To the best of my knowledge only one group is working seriously on this representation. They are the University of Aberdeen SUMTIME group, headed by Prof. Jim Hunter. Comparison of all dimensionality reduction techniques • We can compare the time it takes to build the index, for different sizes of databases, and different query lengths. • We can compare the indexing efficiency. How long does it take to find the best answer to out query. It turns out that the fairest way to measure this is to measure the number of times we have to retrieve an item from disk. • We can simply compare features. Does approach X allow weighted queries, queries of arbitrary lengths, is it simple to implement… The time needed to build the index Black topped histogram bars (in SVD) indicate experiments abandoned at 1,000 seconds. The fraction of the data that must be retrieved from disk to answer a one nearest neighbor query SVD DWT DFT PAA 1 1 1 1 .5 .5 .5 .5 0 1024 512 256 128 64 0 0 16 20 18 14 12 10 8 1024 512 256 128 64 16 20 18 10 14 12 8 0 1024 512 256 128 64 20 18 16 10 14 12 Dataset is Stock Market Data 8 1024 512 256 128 64 20 14 18 16 12 10 8 The fraction of the data that must be retrieved from disk to answer a one nearest neighbor query 0.5 0.5 0.5 0.5 0.4 0.4 0.4 0.4 0.3 0.3 0.3 0.3 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0 0 0 0 16 1024 32 512 256 64 DFT 16 1024 32 512 256 64 DWT 16 1024 32 512 256 PAA 64 16 1024 32 512 256 APCA Dataset is mixture of many “structured” datasets, like ECGs 64 Summary of Results On most datasets, it does not matter too much which representation you choose. Wavelets, PAA, DFT all seem to do about the same. However, on some datasets, in particular those datasets where the level of detail varies over time, APCA can do much better. Raw Data (Electrocardiogram) Haar Wavelet Adaptive Representation (APCA) A Surprising Claim… The claim I made on the previous slide might surprise some people… I said that for the most part it does not matter which representation we use (in terms of indexing efficiency), but many people have written papers claiming otherwise. Who is right? More than 95% of all the time series database/data mining papers published do not make any contribution! In the next 20 minutes I will justify the surprising claim above. Outline of the next 20 minutes of the Tutorial • My Claim • Results of a Survey • Size of test datasets • Number of rival methods considered • Diversity of test datasets • Implementation Bias • What it is, why it matters • Data Bias • What it is, why it matters • Case Study: Similarity Measures • Subjective testing • Objective testing • Concrete Suggestions Important Note • I am anxious that this work should not be taken as been critical of the database/data mining community. • Note that several of my papers are among the worst offenders in terms of weak experimental evaluation!!! • My goal is simply to demonstrate that empirical evaluations in the past have often been inadequate, and I hope this section of the tutorial will encourage more extensive experimental evaluations in the future. The Claim Much of the work in the time series data mining literature suffers from two types of experimental flaws, implementation bias and data bias (defined later). Because of these flaws, much of the work has very little generalizability to real world problems. In More Detail… Many of the contributions offer an amount of “improvement” that would have been completely dwarfed by the variance that would have been observed by testing on many real world datasets, or the variance that would have been observed by changing minor (unstated) implementation details. Time Series Data Mining Tasks Indexing (Query by Content): Given a query time series Q, and some similarity/dissimilarity measure D(Q,C), find the nearest matching time series in database DB. Clustering: Find natural groupings of the time series in database DB under some similarity/dissimilarity measure D(Q,C). Classification: Given an unlabeled time series Q, assign it to one of two or more predefined classes. Literature Survey We read more than 370 papers, but we only included the subset of 58 papers actually cited in our paper when assessing statistics. The subset was chosen based on the following criteria. • Was the paper ever referenced? • Was the paper published in a conference or journal likely to be read by a data miner? In general the papers come from high quality conferences (SIG)KDD (11), ICDE (11), VLDB (5), SIGMOD/PODS (5), and CIKM (6). A Cautionary Note In presenting the results of the survey, we echo the caution of Prechelt, that “while high numbers resulting from such counting cannot prove that the evaluation has high quality, low numbers (suggest) that the quality is low”. Prechelt. L. (1995). A quantitative study of neural network learning algorithm evaluation practices. In proceedings of the 4th Int’l Conference on Artificial Neural Networks. Finding 1: Size of Test Datasets We recorded the size the test dataset for each paper. Where two or more datasets are used, we considered only the size of the largest. The median size of the test databases was only 10,000 objects. Approximately 84% of the test databases are less than one megabyte in size This number only reflects the indexing papers, the other papers have even smaller sizes. Finding 2: Number of Rival Methods We recorded the number of rival methods to which the contribution of each paper is compared. The median number is 1 (average is 0.91) This number is even worse than it seems, because many of the strawman are very unrealistic. More about this later… This number reflects all papers included in the survey. Finding 3: Number of Test Datasets We recorded number of different datasets used in the experimental evaluation. The average is 1.85 datasets (1.26 real and 0.59 synthetic) In fact, this number may be optimistic, if you count stock market data as being the same as random walk data, then… The average is 1.28 datasets Having seen the statistics, let us see why these low numbers are a real problem… Data Bias Definition: Data bias is the conscious or unconscious use of a particular set of testing data to confirm a desired finding. Example: Suppose you are comparing Wavelets to Fourier methods, the following datasets will produce drastically different results… Good for wavelets bad for Fourier 0 Good for Fourier bad for wavelets 200 400 600 0 200 400 600 Example of Data Bias: Who to Believe? For the task of indexing time series for similarity search, which representation is best, the Discrete Fourier Transform (DFT), or the Discrete Wavelet Transform (Haar)? • “Several wavelets outperform the DFT”. • “DFT-based and DWT-based techniques yield comparable results”. • “Haar wavelets perform slightly better that DFT” • “DFT filtering performance is superior to DWT*” Example of Data Bias: Who to Believe II? To find out who to believe (if anyone) we performed an extraordinarily careful and comprehensive set of experiments. For example… • We used a quantum mechanical device generate random numbers. • We averaged results over 100,000 experiments! • For fairness, we use the same (randomly chosen) subsequences for both approaches. Take another quick look at the conflicting claims, the next slide will tell us who was correct… • “Several wavelets outperform the DFT”. • “DFT-based and DWT-based techniques yield comparable results”. • “Haar wavelets perform slightly better that DFT” • “DFT filtering performance is superior to DWT*” I tested on the Powerplant, Infrasound and Attas datasets, and I know DFT outperforms the Haar wavelet 1 0.9 Pruning Power 0.8 0.7 0.6 DFT 0.5 HAAR 0.4 0.3 0.2 0.1 0 Powerplant Infrasound Attas (Aerospace) Stupid Flanders! I tested on the Network, ERPdata and Fetal EEG datasets and I know that there is no real difference between DFT and Haar 0.8 0.7 Pruning Power 0.6 0.5 DFT 0.4 HAAR 0.3 0.2 0.1 0 Network EPRdata Fetal EEG Those two clowns are both wrong! I tested on the Chaotic, Earthquake and Wind datasets, and I am sure that the Haar wavelet outperforms the DFT 0.5 0.45 0.4 Pruning Power 0.35 0.3 DFT 0.25 HAAR 0.2 0.15 0.1 0.05 0 Chaotic Earthquake Wind (3) The Bottom Line Any claims about the relative performance of a time series indexing scheme that is empirically demonstrated on only 2 or 3 datasets should be viewed with suspicion. 1 0.5 0.8 0.9 0.45 0.7 0.8 0.4 0.6 0.7 0.35 0.5 0.6 0.3 DFT 0.5 HAAR 0.4 DFT 0.4 HAAR DFT 0.25 HAAR 0.2 0.3 0.15 0.3 0.2 0.2 0.1 0.1 0.1 0 0.05 0 0 Powerplant Infrasound Attas (Aerospace) Network EPRdata Fetal EEG Chaotic Earthquake Wind (3) Implementation Bias Definition: Implementation bias is the conscious or unconscious disparity in the quality of implementation of a proposed approach, vs. the quality of implementation of the completing approaches. Example: Suppose you want to compare your new representation to DFT. You might use the simple O(n2) DFT algorithm rather than spend the time to code the more complex O(nLogn) radix 2 algorithm. This would make your algorithm run relatively faster. Example of Implementation Bias: Similarity Searching Algorithm sequential_scan(data,query) best_so_far = inf; for every item in the database if euclidean_dist(datai,query) < best_so_far pointer_to_best_match = i; best_so_far = euclidean_dist(datai,query); end; end; Algorithm accum = euclidean_dist(d,q); accum = 0; for i = 1 to length(d) accum = accum + (di – qi)2 end; accum = sqrt(accum); Optimization 1: Neglect to take to square root Optimization 2: Pass the best_so_far into the euclidean_dist function, and abandon the calculation if accum ever gets larger than best_so_far Results of a Similarity Searching Experiment on Increasingly Large Datasets This trivial optimizations can have differences which are larger than many of the speedups claimed for new techniques. Euclid Opt1 Opt2 4 3 Seconds This experiment only considers the main memory case, disk based techniques offer many other possibilities for implementation bias. 5 2 1 0 10,000 50,000 Number of Objects 100,000 Another Example of Implementation Bias I Researchers have found a difference between the indexing ability of Haar and PAA… 0 20 40 60 80 100 120 0 20 40 60 80 100 120 We believe that the reported result might be the result of an (unstated) implementation detail. For normalized data, the first Haar coefficient is always zero. What kind of difference would it make if we forgot that fact? Lets find out! Haar PAA Another Example of Implementation Bias II We re-implemented the experiments, averaging the results over 100 experiments as the original authors did. We compared the results of two implementations, one that takes advantage of the “1st coefficient is zero”, and one that does not. We tested on 50 datasets, and plotted the ratio of the results of both implementations as a histogram. 10 8 This minor (unstated) implementation detail can have effects larger than the claimed improvement! 6 4 2 0.95 1 1.05 1.1 1.15 1.2 The Bottom Line • There are many many possibilities for implementation bias when implementing a strawman. • Minor changes in these minor details can have effects as large as the claimed improvement. • Experiments should be done in a manner which is completely free of implementation bias. The Bottom Line Almost all papers that introduced a new distance measure as their primary contribution, failed to demonstrate the utility of their measure on a single objective or subjective test. To the best of our knowledge there are no distance measures in the literature that are better than the decades old Euclidean Distance and Dynamic Time Warping. The Overall Bottom Line The time series data mining/ database community is generally doing very poor quality work. Concrete Suggestion I Algorithms should be tested on a wide range of datasets, unless the utility of the approach is only been claimed for a particular type of data If possible, one subset of the datasets should be used to fine tune the approach, then a different subset of the datasets should be used to do that the actual testing.This methodology is widely used in the machine learning community to help prevent implementation and data bias. Concrete Suggestion II Where possible, experiments should be designed to be free of the possibility of implementation bias Note that this does not preclude the addition of extensive implementation testing. Concrete Suggestion III Novel similarity measures should be compared to simple strawman, such as Euclidean distance or Dynamic Time Warping. Some subjective visualization, or objective experiments should justify their introduction Concrete Suggestion IIII Where possible, all data and code used in experiments should be made freely available to allow independent duplication of findings Overall Conclusion • Sorry that the last 20 minutes of this tutorial have been on a pessimistic note! • I should emphasize that there have been some really clever ideas introduced in the last decade, and we have seen many of them in the first few hours of this tutorial. Hot Topics in Time Series Research • In my (subjective) opinion, time series similarity search is dead (or at least dying) as a research area. • However, there are lots of interesting problems left. • In the last 10 to 15 minutes of this tutorial, I will discuss the most interesting open problems… Hot Topics in Time Series Research • Exploiting symbolic representations of time series. • Anomaly (interestingness) detection. • Motif discovery (finding repeated patterns). • Rule discovery in time series. • Visualizing massive time series databases. Exploiting Symbolic Representations of Time Series • One central theme of this tutorial is that lowerbounding is a very useful property. (recall the lower bounds of DTW, and that r-tree are based on the MinDist function, which lower bounds the distance between a point and an MBR) • Another central theme is that dimensionality reduction is very important. That’s why we spend so long discussing DFT, DWT, SVD, PAA etc. • There does not currently exist a lowerbounding, dimensionality reducing representation of time series. In the next slide, let us think about what it would mean if we had such a representation… Exploiting Symbolic Representations of Time Series • If we had a lowerbounding, dimensionality reducing representation of time series, we could… • Use data structures that are only defined for discrete data, such as suffix trees. • Use algorithms that are only defined for discrete data, such as hashing, association rules etc • Use definitions that are only defined for discrete data, such as Markov models, probability theory • More generally, we could utilize the vast body of research in text processing and bioinformatics Exploiting Symbolic Representations of Time Series There is now a lower bounding dimensionality reducing time series representation! It is called SAX (Symbolic Aggregate ApproXimation) I expect SAX to have a major impact on time series data mining in the coming years… 3 2 1 0 -1 -2 -3 DFT SAX f e d c b a PLA Haar APCA SAX Anomaly (interestingness) detection We would like to be able to discover surprising (unusual, interesting, anomalous) patterns in time series. Note that we don’t know in advance in what way the time series might be surprising Also note that “surprising” is very context dependent, application dependent, subjective etc. Arrr... what be wrong with current approaches? The blue time series at the top is a normal healthy human electrocardiogram with an artificial “flatline” added. The sequence in red at the bottom indicates how surprising local subsections of the time series are under the measure introduced in Shahabi et. al. Note that the beginning of each normal heartbeat is very surprising, but the “flatline” is the least surprising part of the time series!!! • Note that this problem has been solved for text strings • You take a set of text which has been labeled “normal”, you learn a Markov model for it. • Then, any future data that is not modeled well by the Markov model you annotate as surprising. • Since we have just seen that we can convert time series to text (i.e SAX). Lets us quickly see if we can use Markov models to find surprises in time series… Training data 0 2000 4000 6000 8000 10000 12000 Test data (subset) 0 2000 4000 6000 8000 10000 12000 10000 12000 Markov model surprise 0 2000 4000 6000 8000 These were converted to the symbolic representation. I am showing the original data for simplicity In the next slide we will zoom in on this subsection, to try to understand why it is surprising Training data 0 2000 4000 6000 8000 10000 12000 Test data (subset) 0 2000 4000 6000 8000 10000 12000 10000 12000 Markov model surprise 0 2000 4000 6000 8000 Normal Time Series Surprising Time Series Normal sequence 0 100 Actor misses holster 200 300 Laughing and flailing hand Normal sequence Briefly swings gun at target, but does not aim 400 500 600 700 Anomaly (interestingness) detection In spite of the nice example in the previous slide, the anomaly detection problem is wide open. How can we find interesting patterns… • Without (or with very few) false positives… • In truly massive datasets... • In the face of concept drift… • With human input/feedback… • With annotated data… Motif discovery (finding repeated patterns) Winding Dataset ( The angular speed of reel 2 ) 0 50 0 1000 150 0 2000 Are there any repeated patterns, of about this length the above time series? 2500 in Motif discovery (finding repeated patterns) A 0 500 20 1500 2000 B 40 60 80 100 120 140 0 20 C (The angular speed of reel 2) 1000 A 0 Winding Dataset B 2500 C 40 60 80 100 120 140 0 20 40 60 80 100 120 140 Why Find Motifs? · Mining association rules in time series requires the discovery of motifs. These are referred to as primitive shapes and frequent patterns. · Several time series classification algorithms work by constructing typical prototypes of each class. These prototypes may be considered motifs. · Many time series anomaly/interestingness detection algorithms essentially consist of modeling normal behavior with a set of typical shapes (which we see as motifs), and detecting future patterns that are dissimilar to all typical shapes. · In robotics, Oates et al., have introduced a method to allow an autonomous agent to generalize from a set of qualitatively different experiences gleaned from sensors. We see these “experiences” as motifs. · In medical data mining, Caraca-Valente and Lopez-Chavarrias have introduced a method for characterizing a physiotherapy patient’s recovery based of the discovery of similar patterns. Once again, we see these “similar patterns” as motifs. • Animation and video capture… (Tanaka and Uehara, Zordan and Celly) Motifs Discovery Challenges How can we find motifs… • Without having to specify the length/other parameters • In massive datasets • While ignoring “background” motifs (ECG example) • Under time warping, or uniform scaling • While assessing their significance A 0 B 50 0 1000 Winding Dataset ( 150 0 The angular speed of reel 2 ) C 2000 Finding these 3 motifs requires about 6,250,000 calls to the Euclidean distance function 2500 Rule Discovery in Time Series 2 2 1.5 1.5 1 1 0.5 0.5 0 0 -0.5 -0.5 15 -1 -1.5 -2 -1 -1.5 -2 0 2 4 6 8 10 12 14 16 18 20 0 5 10 15 20 25 30 Support = 9.1 Confidence = 68.1 Das, G., Lin, K., Mannila, H., Renganathan, G. & Smyth, P. (1998). Rule Discovery from Time Series. In proceedings of the 4th Int'l Conference on Knowledge Discovery and Data Mining. New York, NY, Aug 27-31. pp 16-22. Papers Based on Rule Discovery from Time Series • Mori, T. & Uehara, K. (2001). Extraction of Primitive Motion and Discovery of Association Rules from Human Motion. • Cotofrei, P. & Stoffel, K (2002). Classification Rules + Time = Temporal Rules. • Fu, T. C., Chung, F. L., Ng, V. & Luk, R. (2001). Pattern Discovery from Stock Time Series Using Self-Organizing Maps. • Harms, S. K., Deogun, J. & Tadesse, T. (2002). Discovering Sequential Association Rules with Constraints and Time Lags in Multiple Sequences. • Hetland, M. L. & Sætrom, P. (2002). Temporal Rules Discovery Using Genetic Programming and Specialized Hardware. • Jin, X., Lu, Y. & Shi, C. (2002). Distribution Discovery: Local Analysis of Temporal Rules. • Yairi, T., Kato, Y. & Hori, K. (2001). Fault Detection by Mining Association Rules in House-keeping Data. • Tino, P., Schittenkopf, C. & Dorffner, G. (2000). Temporal Pattern Recognition in Noisy Non-stationary Time Series Based on Quantization into Symbolic Streams. and many more… All these people are fooling themselves! They are not finding rules in time series (and, given their representation of time series, they cannot), and it is easy to prove this! A Simple Experiment... w 20 30 d 5.5 5.5 Rule 7 15 8 18 20 21 Sup % 8.3 1.3 Conf % 73.0 62.7 J-Mea. 0.0036 0.0039 Fig (a) (b) “if stock rises then falls greatly, follow a smaller rise, then we can expect to see within 20 time units, a pattern of rapid decrease followed by a leveling out.” 2 2 1.5 1.5 1 1 0.5 0.5 0 0 -0.5 -0.5 -1 -1 -1.5 -1.5 -2 -2 0 2 4 6 8 10 12 (a) 14 16 18 20 w 20 30 0 5 10 15 (b) 20 25 d 5.5 5.5 Rule 11 15 3 24 20 19 Sup % 6.9 2.1 Conf % 71.2 74.7 30 The punch line is… J-Mea 0.0042 0.0035 Fig (a) (b) Rule Discovery in Time Series A paper that will appear in ICDM 2003 (you have a preview copy on your CD-Rom) completely invalidates 50 or so papers on rule discovery in time series. (I.e, the vast majority of the work) So the good news is that the problem is wide open again! The first challenge is to show that it makes sense as a problem… Visualizing Massive Time Series Databases The best data mining/pattern recognition tool is the human eye. Can we exploit this fact? How can we visually summarize massive time series, such that regularities, outliers, anomalies etc, become visible? Beginning of Mary Had a Little Lamb, patterns of 3 or more notes: Figure by Martin Wattenberg Conclusions • As with all computer science problems, the right representation is the key to an effective and efficient solution, with time series, there are many choices. • The evaluation of work in time series data mining has been rather sloppy. • There are lots of interesting problems left to tackle. Thanks! Thanks to the people with whom I have co-authored Time Series Papers • Wagner Truppel • Marios Hadjieleftheriou • Dimitrios Gunopulos • Michail Vlachos • Marc Cardle • Stephen Brooks • Bhrigu Celly • Jiyuan An, H. Chen, K. Furuse and N. Ohbo • Michael Pazzani • Sharad Mehrotra • Kaushik Chakrabarti • Selina Chu • David Hart • Padhraic Smyth • Jessica Lin • Bill 'Yuan-chi' Chiu • Stefano Lonardi • Shruti Kasetty • Chotirat (Ann) Ratanamahatana, • Pranav Patel • Harry Hochheiser • Ben Shneiderman • Wagner Truppel • Marios Hadjieleftheriou • Victor Zordan • Your name here! (I welcome collaborators) Questions? All datasets and code used in this tutorial can be found at www.cs.ucr.edu/~eamonn/TSDMA/index.html 126 130 132 134 135 134 134 138 143 147 146 145 145 144 142 142 142 144 145 147 145 142 137 134 129 129 127 128 131 132 129 126 124 121 122 124 130 134 136 139 141 142 142 145 152 158 163 168 172 178 186 188 190 193 194 198 204 209 212 213 216 219 217 219 222 222 219 219 216 211 206