Time-Series Data Management Yonsei University 2nd Semester, 2014 Sanghyun Park * The slides were extracted from the material presented at ICDM’01 by Eamonn Keogh Contents Introduction, motivation Utility of similarity measurements Indexing time series Summary, conclusions What Are Time Series? A time series is a collection of observations made sequentially in time 25.2250 25.2500 25.2500 25.2750 25.3250 25.3500 25.3500 25.4000 25.4000 25.3250 25.2250 25.2000 25.1750 .. .. 24.6250 24.6750 24.6750 24.6250 24.6250 24.6250 24.6750 24.7500 29 28 27 26 25 24 23 0 50 100 150 200 250 300 350 400 450 500 Time Series Are Ubiquitous (1/2) People measure things … The presidents approval rating Their blood pressure The annual rainfall in Riverside The value of their Yahoo stock The number of web hits per second And things change over time and thus time series occur in virtually every medical, scientific and business domain Time Series Are Ubiquitous (2/2) A random sample of 4,000 graphics from 15 of the world’s newspapers published from 1974 to 1989 found that more than 75% of all graphics were time series Time Series Similarity Defining the similarity between two time series is at the heart of most time series data mining applications/tasks Thus time series similarity will be the primary focus of this lecture Utility Of Similarity Search (1/2) Classification Clustering Utility Of Similarity Search (2/2) Rule Discovery s = 0.5 10 c = 0.3 Query by Content Query Q (template) 1 6 2 7 3 8 4 9 5 10 Database C Challenges Of Research On Time Series (1/3) How do we work with very large databases? 1 hour of ECG data: 1 gigabyte Typical web log: 5 gigabytes per week Space shuttle database: 158 gigabytes and growing Macho database: 2 terabytes, updated with 3 gigabytes per day Since most of the data lives on disk (or tape), we need a representation of the data we can efficiently manipulate Challenges Of Research On Time Series (2/3) We are dealing with subjective notions of similarity The definition of similarity depends on the user, the domain, and the task at hand. We need to handle this subjectivity Challenges Of Research On Time Series (3/3) Miscellaneous data handling problems Differing data formats Differing sampling rates Noise, missing values, etc Whole Matching vs. Subsequence Matching (1/2) Whole matching Given a query Q, a reference database C, and a distance measure, find the Ci that best matches Q Query Q (template) 1 6 2 7 3 8 4 9 5 10 Database C C6 is the best match Whole Matching vs. Subsequence Matching (2/2) Subsequence matching Given a query Q, a reference database C, and a distance measure, find the location that best matches Q Query Q (template) Database C The best matching subsection Motivation Of Similarity Search You go to the doctor because of chest pains. Your ECG looks strange … Your doctor wants to search a database to find similar ECGs, in the hope that they will offer clues about your condition … Two questions How do we define similar? How do we search quickly? Defining Distance Measures Definition: Let O1 and O2 be two objects from the universe of possible objects. Their distance is denoted as D(O1,O2) What properties should a distance measure have? D(A,B) = D(B,A) D(A,A) = 0 D(A,B) = 0 IIf A=B D(A,B) ≤ D(A,C) + D(B,C) Symmetry Constancy of self-similarity Positivity Triangluar inequality The Minkowski Metrics DQ, C qi ci n p p i 1 p = 1 Manhattan (Rectilinear, City Block) p = 2 Euclidean p = Max (Supremum, “sup”) D(Q,C) Euclidean Distance Metric Given two time series Q=q1…qn and C=c1…cn, their Euclidean distance is defined as: DQ, C qi ci n C Q 2 i 1 D(Q,C) Processing The Data Before Distance Calculation If we naively try to measure the distance between two “raw” time series, we may get very unintuitive results This is because Euclidean distance is very sensitive to some distortions in the data For most problems these distortions are not meaningful, and thus we can and should remove them Four most common distortions Offset translation Amplitude scaling Linear trend Noise Offset Translation 3 3 2.5 2.5 2 2 1.5 1.5 1 1 0.5 0.5 0 0 50 100 150 200 250 300 0 D(Q,C) 0 50 100 150 200 250 300 Q = Q - mean(Q) C = C - mean(C) D(Q,C) 0 0 50 100 150 200 250 300 50 100 150 200 250 300 Amplitude Scaling 0 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000 Q = (Q - mean(Q)) / std(Q) C = (C - mean(C)) / std(C) D(Q,C) Linear Trend 5 4 Removed offset translation 3 2 Removed amplitude scaling 1 0 12 -1 10 -2 8 -3 0 20 40 60 80 100 120 140 160 180 200 6 4 2 0 5 -2 -4 0 4 20 40 60 80 100 120 140 160 180 200 Removed linear trend 3 The intuition behind removing linear trend is this: 2 Removed offset translation 1 0 Fit the best fitting straight line to the time series, then subtract that line from the time series Removed amplitude scaling -1 -2 -3 0 20 40 60 80 100 120 140 160 180 200 Noise 8 8 6 6 4 4 2 2 0 0 -2 -2 -4 0 20 40 60 80 100 120 140 -4 0 20 40 60 80 The intuition behind removing noise is this: Q = smooth(Q) Average each datapoint value with its neighbors C = smooth(C) D(Q,C) 100 120 140 Dynamic Time Warping We will first see the utility of DTW, then see how it is calculated Fixed Time Axis “Warped” Time Axis Sequences are aligned “one to one”. Nonlinear alignments are possible. Utility of DTW: Example I, Machine Learning Cylinder-Bell-Funnel Cylinder Funnel Bell This dataset has been studied in a machine learning context by many researchers Recall that, by definition, the instances of Cylinder-BellFunnel are warped in the time axis Classification Experiment on C-B-F Dataset (1/2) Experimental settings Training data consists of 10 exemplars from each class (One) Nearest neighbor algorithm “Leaving-one-out” evaluation, averaged over 100 runs Results Error rate using Euclidean Distance: 26.10% Error rate using Dynamic Time Warping: 2.87% Time to classify one instance using Euclidean Distance: 1 sec Time to classify one instance using Dynamic Time Warping: 4,320 sec Classification Experiment on C-B-F Dataset (2/2) Dynamic time warping can reduce the error rate by an order of magnitude Its classification accuracy is competitive with sophisticated approaches like decision tree, boosting, neural networks, and Bayesian techniques But, it is slow … Utility of DTW: Example II, Data Mining Power-demand time series: each sequence corresponds to a week’s demand for power in a Dutch research facility in 1997 Wednesday was a national holiday Hierarchical Clustering with Euclidean Distance 4 5 3 The two 5-day weeks are correctly grouped. 6 7 Note however, that the three 4-day weeks are not clustered together. Also, the two 3-day weeks are also not clustered together. 2 1 Hierarchical Clustering with Dynamic Time Warping 6 4 7 5 The two 5-day weeks are correctly grouped. 3 The three 4-day weeks are clustered together. The two 3-day weeks are also clustered together. 2 1 Time Taken to Create Hierarchical Clustering of Power-Demand Time Series Time to create dendrogram using Euclidean Distance: 1.2 seconds Time to create dendrogram using Dynamic Time Warping: 3.40 hours Computing the Dynamic Time Warp Distance (1/2) Note that the input sequences can be of different lengths Q |p| C Q w p k j C 1 w1 1 i n |n| Computing the Dynamic Time Warp Distance (2/2) Q |p| |n| C Every possible mapping from Q to C can be represented as a warping path in the search matrix We simply want to find the cheapest one … DTW (Q, C ) min K k 1 wk K Although there are exponentially many such paths, we can find one in only quadratic time using dynamic programming (i,j) = d(qi,cj) + min{ (i-1,j-1) , (i-1,j ) , (i,j-1) } Fast Approximation to Dynamic Time Warping Distance (1/2) Simple idea: approximate the time series with some compressed or downsampled representation, and do DTW on the new representation Q wk p C j 1 w1 1 i How well does this work … n Fast Approximation to Dynamic Time Warping Distance (2/2) 22.7 sec 1.3 sec … Strong visual evidence to suggest it works well Weighted Distance Measures (1/3) Intuition: for some queries different parts of the sequence are more important Weighted Distance Measures (2/3) DQ, C DQ, C ,W D(Q,C) 2 q c i i n i 1 wi qi ci n 2 D(Q,C,W) i 1 The height of this histogram indicates the relative importance of that part of the query W Weighted Distance Measures (3/3) How do we set the weights? One possibility: relevance feedback which is the reformulation of a query in response to feedback provided by the user for the results of previous query Term Vector Term Weights [Jordan , Cow, Bull, River] [ 1 , 1 , 1 , 1 ] Search Display Results Gather Feedback Term Vector Term Weights [Jordan , Cow, Bull, River] [ 1.1 , 1.7 , 0.3 , 0.9 ] Update Weights Indexing Time Series (1/6) We have seen techniques for assessing the similarity of two time series However we have not addressed the problem of finding the best match to a query in a large database … The obvious solution, to retrieve and examine every item (sequential scanning), simply does not scale to large datasets We need some way to index the data 1 6 2 7 3 8 4 9 5 10 Database C Indexing Time Series (2/6) We can project time series of length n into n-dimension space The first value in C is the X-axis, the second value in C is the Y-axis, etc. One advantage of doing this is that we have abstracted away the details of “time series”, now all query processing can be imagined as finding points in space … Indexing Time Series (3/6) We can project the query time series Q into the same ndimension space and simply look for the nearest points Q The problem is that we have to look at every point to find the nearest neighbor Indexing Time Series (4/6) The Minkowski metrics have simple geometric interpolations Euclidean Weighted Euclidean Manhattan Max Indexing Time Series (5/6) We can group clusters of datapoints with “boxes” called Minimum Bounding Rectangles (MBR) R1 R2 R4 R5 R3 R6 R9 R7 R8 We can further recursively group MBRs into larger MBRs Indexing Time Series (6/6) These nested MBRs are organized as a tree (called a spatial access tree or a multidimensional tree). Examples include R-tree, Hybrid-tree, etc. R10 R11 R10 R11 R12 R1 R2 R3 R12 R4 R5 R6 R7 R8 R9 Data nodes containing points Dimensionality Curse (1/4) If we project a query into n-dimensional space, how many additional MBRs must we examine before we are guaranteed to find the best match? For the one dimensional space, the answer is clearly 2 Dimensionality Curse (2/4) If we project a query into n-dimensional space, how many additional MBRs must we examine before we are guaranteed to find the best match? For the two dimensional case, the answer is 8 Dimensionality Curse (3/4) If we project a query into n-dimensional space, how many additional MBRs must we examine before we are guaranteed to find the best match? For the three dimensional case, the answer is 26 Dimensionality Curse (4/4) If we project a query into n-dimensional space, how many additional MBRs must we examine before we are guaranteed to find the best match? More generally, in n-dimensional space we must examine 3n-1 MBRs; n = 21 → 10,460,353,201 MBRs This is known as the curse of dimensionality Spatial Access Methods We can use Spatial Access Methods like the R-tree to index our data, but … The performance of R-trees degrades exponentially with the number of dimensions. Somewhere above 6-20 dimensions the R-tree degrades to linear scanning Often we want to index time series with hundreds, perhaps even thousands of features GEMINI (GEneric Multimedia INdexIng) {Christos Faloutsos} (1/8) Establish a distance metric from a domain expert Produce a dimensionality reduction technique that reduces the dimensionality of the data from n to N, where N can be efficiently handled by your favorite SAM Produce a distance measure defined on N dimensional representation of the data, and prove that it obeys Dindexspace(A,B) ≤ Dtrue(A,B) (lower bounding lemma) Plug into an off-the-shelve SAM GEMINI (GEneric Multimedia INdexIng) {Christos Faloutsos} (2/8) We have 6 objects in 3-D space. We issue a query to find all objects within 1 unit of the point (-3, 0, -2) A 3 2.5 2 1.5 C 1 0.5 B F 0 -0.5 -1 3 2 D 1 0 -1 E -2 -3 -4 -3 -2 -1 0 1 2 3 GEMINI (GEneric Multimedia INdexIng) {Christos Faloutsos} (3/8) The query successfully finds the object E A 3 2 C 1 0 B F -1 3 2 D 1 0 -1 E -2 -3 -4 -3 -2 -1 0 1 2 3 GEMINI (GEneric Multimedia INdexIng) {Christos Faloutsos} (4/8) Consider what would happen if we issued the same query after reducing the dimensionality to 2, assuming the dimensionality technique obeys the lower bounding lemma Informally, it’s OK if objects appear “closer” in the dimensionality reduced space, than in the true space GEMINI (GEneric Multimedia INdexIng) {Christos Faloutsos} (5/8) Note that because of the dimensionality reduction, object F appears to less than one unit from the query (it is a “false alarm”) 3 2.5 A 2 1.5 C F 1 0.5 0 B -0.5 -1 D E -4 -3 -2 -1 0 1 2 3 GEMINI (GEneric Multimedia INdexIng) {Christos Faloutsos} (6/8) This is OK so long as it does not happen too much, since we can always retrieve it, then test it in the true, 3-dimensional space This would leave us with just E, the correct answer GEMINI (GEneric Multimedia INdexIng) {Christos Faloutsos} (7/8) Now, let’s consider a dimensionality reduction technique in which the lower bounding lemma is not satisfied Informally, some objects appear further apart in the dimensionality reduced space, than in the true space 3 2.5 A 2 E 1.5 1 C 0.5 0 F -0.5 B D -1 -4 -3 -2 -1 0 1 2 3 GEMINI (GEneric Multimedia INdexIng) {Christos Faloutsos} (8/8) Note that because of the dimensionality reduction, object E appears to be more than one unit from the query (it is a “false dismissal”) This is unacceptable because we have failed to find the true answer set to our query These examples illustrate why the lower bounding lemma is so important Now all we have to do is to find a dimensionality reduction technique that obeys the lower bounding lemma, and we can index our time series Notation for Dimensionality Reduction For the future discussion of dimensionality reduction we will assume that: M is the number of time series in our database n is the original dimensionality of the data N is the reduced dimensionality of the data Cratio = N/n is the compression ratio An Example of a Dimensionality Reduction Technique (1/5) Raw Data C 0 20 40 60 80 n = 128 100 120 140 0.4995 0.5264 0.5523 0.5761 0.5973 0.6153 0.6301 0.6420 0.6515 0.6596 0.6672 0.6751 0.6843 0.6954 0.7086 0.7240 0.7412 0.7595 0.7780 0.7956 0.8115 0.8247 0.8345 0.8407 0.8431 0.8423 0.8387 … The graphic shows a time series with 128 points The raw data used to produce the graphic is also reproduced as a column of numbers (just the first 30 or so points are shown) An Example of a Dimensionality Reduction Technique (2/5) We can decompose the data into 64 pure sine waves using the Discrete Fourier Transform (just the first few sine waves are shown in the next slide) The Fourier Coefficients are reproduced as a column of numbers (just the first 30 or so coefficients are shown) Note that at this stage we have not done dimensionality reduction; we have merely changed the representation An Example of a Dimensionality Reduction Technique (3/5) Raw Data C 0 20 40 60 80 100 120 140 .............. 0.4995 0.5264 0.5523 0.5761 0.5973 0.6153 0.6301 0.6420 0.6515 0.6596 0.6672 0.6751 0.6843 0.6954 0.7086 0.7240 0.7412 0.7595 0.7780 0.7956 0.8115 0.8247 0.8345 0.8407 0.8431 0.8423 0.8387 … Fourier Coefficients 1.5698 1.0485 0.7160 0.8406 0.3709 0.4670 0.2667 0.1928 0.1635 0.1602 0.0992 0.1282 0.1438 0.1416 0.1400 0.1412 0.1530 0.0795 0.1013 0.1150 0.1801 0.1082 0.0812 0.0347 0.0052 0.0017 0.0002 ... An Example of a Dimensionality Reduction Technique (4/5) Note that the first few sine waves tend to be the largest (equivalently, the magnitude of the Fourier coefficients tends to decrease as you move down the column) We can therefore truncate most of the small coefficients with little effect Instead of taking the first few coefficients, we could take the “best” coefficients This can help greatly in terms of approximation quality, but make indexing hard An Example of a Dimensionality Reduction Technique (5/5) C C’ 0 20 40 60 80 100 We have discarded 15/16 of the data. 120 140 Raw Data 0.4995 0.5264 0.5523 0.5761 0.5973 0.6153 0.6301 0.6420 0.6515 0.6596 0.6672 0.6751 0.6843 0.6954 0.7086 0.7240 0.7412 0.7595 0.7780 0.7956 0.8115 0.8247 0.8345 0.8407 0.8431 0.8423 0.8387 … Fourier Coefficients 1.5698 1.0485 0.7160 0.8406 0.3709 0.4670 0.2667 0.1928 0.1635 0.1602 0.0992 0.1282 0.1438 0.1416 0.1400 0.1412 0.1530 0.0795 0.1013 0.1150 0.1801 0.1082 0.0812 0.0347 0.0052 0.0017 0.0002 ... Truncated Fourier Coefficients 1.5698 1.0485 0.7160 0.8406 0.3709 0.4670 0.2667 0.1928 n = 128 N=8 Cratio = 1/16 0 20 40 60 DFT 80 100 120 0 20 40 60 DWT 80 100 120 0 20 40 60 80 100 120 SVD 0 20 40 60 80 100 120 APCA 0 20 40 60 80 100 120 PAA 0 20 40 60 80 100 120 PLA Directions for Future Research Time series in 2, 3, K dimensions Transforming other problems into time series problems Weighted distance measures Relevance feedback Approximation to SVD