Prototype Optimization for Temporarily and Spatially Distorted Time Series Bastian Hartmann, Ingo Schwab, Norbert Link Institute of Applied Research University of Applied Sciences Karlsruhe Moltkestrasse 30 76133 Karlsruhe, Germany represents a signal class) of time series for generating prototypes is complicated, we propose a method for automatically finding optimal prototypes in roughly segmented data. Our approach is not limited to DTW, but transferable to other time series analysis problems. The problem that we focus on can be formulated as follows: We have roughly segmented signals (e.g. extracted by automated labelling) as representatives of different classes with a class-relevant signal part and redundant information that is not related to the class. The goal is to find prototypes in form of subsignals of these class templates which optimally represent their classes. One prototype and a corresponding threshold is found for each class, improving the quality of threshold classification compared to original class templates. In our approach we utilize two optimization methods to find prototypes: brute force search and timesaving evolution strategy. In recent years work about finding prototypes or motifs has already been published in different fields of research. However, a lot of methods assume that the underlying distance measure is a metric or fulfils the triangle inequality (Mueen et al. 2009), (Ferreira et al. 2006), which is not the case for DTW. (Ye and Keogh 2009) search for optimal prototypes in subsequences of time series signals. Our approach differs from this work in the way that we use other target functions and other optimization methods. Abstract An important issue in time series classification problems is to find representative prototypes. Especially for roughly segmented time series with spatial distortions, such as human gestures, it is complicated to find templates, which optimally represent signal classes. In this paper we present an approach to find optimal time series prototypes in subseries of class templates. Our optimization approach is based on separability measures for prototype candidates and utilizes (but is not limited to) DTW in order to tackle the problem of spatial and temporal distortions. The search for prototypes in the target space is performed by means of a brute force search as well as an evolution strategy. In our experiments with an artificial dataset we show that brute force search optimization is able to improve the time series classification result and that the application of an evolution strategy yields comparable target function scores while reducing computing time. Introduction Time series analysis is a topic of high interest in many fields of research. In our work about interpreting human behavior for integrating human workers in software-controlled manufacturing (Hartmann, Schauer, and Link 2009), we deal with the recognition of human gestures and gesture-like (low level) activities. Because temporal information is an important feature in gesture recognition, we apply time series analysis techniques to signals from wearable sensors, such as accelerometers or angular rate sensors. However, a problem in human gesture recognition is that gestures show high variations in their motion patterns. Thus, besides HMM (Junker et al. 2008), a suitable method used in this field of research is dynamic time warping (DTW) (Ko et al. 2008). A general question in DTW classification is how to choose representative class templates, which are also called prototypes or motifs. In many DTW applications, prototypes are chosen out of a set of sequences, which have been recorded with exactly known start-and endpoints or are segmented from recorded time series ((Urban et al. 2004),(Corradini 2001),(Ko et al. 2008)). Since accurate segmentation (that is choosing a signal time interval which optimally Dynamic Time Warping In our work we focus on the recognition of human gestures and gesture-like activities, where - due to motion variations - accelerations and angular rates appear as signals of varying time scale. Therefore, the ability of being able to measure distances between dynamically shifted time series makes DTW appropriate for comparing signals from human activities. DTW (Sakoe and Chiba 1978) is based on a dynamic programming technique, which has its origins in speech processing. The basic idea of DTW is to calculate the similarity of two time series C = (c1 . . . cN ), T = (t1 . . . tM ) by summing distances d(cn , tm ) of arbitrary pairs of sampling points cn and tm in the way that the result is minimized. As pair distance measure d(·) usually Euclidean distance is chosen, however the use of any positive distance measure is possible. Thus given, DTW can be formulated as optimiza- c 2010, Association for the Advancement of Artificial Copyright Intelligence (www.aaai.org). All rights reserved. 15 tion problem for finding the warping path W that minimizes L DT W (C, T ) = arg min d(cwc(l) , twt(l) ) , (1) W by aligning and taking the average ((Abdulla, Chow, and Sin 2003), (Ko et al. 2008)). Depending on the selected template, an appropriate threshold has to be chosen for a DTW based classification method. However, as with the choice of representative class templates there is no general method existing to determine such a threshold. l=1 in which wc(l) and wt(l) are the elements of the warping path W = [(wc(1), wt(1)) . . . (wc(L), wt(L))] containing index pairs of C and T . Thus, the warping path contains the solution of sampling point pairs with minimal overall distance. Because arbitrary sequential assignments of warping path elements can create warping paths which are not meaningful in practice, possible paths have to be restricted. Typical path restrictions are: • Monotonicity: Successive Elements in the warping path may not contain preceding indices: wc(l) >= wc(l − 1) and wt(l) >= wt(l − 1). • Continuity: Successive Elements have a limited increment: wc(l)−wc(l−1) <= 1 and wt(l)−wt(l−1) <= 1. • Boundaries: Paths start at the first index of the time series and end at their last indices: wc(1) = 1 and wt(1) = 1 such as wc(L) = N and wt(L) = M . Some DTW algorithms incorporate warping path weights, which normalize DTW results calculated over different warping path lengths. An Approach for Finding Optimal Time Series Prototypes An important issue in time series classification problems is how to find representative prototypes for time series classes. This is particularly interesting, since representative prototypes improve the quality of classification. Therefore, we search for optimal time series prototypes in the subseries of time series templates. Since we utilize DTW threshold classification in our application, we want to find one prototype representing each class and a corresponding threshold. To obtain a good separation between a given class of time series and other classes of time series it is desirable that a template has small distance to time series of its own class (intraclass distance) and high distance to time series of other classes (interclass distance). The graph at the top of figure 1 illustrates the interclass distance and intraclass distance distributions of DTW score values of a template belonging to class C. For this template the distributions show a high overlap. Therefore, it would be desirable to find a prototype, for which the distributions appear more separated (as shown in the graphs in the middle or at the bottom of the figure). Besides this common representation, different extensions of DTW are possible. A method to avoid unrealistic warping paths (e.g. appearing in extremely shifted signals) and to reduce computing time is to prohibit extreme combinations of index pairs wc(l) and wt(l) in warping path elements. This can be achieved by using constraints such as the Sakoe-Chiba band (Sakoe and Chiba 1978) or the Itakura Parallelogram (Itakura 1975). In (Ko et al. 2008), variations in the startpoints and endpoints of the time series T are allowed. This extension is necessary for detecting subsignals in long time series or for online DTW processing. Since our work is focused on recognizing human activities it often occurs that time series of the same class are not only dynamically shifted in time but also show variations in their amplitudes. To address this problem we apply a range normalization mapping two time series signals to be compared to the same range of values. A one-dimensional time series A = (a1 . . . aN ) or the components of a multi-dimensional time series are normalized via: A − min(A) (2) Anorm = max(A) − min(A) Figure 1: Distance distributions of DTW score values. As mentioned in the previous section, there is no general method for prototype selection and for determining appropriate thresholds. Furthermore, for practical applications, the generation of prototypes requires well segmented signals (e.g. in form of templates cut out manually by an expert). However, it is often not guaranteed in practice that long time series signals containing candidates for prototypes are accurately segmented in the sense of yielding good representatives. For these reasons, we search for representative prototypes in the subseries of roughly segmented time series. The subseries that we refer to as optimal should have low (DTW) to the [0, 1] range. In our work we utilize the DTW distance measure for the classification of human activities. This requires that prototypes representing a class of time series have to be found. A typical method (Ko et al. 2008) for selecting a prototype template out of a set of class template candidates is to choose the template which yields the best recognition rate or the one which has the smallest average DTW distance to all other class templates. A more sophisticated method is creating an averaged template from a number of class templates 16 gain. However, the result in the graph at the bottom appears to be better separated and more robust for the application of a threshold classification. The target functions which we study in our work focus on achieving separability by maximizing the spread between the intraclass and interclass DTW scores of the classes. Due to the fact that there is no general approach to this problem, we have formulated different target functions, which are described in the following. It has to be noted that all proposed target functions are maximization functions, that is that a maximum has to be found by varying the target function parameters (i.e. the parameters of the prototype Copt ). Moreover, since we assume that all time series templates to which the target functions are applied consist of a classrelevant signal part and redundant information before and after the signal, it is important to mention that for DTW comparisons always flexible start and endpoints have to be regarded. This means that the DT W ((Copt , (Tj ) method always measures the distance from a prototype candidate series Copt to the most similar subseries of template Tj . distance to templates of its own class and high (DTW) distance to templates of other classes. Suppose, we have a dataset which consists of time series template instances Ci = (ci,1 . . . ci,Ni ) ; i = 1 . . . I of class C and instances Tj = (tj,1 . . . tj,Mj ) ; j = 1 . . . J of class T (extension to multi-class case is straightforward). In order to separate C from T , our problem formulation is now to determine one representative prototype Copt as a subseries of an instance of C which has low DTW distance to templates of its own class and high DTW distance to templates of class T . Thus, Copt can be described as Copt = (ci,α . . . ci,β ) ;1 ≤ α < β α < β ≤ N, (3) which is the subseries starting from sampling point α to sampling point β of the instance Ci of class C. Optimally segmented time series prototypes have the following advantages: • Optimal prototypes separate a class of time series from another time series class in a better way than the original class templates. Thus, a better classification rate is achieved. • In contrast to the problem that in practical applications templates of a class have to be accurately segmented, we propose a method that utilizes roughly segmented signals (e.g. from an automated labelling method) as input and finds optimal prototypes from their subseries. • Because of the fact that found prototypes are subsequences of original templates, their length is equal to or less than the length of original templates. Therefore, computing time can be reduced, since it depends on the length of time series. It should be noted, that the presented approach is focused on the DTW similarity measure, but it is generally not only limited to DTW and can also be transferred to other time series similarity measures. In the following subsections we explain how we optimize templates. This includes the definition of target functions as measures for the representativeness of prototypes such as optimization methods for searching the target space. Figure 2: Illustration of target functions. Minimal Interclass to Maximal Intraclass Distance (min max): A straightforward target function formulation is to regard the difference between the minimal interclass distance and the maximal intraclass distance (see graph on the top of figure 2). The target function for the optimization problem can then be described as follows: Target Functions For finding optimal time series prototypes we need a measure, which specifies how well a found prototype separates template classes. This measure has to be formulated as a target function. In the work of (Ye and Keogh 2009) a target function for finding prototypes has been proposed, which bases on information gain. An advantage of this target function is that search strategies utilizing this measure can be optimized in speed by applying early abandoning or pruning. However, this target function does not contain any information on how big the spread between the intraclass and interclass distributions of different classes is, which might serve as an indicator of robustness. This is exemplified in the graphs in the middle and at the bottom of figure 1, where the distributions in both graphs have the same information fmin max = min (DT W (Copt , Tj,k )) j,k − max (DT W (Copt , Ci )) . i (4) This target function can easily be extended to multiclass applications (i.e. an application in which a time series Copt has to be found that best separates the class C from K other classes Tk ). In this case the minimal interclass distance is calculated from the score values of more than one class. For determining of appropriate thresholds for optimized prototypes, we calculate the average between the minimal interclass distance and the maximal intraclass distance (denoted by the τ symbol in the figure). 17 A drawback of this target function is that single outliers may distort the result to a high extent. On the other hand computing time can be reduced by using early abandoning or pruning strategies as in (Ye and Keogh 2009). σct,k Center Point Distance (CP dist): In contrast to the min max target function the CP dist target function is robust to single outliers. This is achieved by using the difference between the center points of the intraclass and interclass distances (see equations (5) and (6)) as target function in equation (7). An illustration can be found in the graph in the middle of figure 2). 1 DT W (Copt , Ci ) I i=1 (5) μct,k = J 1 DT W (Copt , Tj,k ) J j=1 (6) fCP K 1 = μct,k − μcc K (7) J 1 2 = (DT W (Copt , Tj,k ) − μct,k ) . (10) J − 1 j=1 div = I μcc = dist The target function in equation (11) is a multiclass extension of the Kullback-Leibler divergence basing on the average of the Kullback-Leibler divergences of class C and classes Tk and assuming normally distributed class score values. Similar to the CP dist target function influences of outlying classes can be attenuated by an alternative function incorporating the logarithm: k=1 Similar to the min max function, extension to multiclass problems can be achieved by calculating the interclass center point from the score value distribution of more than one class. However, classes yielding high DTW scores may overweight the influence of classes yielding low DTW scores and distract the result. Incorporating the logarithm in the target function term for K interclass center points attenuates this problem: fCP dist log = K 1 log(μct,k ) − log(μcc ). K fKL (8) Thresholds for prototypes optimized with CP dist target functions can be found by taking the middle between the intraclass center point and the lowest interclass center point. = K 1 sgn (μct,k − μcc ) · K k=1 2 2 σcc 1 σct,k + 2 −2 + log 2 2 σcc σct,k 1 1 1 2 (μcc − μct,k ) (12) . + 2 2 2 σcc σct,k Brute Force Search (BFS) Kullback-Leibler Divergence (KL div): Although the CP dist target function is a statistical measure for intraclass and interclass score value distributions it is not taken into account to which extent the distributions overlap. In order to incorporate a more sophisticated measure in our target function representation, we set up a target function based on the Kullback-Leibler divergence (Kullback and Leibler 1951), which is a measure for the dissimilarity of two probability distributions. Our assumption is that distance scores are subject to normal distributions, which is denoted in the graph on the bottom of figure 2. Provided that there are sufficient score values for calculation of distribution parameters in equations (5), (6), (9) and (10), the target function can be expressed by equation (11). σcc div log The prototype threshold for the KL div target function can be determined by calculating the intersection point of the distribution functions between the center points of the distributions (in multiclass problems we propose to use the intersection point with the lowest score value). k=1 I 1 2 = (DT W (Copt , Ci ) − μcc ) . I − 1 i=1 1 sgn (μct,k − μcc ) · K k=1 2 2 1 σct,k σcc + 2 −2 + 2 2 σcc σct,k 1 1 1 (μcc − μct,k )2 .(11) + 2 2 2 σcc σct,k K fKL Finding optimal prototypes from subseries of time series templates according to equation (3) requires that a target function is evaluated in dependence of the optimization parameters α and β. A straightforward approach is to perform a brute force search (BFS) by searching the whole target space, that is varying the parameters α and β over all possible combinations for each instance Ci of a set of training data. Obviously, BFS is time consuming, but it guarantees to find one prototype Copt per class C for which the used target function is maximal and therefore serves as a reference for other optimization methods. Evolution Strategy (ES) A drawback of the BFS is the inefficient computing time, especially when applied to large sets of templates. Since our target functions (excepting min max) do not allow pruning as used in (Ye and Keogh 2009), another approach to (9) 18 reduce computing time is to see the search as an optimization problem and to restrict the search space. Therefore, we used an Evolution Strategy (ES) for finding (nearly) optimal templates in reasonable time. Evolutionary algorithms (EA) have become an important method to approach optimization problems. The Evolution Strategy (ES) is the name of the Evolutionary Algorithm developed in the 1960s and 1970s (Schwefel 1995). It imitates the mutation, selection, and recombination processes in the nature, by using probability distribution functions for mutation, a deterministic selection (which chooses just the best set of offspring individuals for the next generation), and a broad repertoire of recombination operators. The plus selection strategy works as follows: There are μ parents generating λ offsprings and, then, both parents and offspring compose a temporary population from which the best individuals are selected to become parents in the next generation (or iteration). In our approach we use a (5+5)-ES (Schwefel 1995). That means, that 5 parents generate 5 offspring and from the resulting 10 individuals the 5 best are taken. Figure 3: Templates of the artificial dataset. compared to original templates. As we can see from the figure, optimized templates yield higher recall rates (excepting the min max result), but precision dropped a little for CP dist, CP dist log and KL div target functions. The explanation for this is that results of classes 2 and 3 degrade the overall result due to the fact that the core signals are subsignals of the other classes, which causes confusions and results in a strong influence on precision. Figure 5 shows the result without classes 2 and 3 in which we can see that optimized templates improve the classification result for all target functions, excepting the CP dist target function. Experiments and Results We tested our approach to find prototypes for temporarily and spatially distorted time series with an artificial dataset. On the set, brute force search and evolution strategy with different target functions has been applied. The similarity measure, which we used in our tests is a DTW implementation that is based on squared Euclidean distance calculation, which utilizes varying start- and endpoints, range normalization (with a scaling limit depending on the highest variation in the training data) and a Sakoe-Chiba band constraint. For a set of training data we used BFS and ES to search the target space in order to find the subseries yielding the highest target function score for each class. Original templates have been selected from class templates according to their best target function score. Besides representative prototypes, appropriate thresholds have been determined in this step as well (as explained in the target function section). In the BFS evaluation, optimized prototypes are compared with original prototypes by their threshold classification performance applied to a set of test data. In a further experiment, we evaluate the ability of ES to find target function maxima. Artificial Dataset For testing our methods we created an artificial dataset consisting of four classes with 30 instances of time series templates per class. All templates of the set are composed of class signals and additional noise before and after the class signal (see figure 3). Class signal parts are of a common shape but have randomly generated variations in length and height. Furthermore, as a difficulty for classification, class signals of classes 2 and 3 are of the shape of subsignals of classes 1 and 4. In our experiments the dataset has been split up into three subsets of equal size (i.e. 2 sets for training and one set for testing) for threefold cross-validation. Figure 4: Results of classification with original and BFS optimized templates. Evolution Strategy Experiments In our experiments we applied a (5+5)-strategy to find the target function maxima of class 1 for the min max and KL div log target functions. For each target function and cross-validation dataset 5 test runs have been performed (i.e. 30 runs in total) with 20 iterations. Figure 6 contains a statistical plot of the best individuals at Brute Force Search Experiments Figure 4 shows the summarized classification result over all classes (and cross-validation datasets) of BFS optimization 19 segmented time series data, which optimally represent time series classes, we introduced target functions measuring the degree of separability. In our experiments we have shown that brute force search optimization is able to improve the time series classification result. Furthermore, the application of an evolution strategy yields comparable target function scores while saving computing time. References Abdulla, W.; Chow, D.; and Sin, G. 2003. Cross-words reference template for dtw-based speech recognition systems. TENCON 2003. Conference on Convergent Technologies for Asia-Pacific Region 4:1576–1579. Corradini, A. 2001. Dynamic time warping for off-line recognition of a small gesture vocabulary. 82. Ferreira, P.; Azevedo, P.; Silva, C.; and Brito, R. 2006. Mining approximate motifs in time series. Discovery Science 89–101. Hartmann, B.; Schauer, C.; and Link, N. 2009. Worker behavior interpretation for flexible production. CESSE 2009: International Conference on Computer, Electrical, and Systems Science, and Engineering 58:494–502. Itakura, F. 1975. Minimum prediction residual principle applied to speech recognition. Acoustics, Speech and Signal Processing, IEEE Transactions on 23(1):67–72. Junker, H.; Amft, O.; Lukowicz, P.; and Trster, G. 2008. Gesture spotting with body-worn inertial sensors to detect user activities. Pattern Recognition 41(6):2010 – 2024. Ko, M. H.; West, G.; Venkatesh, S.; and Kumar, M. 2008. Using dynamic time warping for online temporal fusion in multisensor systems. Inf. Fusion 9(3):370–388. Kullback, S., and Leibler, R. A. 1951. On information and sufficiency. The Annals of Mathematical Statistics 22(1):79–86. Mueen, A.; Keogh, E. J.; Zhu, Q.; Cash, S.; and Westover, M. B. 2009. Exact discovery of time series motifs. SDM 473–484. Sakoe, H., and Chiba, S. 1978. Dynamic programming algorithm optimization for spoken word recognition. Acoustics, Speech and Signal Processing, IEEE Transactions on 26(1):43–49. Schwefel, H.-P. 1995. Evolution and Optimum Seeking. Sixth-Generation Computer Technology. New York: Wiley Interscience. Urban, M.; Bajcsy, P.; Kooper, R.; and Lementec, J.-C. 2004. Recognition of arm gestures using multiple orientation sensors: repeatability assessment. Proceedings of the 7th International IEEE Conference on Intelligent Transportation Systems 553–558. Ye, L., and Keogh, E. 2009. Time series shapelets: a new primitive for data mining. KDD ’09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining 947–956. Figure 5: Results of classification with original and BFS optimized templates (without class 2 and class 3). each iteration. The figure shows the maximal and minimal target scores (normed to the target function range) such as the mean target score of all test runs. Compared to BFS optimization, which takes 897 target function calculations for templates of class C1 on average, (5+5)-ES takes 100 target function calculations for 20 iterations or 50 target function calculations for 10 iterations, respectively. In this result prototypes yielding on average more than 95% of the maximal target function score could be found by evaluating less than 5% of the target space by using BFS. Figure 6: ES: Best individuals in test runs. Conclusion In this paper we presented our approach to prototype optimization for time series. In order to tackle the problem of spatial and temporal distortions, we utilized DTW together with range normalization as a similarity measure between time series. However, our general approach for optimization is not limited to this measure. Since our optimization approach is based on the idea of finding subseries in roughly 20