Prototype Optimization for Temporarily and Spatially Distorted Time Series

Prototype Optimization for Temporarily and Spatially Distorted Time Series
Bastian Hartmann, Ingo Schwab, Norbert Link
Institute of Applied Research
University of Applied Sciences Karlsruhe
Moltkestrasse 30
76133 Karlsruhe, Germany
represents a signal class) of time series for generating prototypes is complicated, we propose a method for automatically
finding optimal prototypes in roughly segmented data. Our
approach is not limited to DTW, but transferable to other
time series analysis problems.
The problem that we focus on can be formulated as follows: We have roughly segmented signals (e.g. extracted by
automated labelling) as representatives of different classes
with a class-relevant signal part and redundant information
that is not related to the class. The goal is to find prototypes in form of subsignals of these class templates which
optimally represent their classes. One prototype and a corresponding threshold is found for each class, improving the
quality of threshold classification compared to original class
templates. In our approach we utilize two optimization
methods to find prototypes: brute force search and timesaving evolution strategy.
In recent years work about finding prototypes or motifs
has already been published in different fields of research.
However, a lot of methods assume that the underlying distance measure is a metric or fulfils the triangle inequality
(Mueen et al. 2009), (Ferreira et al. 2006), which is not
the case for DTW. (Ye and Keogh 2009) search for optimal
prototypes in subsequences of time series signals. Our approach differs from this work in the way that we use other
target functions and other optimization methods.
Abstract
An important issue in time series classification problems is to find representative prototypes. Especially for
roughly segmented time series with spatial distortions,
such as human gestures, it is complicated to find templates, which optimally represent signal classes.
In this paper we present an approach to find optimal time series prototypes in subseries of class templates. Our optimization approach is based on separability measures for prototype candidates and utilizes
(but is not limited to) DTW in order to tackle the problem of spatial and temporal distortions. The search for
prototypes in the target space is performed by means of
a brute force search as well as an evolution strategy.
In our experiments with an artificial dataset we show
that brute force search optimization is able to improve
the time series classification result and that the application of an evolution strategy yields comparable target
function scores while reducing computing time.
Introduction
Time series analysis is a topic of high interest in many fields
of research. In our work about interpreting human behavior
for integrating human workers in software-controlled manufacturing (Hartmann, Schauer, and Link 2009), we deal
with the recognition of human gestures and gesture-like (low
level) activities. Because temporal information is an important feature in gesture recognition, we apply time series analysis techniques to signals from wearable sensors, such as accelerometers or angular rate sensors. However, a problem in
human gesture recognition is that gestures show high variations in their motion patterns. Thus, besides HMM (Junker
et al. 2008), a suitable method used in this field of research
is dynamic time warping (DTW) (Ko et al. 2008).
A general question in DTW classification is how to
choose representative class templates, which are also called
prototypes or motifs. In many DTW applications, prototypes are chosen out of a set of sequences, which have been
recorded with exactly known start-and endpoints or are segmented from recorded time series ((Urban et al. 2004),(Corradini 2001),(Ko et al. 2008)). Since accurate segmentation (that is choosing a signal time interval which optimally
Dynamic Time Warping
In our work we focus on the recognition of human gestures and gesture-like activities, where - due to motion variations - accelerations and angular rates appear as signals of
varying time scale. Therefore, the ability of being able to
measure distances between dynamically shifted time series
makes DTW appropriate for comparing signals from human
activities.
DTW (Sakoe and Chiba 1978) is based on a dynamic programming technique, which has its origins in speech processing. The basic idea of DTW is to calculate the similarity
of two time series C = (c1 . . . cN ), T = (t1 . . . tM ) by
summing distances d(cn , tm ) of arbitrary pairs of sampling
points cn and tm in the way that the result is minimized.
As pair distance measure d(·) usually Euclidean distance is
chosen, however the use of any positive distance measure is
possible. Thus given, DTW can be formulated as optimiza-
c 2010, Association for the Advancement of Artificial
Copyright Intelligence (www.aaai.org). All rights reserved.
15
tion problem for finding the warping path W that minimizes
L
DT W (C, T ) = arg min
d(cwc(l) , twt(l) ) , (1)
W
by aligning and taking the average ((Abdulla, Chow, and Sin
2003), (Ko et al. 2008)). Depending on the selected template, an appropriate threshold has to be chosen for a DTW
based classification method. However, as with the choice
of representative class templates there is no general method
existing to determine such a threshold.
l=1
in which wc(l) and wt(l) are the elements of the warping
path W = [(wc(1), wt(1)) . . . (wc(L), wt(L))] containing
index pairs of C and T . Thus, the warping path contains
the solution of sampling point pairs with minimal overall
distance. Because arbitrary sequential assignments of warping path elements can create warping paths which are not
meaningful in practice, possible paths have to be restricted.
Typical path restrictions are:
• Monotonicity: Successive Elements in the warping path
may not contain preceding indices: wc(l) >= wc(l − 1)
and wt(l) >= wt(l − 1).
• Continuity: Successive Elements have a limited increment: wc(l)−wc(l−1) <= 1 and wt(l)−wt(l−1) <= 1.
• Boundaries: Paths start at the first index of the time series
and end at their last indices: wc(1) = 1 and wt(1) = 1
such as wc(L) = N and wt(L) = M .
Some DTW algorithms incorporate warping path weights,
which normalize DTW results calculated over different
warping path lengths.
An Approach for Finding Optimal Time Series
Prototypes
An important issue in time series classification problems is
how to find representative prototypes for time series classes.
This is particularly interesting, since representative prototypes improve the quality of classification. Therefore, we
search for optimal time series prototypes in the subseries of
time series templates. Since we utilize DTW threshold classification in our application, we want to find one prototype
representing each class and a corresponding threshold.
To obtain a good separation between a given class of time
series and other classes of time series it is desirable that a
template has small distance to time series of its own class
(intraclass distance) and high distance to time series of other
classes (interclass distance). The graph at the top of figure
1 illustrates the interclass distance and intraclass distance
distributions of DTW score values of a template belonging
to class C. For this template the distributions show a high
overlap. Therefore, it would be desirable to find a prototype,
for which the distributions appear more separated (as shown
in the graphs in the middle or at the bottom of the figure).
Besides this common representation, different extensions of DTW are possible.
A method to avoid unrealistic warping paths (e.g. appearing
in extremely shifted signals) and to reduce computing time
is to prohibit extreme combinations of index pairs wc(l) and
wt(l) in warping path elements. This can be achieved by
using constraints such as the Sakoe-Chiba band (Sakoe and
Chiba 1978) or the Itakura Parallelogram (Itakura 1975).
In (Ko et al. 2008), variations in the startpoints and
endpoints of the time series T are allowed. This extension
is necessary for detecting subsignals in long time series or
for online DTW processing.
Since our work is focused on recognizing human activities
it often occurs that time series of the same class are not only
dynamically shifted in time but also show variations in their
amplitudes. To address this problem we apply a range normalization mapping two time series signals to be compared
to the same range of values. A one-dimensional time series
A = (a1 . . . aN ) or the components of a multi-dimensional
time series are normalized via:
A − min(A)
(2)
Anorm =
max(A) − min(A)
Figure 1: Distance distributions of DTW score values.
As mentioned in the previous section, there is no general
method for prototype selection and for determining appropriate thresholds. Furthermore, for practical applications,
the generation of prototypes requires well segmented signals (e.g. in form of templates cut out manually by an expert). However, it is often not guaranteed in practice that
long time series signals containing candidates for prototypes
are accurately segmented in the sense of yielding good representatives.
For these reasons, we search for representative prototypes
in the subseries of roughly segmented time series. The subseries that we refer to as optimal should have low (DTW)
to the [0, 1] range.
In our work we utilize the DTW distance measure for the
classification of human activities. This requires that prototypes representing a class of time series have to be found.
A typical method (Ko et al. 2008) for selecting a prototype template out of a set of class template candidates is to
choose the template which yields the best recognition rate or
the one which has the smallest average DTW distance to all
other class templates. A more sophisticated method is creating an averaged template from a number of class templates
16
gain. However, the result in the graph at the bottom appears
to be better separated and more robust for the application of
a threshold classification.
The target functions which we study in our work focus
on achieving separability by maximizing the spread between
the intraclass and interclass DTW scores of the classes. Due
to the fact that there is no general approach to this problem,
we have formulated different target functions, which are described in the following. It has to be noted that all proposed
target functions are maximization functions, that is that a
maximum has to be found by varying the target function parameters (i.e. the parameters of the prototype Copt ).
Moreover, since we assume that all time series templates
to which the target functions are applied consist of a classrelevant signal part and redundant information before and
after the signal, it is important to mention that for DTW
comparisons always flexible start and endpoints have to be
regarded. This means that the DT W ((Copt , (Tj ) method always measures the distance from a prototype candidate series Copt to the most similar subseries of template Tj .
distance to templates of its own class and high (DTW) distance to templates of other classes.
Suppose, we have a dataset which consists of time series
template instances Ci = (ci,1 . . . ci,Ni ) ; i = 1 . . . I of class
C and instances Tj = (tj,1 . . . tj,Mj ) ; j = 1 . . . J of class
T (extension to multi-class case is straightforward). In order
to separate C from T , our problem formulation is now to determine one representative prototype Copt as a subseries of
an instance of C which has low DTW distance to templates
of its own class and high DTW distance to templates of class
T . Thus, Copt can be described as
Copt = (ci,α . . . ci,β )
;1 ≤ α < β
α < β ≤ N, (3)
which is the subseries starting from sampling point α to
sampling point β of the instance Ci of class C.
Optimally segmented time series prototypes have the
following advantages:
• Optimal prototypes separate a class of time series from
another time series class in a better way than the original class templates. Thus, a better classification rate is
achieved.
• In contrast to the problem that in practical applications
templates of a class have to be accurately segmented, we
propose a method that utilizes roughly segmented signals
(e.g. from an automated labelling method) as input and
finds optimal prototypes from their subseries.
• Because of the fact that found prototypes are subsequences of original templates, their length is equal to or
less than the length of original templates. Therefore, computing time can be reduced, since it depends on the length
of time series.
It should be noted, that the presented approach is focused
on the DTW similarity measure, but it is generally not only
limited to DTW and can also be transferred to other time
series similarity measures.
In the following subsections we explain how we optimize
templates. This includes the definition of target functions
as measures for the representativeness of prototypes such as
optimization methods for searching the target space.
Figure 2: Illustration of target functions.
Minimal Interclass to Maximal Intraclass Distance
(min max): A straightforward target function formulation
is to regard the difference between the minimal interclass
distance and the maximal intraclass distance (see graph on
the top of figure 2). The target function for the optimization
problem can then be described as follows:
Target Functions
For finding optimal time series prototypes we need a measure, which specifies how well a found prototype separates
template classes. This measure has to be formulated as a
target function. In the work of (Ye and Keogh 2009) a target function for finding prototypes has been proposed, which
bases on information gain. An advantage of this target function is that search strategies utilizing this measure can be
optimized in speed by applying early abandoning or pruning. However, this target function does not contain any information on how big the spread between the intraclass and
interclass distributions of different classes is, which might
serve as an indicator of robustness. This is exemplified in
the graphs in the middle and at the bottom of figure 1, where
the distributions in both graphs have the same information
fmin
max
= min (DT W (Copt , Tj,k ))
j,k
− max (DT W (Copt , Ci )) .
i
(4)
This target function can easily be extended to multiclass
applications (i.e. an application in which a time series Copt
has to be found that best separates the class C from K other
classes Tk ). In this case the minimal interclass distance is
calculated from the score values of more than one class. For
determining of appropriate thresholds for optimized prototypes, we calculate the average between the minimal interclass distance and the maximal intraclass distance (denoted
by the τ symbol in the figure).
17
A drawback of this target function is that single outliers
may distort the result to a high extent. On the other hand
computing time can be reduced by using early abandoning
or pruning strategies as in (Ye and Keogh 2009).
σct,k
Center Point Distance (CP dist): In contrast to the
min max target function the CP dist target function is robust to single outliers. This is achieved by using the difference between the center points of the intraclass and interclass distances (see equations (5) and (6)) as target function
in equation (7). An illustration can be found in the graph in
the middle of figure 2).
1
DT W (Copt , Ci )
I i=1
(5)
μct,k =
J
1
DT W (Copt , Tj,k )
J j=1
(6)
fCP
K
1 =
μct,k − μcc
K
(7)
J
1 2
=
(DT W (Copt , Tj,k ) − μct,k ) . (10)
J − 1 j=1
div
=
I
μcc =
dist
The target function in equation (11) is a multiclass extension of the Kullback-Leibler divergence basing on the average of the Kullback-Leibler divergences of class C and
classes Tk and assuming normally distributed class score
values. Similar to the CP dist target function influences of
outlying classes can be attenuated by an alternative function
incorporating the logarithm:
k=1
Similar to the min max function, extension to multiclass
problems can be achieved by calculating the interclass center point from the score value distribution of more than
one class. However, classes yielding high DTW scores
may overweight the influence of classes yielding low DTW
scores and distract the result. Incorporating the logarithm in
the target function term for K interclass center points attenuates this problem:
fCP
dist log
=
K
1 log(μct,k ) − log(μcc ).
K
fKL
(8)
Thresholds for prototypes optimized with CP dist target
functions can be found by taking the middle between the
intraclass center point and the lowest interclass center point.
=
K
1 sgn (μct,k − μcc ) ·
K
k=1
2
2
σcc
1 σct,k
+ 2 −2 +
log
2
2
σcc
σct,k
1
1
1
2
(μcc − μct,k )
(12)
.
+ 2
2
2
σcc
σct,k
Brute Force Search (BFS)
Kullback-Leibler Divergence (KL div): Although the
CP dist target function is a statistical measure for intraclass
and interclass score value distributions it is not taken into
account to which extent the distributions overlap. In order
to incorporate a more sophisticated measure in our target
function representation, we set up a target function based
on the Kullback-Leibler divergence (Kullback and Leibler
1951), which is a measure for the dissimilarity of two probability distributions. Our assumption is that distance scores
are subject to normal distributions, which is denoted in the
graph on the bottom of figure 2. Provided that there are sufficient score values for calculation of distribution parameters
in equations (5), (6), (9) and (10), the target function can be
expressed by equation (11).
σcc
div log
The prototype threshold for the KL div target function
can be determined by calculating the intersection point of
the distribution functions between the center points of the
distributions (in multiclass problems we propose to use the
intersection point with the lowest score value).
k=1
I
1 2
=
(DT W (Copt , Ci ) − μcc ) .
I − 1 i=1
1 sgn (μct,k − μcc ) ·
K
k=1
2
2
1 σct,k
σcc
+ 2 −2 +
2
2
σcc
σct,k
1
1
1
(μcc − μct,k )2
.(11)
+ 2
2
2
σcc
σct,k
K
fKL
Finding optimal prototypes from subseries of time series
templates according to equation (3) requires that a target
function is evaluated in dependence of the optimization parameters α and β. A straightforward approach is to perform
a brute force search (BFS) by searching the whole target
space, that is varying the parameters α and β over all possible combinations for each instance Ci of a set of training
data. Obviously, BFS is time consuming, but it guarantees to
find one prototype Copt per class C for which the used target
function is maximal and therefore serves as a reference for
other optimization methods.
Evolution Strategy (ES)
A drawback of the BFS is the inefficient computing time,
especially when applied to large sets of templates. Since
our target functions (excepting min max) do not allow pruning as used in (Ye and Keogh 2009), another approach to
(9)
18
reduce computing time is to see the search as an optimization problem and to restrict the search space. Therefore, we
used an Evolution Strategy (ES) for finding (nearly) optimal
templates in reasonable time.
Evolutionary algorithms (EA) have become an important
method to approach optimization problems. The Evolution
Strategy (ES) is the name of the Evolutionary Algorithm developed in the 1960s and 1970s (Schwefel 1995). It imitates
the mutation, selection, and recombination processes in the
nature, by using probability distribution functions for mutation, a deterministic selection (which chooses just the best
set of offspring individuals for the next generation), and a
broad repertoire of recombination operators. The plus selection strategy works as follows: There are μ parents generating λ offsprings and, then, both parents and offspring
compose a temporary population from which the best individuals are selected to become parents in the next generation
(or iteration). In our approach we use a (5+5)-ES (Schwefel
1995). That means, that 5 parents generate 5 offspring and
from the resulting 10 individuals the 5 best are taken.
Figure 3: Templates of the artificial dataset.
compared to original templates. As we can see from the
figure, optimized templates yield higher recall rates (excepting the min max result), but precision dropped a little for
CP dist, CP dist log and KL div target functions. The explanation for this is that results of classes 2 and 3 degrade
the overall result due to the fact that the core signals are
subsignals of the other classes, which causes confusions and
results in a strong influence on precision.
Figure 5 shows the result without classes 2 and 3 in which
we can see that optimized templates improve the classification result for all target functions, excepting the CP dist target function.
Experiments and Results
We tested our approach to find prototypes for temporarily
and spatially distorted time series with an artificial dataset.
On the set, brute force search and evolution strategy with
different target functions has been applied. The similarity
measure, which we used in our tests is a DTW implementation that is based on squared Euclidean distance calculation,
which utilizes varying start- and endpoints, range normalization (with a scaling limit depending on the highest variation
in the training data) and a Sakoe-Chiba band constraint.
For a set of training data we used BFS and ES to search
the target space in order to find the subseries yielding the
highest target function score for each class. Original templates have been selected from class templates according to
their best target function score. Besides representative prototypes, appropriate thresholds have been determined in this
step as well (as explained in the target function section). In
the BFS evaluation, optimized prototypes are compared with
original prototypes by their threshold classification performance applied to a set of test data. In a further experiment,
we evaluate the ability of ES to find target function maxima.
Artificial Dataset
For testing our methods we created an artificial dataset consisting of four classes with 30 instances of time series templates per class. All templates of the set are composed of
class signals and additional noise before and after the class
signal (see figure 3). Class signal parts are of a common
shape but have randomly generated variations in length and
height. Furthermore, as a difficulty for classification, class
signals of classes 2 and 3 are of the shape of subsignals of
classes 1 and 4. In our experiments the dataset has been split
up into three subsets of equal size (i.e. 2 sets for training and
one set for testing) for threefold cross-validation.
Figure 4: Results of classification with original and BFS optimized templates.
Evolution Strategy Experiments
In our experiments we applied a (5+5)-strategy to find the
target function maxima of class 1 for the min max and
KL div log target functions. For each target function and
cross-validation dataset 5 test runs have been performed (i.e.
30 runs in total) with 20 iterations.
Figure 6 contains a statistical plot of the best individuals at
Brute Force Search Experiments
Figure 4 shows the summarized classification result over all
classes (and cross-validation datasets) of BFS optimization
19
segmented time series data, which optimally represent time
series classes, we introduced target functions measuring the
degree of separability. In our experiments we have shown
that brute force search optimization is able to improve the
time series classification result. Furthermore, the application of an evolution strategy yields comparable target function scores while saving computing time.
References
Abdulla, W.; Chow, D.; and Sin, G. 2003. Cross-words reference template for dtw-based speech recognition systems.
TENCON 2003. Conference on Convergent Technologies
for Asia-Pacific Region 4:1576–1579.
Corradini, A. 2001. Dynamic time warping for off-line
recognition of a small gesture vocabulary. 82.
Ferreira, P.; Azevedo, P.; Silva, C.; and Brito, R. 2006.
Mining approximate motifs in time series. Discovery Science 89–101.
Hartmann, B.; Schauer, C.; and Link, N. 2009. Worker
behavior interpretation for flexible production. CESSE
2009: International Conference on Computer, Electrical,
and Systems Science, and Engineering 58:494–502.
Itakura, F. 1975. Minimum prediction residual principle
applied to speech recognition. Acoustics, Speech and Signal Processing, IEEE Transactions on 23(1):67–72.
Junker, H.; Amft, O.; Lukowicz, P.; and Trster, G. 2008.
Gesture spotting with body-worn inertial sensors to detect
user activities. Pattern Recognition 41(6):2010 – 2024.
Ko, M. H.; West, G.; Venkatesh, S.; and Kumar, M. 2008.
Using dynamic time warping for online temporal fusion in
multisensor systems. Inf. Fusion 9(3):370–388.
Kullback, S., and Leibler, R. A. 1951. On information
and sufficiency. The Annals of Mathematical Statistics
22(1):79–86.
Mueen, A.; Keogh, E. J.; Zhu, Q.; Cash, S.; and Westover,
M. B. 2009. Exact discovery of time series motifs. SDM
473–484.
Sakoe, H., and Chiba, S. 1978. Dynamic programming algorithm optimization for spoken word recognition. Acoustics, Speech and Signal Processing, IEEE Transactions on
26(1):43–49.
Schwefel, H.-P. 1995. Evolution and Optimum Seeking.
Sixth-Generation Computer Technology. New York: Wiley
Interscience.
Urban, M.; Bajcsy, P.; Kooper, R.; and Lementec, J.-C.
2004. Recognition of arm gestures using multiple orientation sensors: repeatability assessment. Proceedings of the
7th International IEEE Conference on Intelligent Transportation Systems 553–558.
Ye, L., and Keogh, E. 2009. Time series shapelets: a new
primitive for data mining. KDD ’09: Proceedings of the
15th ACM SIGKDD international conference on Knowledge discovery and data mining 947–956.
Figure 5: Results of classification with original and BFS optimized templates (without class 2 and class 3).
each iteration. The figure shows the maximal and minimal
target scores (normed to the target function range) such as
the mean target score of all test runs. Compared to BFS optimization, which takes 897 target function calculations for
templates of class C1 on average, (5+5)-ES takes 100 target
function calculations for 20 iterations or 50 target function
calculations for 10 iterations, respectively. In this result prototypes yielding on average more than 95% of the maximal
target function score could be found by evaluating less than
5% of the target space by using BFS.
Figure 6: ES: Best individuals in test runs.
Conclusion
In this paper we presented our approach to prototype optimization for time series. In order to tackle the problem of
spatial and temporal distortions, we utilized DTW together
with range normalization as a similarity measure between
time series. However, our general approach for optimization
is not limited to this measure. Since our optimization approach is based on the idea of finding subseries in roughly
20