Times Series Discretization Using Evolutionary Programming Abstract. In this work, we present a novel algorithm for time series discretization. Our approach includes the optimization of the word size and the alphabet as one parameter. Using evolutionary programming, the search for a good discretization scheme is guided by a cost function which considers three criteria: the entropy regarding the classification, the complexity measured as the number of different strings needed to represent the complete dataset, and the compression rate assessed as the length of the discrete representation. Our proposal is compared with some of the most representative algorithms found in the specialized literature, tested in a well-known benchmark of time series data sets. The statistical analysis of the classification accuracy shows that the overall performance of our algorithm is highly competitive. Keywords: Times series, Discretization, Evolutionary Algorithms, Optimization 1 Introduction Many real-world applications related with information processing generate temporal data [12]. Most of the cases, this kind of data requires huge data storage. Therefore, it is desirable to compress this information maintaining the most important features. Many approaches are mainly focused in data compression. However they do not rely on significant information measured with entropy [11, 13]. In those approaches, the dimensionality reduction is given by the transformation of time series of length N into a data set of n coefficients, where n < N [7]. The two main characteristics of a time series are: the number of segments (word size) and the number of values (alphabet) required to represent its continuous values. Fig. 1 shows a time series with a grid that represents the cut points for word size and alphabet. Most of the discretization algorithms require, as an input, the parameters of word size and alphabet. However, in real-world applications it might be very difficult to know in advance their best values. Hence, their definitions require a careful analysis of the time series data set [9, 13]. Among the approaches proposed to deal with data discretization we can find those which work with one time series at a time, such as the one proposed by Mörchen [14]. His algorithm is centered on the search of persistent states (the most frequent values) in time series. However, such states are not common in many real-world time series applications. Another representative approach was proposed by Dimitrova [3], where a multi-connected graph representation for 2 Fig. 1: Word size and alphabet representation. In this case the time series has a word size = 9 and an alphabet = 5 (values A and E do not appear in the time series) time series was employed. The links between nodes have Euclidean distance values which are used under this representation to eliminate links in order to obtain a path that defines the discretization scheme. Nonetheless, this way to define the discretization process could be a disadvantage because not all the time series in a data set will necessarily have the same discretization scheme. Keogh [13] proposed the Symbolic Aggregate Approximation (SAX) approach. This algorithm is based in the Piecewise Aggregate Approximation (PAA), a dimensionality reduction algorithm [8]. After PAA is applied, the values are then transformed into categorical values through a probability distribution function. The algorithm requires the alphabet and the word size as inputs. This is SAX’s main disadvantage because it is not clear how to define them from a given time series data set. There are other approaches based on search algorithms. Garcı́a-López [2] proposed EBLA2, which in order to automatically find the word size and alphabet performs a greedy search looking for entropy minimization. The main disadvantage of this approach is the sensitivity of the greedy search to get trapped in local optima. Therefore, in [6] simulated annealing was used as a search algorithm and the results improved. Finally, in [1], a genetic algorithm was used to guide the search, however the solution was incomplete in the sense that the algorithm considered the minimization of the alphabet as a first stage, and attempted to reduce the word size in a second stage. In this way some solutions could not be generated. In this work, we present a new approach in which both, the word size and the alphabet are optimized at the same time. Due to its simplicity with respect to other evolutionary algorithms, evolutionary programming (EP) is adopted as a search algorithm (e.g., no recombination and parent selection mechanisms are performed and just mutation and replacement need to be designed). Further- Times Series Discretization Using Evolutionary Programming 3 more, the amount of strings and the length of the discretized series are optimized as well. The contents of this paper are organized as follows: Section 2 details the proposed algorithm. After that, Section 3 presents the obtained results and a comparison against other approaches. Finally, Section 4, draws some conclusions and presents the future work. 2 Our approach In this section we firstly define the discretization problem. Thereafter, EP is introduced and its adaptation to solve the problem of interest is detailed in four steps: (1) solution encoding, (2) fitness function definition, (3) mutation operator and (4) replacement technique. 2.1 Statement of the problem The discretization process refers to the transformation of continuous values into discrete values. Formally, the domain is represented as x|x ∈ R where R is the set of real numbers and the discretization scheme is D = {[d0 , d1 ], (d1 , d2 ], ... (dn−1 , dn ]} where d0 y dn are the minimum and maximum values for x respectively. Each pair in D represents an interval, where each continuous value is mapped within the continuous values to one of the elements from the discrete set 1...m, where m is called the discretization degree and di |i = 1...n are the limits of intervals, also known as cut points. The discretization process has to be done in both characteristics, the length (word size) and the interval of values taken by the continuous variable (alphabet). Within our approach we use a modified version of the PAA algorithm [8]. PAA requires the number of segments for the time series as an input value. Moreover, all the partitions have an equal length. In our proposed approach each segment is calculated through the same idea as in PAA by using mean values. However, partitions will not necessarily have equal lengths. This difference can be stated as follows: let C be a time series with length n represented as a vector C = c1 , ..., cn and T = t1 , t2 , ..., tm be the discretization scheme over word size, where {(ti , ti+1 ]} is the time interval from segment i of C, where the element i Pti+1 Ctj from C is given by: ci = (ti+11−ti ) j=t i +1 2.2 Evolutionary programming EP is a simple but powerful evolutionary algorithm where evolution is simulated at species level, i.e., no crossover is considered [5]. Instead, asexual reproduction is implemented by a mutation operator. The main steps in EP are: 1. Population initialization. 2. Evaluation of solutions. 3. Offspring generation by mutation. 4 4. Replacement. From the steps mentioned above, the following elements must be defined so as to adapt EP to the time series discretization problem: (a) solution encoding, (b) fitness function to evaluate solutions, (c) mutation operator and (d) replacement mechanism. They are described below. Solution encoding As in other evolutionary algorithms, in EP a complete solution of the problem must be encoded in each individual. A complete discretization scheme is encoded as shown in Fig. 2a, where the word size is encoded first with integer values, followed by the alphabet represented by real numbers, which must be sorted so as to apply the scheme to the time series data set [17] as shown in Fig. 2b. (a) Solution encoding: each block represents a cut point, the first part of the segment (before the red line) encodes the word size. The second part represents the alphabet. The first part indicates that the first segment goes from position 1 to position 23, the second segment goes from position 24 to position 45, and so on. In the same manner, the second part shows the alphabet intervals. See figure 2b (b) Solution decoding: the decoded solution from Fig. 2a, after sorting the values for the word size and the alphabet, is applied to a time series. The solution can be seen as a grid (i.e., the word size over the x-axis and the alphabet over the y-axis) Fig. 2: Solution encoding-decoding Fitness function Different measures have been reported in the specialized literature to determine the quality of discretization schemes, such as information criterion [3], persistence state [14], information entropy maximization (IEM), Times Series Discretization Using Evolutionary Programming 5 information gain, entropy maximization, Petterson-Niblett and minimum description length (MDL) [11, 4]. Our fitness function, which aims to bias EP to promising regions of the search space, is based on three elements: 1. Classification accuracy (accuracy) based on entropy. 2. Strings reduction level (num strings). 3. Compression level (num cutpoints). Those three values are normalized and added into one single value using the relationship in Eq. 1 for individual j in the population P op. F itness(P opj ) = (α accuracy) + (β num strings) + (γ num cutpoints) (1) where: α, β y γ are the weights whose values determine the importance of each element. The whole evaluation process for a given individual, i.e., discretization scheme, requires the following steps: First, the discretization scheme is applied over the complete time series data set S. Then, N S strings are obtained, where N S is equal to the number of time series from the data set S. A m × ns matrix called M is generated, where m is the number of different classes and ns is the number of different strings obtained. From this discretized data set, SU is computed as the list of unique strings. Each of these strings has its own class label C. The first element of Eq. 1 (accuracy) is computed through entropy calculation over the columns of the matrix M as indicated in Equation 2. accuracy = #S XU Entropy(Colj ) (2) j=0 where: #SU is the number of different strings and Colj is the column j of the matrix M . The second element num strings is calculated in Eq. 3 num strings = (#SU − #C)/(N + #C) (3) where: #SU is the number of different strings, N is the number of time series in data set and #C is the number of existing classes. Finally, the third element num cutpoints is computed in Eq. 4 num cutpoints = (size individual/(2 ∗ length series)) (4) where: size individual is the number of partitions (word size) that the particular discretization scheme has and length series is the size of the original time series. In summary, the first element represents how well a particular individual (discretization scheme) is able to correctly classify the data base, the second element asses the complexity of the representation in terms of different patterns needed to encode the data, and the third element is a measure of the compression rate reached using a particular discretization scheme. 6 Mutation operator The mutation operator is applied to every individual in the population in order to generate one offspring per individual. We need a value N M U T ∈ [1, 2, 3] to define how many changes will be made to an individual. Each time an individual is mutated the N M U T value is calculated. Each change consists on choosing a position of the vector defined in Fig. 2a and generate a new valid value at random. Replacement mechanism The replacement mechanism consists on sorting the current population and their offspring by their fitness values in and letting the first half to survive for the next generation while the second half is eliminated. The pseudocode of our EP algorithm to tackle the times series discretization problem is presented in Algorithm 1, where a population of individuals, i.e., valid schemes is generated at random. After that, each individual m generates one offspring by the mutation implemented as one to three random changes in the encoding. The set of current individuals P op and the set of Of f spring ′ are merged into one set called P op which is sorted based on fitness and the first-half remains for the next generation. The process finishes when a number of M AXGEN generations is computed and the best discretization scheme is then used to discretize the data base to be classified by the K-nearest neighbors (KNN) algorithm. Algorithm 1 EP pseudocode 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 3 Pop= ∅ for m = 0 to popsize do Popm = Valid Scheme() %Generate individuals at random. end for for k = 0 to M AXGEN do Offspring = ∅ for m = 0 to popsize do Offspringm = Mutation(Popm ) %Create a new individual by mutation end for ′ Pop = Replacement(Pop + Offspring) %Select the best ones ′ Pop=Pop end for Experiments and results The EP algorithm was tested on twenty data sets (available in a well-known classification/clustering web page of time series [10]). A summary of the features of each data set is presented in Table 1. The EP algorithm was executed with the following parameters experimentally found after preliminary tests: popsize = 250 and M AXGEN = 50, α = 0.9009, β = 0.0900 and γ = 0.0090. Times Series Discretization Using Evolutionary Programming 7 Table 1: Data sets used in the experiments The quality of the solutions obtained by the EP algorithm was computed by using the best discretization scheme obtained for a set of five independent runs in the k-nearest neighbors classifier with K = 1. Other K values were tested (K = 3 and K = 5) but the performance decreased in all cases. Therefore, the results were not included in this paper. The low number of runs (5) is due to the time required (more than an hour) for the algorithm to process one single run for a given data set. The distance measure used in the k-nearest neighbors algorithm was Euclidean distance. The algorithms used for comparison were GENEBLA and SAX. The raw data was also included as a reference. Based on the fact that SAX requires the word length and alphabet as inputs, it was run using the parameters obtained by the EP algorithm and also by those obtained by GENEBLA. Table 2 summarizes the error rate on the twenty time series data sets for K = 1 for each evaluated algorithm (EP, SAX(EP), GENEBLA, SAX(GENEBLA)) and raw data. Values go from zero to one, where the lower means the better value. The values between parentheses indicate the confidence on the significance of the 8 differences observed based on statistical tests applied to the samples of results per algorithm. In all cases the differences were significant. From the results in Table 2 it can be noticed that different performances were provided by the compared algorithms. EP obtained the lowest error rate in nine data sets. On the other hand, GENEBLA had better results in just three data sets. Regarding the combination of EP and GENEBLA with SAX, slightly better results were observed with GENEBLA-SAX with respect to EP-SAX, where better results were obtained in five and three data sets, respectively. It is worth noticing that EP provided its best performance in data sets with a lower number of classes, (between 2 and 4): CBF, Face Four, Coffee, Gun Point, FGC200 and Two Pattern (Fish is the only exception). On the other hand, GENEBLA performed better in data sets with more classes (Adiac with 37 and Face All with 14). Another related interesting finding is that SAX seems to help EP to solve with the best performance of the compared approaches those data sets with a higher number of classes (Lighting7 with 7, 50words with 50 and Swedish Leaf with 15). In contrast, the combination of GENEBLA with SAX help the former to deal with some data sets with a lower number of classes (Beef with 5, Lighting2 with 2, Synthetic Control and OSU Leaf with 6 and Wafer with 2). Finally, there was not a clear pattern about the algorithm with the best performance by considering the sizes of the training and test sets as well as the time series length. Table 2: Error rate obtained by each compared approach in the 20 data sets. The best result for each data set is remarked with a gray background. Raw data is presented only as a reference. Times Series Discretization Using Evolutionary Programming 4 9 Conclusions and Future Work We presented a novel time series discretization algorithm based on EP. The proposed algorithm was able to automatically find the parameters for a good discretization scheme considering the optimization of accuracy and compression rate. Moreover, and as far as we know, this is the first approach that considers the world length and the alphabet optimization at the same time. A simple mutation operator was able to sample the search space by generating new and competitive solutions. Our EP algorithm is easy to implement and the results obtained in 20 different data sets were highly competitive with respect to previously proposed methods including the raw data, i.e., the original time series, which means that the EP algorithm is able to obtain the important information of a continuous time series and disregards unimportant data. The EP algorithm provided a high performance with respect to GENEBLA in problems with a low number of classes. However, if EP is combined with SAX, the approach is able to outperform GENEBLA and also GENEBLA-SAX in problems with a higher number of classes. The future work consists on a further analysis of the EP algorithm such as the effect of the weights in the search as well as the number of changes in the mutation operator. Furthermore, other classification techniques (besides knearest neighbors) need to be tested. Finally, Pareto dominance will be explored with the aim to deal with the three objectives considered in the fitness function [15]. References 1. Garcı́a-López, D.A., Héctor-Gabriel, A.-M., Ernesto, C.-P.: Discretization of Time Series Dataset with a Genetic Search. In Proceeding MICAI ’09 Proceedings of the 8th Mexican International Conference on Artificial Intelligence, ISBN: 978-3-64205257-6 (2009). 2. Acosta Mesa, H.G., Nicandro, C.R., Daniel-Alejandro, G.-L.: Entropy Based Linear Approximation Algorithm for Time Series Discretization. In: Advances in Artificial Intelligence and Applications, vol. 32, pp. 214224. Research in Computers Science. 3. Dimitrova, E.S., McGee, J., Laubenbacher, E.: Discretization of Time Series Data, (2005) eprint arXiv:q-bio/0505028. 4. Fayyad, U., Irani, K.: Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning. In: Proceedings of the 13th International Joint Conference on Artificial Intelligence (1993). 5. Fogel L.: Intelligence Through Simulated Evolution. Forty years of Evolutionary Programming (Wiley Series on Intelligent Systems). (1999). 6. Garcı́a-López D.A.: Algoritmo de Discretizacin de Series de Tiempo Basado en Entropa y su Aplicacin en Datos Colposcpicos. Universidad Veracruzana (2007). 7. Han J., and Kamber M.: Data Mining: Concepts and Techniques (The Morgan Kaufmann Series in Data Management Systems). (2001). 8. Keogh E., Chakrabarti K., Pazzani M., S. Mehrotra: Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases, ACM Trans. Database Syst (2002). 10 9. Keogh E., S. Lonardi, Ratanamabatana C.A.: Towards parameter-free data mining. In proceedings of Tenth ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (2001). 10. Keogh E., Xi C., Wei L., Ratanamabatana C.A.: The UCR Time Series Classification/Clustering Homepage (2006), http://www.cs.ucr.edu/ eamonn/time series data/. 11. Kurgan L., and Cios K.: CAIM Discretization Algorithm, IEEE Transactions On Knowledge And Data Engineering (2004). 12. Last M., Kandel A., and Bunke H.: Data mining in time series databases, World Scientific Pub Co Inc, Singapore (2004). 13. Lin J., Keogh E., Lonardi S., y Chin B.: A symbolic representation of time series, with implications for streaming Algorithms, In Proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (2003). 14. Mörchen F., y Ultsch A.: Optimizing Time Series Discretization for Knowledge Discovery. In: Proceeding of the Eleventh ACM SIGKDD international Conference on Knowledge Discovery in Data Mining (2005). 15. Deb Kalyanmoy, Pratap A., Agarwal S., y Meyarivan T.: A Fast and Elitist Multiobjective Genetic Algorithm: NSGA-II. IEEE Transactions on evolutionary computation (2002). 16. Hastie Trevor, Tibshirani R., Friedman J.: The elements of Statistical Learning. Springer 2009. 17. Chaochang Chiu*, Nanh S.C.: An adapted covering algorithm approach for modeling airplanes landing gravities. Department of Information Management, Yuan Ze University, 135 Far East Rd., Chung-Li 320, Taiwan, ROC. Expert Systems with Applications 26 (2004) 443450. ELSEVIER.