Clustering Time Series Data An Evolutionary Approach Monica Chiş, Soumya Banerjee, Aboul Ella Hassanien Springer 2009 Outline Introduction Related Works: Time Series Clustering Evolutionary Time Series Clustering Algorithm (ETSC) Experimental Results Conclusions and Future Work Introduction Time series are an ordered sequence of values of a variable at equally spaced time intervals. Traditional time series analysis and modeling tend to be based on non-automatic and trialand-error approaches. (Cont.) Time series arise in many important areas. Clustering of time series has attracted the interest of researchers from a wide range of fields. Related Works: Time Series Clustering The mining of time series data has attracted great attention in the data mining community in recent years. There are two main categories in time series clustering: Whole clustering Subsequence clustering the clustering performed on many individual time series to group similar series into clusters based on sliding window extractions of a single time series and aims to find similarity and differences among different time window of a single time series. (Cont.) Clustering Approach Hierarchical Clustering Generality Only small datasets K-Means Distance is minimized (Cont.) Time-Series Use lower dimensionality that preserves the original information Describes the original shape of the time-series data as closely as possible. (Cont.) Another categorization of clustering time series is the classification in model based or model free. Model based assume some form of the underlying generating process, estimate the model from each data then cluster based on similarity between model parameters. Markov chain Hidden Makov Models Model free Fast Fourier transforms Wavelet transforms involve the specification of a specific distance measure and/or a transformation of the data. Evolutionary Time Series Clustering Algorithm (ETSC) A new evolutionary algorithms focused on whole clustering of time series, called Evolutionary Time Series Clustering- ETSC is presented. (Cont.) Apply an evolutionary algorithm for solving a problem is To find : 1. A suitable representation for candidate solutions 2. By finding a good representation for clustering time series dataset. The fitness function will be considered a similarity measure of time series (Cont.) Two time-series are similarity if they have enough non overlapping time ordered subsequences that are similar. Two subsequences are similarity if one is enclosed within an envelope of a user defined width around another Evolutionary Time Series Clustering 選擇與複製 突變 Experimental Results Conclusions and Future Work An evolutionary approach to perform clustering on time series data set was presented. Some different fitness functions are tested. Hierarchical Clustering Calculate the similarity between all possible combinations of two profiles Keys • Similarity • Clustering Two most similar clusters are grouped together to form a new cluster Calculate the similarity between the new cluster and all remaining clusters. 出處:IR講義 (Cont.) 出處:IR講義 The K-Means Clustering Method Example 10 10 9 9 8 8 7 7 6 6 5 5 10 9 8 7 6 5 4 4 3 2 1 0 0 1 2 3 4 5 6 7 8 K=2 Arbitrarily choose K object as initial cluster center 9 10 Assign each objects to most similar center 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 Update the cluster means 4 3 2 1 0 0 1 2 3 4 5 6 reassign 10 9 9 8 8 7 7 6 6 5 5 4 3 2 1 0 1 2 3 4 5 6 7 8 8 9 10 reassign 10 0 7 9 10 Update the cluster means 4 3 2 1 0 0 1 2 3 4 5 6 出處:IR講義 7 8 9 10 Evoluationary Algorithm Evolutionary algorithms are randomized search optimization techniques guided by the principles of natural biology evolution processes (Cont.) (Cont.) They are effective and robust search processes provide a near optimal solution of a fitness function. (Cont.) Evolutionary algorithms are used to improve the robustness and accuracy of the more traditional techniques used in feature extraction, feature selection, classification and clustering Gene Algorithm 1975年 John Holland 學者所提出 成為非常著名的演算法 Genetic Algorithm (GA) 基因演算法 又稱: 遺傳演算法 出處:基因演算法介紹 基因演算法的概念 將問題編碼成 染色體的型式 基因: 0 or 1 染色體: x = 0 1 0 0 1 1 0 設計 適應函數 (fitness function) f(x) :適應函數,給一 x 可得一 適應值 f(x) 適應值愈高, 表愈適應環境 出處:基因演算法介紹 基因演算法的三個基本動作 選擇(Selection)與複製 (Reproduction): X=0100110 選擇 適應值高 之染色體 複製成多個染色體 出處:基因演算法介紹 (Cont.) 交換 (crossover): 父:0 1 0 0 1 1 0 母:1 0 1 1 0 0 1 交換 0111010 1000101 出處:基因演算法介紹 (Cont.) 突變 (mutation): 0100110 1 出處:基因演算法介紹 (Cont.) 產生初始族群 是否滿足 停止條件 否 是 產生最適個體 計算適應値 選擇 複製 交換 突變 產生下一子代 出處:基因演算法介紹 Fitness Function (1) Fitness Function (2) (Cont.) An ordered set of m realvalued variables The dataset could be described in matrix as follows An individual is represented as a vector (Cont.) dj is a time series with m real-valued cj is an integer number from 1 to p Representation indicates how time series are assigned to classes The value cj of the gene j indicates the classes to which the object dj is assigned. All the times series for dataset are assigned to a class (cluster). The number of classes is determining by the evolutionary algorithms