Technical Document of YADING Fast and Automatic Clustering of Large-Scale Time Series Data YADING project Member: Rui Ding, Qiang Wang, Yingnong Dang, Qiang Fu, Haidong Zhang, Dongmei Zhang Software Analytics Group Microsoft Research December, 2013 1. DOCUMENT Appendix includes the detailed proofs of several statements, and some other useful information. ๐(๐๐′ ≥ ๐) > 1 − ๐ผ ↔ ๐ (๐ ≥ −๐ง๐ผ & ๐ ≤ ๐ ๐๐ . ๐ 2 (๐๐ − 1.1.1 Sample Size Determination Sampling is the most effective mechanism to handle the scale of the input dataset. Since we want to achieve high performance and we do not assume any distribution of the input dataset, we choose random sampling [47] as our sampling algorithm. In practice, a predefined sampling rate is often used to determine the size of the sampled dataset s. As N, the size of the input dataset, keeps increasing, s also increases accordingly, which will result in slow clustering performance on the sampled dataset. Furthermore, it is unclear what impact the increased number of samples may have on the clustering accuracy. We come up with the following theoretical bounds to guide the selection of s. Assume that the ground truth of clustering is known for ๐ฏ๐×๐ท , i.e. all the ๐๐ ∈ ๐ฏ๐×๐ท belong to k known groups, and ๐๐ represents the ๐ number of time series in the ith group. Let ๐๐ = ๐ denote the ๐ ๐′ population ratio of group i. Similarly, ๐๐′ = ๐ denote the ๐ population ratio of the ith group on the sampled dataset. |๐๐ − ๐๐′ | reflects the ratio deviation between the input dataset ๐ฏ๐×๐ท and the sampled dataset ๐ฏ๐ ×๐ . We formalize the selection of the sample size s as finding the lower bound ๐ ๐ and upper bound ๐ ๐ข such that, given a tolerance ๐ and a confidence level 1 − ๐ผ, (1) group i with ๐๐ less than ๐ is not guaranteed to have sufficient instances in the sampled dataset for ๐ < ๐ ๐ , and (2) the maximum of ratio deviation |๐๐ − ๐๐′ |, 1 ≤ ๐ ≤ ๐, is within a given tolerance for ๐ ≥ ๐ ๐ข . Intuitively, the lower bound constrains the smallest size of clusters that are possible to be found; and the upper bound indicates that when the sample size is greater than a threshold, more samples will not impact the clustering result. Lemma 1 (lower bound): Given m, the least number of instances present in the sampled dataset for group i, tolerance ๐, and the confidence level 1 − ๐ผ , the sample size ๐ ≥ ๐(๐๐′ ๐ง ๐ง 2 ๐+๐ง๐ผ ( ๐ผ+√๐+ ๐ผ ) 2 4 ๐๐ satisfies ≥ ๐) > 1 − ๐ผ . Here, ๐ง๐ผ/2 is a function of ๐ผ , ๐(๐ > ๐ง๐ผ/2 ) = ๐ผ/2, where ๐~๐(0, 1). With confidence level 1 − ๐ผ, Lemma 1 provides the lower bound on sample size s that guarantees m instances in the sampled dataset for any cluster with population ratio higher than ๐. For sample, if a cluster has ๐๐ > 1% , and we set ๐ = 5 with confidence 95% (i.e. 1 − ๐ผ = 0.95), then we get ๐ ๐ ≥ 1,030. In this case, when ๐ < 1,030 , the clusters with ๐๐ < 1% have perceptible probability (>5%) to be missed in the sampled dataset. It should be noted that the selection of m is related to the clustering method applied to the sampled dataset. For example, DBSCAN is a density-based method, and it typically requires 4 nearest neighbors of a specific object to identify a cluster. Thus, any cluster with size less than 5 is difficult to be found. The consideration on clustering method also supports our formalization for deciding ๐ ๐ . ๐−๐ ๐ ๐−๐ ๐ ๐ ๐ Proof: Event {๐๐′ ≥ ๐} ↔ { ๐ ๐ ≥ } ↔ {๐ ≥ }, the ๐ ๐ ๐ ′ last statement holds since ๐๐ ~๐(๐ ๐๐ , ๐ ๐๐ (1 − ๐๐ )), and here ๐ = √๐ ๐๐ (1 − ๐๐ ). So )>1−๐ผ ↔ ๐−๐ ๐๐ ๐ ≤ The last inequality can be transformed to 1.1 Lemmas and Proofs ๐′ −๐ ๐ ๐−๐ ๐๐ 2 ๐ + ๐ง๐ผ2 /2 ๐ + ๐ง๐ผ2 /2 ๐2 , ๐ ≤ ๐ ๐๐ 2 ) ≥( 2 ) − ๐ + ๐ง๐ผ ๐ + ๐ง๐ผ ๐ (๐ + ๐ง๐ผ2 ) Consider ๐ง๐ผ usually valued in [0, 3], so ๐ โซ ๐ง๐ผ2 , so ๐ + ๐ง๐ผ2 ≈ ๐ , apply this to the inequality above, we get a simplified version ๐ ≥ ๐+๐ง๐ผ ( ๐ง๐ผ ๐ง 2 +√๐+ ๐ผ ) 2 4 , hence the lemma is proven. ๐๐ Lemma 2 (upper bound): Given tolerance ๐, and the confidence level 1 − ๐ผ, the sample size ๐ ≥ 2 ๐ง๐ผ/2 4๐ 2 satisfies ๐ (max|๐๐ − ๐๐′ | < ๐ ๐) > 1 − ๐ผ. Here, ๐ง๐ผ/2 is a function of ๐ผ, ๐(๐ > ๐ง๐ผ/2 ) = ๐ผ/2, where ๐~๐(0, 1). Lemma 2 implies that the sample size s only depends on the tolerance ๐ and the confidence level 1 − ๐ผ, and it is independent of the input data size. For example, if we set ๐ = 0.01 and 1 − ๐ผ = 0.95, which means that for any group, the difference of its population ratio between the input dataset and the sampled dataset is less than 0.01, then the lowest sample size to guarantee such ๐ง2 setting is ๐ ≥ 0.025 2 ~9,600. More samples than 9,600 are not 4×0.01 necessary. This makes ๐ = 9,600 the upper bound of the sample size. Moreover, this sample size does not change with the size of the input dataset. Proof: the probability that a sample belongs to the ith group is ๐๐ , due to the property of random sampling. Since each sample is independent, the number of instances belonging to the ith group form Binomial distribution, which is ๐๐′ ~๐ต(๐ , ๐๐ ). ๐ต(๐ , ๐๐ )~๐(๐ ๐๐ , ๐ ๐๐ (1 − ๐๐ )) when ๐ is large, and the distribution is not too skew [45], so next we assume ๐๐′ ~๐(๐ ๐๐ , ๐ ๐๐ (1 − ๐๐ )) by approximation. Then, ๐๐′ = ๐๐′ ๐ ~๐(๐๐ , ๐๐ (1−๐๐ ) ๐ ) ๏ ๐ = ๐๐′ − ๐๐ ~๐(0, ๐๐ (1−๐๐ ) ๐ ), ๐ Event {|๐๐ − ๐๐′ | < ๐} ↔ {|๐| < ๐} ↔ {|๐| < } where ๐ = ๐ ๐๐ (1−๐๐ ) √ ๐ So when . ๐ ๐ ๐ > ๐ง๐ผ/2 , ๐ (|๐| < ) > 1 − ๐ผ is achieved. Expand ๐, ๐ we get the range value of ๐ should be ๐ ≥ 1 2 ๐ง๐ผ/2 4๐ 2 ๐๐ (1−๐๐ ) 2 ๐ง๐ผ , ๐2 2 since ๐๐ (1 − ๐๐ ) ≤ , so ๐ ≥ is a valid value range of the ith group, 4 hence the lemma is proven. Lemma 2 provides a loose upper bound, since we replace 1 ๐๐ (1 − ๐๐ ) to to bound all the value of ratios. So set sample size 4 smaller than 9,600 may preserve reasonable results and increase performance. In practice, we change sample size from 1,030 (lower bound) to 10,000 in real data sets for testing clustering results, we finally choose 2,000 since the accuracy of clustering results are very close when sample size is larger than 2,000. 1.1.2 Phase-Shift Overcoming In this section, we investigate how ๐ฟ1 distance combined with density-based clustering could overcome phase-shift trouble. The first observation is, when phase shift is small enough, the ๐ฟ1 distance could also be small enough; another observation is, when data scale becomes large, the distance between the kNN (kth nearest neighbor) and a particular time series can be short enough, so that all the time series are connected by applying DBSCAN. general, such way for describing time series is very common and nature. Preliminaries: denote a time series ๐(๐) = {๐(๐ + ๐), ๐(๐ + 2๐), ๐(๐ + 3๐), … ๐(๐ + ๐๐)}, here ๐ is the initial phase, ๐ is the interval that time series is sampled, ๐ is the length. The time series is generated by an underlying continuous model ๐(๐ก) . Here, we assume ๐(๐ก) which is an analytic function [46]. Another time series with phase shift ๐ฟ is represented by ๐(๐ − ๐ฟ) = {๐(๐ + ๐ − ๐ฟ), ๐(๐ + 2๐ − ๐ฟ), … ๐(๐ + ๐๐ − ๐ฟ)}. Without noise, the time series instances are identical, which is trivial for applying any type of clustering methods. When noise is incorporated, the outcome time series instances have deviations. Lemma 1: ∃๐, ๐ . ๐ก. , ๐ฟ1(๐(๐), ๐(๐ − ๐ฟ)) โ ∑๐ ๐=1|๐(๐ + ๐๐) − ๐(๐ + ๐๐ − ๐ฟ)| ≤ ๐๐๐ฟ. Proof: according to Taylor’s theorem about analytic function: ๐(๐ฅ) = ๐(๐) + ๐ ′ (๐)(๐ฅ − ๐), ๐คโ๐๐๐ ๐ ∈ (๐, ๐ฅ) We immediately get ๐ ๐ฟ1(๐(๐), ๐(๐ − ๐ฟ)) = ∑|๐(๐ + ๐๐) − ๐(๐ + ๐๐ − ๐ฟ)| ๐=1 ๐ = ∑|๐ ′ (๐๐ )๐ฟ| ≤ ๐๐๐ฟ ๐=1 Where ๐ = max |๐ ′ (๐๐ )| , ๐๐ ∈ (๐ + ๐๐ − ๐ฟ, ๐ + ๐๐). ๐ Now suppose we have ๐ time series ๐(๐ − ๐ฟ๐ ) , which only differed from ๐(๐) by a phase shift ๐ฟ๐ . Without generality, let ๐ฟ๐ ∈ [0, โ] . We assume these time series are generated independently, with ๐ฟ๐ uniformly distributed in the interval [0, โ]. Denote clustering parameter as ๐: (๐ is a distance threshold, if the distance between a specific object and its kNN is smaller than ๐, then it is a core point). Denote event ๐ธ๐ : ๐ธ๐ โ {๐(๐ − ๐ฟ๐ ), ๐ = 1, 2, … ๐. ๐๐๐๐๐๐ ๐ก๐ ๐ ๐๐๐ ๐๐๐ข๐ ๐ก๐๐ } Lemma 2: ๐(๐ธ๐ ) ≥ 1 − ๐(1 ๐ − )๐ ๐๐๐โ Proof: divide the interval [0, โ] into several buckets with length ๐ equals to . According to the mechanism of DBSCAN, if each ๐๐๐ bucket contains at least one time series, then ๐ ๐ฟ1(๐(๐), ๐๐ก๐ ๐๐๐) ≤ ๐ × ๐๐ × = ๐, so all the time series ๐๐๐ are core points, and they are density-connected, so all the time series will be grouped into one cluster. Denote event ๐๐ โ {๐๐กโ ๐๐ข๐๐๐๐ก ๐๐ ๐๐๐๐ก๐ฆ}, then ๐(๐๐ก ๐๐๐๐ ๐ก ๐๐๐ ๐๐ข๐๐๐๐ก ๐๐ ๐๐๐๐ก๐ฆ) = ๐(โ๐ ๐๐ ) ≤ ∑๐ ๐(๐๐ ) = ๐ ๐(1 − )๐ . Note that event {๐๐ ๐๐๐๐ก๐ฆ ๐๐ข๐๐๐๐ก} is just a ๐๐๐โ subset of ๐ธ๐ , so ๐(๐ธ๐ ) ≥ ๐(๐๐ ๐๐๐๐ก๐ฆ ๐๐ข๐๐๐๐ก) = 1 − ๐(โ๐ ๐๐ ) ≥ 1 − ๐ (1 − ๐ ๐ To illustrate how density-based clustering overcome the random noise, we give an theoretical analysis on the time series generated by AR(1) model, which is relatively simple, and without loss of generality. Let ๐ฅ๐ be the value of ith epoch of a particular time series instance, AR(1) is represented by ๐ฅ๐ = ๐๐ฅ๐−1 + ๐, where |๐| < 1 to make it stable, and ๐~๐(0, ๐ 2 ) is the white noise. A given time series โ๐ โ {๐ฅ1 , ๐ฅ2 , … , ๐ฅ๐ }๐ , let ๐ฅ1 = 0 be the initial value. Joint distribution of โ๐: Denote ๐(๐ฅ1 , ๐ฅ2 , … , ๐ฅ๐ ) as the p.d.f. of a time series generated by the given AR(1) model. So ๐(๐ฅ0 , ๐ฅ1 , … , ๐ฅ๐ ) = ๐(๐ฅ๐ |๐ฅ0 , ๐ฅ1 , … , ๐ฅ๐−1 )๐(๐ฅ๐−1 |๐ฅ0 , … , ๐ฅ๐−2 ) … ๐(๐ฅ1 |๐ฅ0 )๐(๐ฅ0 ) , according to identity transformation of probability. ๐(๐ฅ0 = 0) = 1 1. ๐(๐ฅ๐ |๐ฅ0 , … , ๐ฅ๐−1 ) = exp [− √2๐๐ Markov property. So finally, we get 1 โ ) โ ๐(๐ฅ0 , ๐ฅ1 , … , ๐ฅ๐ ) = ๐(๐ (๐ฅ๐ −๐๐ฅ๐−1 )2 2๐ 2 ๐ ๐ ∏ exp [− (2๐๐ 2 )2 = Here, Σ −1 = 1 + ๐2 −๐ 0 ] according to the ๐=1 (๐ฅ๐ − ๐๐ฅ๐−1 )2 ] 2๐ 2 1 ๐ −1 โ๐ Σ โ๐ 2 exp [− ] ๐ ๐2 (2๐๐ 2 )2 1 −๐ 1 + ๐2 −๐ โฎ 0 −๐ ๐2 โฏ 0 โฑ โฎ 2 −๐ 1 + ๐ 0 โฏ ( −๐ 1) Now let ๐ time series instances (each with length= ๐) generated independently from the discussed AR(1) model. For density based clustering, denote clustering parameter as ๐: (๐ is a distance threshold, if the distance between a specific object and its kNN is ๐ smaller than ๐, then it is a core point). Define ๐ = ๐ , where ๐๐ ๐ ๐๐ โ ๐๐ × ๐ ๐ , is the volume of the hyper-sphere (see Figure 2) in the ๐-dimensional ๐ฟ๐ space, e.g., in Euclidean space, ๐๐ = ๐๐/2 ๐ 2 Γ( +1) Lemma 1: the ratio of the ๐ objects (refer to these time series instances) can be clustered together by applying density based clustering is ) , hence the lemma is proven. ๐๐๐โ Corollary: lim ๐(๐ธ๐ ) = 1 ๐→∞ This is intuitive. Description: lemma 2 provides the probabilistic confidence bound that how probable the phase-shift-time-series instances will be grouped together. As ๐ goes to infinity, the confidence will converged to 1 in probability. 1.1.3 Random Noise Overcoming This section describes how density based clustering overcome the random noise which is encode into the specific models to generate particular time series. Here we discuss the time series generated by “precise model” + “white noise”, and we refer “random noise” as the white noise. In ๐(๐, ๐) ≈ ∫ โ )๐๐ฅ1 ๐๐ฅ2 … ๐๐ฅ๐ … ∫ ๐(๐ โ )≥๐ ๐๐(๐ โ ) is the p.d.f. of โ๐ , the The proof is straightforward. Since ๐(๐ โ ). local density of a given position โ๐ is approximately as ๐๐(๐ When the local density is greater than ๐ , the points in the โ) neighborhood of โ๐ can be identified as core points, because ๐(๐ โ ), all is a smooth function. By considering the convexity of ๐(๐ the region with density greater than ๐ can be grouped together. Corollary: lim ๐(๐, ๐) = 1 ๐→∞ The density can be as large as possible, once the number of objects increasing to infinity. 1.1.4 Multi-Density Estimation ๐: Dimensionality of each object. Density estimation is key to density-based clustering algorithms. It is performed manually or with slow performance in most of the existing algorithms [23][38][39][40][41]]. In this section, we define a concept density radius and provide theoretical proof on its estimation. We use density radius in YADING to identify the core points of the input dataset and conduct multi-density clustering accordingly. ๐๐๐๐ : The distance between an object and its kNN. We define ๐๐๐๐ of an object as the distance between this object and its kNN. A ๐๐๐๐ curve is a list of ๐๐๐๐ values in descending order. Figure 2 shows an example of ๐๐๐๐ curve with ๐ = 4 . We define density radius as the most frequent ๐๐๐๐ value. Intuitively, most objects contain exactly ๐ nearest neighbors in a hypersphere with radius equals to density radius. ๐๐๐๐ ๐๐ข๐๐ฃ๐: Aggregate the ๐๐๐๐ value of each object, then sort in descending order. |{๐๐๐๐๐๐ก๐ ๐คโ๐๐ ๐ ๐ ≤๐}| ๐๐๐ ๐ธ๐ท๐น๐ (๐) โ , here ๐ธ๐ท๐น๐ also refers to ๐ empirical distribution function of ๐๐๐๐ . ๐๐ โ ๐๐ × ๐ ๐ , is the volume of the hyper-sphere (see Figure 2) in the ๐ -dimensional ๐ฟ๐ space, e.g., in Euclidean space, ๐๐ = ๐๐/2 ๐ 2 . Γ( +1) Density-radius: the most frequent ๐๐๐๐ value in ๐๐๐๐ ๐๐ข๐๐ฃ๐. Preliminaries: We transform the estimation of density radius to identifying the inflection point on ๐๐๐๐ curve. Here, inflection point takes general definition of having its second derivative equal to zero. Next, we provide the intuition behind this transformation followed by theoretical proof. Intuitively, the local area of an inflection point on ๐๐๐๐ curve is the flattest (i.e. the slopes on its left-hand and right-hand sides has the smallest difference). On ๐๐๐๐ curve, the points in the neighborhood of a inflection point have close values of ๐๐๐๐ . For example, in Figure 2, there are three inflection points with corresponding ๐๐๐๐ values equal to 1,500, 500, and 200. In other words, most points on this curve have ๐๐๐๐ values close to 1,500, 500, or 200. According to the definition of density radius, these three values can be used to approximate three density radiuses. We now provide theoretical on estimating density radius by identifying the inflection point on ๐๐๐๐ curve. We first prove that the Y-value, ๐๐๐๐ , of each inflection point on the ๐๐๐๐ curve equals to one unique density radius. Specifically, given a dataset with single density, we provide analytical form to its ๐๐๐๐ curve, and prove that there exists a unique inflection point with Y-value equal to the density radius of the dataset. We further generalize the estimation to the dataset with multiple densities. To make the mathematical deduction easier, we use ๐ธ๐ท๐น๐ (๐) โ |{๐๐๐๐๐๐ก๐ ๐คโ๐๐ ๐ ๐๐๐๐ ≤๐}| to represent ๐๐๐๐ curve equivalently. EDF ๐ is short for Empirical Distribution Function. It is the ๐๐๐๐ curve rotated 90 degrees clockwise with normalized Y-Axis. The Xvalue of inflection point on ๐ธ๐ท๐น๐ (๐) equals to the Y-value of inflection point on ๐๐๐๐ curve. 3000 2500 Problem 1 (Analytical expression of ๐ฌ๐ซ๐ญ๐ (๐) ): Suppose ๐ objects are sampled independently, by a uniform distribution, which is defined on a region with volume is ๐, see Figure 2. So ๐ the density ๐ = . ๐๐ is an arbitrary hyper-sphere region with ๐ radius ๐, centered at ๐, where a particular object is located at ๐. Define event ๐ธ๐,๐ โ {๐๐๐๐ฆ ๐ ๐๐๐๐๐ก๐ ๐๐๐ ๐๐๐ ๐๐ , ๐ ≥ 1}. ๐−1 ๐−1 Lemma 1: ๐(๐ธ๐,๐ ) = ๐ถ๐−1 ๐๐ (1 − ๐๐ )๐−๐ , where ๐๐ = ๐๐ ๐ = ๐๐ ×๐ ๐ ๐ . Proof: since there’s already one object located inside sphere (located at center), so ๐ธ๐,๐ requires ๐ − 1 extra objects inside sphere, ๐ − ๐ objects outside. ๐๐ is the probability that one object is sampled inside the sphere, so the lemma is obvious which is a binomial-distribution. Lemma 2: ๐(๐๐๐๐ ≤ ๐) = ∑๐ ๐=๐+1 ๐(๐ธ๐,๐ ). Proof: if ๐๐๐๐ ≤ ๐, then the kNN of an object is inside of the hyper-sphere, so there are at least ๐ + 1 objects inside of hypersphere; and vice versa. Corollary 1: ๐ธ๐ท๐น๐ (๐) ≈ ๐(๐๐๐๐ ≤ ๐). Three potential inflection points Should point out that, although the ๐๐๐๐ of each object share same distribution, they are actually not totally independent, so we cannot directly use ๐ธ๐ท๐น๐ (๐) to approximate ๐(๐๐๐๐ ≤ ๐). But we assume this is a good approximation, which can also be evidenced by simulation experiments. 2000 4-dis Figure 2. One population generated by uniform distribution 1500 1000 500 Corollary 2: ๐ธ๐ท๐น1 (๐) ≈ 1 − ๐ −๐๐๐ ๐ 0 0 200 400 600 800 1000 Point index Figure 1. 4-dis curve of a time series dataset The mentioned notations are summarized as follows: ๐: Total number of objects (same as points). ๐ 1๐๐๐ has the simplest version. According to Lemma 2 and 3, ๐−1 ๐ธ๐ท๐น1 (๐) = 1 − (1 − ๐๐ )๐−1 = 1 − (1 − ๐ ≈ 1 − (1 − ๐๐๐ ๐ ๐ ) ๐ ๐๐๐ ๐ ๐ ๐ ) ≈ 1 − ๐ −๐๐๐ ๐ ๐ The last two approximations make sense when ๐ is large. Lemma Lemma 3 (existence and uniqueness): there exist one and only one inflection point on ๐ธ๐ท๐น๐ (๐), and Y-value of inflection point on ๐๐๐๐ curve is density radius. ๐ , ๐ ๐ ๐กโ๐๐ก lim ๐๐ = ๐๐1 . Same statement satisfied for ๐๐2 . Proof: Denote ๐๐ as the X-value of inflection point of ๐ธ๐ท๐น๐ (๐), ๐2 ๐ธ๐ท๐น1 (๐) ๐ ๐ = ๐(๐ − 1)๐๐ ๐ ๐−2 (๐1 ๐1 ๐ −๐1๐๐ ๐ + ๐2 ๐2 ๐ −๐2๐๐ ๐ ) ๐๐ 2 ๐ − ๐2 ๐๐2 ๐ 2๐−2 (๐1 ๐12 ๐ −๐1๐๐ ๐ ๐ + ๐2 ๐22 ๐ −๐2 ๐๐ ๐ ) so ๐ 2 ๐ธ๐ท๐น๐ (๐) ๐๐ 2 ๐ ๐๐ ๐−2 ๐ −๐๐๐ ๐ (๐ |๐=๐๐ = 0 , ๐ 2 ๐ธ๐ท๐น๐ (๐) where ๐๐ 2 = ๐๐ ๐ ). ๐๐๐ − 1 − ๐๐๐ Since the first derivation of ๐ธ๐ท๐น๐ (๐) is the probability density function; the second derivation equal to zero means that ๐ = ๐๐ has the maximum likelihood. In other words, ๐๐ is the most frequent value of ๐๐๐๐ , which is the definition of density radius. Since the X-value of inflection point on ๐ธ๐ท๐น๐ (๐) equals to the Y-value of inflection point on ๐๐๐๐ curve, the lemma is proven. Corollary 3: For ๐ธ๐ท๐น1 (๐), inflection point ๐๐ = ( Take ๐ 2 ๐ธ๐ท๐น๐ (๐) ๐๐ 2 ๐−1 1 ๐ ๐๐๐ ) 1 ๐ |๐=๐๐ = 0 , with the formulation of ๐ธ๐ท๐น1 (๐) ≈ ๐ 1 − ๐ −๐๐๐ ๐ , we get the expression of ๐๐ . Problem 2 (mixture of density): Denote ๐๐ = {๐๐ , ๐๐ , ๐๐ } as the parameter vector of ๐ธ๐ท๐น๐,๐ (๐) for a particular population which is with single density. Now suppose there are โ regions with different densities, located at space without overlap. Let’s estimate the expression for the overall ๐ธ๐ท๐น๐ (๐). Lemma 4: ๐ธ๐ท๐น๐ (๐) = ∑โ๐=1 ๐๐ ๐ธ๐ท๐น๐,๐ (๐) ๐ , where ๐ = ∑โ๐=1 ๐๐ This can be easily obtained by using the definition of ๐ธ๐ท๐น๐ (๐). The mixture model is just the linear combination of each individual ๐ธ๐ท๐น๐,๐ (๐), this enables model inference: given the overall ๐ธ๐ท๐น๐ (๐), identify the underlying models, which can be represented by {๐1 , … ๐โ } , further, multi-densities can be obtained. Mixture model identification: Denote the expression of ๐ธ๐ท๐น๐ (๐) as ๐ธ๐ท๐น๐ (๐|๐1 , … ๐โ ) , now given the ๐๐๐๐ ๐๐ข๐๐ฃ๐ which can be represented by a list of {๐ฆ๐ , ๐๐ }, we formulate the problem as an optimization problem: arg min ∑[๐ฆ๐ − ๐ธ๐ท๐น๐ (๐๐ |๐1 , ๐2 , … ๐โ )]2 ๐=1 โ ๐ ๐ข๐๐๐๐๐ก ๐ก๐ ∑ ๐๐ = 1 , ๐๐ ≥ 0, ∀๐ ๐=1 Several well-developed techniques, such as EM method is suitable to solve such a problem. The next lemma provides theoretical bounds to show that the inflection points on the ๐ธ๐ท๐น๐ (๐) of mixture densities can approximate the inflection points of ๐ธ๐ท๐น๐,๐ (๐) (each single density) when the densities are different enough. Without loss of generality, and also for simplicity consideration, we consider two mixture densities, and set ๐ = 1, and we assume the intrinsic dimensionality of the two density regions are equal. Denote ๐๐1 , ๐๐2 are the X-value of inflection point of two density ๐ regions, with density ๐1 , ๐2 respectively. Denote ๐ = 1. Denote ๐ โ {๐๐ | ๐ 2 ๐ธ๐ท๐น1 (๐๐ ) ๐๐ 2 ๐2 = 0}. Denote ๐1 = ๐1 ๐ , ๐2 = ๐2 ๐ . ∃๐๐ ∈ ๐ , ๐ ๐ ๐กโ๐๐ก lim ๐๐ = ๐๐1 ๐→0 ; ∃๐๐ ∈ and ๐→∞ Proof: According to the form Put the form of ๐๐1 = ( ๐−1 1 ๐ 1 ๐ ๐1 ๐๐ ) into it, we get 1 ๐ ๐ธ๐ท๐น1 (๐๐1 ) ๐−1 2 = ( ) ๐(1 − ๐)๐ −๐(1−๐ ) ๐๐ 2 ๐๐1 2 The lemma is proven not matter ๐ → 0 or ๐ → ∞. This lemma indicates that, when the density different is large enough, the inflection point obtained from the ๐ธ๐ท๐น๐ (๐) can approximate the inflection point of each single density region, which further represents the density radius. Corollary 4: For high dimensionality ๐ โซ 1, ๐ = worst density different that for approximation. 3±√5 2 3±√5 ๐ 2 ๐ธ๐ท๐น (๐๐1 ) 1 1 Let 1 − ≈ 1, we get that when ๐ = , ๐ 2 ๐๐ 2 maximum/minimum value, which is not equal to 0. is the get the The corollary implies that, when the density difference is proper, the mixture ๐ธ๐ท๐น๐ (๐) can hardly be used for density radius identification. 1.2 Tables for Pseudo-code 1.2.1 Dimensionality Reduction We adopt PAA for dimensionality reduction because of its computational efficiency and its capability of preserving the shape of time series. Denote a time series instance with length ๐ท as ๐๐ โ (๐ก๐1 , ๐ก๐2 , … , ๐ก๐๐ท ) . The transformation from ๐๐ to ๐๐′ โ ๐ ๐ท (๐๐1 , ๐๐2 , … , ๐๐๐ ) where ๐๐๐ = ∑๐ ๐ท ๐ท ๐ ๐ท ๐ ๐= (๐−1)+1 ๐ก๐๐ , is called PAA with frame length equal to . PAA segments a time series instance ๐ into d frames, and uses one value (i.e. the mean value) to represent each frame so as to reduce its length from D to d. ๐ ๐1 ~๐โ 5: One key issue in applying PAA is to specify a proper ๐ automatically. As proved by the Nyquist-Shannon sampling theory, any time series without frequencies higher than B Hertz can be perfectly recovered by its sampled points with sampling rate 2*B. This means that using 2*B as sampling rate can preserve the shape of a frequency-bound time series. Although some time series under our study are often imperfectly frequency-bound signals, most of them can be approximated by frequency-bound signals because their very high-frequency components are usually corresponding to noise. Therefore, we transform the problem of determining d into estimating the upper bound of frequencies. In this paper, we propose a novel auto-correlation-based approach to identify the approximate value of the frequency upper bound of all the input time series instances. The frame length d is then easily determined as the inverse of the frequency upper bound. In more details, we first identify the typical frequency of each time series instance ๐๐ by locating the first local minimum on its auto-correlation curve ๐๐ , which is denoted as ๐๐ (๐ฆ) = ∑๐ท−๐ฆ ๐=1 ๐ก๐๐ ๐ก๐(๐+๐ฆ) , where ๐ฆ is the lag. If there is a local minimum of ๐๐ on a particular lag ๐ฆ ′ , then ๐ฆ ′ relates to a typical half-period if ๐๐ (๐ฆ ′ ) < 0. In this case, we call 1/๐ฆ ′ the typical frequency for ๐๐ . The smaller ๐ฆ ′ is, the higher the frequency it represents. distance between each pair of objects on the sample set. Multidensity estimation costs ๐(๐ log ๐ ) since it adopts divide-andconquer strategy. Then, we sort all the detected typical frequencies in ascending order, and select the 80th percentile to approximate the frequency upper bound of all the time series instances. The reason why we do not use the exact maximum typical frequency is to remove the potential instability caused by the small amount of extraordinary noise in some time series instances. 1.2.3 Clustering Regarding implementation, the auto-correlation curves ๐๐ (๐ฆ) can be obtained efficiently using the Fast Fourier transforms: (1) ๐น๐ (๐) = ๐น๐น๐[๐๐ ] ; (2) ๐(๐) = ๐น๐ (๐)๐น๐∗ (๐) ; (3) ๐(๐ฆ) = ๐ผ๐น๐น๐[๐(๐)]. Where IFFT is inverse Fast Fourier transforms, and the asterisk denotes complex conjugate. Table 1 shows the algorithm of automatically estimating the frame length. Once we obtain the density radiuses, the clustering algorithm is straightforward. With each density radius specified, from the smallest to the largest, DBSCAN is performed accordingly. In our implementation, we set ๐ = 4 , which is the MinPts value in DBSCAN. The implementation is illustrated in Table 3. Table 3. Algorithm for multi-density based clustering /* p: the sample data set radiuses: the density radiuses */ MULTIDBSCAN(p, radiuses) for each radius ∈ radiuses objs ๏ cluster from DBSCAN(p, radius) remove objs from p mark p as noise objects Table 1. Auto estimation of the frame length FRAMELENGTH(๐ฃ′๐×๐ซ ) for each ๐ป๐ ∈ ๐ฃ′๐×๐ซ ) ๐๐ (๐) ๏ auto-correlation applied to ๐ป๐ ๐∗๐ ๏ get first local minimum of ๐๐ (๐) ๐∗ ๏ 80% percentile on sorted {๐∗๐ ~ ๐∗๐ } return ๐∗ The time complexity of data reduction is ๐(๐ ๐ท log ๐ท + ๐๐ท) , specifically, obtaining frame length cost ๐(๐ ๐ท log ๐ท) , and applying PAA to the entire input data set cost ๐(๐๐ท). 1.2.2 Density Estimation We implement a fast algorithm to estimate density radiuses (Table 2). We first find the inflection point with the minimum difference between its left and right-hand slopes. We then recursively repeat this process on the two sub-curves segmented by the obtained inflection point, until no more significant inflection points are found. The DBSCAN costs ๐(๐ log ๐ ). Since it is performed the number of times not exceeding ๐ , the total cost is ๐(๐ 2 log ๐ ). 1.2.4 Assignment After clustering is performed on the sampled dataset, a cluster label needs to be assigned to each unlabeled time series instance in the input dataset. The assignment process is straightforward. For an unlabeled instance, its closest labeled instance is found. If their distance is less than the density radius of the cluster the labeled instance belongs to, then the unlabeled instance is considered to be in the same cluster as the labeled instance. Otherwise, it is labeled as noise. Labeled points Unlabeled points ๐๐๐ − ๐ ๐ ๐ ๐๐๐ Table 2. Algorithm for estimating density radiuses Function 1 Function 2 DENSITYRADIUSES(๐๐ ๐๐ ) length ๏ |๐๐ ๐๐ | allocate res as list INFLECTIONPOINT(๐๐ ๐๐ , 0, length, res) return res INFLECTIONPOINT (๐๐ ๐๐ , s, e, res) r ๏ -1, diff ๏ -1 for i ๏ s to e left ๏ SLOPE (๐๐ ๐๐ , s, i) right ๏ SLOPE (๐๐ ๐๐ , i, e) if left or right greater than threshold1 continue if |left - right| smaller than diff diff ๏ |left-right| r ๏ ith element of ๐๐ ๐๐ if diff smaller than threshod2 /*record the inflection point, and recursively search*/ add r to res INFLECTIONPOINT (๐๐ ๐๐ , s, r-1) INFLECTIONPOINT (๐๐ ๐๐ , r+1, e) The time complexity of estimating density radiuses is as follows. The generation of ๐๐๐๐ curve costs ๐(๐๐ 2 ) due to calculation of Figure 3. Illustration of the pruning strategy in assignment The assignment process involves the distance computation between every pair of unlabeled and labeled instances, which has complexity ๐(๐๐ ๐). The observation illustrated in Figure 3 can reduce such computation. If an unlabeled object ๐ is far from one labeled object ๐, i.e. their distance dis is greater than the density radius ๐ of b’s cluster, then the distance between a and the labeled neighbors of ๐ (within ๐๐๐ − ๐ ) is also greater than ๐ (according to triangle inequality). Therefore, the distance computation between a and each of b’s neighbors is saved. We design a data structure named Sorted Neighbor Graph (SNG) to achieve the above pruning strategy. When performing densitybased clustering on the sampled dataset, if an instance ๐ is determined to be a core point, then b is added to SNG, and its distances to all the other instances in the sampled dataset are computed and stored in SNG in ascending order. Quick-sort is used in the construction of SNG, so the time complexity of SNG is ๐(๐ 2 log ๐ ). The implementation of assignment using SNG is shown in Table 4. Its time complexity is ๐(๐๐ ๐). Although SNG and pruning can reduce the search space, in the worst case, every unlabeled instance has to compare to every labeled instance. Table 4. Algorithm for assignment // uObj: the list of unlabeled objects ASSIGNMENT(SNG, uObj) for each obj ∈ uObj set the label of obj as “noisy” for each o ∈ {keys of SNG} if o has been inspected continue; dis ๏ L1 distance between o and obj if dis less than density radius of o mark obj with same label of o break mark o as inspected jump ๏ dis - density radius of o i ๏ BINARYSEARCH(SNG[o], jump) for each neighbor ∈ SNG[o] with index greater than i if density radius of neighbor is less than jump mark neighbor as inspected else break /*this is a sorted list*/ Random Walk [70]. ๐ฅ๐ก = ๐ฅ๐ก−1 + ๐ We use random walk to represent the noise time series. Since the random walk is accumulation of white noise, with no extra information contained Using the aforementioned stochastic models, we generate a template for creating simulation datasets to be used in the subsequent experiments. 1.3.2 Template Details According to our evaluation design, we use one template, named TemplateA for RQ1, RQ2 and partially RQ3 (robustness to random noise), and we use TemplateB to mimic phase perturbation phenomenon. Specifically for TemplateA, it consists of 15 groups of time series, here each group indicates a specific label assigned to each object in this group. The group size varies significantly, which is represented by the population ratio ranging from 0.1% to 30%. The mapping between GroupID and model type is random. The parameters of the models are also arbitrarily chosen. Table 5 lists the information of each group in the template.. Based on this template, the size of each simulation dataset N and the length of each time series instance D are set to meet the requirements of each experiment 1.3 Evaluation Report This report includes the detailed and complementary information according to the paper of YADING, named “YADING: Fast Clustering of Large-Scale Time Series Data”. For easier illustration, the three research questions mentioned in paper are listed as follows: Table 5. Details of TemplateA GroupID Ratio Model 1 30% AR(1) 2 20% Forced Oscillation 3 13% Drifting 4 9.0% Drifting We use five different underlying stochastic models to generate simulation time series data. Below are the detailed illustration of these models, and corresponding parameters. These models cover a wide range of characteristics of time series. 5 6.0% AR(1) ๐ผ = 20, ๐ฝ = 0.5, ๐ = 1 6 4.0% Forced Oscillation AR(1) Model [66]. ๐ฅ๐ก = ๐ฝ ∗ ๐ฅ๐ก−1 + ๐ผ + ๐ 7 2.8% Peak AR(1) model is a simplified version of general ARMA mode. Here ๐ผ , ๐ฝ and ๐ are parameters. |๐ฝ| < 1 to assure this is a stationary process. ๐ผ, ๐ฝ decide the asymptotic converged value of ๐ฅ๐ก , and ๐~๐(0, ๐) is white noise. 8 1.8% Forced Oscillation 9 1.2% Peak ๐ = 1, ๐ = 2, ๐ = 30, ๐พ = 1, ๐ฝ = 0, ๐ = 10 ๐ = 1, ๐ = 2, ๐ = 10, ๐พ = 1, ๐ฝ = 0, ๐ = 10, ๐ = 1000 ๐ = 1, ๐ = 2, ๐ = 10, ๐พ = 1, ๐ฝ = 0, ๐ = 10 ๐ = 1, ๐ = 2, ๐ = 100, ๐พ = 1, ๐ฝ = 0, ๐ = 10, ๐ = 1000 10 0.8% RQ1. How efficiently can YADING cluster time series data? RQ2. How does sample size affect the clustering accuracy? RQ3. How robust is YADING to time series variation? 1.3.1 Models for Generating Simulation Data 1 [๐∗cos(๐พ∗๐ก+๐ฝ)+๐]+2๐ฅ๐ก−1 −๐ฅ๐ก−2 Forced Oscillation [67]. ๐ฅ๐ก = ๐ 1+๐2 Forced Oscillation is used to model the cyclical time series. Here γ is the circular frequency of external “force”, ๐ฝ is the initial phase. ๐ and ๐ are the intrinsic properties of studied time series. ๐~๐(0, ๐) is white noise. Drift [68]. Leverage the formula of AR(1) model but make ๐ฝ > 1 . Here ๐ฝ − 1 becomes the drift coefficient. To avoid the divergence of ๐ฅ๐ก , we set a threshold ๐, so that ๐ฅ๐ก is set to initial value after ๐ steps. Peak [69]. Leverage the forced oscillation model, but set the force ๐น(๐ก) = ๐ ∗ cos(๐พ ∗ ๐ก + ๐ฝ) + ๐ + ๐๐ฟ(๐ก − ๐ก๐ ) , where ๐ฟ(๐ฅ) = 1, ๐ฅ = 0 { . This type of time series is used to mimic the spikes or 0, ๐๐๐ ๐ transient anomalies. AR(1) Params ๐ผ = 10, ๐ฝ = 0.5, ๐ =10 ๐ = 1, ๐ = 2, ๐ = 30, ๐พ = 0.1, ๐ฝ = 0, ๐ = 10 ๐ผ = 10, ๐ฝ = 1, ๐ = 10, ๐ = 30 ๐ผ = 20, ๐ฝ = 1, ๐ = 5, ๐ = 30 ๐ผ = 30, ๐ฝ = 0.5, ๐ =1 ๐ = 1, ๐ = 2, ๐ = 20, ๐พ = 1, ๐ฝ = 0, ๐ = 10 ๐ผ = 5, ๐ฝ = 1, ๐ = 5, 12 0.4% Drifting ๐ = 30 ๐ผ = 20, ๐ฝ = 13 0.2% AR(1) 0.5, ๐ =10 Forced ๐ = 1, ๐ = 2, ๐ = 20, 14 0.2% Oscillation ๐พ = 0.1, ๐ฝ = 0, ๐ = 10 ๐ผ = 5, ๐ฝ = 1, ๐ = 10, 15 0.1% Drifting ๐ = 30 According to the settings in Table 1, the data set is generated once the data size and dimensionality is specified. 11 0.5% Forced Oscillation Table 6. Details of TemplateB GroupID Ratio Model 1 31% Forced Oscillation 2 20% AR(1) 14% Forced Oscillation 3 4 9.0% Drifting 5 6.0% AR(1) Params ๐ = 1, ๐ = 2, ๐ = 32, ๐พ = 0.08, ๐ = 5, ๐ฝ ∈ [0.0, 2๐⁄3] ๐ผ = 0, ๐ฝ = −0.5, ๐ =10 ๐ = 1, ๐ = 2, ๐ = 64, ๐พ = 0.1, ๐ = 20, ๐ฝ ∈ [0.0, ๐⁄3] ๐ผ = 20, ๐ฝ = 0.8, ๐ = 8, ๐ = 8 ๐ผ = 25, ๐ฝ = 0.5, ๐ = 1 ๐ผ = 14, ๐ฝ = 0.85, ๐ = 8, ๐ = 32 ๐ = 1, ๐ = 2, ๐ = 8, 7 2.8% Peak ๐พ = 1, ๐ฝ = 3, ๐ = 10, ๐ = 400 ๐ = 1, ๐ = 2, ๐ = 8, 8 1.2% Peak ๐พ = 1, ๐ฝ = 3, ๐ = 10, ๐ = 400 We use TemplateB to mimic the phase perturbation phenomenon. As illustrated in Table 2, we set the phase perturbation for Forced Oscillation models. The initial phase is set by uniform distribution defined in the given interval, e.g., for Group 1, ๐ฝ ∈ [0.0, 2๐⁄3]; for Group 3, ๐ฝ ∈ [0.0, ๐⁄3]. Here ๐ฝ is the initial phase. 6 6.0% Drifting 2. REFERENCES [1] Debregeas, A., and Hebrail, G. 1998. Interactive interpretation of Kohonen maps applied to curves. In Proc. of KDD’98. 179-183. [2] Derrick, K., Bill, K., and Vamsi, C. 2012. Large scale/big data federation & virtualization: a case study. http://rhsummit.files.wordpress.com/2012/03/kittler_large_ scale_big_data.pdf. [3] D. A. Patterson. 2002. A simple way to estimate the cost of downtime. In Proc. of LISA’ 02, pp. 185-188. [4] Eamonn, K., and Shruti, K. 2002. On the need for time series data mining benchmarks: a survey and empirical demonstration. In Proc. of KDD’02, July 23-26. [5] T. W. Liao. 2005. Clustering of time series data—A survey Pattern Recognit., vol. 38, no. 11, pp. 1857–1874, Nov. [6] C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. 1994. Fast subsequence matching in time series databases. In Proc. of the ACM SIGMOD Conf., May. [7] X. Golay, S. Kollias, G. Stoll, D. Meier, A. Valavanis, P. Boesiger. 1998. A new correlation-based fuzzy logic clustering algorithm for fMRI, Mag. Resonance Med. 40 249–260. [8] D. Rafiei, and A. Mendelzon. 1997. Similarity-based queries for time series data. In Proc. of the ACM SIGMOD Conf., Tucson, AZ, May. [9] B. K. Yi, H. V. Jagadish, and C. Faloutsos. 1998. Efficient retrieval of similar time sequences under time warping. In IEEE Proc. of ICDE, Feb. [10] R. Agrawal, K. L. Lin, H. S. Sawhney, and K. Shim. 1995. Fast similarity search in the presence of noise, scaling, and translation in time series database. In Proc. of the VLDB conf., Zurich, Switzerland. [11] J. Han, M. Kamber. 2001. Data Mining: Concepts and Techniques, Morgan Kaufmann, San Francisco, pp. 346– 389. [12] Chu, K. & Wong, M. 1999. Fast time-series searching with scaling and shifting. In proc. of PODS. pp 237-248. [13] Faloutsos, C., Jagadish, H., Mendelzon, A. & Milo, T. 1997. A signature technique for similarity-based queries. In Proc. of the ICCCS. [14] Chan, K. & Fu, A. W. 1999. Efficient time series matching by wavelets. In proc. of ICDE. pp 126-133. [15] Popivanov, I. & Miller, R. J. 2002. Similarity search over time series data using wavelets. In proc. of ICDE. pp 212221. [16] Keogh, E., Chakrabarti, K., Pazzani, M. & Mehrotra, S. 2001. Locally adaptive dimensionality reduction for indexing large time series databases. In proc. of ACM SIGMOD. pp 151-162. [17] Korn, F., Jagadish, H. & Faloutsos, C. 1997. Efficiently supporting ad hoc queries in large datasets of time sequences. In proc. of the ACM SIGMOD. pp 289-300. [18] Yi, B. & Faloutsos, C. 2000. Fast time sequence indexing for arbitrary lp norms. In proc. of the VLDB. pp 385-394. [19] E. J. Keogh, K. Chakrabarti, M. J. Pazzani, and S. Mehrotra. 2001. Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases. Knowl. Inf. Syst., 3(3). [20] G. Kollios, D. Gunopulos, N. Koudas, and S. Berchtold. 2003. E๏ฌcient biased sampling for approximate clustering and outlier detection in large datasets. IEEE TKDE, 15(5). [21] Zhou, S., Zhou, A., Cao, J., Wen, J., Fan, Y., Hu. Y. 2000. Combining sampling technique with DBSCAN algorithm for clustering large spatial databases. In Proc. of the PAKDD. 169-172. [22] Stuart, Alan. 1962. Basic Ideas of Scientific Sampling, Hafner Publishing Company, New York. [23] M. Ester, H. P. Kriegel, and X. Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of 2nd ACM SIGKDD, pages 226–231. [24] “Amazon’s S3 cloud service turns into a puff of smoke”. 2008. In InformationWeek NewsFilter, Aug. [25] J. N. Hoover: “Outages force cloud computing users to rethink tactics”. In InformationWeek, Aug. 16, 2008. [26] M. Steinbach, L. Ertoz, and V. Kumar. 2003. Challenges of clustering high dimensional data. In L. T. Wille, editor, New Vistas in Statistical Physics – Applications in Econophysics, Bioinformatics, and Pattern Recognition. Springer-Verlag. [27] N. D. Sidiropoulos and R. Bros. 1999. Mathematical Programming Algorithms for Regression-based Non-linear Filtering in RN . IEEE Trans. on Signal Processing, Mar. [28] J. L. Rodgers and W. A. Nicewander. 1988. Thirteen ways to look at the correlation coefficient. The American Statistician, 42(1):59–66, February. [29] Al-Naymat, G., Chawla, S., & Taheri, J. (2012). SparseDTW: A Novel Approach to Speed up Dynamic Time Warping. [30] M. Kumar, N.R. Patel, J. Woo. 2002. Clustering seasonality patterns in the presence of errors, Proceedings of KDD ’02, Edmonton, Alberta, Canada. [31] Kullback, S.; Leibler, R.A. 1951. "On Information and Sufficiency". Annals of Mathematical Statistics 22 (1): 79– 86. doi:10.1214/aoms/1177729694. MR 39968. [32] Ng R.T., and Han J. 1994. Efficient and Effective Clustering Methods for Spatial Data Mining, In proc. of VLDB, 144-155. [33] S. Guha, R. Rastogi, K. Shim. 1998. CURE: an efficient clustering algorithm for large databases. In proc. of SIGMOD. pp. 73–84. [34] García J.A., Fdez-Valdivia J., Cortijo F. J., and Molina R. 1994. A Dynamic Approach for Clustering Data. Signal Processing, Vol. 44, No. 2, 1994, pp. 181-196. [35] Jianbo Shi and Jitendra Malik. 2000. Normalized Cuts and Image Segmentation, IEEE Transactions on PAMI, Vol. 22, No. 8, Aug 2000. [36] W. Wang, J. Yang, R. Muntz, R. 1997. STING: a statistical information grid approach to spatial data mining, VLDB’97, Athens, Greek, pp. 186–195. [37] Hans-Peter Kriegel, Peer Kröger, Jörg Sander, Arthur Zimek 2011. Density-based Clustering. WIREs Data Mining and Knowledge Discovery 1 (3): 231–240. doi:10.1002/widm.30. [38] Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, Jörg Sander. 1999. OPTICS: Ordering Points To Identify the Clustering Structure. ACM SIGMOD. pp. 49–60. [39] Achtert, E.; Böhm, C.; Kröger, P. 2006. "DeLi-Clu: Boosting Robustness, Completeness, Usability, and Efficiency of Hierarchical Clustering by a Closest Pair Ranking". LNCS: Advances in Knowledge Discovery and Data Mining. Lecture Notes in Computer Science 3918: 119–128. [40] Liu P, Zhou D, Wu NJ. 2007. VDBSCAN: varied density based spatial clustering of applications with noise. In Proc. ofICSSSM. pp 1–4. [46] Krantz, Steven; Parks, Harold R. 2002. A Primer of Real Analytic Functions (2nd ed.). Birkhäuser. [47] J. Bentley. 1986. Programming Pearls, Addison-Wesley, Reading, MA.6. [48] Box, G. E. P.; Jenkins, G. M.; Reinsel, G. C. 1994. Time Series Analysis: Forecasting and Control (3rd ed.). Upper Saddle River, NJ: Prentice–Hall. [49] Beckmann, N.; Kriegel, H. P.; Schneider, R.; Seeger, B. 1990. "The R*-tree: an efficient and robust access method for points and rectangles". In proc. of SIGMOD. p. 322. [50] X. Golay, S. Kollias, G. Stoll, D. Meier, A. Valavanis, P. Boesiger, A new correlation-based fuzzy logic clustering algorithm for fMRI, Mag. Resonance Med. 40 (1998) 249– 260. [51] Y. Kakizawa, R.H. Shumway, N. Taniguchi, Discrimination and clustering for multivariate time series, J. Amer. Stat. Assoc. 93 (441) (1998) 328–340. [52] M. Kumar, N.R. Patel, J. Woo, Clustering seasonality patterns in the presence of errors. In Proc. of KDD ’02. [53] R.H. Shumway, Time–frequency clustering and discriminant analysis, Stat. Probab. Lett. 63 (2003) 307– 314. [54] J.J. van Wijk, E.R. van Selow. 1999. Cluster and calendar based visualization of time series data. In Proc. of SOIV. [55] T.W. Liao, B. Bolt, J. Forester, E. Hailman, C. Hansen, R.C. Kaste, J. O’May, Understanding and projecting the battle state, 23rd Army Science Conference, Orlando, FL, December 2–5, 2002. [56] S. Policker, A.B. Geva, Nonstationary time series analysis by temporal clustering, IEEE Trans. Syst. Man Cybernet.B: Cybernet. 30 (2) (2000) 339–343. [57] T.-C. Fu, F.-L. Chung, V. Ng, R. Luk. 2001. Pattern discovery from stock time series using self-organizing maps, KDD Workshop on Temporal Data Mining. pp. 27– 37. [58] D. Piccolo,A distance measure for classifyingARMA models, J. Time Ser. Anal. 11 (2) (1990) 153–163. [59] J. Beran, G. Mazzola, Visualizing the relationship between time series by hierarchical smoothing models, J. Comput. Graph. Stat. 8 (2) (1999) 213–238. [60] M. Ramoni, P. Sebastiani, P. Cohen, Bayesian clustering by dynamics, Mach. Learning 47 (1) (2002) 91–121. [41] Tao Pei, Ajay Jasra, David J. Hand, A. X. Zhu, C. Zhou. 2009. DECODE: a new method for discovering clusters of different densities in spatial data, Data Min Knowl Disc. [61] M. Ramoni, P. Sebastiani, P. Cohen, Multivariate clustering by dynamics. Proceedings of AAAI-2000. pp. 633–638. [42] P. Cheeseman, J. Stutz. 1996. Bayesian classification (AutoClass): theory and results. Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press. [62] K. Kalpakis, D. Gada, V. Puttagunta, Distance measures for effective clustering of ARIMA time-series. In proc. of ICDM. pp. 273–280. [43] T. Kohonen. 1990. The self-organizing maps, Proc. IEEE 78 (9) (1990) 1464–1480. [63] D. Tran, M. Wagner, Fuzzy c-means clustering-based speaker verification, in: N.R. Pal, M. Sugeno (Eds.), AFSS 2002, Lecture Notes in Artificial Intelligence, 2275, 2002, pp. 318–324. [44] C. Guo, H. Li, and D. Pan. 2010. An improved piecewise aggregate approximation based on statistical features for time series mining. KSEM’10, pages 234–244. [45] Box, Hunter and Hunter. 1978. Statistics for experimenters. Wiley. p. 130.2. [64] T. Oates, L. Firoiu, P.R. Cohen, Clustering time series with hidden Markov models and dynamic time warping, In Proc. of the IJCAI-99. [65] Danon L, D´ฤฑaz-Guilera A, Duch J and Arenas A 2005 J. Stat. Mech. P09008. [66] http://en.wikipedia.org/wiki/Autoregressive_model [67] http://en.wikipedia.org/wiki/Harmonic_oscillator [68] http://en.wikipedia.org/wiki/Stochastic_drift [69] http://en.wikipedia.org/wiki/Pulse_(signal_processing) [70] http://en.wikipedia.org/wiki/Random_walk [71] http://en.wikipedia.org/wiki/Principal_component_analysis [72] http://research.microsoft.com/enus/people/juding/yadingdoc.pdf [73] Oppenheim, Alan V. Ronald W. Schafer, John R. Buck (1999). Discrete-Time Signal Processing (2nd ed.). Prentice Hall.ISBN 0-13-754920-2. [74] http://en.wikipedia.org/wiki/Nyquist%E2%80%93Shannon _sampling_theorem