Proceedings Template - WORD

Technical Document of YADING Fast and Automatic Clustering of Large-Scale Time Series Data YADING project Member: Rui Ding, Qiang Wang, Yingnong Dang, Qiang Fu, Haidong Zhang, Dongmei Zhang Software Analytics Group Microsoft Research December, 2013 1. DOCUMENT Appendix includes the detailed proofs of several statements, and some other useful information. 𝑃(𝑛𝑖′ ≥ 𝑚) > 1 − 𝛼 ↔ 𝑃 (𝑍 ≥ −𝑧𝛼 & 𝑚 ≤ 𝑠𝑝𝑖 . 𝜎 2 (𝑝𝑖 − 1.1.1 Sample Size Determination Sampling is the most effective mechanism to handle the scale of the input dataset. Since we want to achieve high performance and we do not assume any distribution of the input dataset, we choose random sampling [47] as our sampling algorithm. In practice, a predefined sampling rate is often used to determine the size of the sampled dataset s. As N, the size of the input dataset, keeps increasing, s also increases accordingly, which will result in slow clustering performance on the sampled dataset. Furthermore, it is unclear what impact the increased number of samples may have on the clustering accuracy. We come up with the following theoretical bounds to guide the selection of s. Assume that the ground truth of clustering is known for 𝒯𝑁×𝐷 , i.e. all the 𝑇𝑖 ∈ 𝒯𝑁×𝐷 belong to k known groups, and 𝑛𝑖 represents the 𝑛 number of time series in the ith group. Let 𝑝𝑖 = 𝑖 denote the 𝑁 𝑛′ population ratio of group i. Similarly, 𝑝𝑖′ = 𝑖 denote the 𝑠 population ratio of the ith group on the sampled dataset. |𝑝𝑖 − 𝑝𝑖′ | reflects the ratio deviation between the input dataset 𝒯𝑁×𝐷 and the sampled dataset 𝒯𝑠×𝑑 . We formalize the selection of the sample size s as finding the lower bound 𝑠𝑙 and upper bound 𝑠𝑢 such that, given a tolerance 𝜖 and a confidence level 1 − 𝛼, (1) group i with 𝑝𝑖 less than 𝜖 is not guaranteed to have sufficient instances in the sampled dataset for 𝑠 < 𝑠𝑙 , and (2) the maximum of ratio deviation |𝑝𝑖 − 𝑝𝑖′ |, 1 ≤ 𝑖 ≤ 𝑘, is within a given tolerance for 𝑠 ≥ 𝑠𝑢 . Intuitively, the lower bound constrains the smallest size of clusters that are possible to be found; and the upper bound indicates that when the sample size is greater than a threshold, more samples will not impact the clustering result. Lemma 1 (lower bound): Given m, the least number of instances present in the sampled dataset for group i, tolerance 𝜖, and the confidence level 1 − 𝛼 , the sample size 𝑠 ≥ 𝑃(𝑛𝑖′ 𝑧 𝑧 2 𝑚+𝑧𝛼 ( 𝛼+√𝑚+ 𝛼 ) 2 4 𝑝𝑖 satisfies ≥ 𝑚) > 1 − 𝛼 . Here, 𝑧𝛼/2 is a function of 𝛼 , 𝑃(𝑍 > 𝑧𝛼/2 ) = 𝛼/2, where 𝑍~𝑁(0, 1). With confidence level 1 − 𝛼, Lemma 1 provides the lower bound on sample size s that guarantees m instances in the sampled dataset for any cluster with population ratio higher than 𝜖. For sample, if a cluster has 𝑝𝑖 > 1% , and we set 𝑚 = 5 with confidence 95% (i.e. 1 − 𝛼 = 0.95), then we get 𝑠𝑙 ≥ 1,030. In this case, when 𝑠 < 1,030 , the clusters with 𝑝𝑖 < 1% have perceptible probability (>5%) to be missed in the sampled dataset. It should be noted that the selection of m is related to the clustering method applied to the sampled dataset. For example, DBSCAN is a density-based method, and it typically requires 4 nearest neighbors of a specific object to identify a cluster. Thus, any cluster with size less than 5 is difficult to be found. The consideration on clustering method also supports our formalization for deciding 𝑠𝑙 . 𝑚−𝑠𝑝 𝑚−𝑠𝑝 𝑖 𝑖 Proof: Event {𝑛𝑖′ ≥ 𝑚} ↔ { 𝑖 𝑖 ≥ } ↔ {𝑍 ≥ }, the 𝜎 𝜎 𝜎 ′ last statement holds since 𝑛𝑖 ~𝑁(𝑠𝑝𝑖 , 𝑠𝑝𝑖 (1 − 𝑝𝑖 )), and here 𝜎 = √𝑠𝑝𝑖 (1 − 𝑝𝑖 ). So )>1−𝛼 ↔ 𝑚−𝑠𝑝𝑖 𝜎 ≤ The last inequality can be transformed to 1.1 Lemmas and Proofs 𝑛′ −𝑠𝑝 𝑚−𝑠𝑝𝑖 2 𝑚 + 𝑧𝛼2 /2 𝑚 + 𝑧𝛼2 /2 𝑚2 , 𝑚 ≤ 𝑠𝑝𝑖 2 ) ≥( 2 ) − 𝑠 + 𝑧𝛼 𝑠 + 𝑧𝛼 𝑠(𝑠 + 𝑧𝛼2 ) Consider 𝑧𝛼 usually valued in [0, 3], so 𝑠 ≫ 𝑧𝛼2 , so 𝑠 + 𝑧𝛼2 ≈ 𝑠, apply this to the inequality above, we get a simplified version 𝑠 ≥ 𝑚+𝑧𝛼 ( 𝑧𝛼 𝑧 2 +√𝑚+ 𝛼 ) 2 4 , hence the lemma is proven. 𝑝𝑖 Lemma 2 (upper bound): Given tolerance 𝜖, and the confidence level 1 − 𝛼, the sample size 𝑠 ≥ 2 𝑧𝛼/2 4𝜖 2 satisfies 𝑃 (max|𝑝𝑖 − 𝑝𝑖′ | < 𝑖 𝜖) > 1 − 𝛼. Here, 𝑧𝛼/2 is a function of 𝛼, 𝑃(𝑍 > 𝑧𝛼/2 ) = 𝛼/2, where 𝑍~𝑁(0, 1). Lemma 2 implies that the sample size s only depends on the tolerance 𝜖 and the confidence level 1 − 𝛼, and it is independent of the input data size. For example, if we set 𝜖 = 0.01 and 1 − 𝛼 = 0.95, which means that for any group, the difference of its population ratio between the input dataset and the sampled dataset is less than 0.01, then the lowest sample size to guarantee such 𝑧2 setting is 𝑠 ≥ 0.025 2 ~9,600. More samples than 9,600 are not 4×0.01 necessary. This makes 𝑠 = 9,600 the upper bound of the sample size. Moreover, this sample size does not change with the size of the input dataset. Proof: the probability that a sample belongs to the ith group is 𝑝𝑖 , due to the property of random sampling. Since each sample is independent, the number of instances belonging to the ith group form Binomial distribution, which is 𝑛𝑖′ ~𝐵(𝑠, 𝑝𝑖 ). 𝐵(𝑠, 𝑝𝑖 )~𝑁(𝑠𝑝𝑖 , 𝑠𝑝𝑖 (1 − 𝑝𝑖 )) when 𝑠 is large, and the distribution is not too skew [45], so next we assume 𝑛𝑖′ ~𝑁(𝑠𝑝𝑖 , 𝑠𝑝𝑖 (1 − 𝑝𝑖 )) by approximation. Then, 𝑝𝑖′ = 𝑛𝑖′ 𝑠 ~𝑁(𝑝𝑖 , 𝑝𝑖 (1−𝑝𝑖 ) 𝑠 )  𝑌 = 𝑝𝑖′ − 𝑝𝑖 ~𝑁(0, 𝑝𝑖 (1−𝑝𝑖 ) 𝑠 ), 𝜖 Event {|𝑝𝑖 − 𝑝𝑖′ | < 𝜖} ↔ {|𝑌| < 𝜖} ↔ {|𝑍| < } where 𝜎 = 𝜎 𝑝𝑖 (1−𝑝𝑖 ) √ 𝑠 So when . 𝜖 𝜎 𝜖 > 𝑧𝛼/2 , 𝑃 (|𝑍| < ) > 1 − 𝛼 is achieved. Expand 𝜎, 𝜎 we get the range value of 𝑠 should be 𝑠 ≥ 1 2 𝑧𝛼/2 4𝜖 2 𝑝𝑖 (1−𝑝𝑖 ) 2 𝑧𝛼 , 𝜖2 2 since 𝑝𝑖 (1 − 𝑝𝑖 ) ≤ , so 𝑠 ≥ is a valid value range of the ith group, 4 hence the lemma is proven. Lemma 2 provides a loose upper bound, since we replace 1 𝑝𝑖 (1 − 𝑝𝑖 ) to to bound all the value of ratios. So set sample size 4 smaller than 9,600 may preserve reasonable results and increase performance. In practice, we change sample size from 1,030 (lower bound) to 10,000 in real data sets for testing clustering results, we finally choose 2,000 since the accuracy of clustering results are very close when sample size is larger than 2,000. 1.1.2 Phase-Shift Overcoming In this section, we investigate how 𝐿1 distance combined with density-based clustering could overcome phase-shift trouble. The first observation is, when phase shift is small enough, the 𝐿1 distance could also be small enough; another observation is, when data scale becomes large, the distance between the kNN (kth nearest neighbor) and a particular time series can be short enough, so that all the time series are connected by applying DBSCAN. general, such way for describing time series is very common and nature. Preliminaries: denote a time series 𝑇(𝑎) = {𝑓(𝑎 + 𝑏), 𝑓(𝑎 + 2𝑏), 𝑓(𝑎 + 3𝑏), … 𝑓(𝑎 + 𝑚𝑏)}, here 𝑎 is the initial phase, 𝑏 is the interval that time series is sampled, 𝑚 is the length. The time series is generated by an underlying continuous model 𝑓(𝑡) . Here, we assume 𝑓(𝑡) which is an analytic function [46]. Another time series with phase shift 𝛿 is represented by 𝑇(𝑎 − 𝛿) = {𝑓(𝑎 + 𝑏 − 𝛿), 𝑓(𝑎 + 2𝑏 − 𝛿), … 𝑓(𝑎 + 𝑚𝑏 − 𝛿)}. Without noise, the time series instances are identical, which is trivial for applying any type of clustering methods. When noise is incorporated, the outcome time series instances have deviations. Lemma 1: ∃𝑀, 𝑠. 𝑡. , 𝐿1(𝑇(𝑎), 𝑇(𝑎 − 𝛿)) ≔ ∑𝑚 𝑖=1|𝑓(𝑎 + 𝑖𝑏) − 𝑓(𝑎 + 𝑖𝑏 − 𝛿)| ≤ 𝑚𝑀𝛿. Proof: according to Taylor’s theorem about analytic function: 𝑓(𝑥) = 𝑓(𝑎) + 𝑓 ′ (𝜃)(𝑥 − 𝑎), 𝑤ℎ𝑒𝑟𝑒 𝜃 ∈ (𝑎, 𝑥) We immediately get 𝑚 𝐿1(𝑇(𝑎), 𝑇(𝑎 − 𝛿)) = ∑|𝑓(𝑎 + 𝑖𝑏) − 𝑓(𝑎 + 𝑖𝑏 − 𝛿)| 𝑖=1 𝑚 = ∑|𝑓 ′ (𝜃𝑖 )𝛿| ≤ 𝑚𝑀𝛿 𝑖=1 Where 𝑀 = max |𝑓 ′ (𝜃𝑖 )| , 𝜃𝑖 ∈ (𝑎 + 𝑖𝑏 − 𝛿, 𝑎 + 𝑖𝑏). 𝑖 Now suppose we have 𝑛 time series 𝑇(𝑎 − 𝛿𝑖 ) , which only differed from 𝑇(𝑎) by a phase shift 𝛿𝑖 . Without generality, let 𝛿𝑖 ∈ [0, ∆] . We assume these time series are generated independently, with 𝛿𝑖 uniformly distributed in the interval [0, ∆]. Denote clustering parameter as 𝜀: (𝜀 is a distance threshold, if the distance between a specific object and its kNN is smaller than 𝜀, then it is a core point). Denote event 𝐸𝑛 : 𝐸𝑛 ≔ {𝑇(𝑎 − 𝛿𝑖 ), 𝑖 = 1, 2, … 𝑛. 𝑏𝑒𝑙𝑜𝑛𝑔 𝑡𝑜 𝑠𝑎𝑚𝑒 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 } Lemma 2: 𝑃(𝐸𝑛 ) ≥ 1 − 𝑛(1 𝜀 − )𝑛 𝑚𝑀𝑘∆ Proof: divide the interval [0, ∆] into several buckets with length 𝜀 equals to . According to the mechanism of DBSCAN, if each 𝑚𝑀𝑘 bucket contains at least one time series, then 𝜀 𝐿1(𝑇(𝑎), 𝑖𝑡𝑠 𝑘𝑁𝑁) ≤ 𝑘 × 𝑚𝑀 × = 𝜀, so all the time series 𝑚𝑀𝑘 are core points, and they are density-connected, so all the time series will be grouped into one cluster. Denote event 𝑈𝑗 ≔ {𝑗𝑡ℎ 𝑏𝑢𝑐𝑘𝑒𝑡 𝑖𝑠 𝑒𝑚𝑝𝑡𝑦}, then 𝑃(𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝑏𝑢𝑐𝑘𝑒𝑡 𝑖𝑠 𝑒𝑚𝑝𝑡𝑦) = 𝑃(⋃𝑗 𝑈𝑗 ) ≤ ∑𝑗 𝑃(𝑈𝑗 ) = 𝜀 𝑛(1 − )𝑛 . Note that event {𝑛𝑜 𝑒𝑚𝑝𝑡𝑦 𝑏𝑢𝑐𝑘𝑒𝑡} is just a 𝑚𝑀𝑘∆ subset of 𝐸𝑛 , so 𝑃(𝐸𝑛 ) ≥ 𝑃(𝑛𝑜 𝑒𝑚𝑝𝑡𝑦 𝑏𝑢𝑐𝑘𝑒𝑡) = 1 − 𝑃(⋃𝑗 𝑈𝑗 ) ≥ 1 − 𝑛 (1 − 𝜀 𝑛 To illustrate how density-based clustering overcome the random noise, we give an theoretical analysis on the time series generated by AR(1) model, which is relatively simple, and without loss of generality. Let 𝑥𝑖 be the value of ith epoch of a particular time series instance, AR(1) is represented by 𝑥𝑖 = 𝑎𝑥𝑖−1 + 𝜇, where |𝑎| < 1 to make it stable, and 𝜇~𝑁(0, 𝜎 2 ) is the white noise. A given time series ⃑𝒙 ≔ {𝑥1 , 𝑥2 , … , 𝑥𝑑 }𝑇 , let 𝑥1 = 0 be the initial value. Joint distribution of ⃑𝒙: Denote 𝑓(𝑥1 , 𝑥2 , … , 𝑥𝑑 ) as the p.d.f. of a time series generated by the given AR(1) model. So 𝑓(𝑥0 , 𝑥1 , … , 𝑥𝑑 ) = 𝑓(𝑥𝑑 |𝑥0 , 𝑥1 , … , 𝑥𝑑−1 )𝑓(𝑥𝑑−1 |𝑥0 , … , 𝑥𝑑−2 ) … 𝑓(𝑥1 |𝑥0 )𝑃(𝑥0 ) , according to identity transformation of probability. 𝑃(𝑥0 = 0) = 1 1. 𝑓(𝑥𝑖 |𝑥0 , … , 𝑥𝑖−1 ) = exp [− √2𝜋𝜎 Markov property. So finally, we get 1 ⃑ ) ≔ 𝑓(𝑥0 , 𝑥1 , … , 𝑥𝑑 ) = 𝑓(𝒙 (𝑥𝑖 −𝑎𝑥𝑖−1 )2 2𝜎 2 𝑑 𝑑 ∏ exp [− (2𝜋𝜎 2 )2 = Here, Σ −1 = 1 + 𝑎2 −𝑎 0 ] according to the 𝑖=1 (𝑥𝑖 − 𝑎𝑥𝑖−1 )2 ] 2𝜎 2 1 𝑇 −1 ⃑𝒙 Σ ⃑𝒙 2 exp [− ] 𝑑 𝜎2 (2𝜋𝜎 2 )2 1 −𝑎 1 + 𝑎2 −𝑎 ⋮ 0 −𝑎 𝑎2 ⋯ 0 ⋱ ⋮ 2 −𝑎 1 + 𝑎 0 ⋯ ( −𝑎 1) Now let 𝑁 time series instances (each with length= 𝑑) generated independently from the discussed AR(1) model. For density based clustering, denote clustering parameter as 𝜀: (𝜀 is a distance threshold, if the distance between a specific object and its kNN is 𝑘 smaller than 𝜀, then it is a core point). Define 𝜌 = 𝑑 , where 𝑐𝑑 𝑟 𝑉𝑟 ≔ 𝑐𝑑 × 𝑟 𝑑 , is the volume of the hyper-sphere (see Figure 2) in the 𝑑-dimensional 𝐿𝑝 space, e.g., in Euclidean space, 𝑐𝑑 = 𝜋𝑑/2 𝑑 2 Γ( +1) Lemma 1: the ratio of the 𝑁 objects (refer to these time series instances) can be clustered together by applying density based clustering is ) , hence the lemma is proven. 𝑚𝑀𝑘∆ Corollary: lim 𝑃(𝐸𝑛 ) = 1 𝑛→∞ This is intuitive. Description: lemma 2 provides the probabilistic confidence bound that how probable the phase-shift-time-series instances will be grouped together. As 𝑛 goes to infinity, the confidence will converged to 1 in probability. 1.1.3 Random Noise Overcoming This section describes how density based clustering overcome the random noise which is encode into the specific models to generate particular time series. Here we discuss the time series generated by “precise model” + “white noise”, and we refer “random noise” as the white noise. In 𝑝(𝜌, 𝑁) ≈ ∫ ⃑ )𝑑𝑥1 𝑑𝑥2 … 𝑑𝑥𝑑 … ∫ 𝑓(𝒙 ⃑ )≥𝜌 𝑁𝑓(𝒙 ⃑ ) is the p.d.f. of ⃑𝒙 , the The proof is straightforward. Since 𝑓(𝒙 ⃑ ). local density of a given position ⃑𝒙 is approximately as 𝑁𝑓(𝒙 When the local density is greater than 𝜌 , the points in the ⃑) neighborhood of ⃑𝒙 can be identified as core points, because 𝑓(𝒙 ⃑ ), all is a smooth function. By considering the convexity of 𝑓(𝒙 the region with density greater than 𝜌 can be grouped together. Corollary: lim 𝑝(𝜌, 𝑁) = 1 𝑁→∞ The density can be as large as possible, once the number of objects increasing to infinity. 1.1.4 Multi-Density Estimation 𝑑: Dimensionality of each object. Density estimation is key to density-based clustering algorithms. It is performed manually or with slow performance in most of the existing algorithms [23][38][39][40][41]]. In this section, we define a concept density radius and provide theoretical proof on its estimation. We use density radius in YADING to identify the core points of the input dataset and conduct multi-density clustering accordingly. 𝑘𝑑𝑖𝑠 : The distance between an object and its kNN. We define 𝑘𝑑𝑖𝑠 of an object as the distance between this object and its kNN. A 𝑘𝑑𝑖𝑠 curve is a list of 𝑘𝑑𝑖𝑠 values in descending order. Figure 2 shows an example of 𝑘𝑑𝑖𝑠 curve with 𝑘 = 4 . We define density radius as the most frequent 𝑘𝑑𝑖𝑠 value. Intuitively, most objects contain exactly 𝑘 nearest neighbors in a hypersphere with radius equals to density radius. 𝑘𝑑𝑖𝑠 𝑐𝑢𝑟𝑣𝑒: Aggregate the 𝑘𝑑𝑖𝑠 value of each object, then sort in descending order. |{𝑜𝑏𝑗𝑒𝑐𝑡𝑠 𝑤ℎ𝑜𝑠𝑒 𝑘 ≤𝑟}| 𝑑𝑖𝑠 𝐸𝐷𝐹𝑘 (𝑟) ≔ , here 𝐸𝐷𝐹𝑘 also refers to 𝑁 empirical distribution function of 𝑘𝑑𝑖𝑠 . 𝑉𝑟 ≔ 𝑐𝑑 × 𝑟 𝑑 , is the volume of the hyper-sphere (see Figure 2) in the 𝑑 -dimensional 𝐿𝑝 space, e.g., in Euclidean space, 𝑐𝑑 = 𝜋𝑑/2 𝑑 2 . Γ( +1) Density-radius: the most frequent 𝑘𝑑𝑖𝑠 value in 𝑘𝑑𝑖𝑠 𝑐𝑢𝑟𝑣𝑒. Preliminaries: We transform the estimation of density radius to identifying the inflection point on 𝑘𝑑𝑖𝑠 curve. Here, inflection point takes general definition of having its second derivative equal to zero. Next, we provide the intuition behind this transformation followed by theoretical proof. Intuitively, the local area of an inflection point on 𝑘𝑑𝑖𝑠 curve is the flattest (i.e. the slopes on its left-hand and right-hand sides has the smallest difference). On 𝑘𝑑𝑖𝑠 curve, the points in the neighborhood of a inflection point have close values of 𝑘𝑑𝑖𝑠 . For example, in Figure 2, there are three inflection points with corresponding 𝑘𝑑𝑖𝑠 values equal to 1,500, 500, and 200. In other words, most points on this curve have 𝑘𝑑𝑖𝑠 values close to 1,500, 500, or 200. According to the definition of density radius, these three values can be used to approximate three density radiuses. We now provide theoretical on estimating density radius by identifying the inflection point on 𝑘𝑑𝑖𝑠 curve. We first prove that the Y-value, 𝑘𝑑𝑖𝑠 , of each inflection point on the 𝑘𝑑𝑖𝑠 curve equals to one unique density radius. Specifically, given a dataset with single density, we provide analytical form to its 𝑘𝑑𝑖𝑠 curve, and prove that there exists a unique inflection point with Y-value equal to the density radius of the dataset. We further generalize the estimation to the dataset with multiple densities. To make the mathematical deduction easier, we use 𝐸𝐷𝐹𝑘 (𝑟) ≔ |{𝑜𝑏𝑗𝑒𝑐𝑡𝑠 𝑤ℎ𝑜𝑠𝑒 𝑘𝑑𝑖𝑠≤𝑟}| to represent 𝑘𝑑𝑖𝑠 curve equivalently. EDF 𝑁 is short for Empirical Distribution Function. It is the 𝑘𝑑𝑖𝑠 curve rotated 90 degrees clockwise with normalized Y-Axis. The Xvalue of inflection point on 𝐸𝐷𝐹𝑘 (𝑟) equals to the Y-value of inflection point on 𝑘𝑑𝑖𝑠 curve. 3000 2500 Problem 1 (Analytical expression of 𝑬𝑫𝑭𝒌 (𝒓) ): Suppose 𝑁 objects are sampled independently, by a uniform distribution, which is defined on a region with volume is 𝑉, see Figure 2. So 𝑁 the density 𝜌 = . 𝑉𝑟 is an arbitrary hyper-sphere region with 𝑉 radius 𝑟, centered at 𝑂, where a particular object is located at 𝑂. Define event 𝐸𝑚,𝑟 ≔ {𝑜𝑛𝑙𝑦 𝑚 𝑝𝑜𝑖𝑛𝑡𝑠 𝑖𝑛𝑠𝑖𝑑𝑒 𝑉𝑟 , 𝑚 ≥ 1}. 𝑚−1 𝑚−1 Lemma 1: 𝑃(𝐸𝑚,𝑟 ) = 𝐶𝑁−1 𝑃𝑟 (1 − 𝑃𝑟 )𝑁−𝑚 , where 𝑃𝑟 = 𝑉𝑟 𝑉 = 𝑐𝑑 ×𝑟 𝑑 𝑉 . Proof: since there’s already one object located inside sphere (located at center), so 𝐸𝑚,𝑟 requires 𝑚 − 1 extra objects inside sphere, 𝑁 − 𝑚 objects outside. 𝑃𝑟 is the probability that one object is sampled inside the sphere, so the lemma is obvious which is a binomial-distribution. Lemma 2: 𝑃(𝑘𝑑𝑖𝑠 ≤ 𝑟) = ∑𝑁 𝑚=𝑘+1 𝑃(𝐸𝑚,𝑟 ). Proof: if 𝑘𝑑𝑖𝑠 ≤ 𝑟, then the kNN of an object is inside of the hyper-sphere, so there are at least 𝑘 + 1 objects inside of hypersphere; and vice versa. Corollary 1: 𝐸𝐷𝐹𝑘 (𝑟) ≈ 𝑃(𝑘𝑑𝑖𝑠 ≤ 𝑟). Three potential inflection points Should point out that, although the 𝑘𝑑𝑖𝑠 of each object share same distribution, they are actually not totally independent, so we cannot directly use 𝐸𝐷𝐹𝑘 (𝑟) to approximate 𝑃(𝑘𝑑𝑖𝑠 ≤ 𝑟). But we assume this is a good approximation, which can also be evidenced by simulation experiments. 2000 4-dis Figure 2. One population generated by uniform distribution 1500 1000 500 Corollary 2: 𝐸𝐷𝐹1 (𝑟) ≈ 1 − 𝑒 −𝜌𝑐𝑑 𝑟 0 0 200 400 600 800 1000 Point index Figure 1. 4-dis curve of a time series dataset The mentioned notations are summarized as follows: 𝑁: Total number of objects (same as points). 𝑑 1𝑑𝑖𝑠 has the simplest version. According to Lemma 2 and 3, 𝑁−1 𝐸𝐷𝐹1 (𝑟) = 1 − (1 − 𝑃𝑟 )𝑁−1 = 1 − (1 − 𝑁 ≈ 1 − (1 − 𝜌𝑐𝑑 𝑟 𝑑 ) 𝑁 𝜌𝑐𝑑 𝑟 𝑑 𝑑 ) ≈ 1 − 𝑒 −𝜌𝑐𝑑 𝑟 𝑁 The last two approximations make sense when 𝑁 is large. Lemma Lemma 3 (existence and uniqueness): there exist one and only one inflection point on 𝐸𝐷𝐹𝑘 (𝑟), and Y-value of inflection point on 𝑘𝑑𝑖𝑠 curve is density radius. 𝑅, 𝑠𝑜 𝑡ℎ𝑎𝑡 lim 𝑟𝑗 = 𝑟𝑖1 . Same statement satisfied for 𝑟𝑖2 . Proof: Denote 𝑟𝑖 as the X-value of inflection point of 𝐸𝐷𝐹𝑘 (𝑟), 𝑑2 𝐸𝐷𝐹1 (𝑟) 𝑑 𝑑 = 𝑑(𝑑 − 1)𝑐𝑑 𝑟 𝑑−2 (𝑝1 𝜌1 𝑒 −𝜌1𝑐𝑑 𝑟 + 𝑝2 𝜌2 𝑒 −𝜌2𝑐𝑑 𝑟 ) 𝑑𝑟 2 𝑑 − 𝑑2 𝑐𝑑2 𝑟 2𝑑−2 (𝑝1 𝜌12 𝑒 −𝜌1𝑐𝑑 𝑟 𝑑 + 𝑝2 𝜌22 𝑒 −𝜌2 𝑐𝑑 𝑟 ) so 𝑑 2 𝐸𝐷𝐹𝑘 (𝑟) 𝑑𝑟 2 𝑑 𝑑𝑟 𝑑−2 𝑒 −𝜌𝑐𝑑 𝑟 (𝑑 |𝑟=𝑟𝑖 = 0 , 𝑑 2 𝐸𝐷𝐹𝑘 (𝑟) where 𝑑𝑟 2 = 𝑑𝑟 𝑑 ). 𝜌𝑐𝑑 − 1 − 𝜌𝑐𝑑 Since the first derivation of 𝐸𝐷𝐹𝑘 (𝑟) is the probability density function; the second derivation equal to zero means that 𝑟 = 𝑟𝑖 has the maximum likelihood. In other words, 𝑟𝑖 is the most frequent value of 𝑘𝑑𝑖𝑠 , which is the definition of density radius. Since the X-value of inflection point on 𝐸𝐷𝐹𝑘 (𝑟) equals to the Y-value of inflection point on 𝑘𝑑𝑖𝑠 curve, the lemma is proven. Corollary 3: For 𝐸𝐷𝐹1 (𝑟), inflection point 𝑟𝑖 = ( Take 𝑑 2 𝐸𝐷𝐹𝑘 (𝑟) 𝑑𝑟 2 𝑑−1 1 𝑑 𝜌𝑐𝑑 ) 1 𝑑 |𝑟=𝑟𝑖 = 0 , with the formulation of 𝐸𝐷𝐹1 (𝑟) ≈ 𝑑 1 − 𝑒 −𝜌𝑐𝑑 𝑟 , we get the expression of 𝑟𝑖 . Problem 2 (mixture of density): Denote 𝜃𝑖 = {𝜌𝑖 , 𝑑𝑖 , 𝑛𝑖 } as the parameter vector of 𝐸𝐷𝐹𝑖,𝑘 (𝑟) for a particular population which is with single density. Now suppose there are ℎ regions with different densities, located at space without overlap. Let’s estimate the expression for the overall 𝐸𝐷𝐹𝑘 (𝑟). Lemma 4: 𝐸𝐷𝐹𝑘 (𝑟) = ∑ℎ𝑖=1 𝑛𝑖 𝐸𝐷𝐹𝑖,𝑘 (𝑟) 𝑁 , where 𝑁 = ∑ℎ𝑖=1 𝑛𝑖 This can be easily obtained by using the definition of 𝐸𝐷𝐹𝑘 (𝑟). The mixture model is just the linear combination of each individual 𝐸𝐷𝐹𝑖,𝑘 (𝑟), this enables model inference: given the overall 𝐸𝐷𝐹𝑘 (𝑟), identify the underlying models, which can be represented by {𝜃1 , … 𝜃ℎ } , further, multi-densities can be obtained. Mixture model identification: Denote the expression of 𝐸𝐷𝐹𝑘 (𝑟) as 𝐸𝐷𝐹𝑘 (𝑟|𝜃1 , … 𝜃ℎ ) , now given the 𝑘𝑑𝑖𝑠 𝑐𝑢𝑟𝑣𝑒 which can be represented by a list of {𝑦𝑖 , 𝑟𝑖 }, we formulate the problem as an optimization problem: arg min ∑[𝑦𝑖 − 𝐸𝐷𝐹𝑘 (𝑟𝑖 |𝜃1 , 𝜃2 , … 𝜃ℎ )]2 𝑖=1 ℎ 𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜 ∑ 𝑝𝑗 = 1 , 𝑝𝑗 ≥ 0, ∀𝑗 𝑗=1 Several well-developed techniques, such as EM method is suitable to solve such a problem. The next lemma provides theoretical bounds to show that the inflection points on the 𝐸𝐷𝐹𝑘 (𝑟) of mixture densities can approximate the inflection points of 𝐸𝐷𝐹𝑖,𝑘 (𝑟) (each single density) when the densities are different enough. Without loss of generality, and also for simplicity consideration, we consider two mixture densities, and set 𝑘 = 1, and we assume the intrinsic dimensionality of the two density regions are equal. Denote 𝑟𝑖1 , 𝑟𝑖2 are the X-value of inflection point of two density 𝜌 regions, with density 𝜌1 , 𝜌2 respectively. Denote 𝜑 = 1. Denote 𝑅 ≔ {𝑟𝑖 | 𝑑 2 𝐸𝐷𝐹1 (𝑟𝑖 ) 𝑑𝑟 2 𝜌2 = 0}. Denote 𝑝1 = 𝑛1 𝑁 , 𝑝2 = 𝑛2 𝑁 . ∃𝑟𝑖 ∈ 𝑅, 𝑠𝑜 𝑡ℎ𝑎𝑡 lim 𝑟𝑖 = 𝑟𝑖1 𝜑→0 ; ∃𝑟𝑗 ∈ and 𝜑→∞ Proof: According to the form Put the form of 𝑟𝑖1 = ( 𝑑−1 1 𝑑 1 𝑑 𝜌1 𝑐𝑑 ) into it, we get 1 𝑑 𝐸𝐷𝐹1 (𝑟𝑖1 ) 𝑑−1 2 = ( ) 𝜑(1 − 𝜑)𝑒 −𝜑(1−𝑑 ) 𝑑𝑟 2 𝑟𝑖1 2 The lemma is proven not matter 𝜑 → 0 or 𝜑 → ∞. This lemma indicates that, when the density different is large enough, the inflection point obtained from the 𝐸𝐷𝐹𝑘 (𝑟) can approximate the inflection point of each single density region, which further represents the density radius. Corollary 4: For high dimensionality 𝑑 ≫ 1, 𝜑 = worst density different that for approximation. 3±√5 2 3±√5 𝑑 2 𝐸𝐷𝐹 (𝑟𝑖1 ) 1 1 Let 1 − ≈ 1, we get that when 𝜑 = , 𝑑 2 𝑑𝑟 2 maximum/minimum value, which is not equal to 0. is the get the The corollary implies that, when the density difference is proper, the mixture 𝐸𝐷𝐹𝑘 (𝑟) can hardly be used for density radius identification. 1.2 Tables for Pseudo-code 1.2.1 Dimensionality Reduction We adopt PAA for dimensionality reduction because of its computational efficiency and its capability of preserving the shape of time series. Denote a time series instance with length 𝐷 as 𝑇𝑖 ≔ (𝑡𝑖1 , 𝑡𝑖2 , … , 𝑡𝑖𝐷 ) . The transformation from 𝑇𝑖 to 𝑇𝑖′ ≔ 𝑑 𝐷 (𝜏𝑖1 , 𝜏𝑖2 , … , 𝜏𝑖𝑑 ) where 𝜏𝑖𝑗 = ∑𝑑 𝐷 𝐷 𝑗 𝐷 𝑑 𝑘= (𝑗−1)+1 𝑡𝑖𝑘 , is called PAA with frame length equal to . PAA segments a time series instance 𝑑 into d frames, and uses one value (i.e. the mean value) to represent each frame so as to reduce its length from D to d. 𝑁 𝜃1 ~𝜃ℎ 5: One key issue in applying PAA is to specify a proper 𝑑 automatically. As proved by the Nyquist-Shannon sampling theory, any time series without frequencies higher than B Hertz can be perfectly recovered by its sampled points with sampling rate 2*B. This means that using 2*B as sampling rate can preserve the shape of a frequency-bound time series. Although some time series under our study are often imperfectly frequency-bound signals, most of them can be approximated by frequency-bound signals because their very high-frequency components are usually corresponding to noise. Therefore, we transform the problem of determining d into estimating the upper bound of frequencies. In this paper, we propose a novel auto-correlation-based approach to identify the approximate value of the frequency upper bound of all the input time series instances. The frame length d is then easily determined as the inverse of the frequency upper bound. In more details, we first identify the typical frequency of each time series instance 𝑇𝑖 by locating the first local minimum on its auto-correlation curve 𝑔𝑖 , which is denoted as 𝑔𝑖 (𝑦) = ∑𝐷−𝑦 𝑗=1 𝑡𝑖𝑗 𝑡𝑖(𝑗+𝑦) , where 𝑦 is the lag. If there is a local minimum of 𝑔𝑖 on a particular lag 𝑦 ′ , then 𝑦 ′ relates to a typical half-period if 𝑔𝑖 (𝑦 ′ ) < 0. In this case, we call 1/𝑦 ′ the typical frequency for 𝑇𝑖 . The smaller 𝑦 ′ is, the higher the frequency it represents. distance between each pair of objects on the sample set. Multidensity estimation costs 𝑂(𝑠 log 𝑠) since it adopts divide-andconquer strategy. Then, we sort all the detected typical frequencies in ascending order, and select the 80th percentile to approximate the frequency upper bound of all the time series instances. The reason why we do not use the exact maximum typical frequency is to remove the potential instability caused by the small amount of extraordinary noise in some time series instances. 1.2.3 Clustering Regarding implementation, the auto-correlation curves 𝑔𝑖 (𝑦) can be obtained efficiently using the Fast Fourier transforms: (1) 𝐹𝑔 (𝑓) = 𝐹𝐹𝑇[𝑇𝑖 ] ; (2) 𝑆(𝑓) = 𝐹𝑔 (𝑓)𝐹𝑔∗ (𝑓) ; (3) 𝑔(𝑦) = 𝐼𝐹𝐹𝑇[𝑆(𝑓)]. Where IFFT is inverse Fast Fourier transforms, and the asterisk denotes complex conjugate. Table 1 shows the algorithm of automatically estimating the frame length. Once we obtain the density radiuses, the clustering algorithm is straightforward. With each density radius specified, from the smallest to the largest, DBSCAN is performed accordingly. In our implementation, we set 𝑘 = 4 , which is the MinPts value in DBSCAN. The implementation is illustrated in Table 3. Table 3. Algorithm for multi-density based clustering /* p: the sample data set radiuses: the density radiuses */ MULTIDBSCAN(p, radiuses) for each radius ∈ radiuses objs  cluster from DBSCAN(p, radius) remove objs from p mark p as noise objects Table 1. Auto estimation of the frame length FRAMELENGTH(𝓣′𝒔×𝑫 ) for each 𝑻𝒊 ∈ 𝓣′𝒔×𝑫 ) 𝒈𝒊 (𝒚)  auto-correlation applied to 𝑻𝒊 𝒚∗𝒊  get first local minimum of 𝒈𝒊 (𝒚) 𝒚∗  80% percentile on sorted {𝒚∗𝟏 ~ 𝒚∗𝒔 } return 𝒚∗ The time complexity of data reduction is 𝑂(𝑠𝐷 log 𝐷 + 𝑁𝐷) , specifically, obtaining frame length cost 𝑂(𝑠𝐷 log 𝐷) , and applying PAA to the entire input data set cost 𝑂(𝑁𝐷). 1.2.2 Density Estimation We implement a fast algorithm to estimate density radiuses (Table 2). We first find the inflection point with the minimum difference between its left and right-hand slopes. We then recursively repeat this process on the two sub-curves segmented by the obtained inflection point, until no more significant inflection points are found. The DBSCAN costs 𝑂(𝑠 log 𝑠). Since it is performed the number of times not exceeding 𝑠, the total cost is 𝑂(𝑠 2 log 𝑠). 1.2.4 Assignment After clustering is performed on the sampled dataset, a cluster label needs to be assigned to each unlabeled time series instance in the input dataset. The assignment process is straightforward. For an unlabeled instance, its closest labeled instance is found. If their distance is less than the density radius of the cluster the labeled instance belongs to, then the unlabeled instance is considered to be in the same cluster as the labeled instance. Otherwise, it is labeled as noise. Labeled points Unlabeled points 𝑑𝑖𝑠 − 𝜀 𝑎 𝑏 𝑑𝑖𝑠 Table 2. Algorithm for estimating density radiuses Function 1 Function 2 DENSITYRADIUSES(𝒌𝒅𝒊𝒔 ) length  |𝒌𝒅𝒊𝒔 | allocate res as list INFLECTIONPOINT(𝒌𝒅𝒊𝒔 , 0, length, res) return res INFLECTIONPOINT (𝒌𝒅𝒊𝒔 , s, e, res) r  -1, diff  -1 for i  s to e left  SLOPE (𝒌𝒅𝒊𝒔 , s, i) right  SLOPE (𝒌𝒅𝒊𝒔 , i, e) if left or right greater than threshold1 continue if |left - right| smaller than diff diff  |left-right| r  ith element of 𝒌𝒅𝒊𝒔 if diff smaller than threshod2 /*record the inflection point, and recursively search*/ add r to res INFLECTIONPOINT (𝒌𝒅𝒊𝒔 , s, r-1) INFLECTIONPOINT (𝒌𝒅𝒊𝒔 , r+1, e) The time complexity of estimating density radiuses is as follows. The generation of 𝑘𝑑𝑖𝑠 curve costs 𝑂(𝑑𝑠 2 ) due to calculation of Figure 3. Illustration of the pruning strategy in assignment The assignment process involves the distance computation between every pair of unlabeled and labeled instances, which has complexity 𝑂(𝑁𝑠𝑑). The observation illustrated in Figure 3 can reduce such computation. If an unlabeled object 𝑎 is far from one labeled object 𝑏, i.e. their distance dis is greater than the density radius 𝜀 of b’s cluster, then the distance between a and the labeled neighbors of 𝑏 (within 𝑑𝑖𝑠 − 𝜀 ) is also greater than 𝜀 (according to triangle inequality). Therefore, the distance computation between a and each of b’s neighbors is saved. We design a data structure named Sorted Neighbor Graph (SNG) to achieve the above pruning strategy. When performing densitybased clustering on the sampled dataset, if an instance 𝑏 is determined to be a core point, then b is added to SNG, and its distances to all the other instances in the sampled dataset are computed and stored in SNG in ascending order. Quick-sort is used in the construction of SNG, so the time complexity of SNG is 𝑂(𝑠 2 log 𝑠). The implementation of assignment using SNG is shown in Table 4. Its time complexity is 𝑂(𝑁𝑠𝑑). Although SNG and pruning can reduce the search space, in the worst case, every unlabeled instance has to compare to every labeled instance. Table 4. Algorithm for assignment // uObj: the list of unlabeled objects ASSIGNMENT(SNG, uObj) for each obj ∈ uObj set the label of obj as “noisy” for each o ∈ {keys of SNG} if o has been inspected continue; dis  L1 distance between o and obj if dis less than density radius of o mark obj with same label of o break mark o as inspected jump  dis - density radius of o i  BINARYSEARCH(SNG[o], jump) for each neighbor ∈ SNG[o] with index greater than i if density radius of neighbor is less than jump mark neighbor as inspected else break /*this is a sorted list*/ Random Walk [70]. 𝑥𝑡 = 𝑥𝑡−1 + 𝜇 We use random walk to represent the noise time series. Since the random walk is accumulation of white noise, with no extra information contained Using the aforementioned stochastic models, we generate a template for creating simulation datasets to be used in the subsequent experiments. 1.3.2 Template Details According to our evaluation design, we use one template, named TemplateA for RQ1, RQ2 and partially RQ3 (robustness to random noise), and we use TemplateB to mimic phase perturbation phenomenon. Specifically for TemplateA, it consists of 15 groups of time series, here each group indicates a specific label assigned to each object in this group. The group size varies significantly, which is represented by the population ratio ranging from 0.1% to 30%. The mapping between GroupID and model type is random. The parameters of the models are also arbitrarily chosen. Table 5 lists the information of each group in the template.. Based on this template, the size of each simulation dataset N and the length of each time series instance D are set to meet the requirements of each experiment 1.3 Evaluation Report This report includes the detailed and complementary information according to the paper of YADING, named “YADING: Fast Clustering of Large-Scale Time Series Data”. For easier illustration, the three research questions mentioned in paper are listed as follows: Table 5. Details of TemplateA GroupID Ratio Model 1 30% AR(1) 2 20% Forced Oscillation 3 13% Drifting 4 9.0% Drifting We use five different underlying stochastic models to generate simulation time series data. Below are the detailed illustration of these models, and corresponding parameters. These models cover a wide range of characteristics of time series. 5 6.0% AR(1) 𝛼 = 20, 𝛽 = 0.5, 𝜎 = 1 6 4.0% Forced Oscillation AR(1) Model [66]. 𝑥𝑡 = 𝛽 ∗ 𝑥𝑡−1 + 𝛼 + 𝜇 7 2.8% Peak AR(1) model is a simplified version of general ARMA mode. Here 𝛼 , 𝛽 and 𝜇 are parameters. |𝛽| < 1 to assure this is a stationary process. 𝛼, 𝛽 decide the asymptotic converged value of 𝑥𝑡 , and 𝜇~𝑁(0, 𝜎) is white noise. 8 1.8% Forced Oscillation 9 1.2% Peak 𝑚 = 1, 𝜔 = 2, 𝑓 = 30, 𝛾 = 1, 𝛽 = 0, 𝜎 = 10 𝑚 = 1, 𝜔 = 2, 𝑓 = 10, 𝛾 = 1, 𝛽 = 0, 𝜎 = 10, 𝑔 = 1000 𝑚 = 1, 𝜔 = 2, 𝑓 = 10, 𝛾 = 1, 𝛽 = 0, 𝜎 = 10 𝑚 = 1, 𝜔 = 2, 𝑓 = 100, 𝛾 = 1, 𝛽 = 0, 𝜎 = 10, 𝑔 = 1000 10 0.8% RQ1. How efficiently can YADING cluster time series data? RQ2. How does sample size affect the clustering accuracy? RQ3. How robust is YADING to time series variation? 1.3.1 Models for Generating Simulation Data 1 [𝑓∗cos(𝛾∗𝑡+𝛽)+𝜇]+2𝑥𝑡−1 −𝑥𝑡−2 Forced Oscillation [67]. 𝑥𝑡 = 𝑚 1+𝜔2 Forced Oscillation is used to model the cyclical time series. Here γ is the circular frequency of external “force”, 𝛽 is the initial phase. 𝑚 and 𝜔 are the intrinsic properties of studied time series. 𝜇~𝑁(0, 𝜎) is white noise. Drift [68]. Leverage the formula of AR(1) model but make 𝛽 > 1 . Here 𝛽 − 1 becomes the drift coefficient. To avoid the divergence of 𝑥𝑡 , we set a threshold 𝑝, so that 𝑥𝑡 is set to initial value after 𝑝 steps. Peak [69]. Leverage the forced oscillation model, but set the force 𝐹(𝑡) = 𝑓 ∗ cos(𝛾 ∗ 𝑡 + 𝛽) + 𝜇 + 𝑔𝛿(𝑡 − 𝑡𝑝 ) , where 𝛿(𝑥) = 1, 𝑥 = 0 { . This type of time series is used to mimic the spikes or 0, 𝑒𝑙𝑠𝑒 transient anomalies. AR(1) Params 𝛼 = 10, 𝛽 = 0.5, 𝜎 =10 𝑚 = 1, 𝜔 = 2, 𝑓 = 30, 𝛾 = 0.1, 𝛽 = 0, 𝜎 = 10 𝛼 = 10, 𝛽 = 1, 𝜎 = 10, 𝑝 = 30 𝛼 = 20, 𝛽 = 1, 𝜎 = 5, 𝑝 = 30 𝛼 = 30, 𝛽 = 0.5, 𝜎 =1 𝑚 = 1, 𝜔 = 2, 𝑓 = 20, 𝛾 = 1, 𝛽 = 0, 𝜎 = 10 𝛼 = 5, 𝛽 = 1, 𝜎 = 5, 12 0.4% Drifting 𝑝 = 30 𝛼 = 20, 𝛽 = 13 0.2% AR(1) 0.5, 𝜎 =10 Forced 𝑚 = 1, 𝜔 = 2, 𝑓 = 20, 14 0.2% Oscillation 𝛾 = 0.1, 𝛽 = 0, 𝜎 = 10 𝛼 = 5, 𝛽 = 1, 𝜎 = 10, 15 0.1% Drifting 𝑝 = 30 According to the settings in Table 1, the data set is generated once the data size and dimensionality is specified. 11 0.5% Forced Oscillation Table 6. Details of TemplateB GroupID Ratio Model 1 31% Forced Oscillation 2 20% AR(1) 14% Forced Oscillation 3 4 9.0% Drifting 5 6.0% AR(1) Params 𝑚 = 1, 𝜔 = 2, 𝑓 = 32, 𝛾 = 0.08, 𝜎 = 5, 𝛽 ∈ [0.0, 2𝜋⁄3] 𝛼 = 0, 𝛽 = −0.5, 𝜎 =10 𝑚 = 1, 𝜔 = 2, 𝑓 = 64, 𝛾 = 0.1, 𝜎 = 20, 𝛽 ∈ [0.0, 𝜋⁄3] 𝛼 = 20, 𝛽 = 0.8, 𝜎 = 8, 𝑝 = 8 𝛼 = 25, 𝛽 = 0.5, 𝜎 = 1 𝛼 = 14, 𝛽 = 0.85, 𝜎 = 8, 𝑝 = 32 𝑚 = 1, 𝜔 = 2, 𝑓 = 8, 7 2.8% Peak 𝛾 = 1, 𝛽 = 3, 𝜎 = 10, 𝑔 = 400 𝑚 = 1, 𝜔 = 2, 𝑓 = 8, 8 1.2% Peak 𝛾 = 1, 𝛽 = 3, 𝜎 = 10, 𝑔 = 400 We use TemplateB to mimic the phase perturbation phenomenon. As illustrated in Table 2, we set the phase perturbation for Forced Oscillation models. The initial phase is set by uniform distribution defined in the given interval, e.g., for Group 1, 𝛽 ∈ [0.0, 2𝜋⁄3]; for Group 3, 𝛽 ∈ [0.0, 𝜋⁄3]. Here 𝛽 is the initial phase. 6 6.0% Drifting 2. REFERENCES [1] Debregeas, A., and Hebrail, G. 1998. Interactive interpretation of Kohonen maps applied to curves. In Proc. of KDD’98. 179-183. [2] Derrick, K., Bill, K., and Vamsi, C. 2012. Large scale/big data federation & virtualization: a case study. http://rhsummit.files.wordpress.com/2012/03/kittler_large_ scale_big_data.pdf. [3] D. A. Patterson. 2002. A simple way to estimate the cost of downtime. In Proc. of LISA’ 02, pp. 185-188. [4] Eamonn, K., and Shruti, K. 2002. On the need for time series data mining benchmarks: a survey and empirical demonstration. In Proc. of KDD’02, July 23-26. [5] T. W. Liao. 2005. Clustering of time series data—A survey Pattern Recognit., vol. 38, no. 11, pp. 1857–1874, Nov. [6] C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. 1994. Fast subsequence matching in time series databases. In Proc. of the ACM SIGMOD Conf., May. [7] X. Golay, S. Kollias, G. Stoll, D. Meier, A. Valavanis, P. Boesiger. 1998. A new correlation-based fuzzy logic clustering algorithm for fMRI, Mag. Resonance Med. 40 249–260. [8] D. Rafiei, and A. Mendelzon. 1997. Similarity-based queries for time series data. In Proc. of the ACM SIGMOD Conf., Tucson, AZ, May. [9] B. K. Yi, H. V. Jagadish, and C. Faloutsos. 1998. Efficient retrieval of similar time sequences under time warping. In IEEE Proc. of ICDE, Feb. [10] R. Agrawal, K. L. Lin, H. S. Sawhney, and K. Shim. 1995. Fast similarity search in the presence of noise, scaling, and translation in time series database. In Proc. of the VLDB conf., Zurich, Switzerland. [11] J. Han, M. Kamber. 2001. Data Mining: Concepts and Techniques, Morgan Kaufmann, San Francisco, pp. 346– 389. [12] Chu, K. & Wong, M. 1999. Fast time-series searching with scaling and shifting. In proc. of PODS. pp 237-248. [13] Faloutsos, C., Jagadish, H., Mendelzon, A. & Milo, T. 1997. A signature technique for similarity-based queries. In Proc. of the ICCCS. [14] Chan, K. & Fu, A. W. 1999. Efficient time series matching by wavelets. In proc. of ICDE. pp 126-133. [15] Popivanov, I. & Miller, R. J. 2002. Similarity search over time series data using wavelets. In proc. of ICDE. pp 212221. [16] Keogh, E., Chakrabarti, K., Pazzani, M. & Mehrotra, S. 2001. Locally adaptive dimensionality reduction for indexing large time series databases. In proc. of ACM SIGMOD. pp 151-162. [17] Korn, F., Jagadish, H. & Faloutsos, C. 1997. Efficiently supporting ad hoc queries in large datasets of time sequences. In proc. of the ACM SIGMOD. pp 289-300. [18] Yi, B. & Faloutsos, C. 2000. Fast time sequence indexing for arbitrary lp norms. In proc. of the VLDB. pp 385-394. [19] E. J. Keogh, K. Chakrabarti, M. J. Pazzani, and S. Mehrotra. 2001. Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases. Knowl. Inf. Syst., 3(3). [20] G. Kollios, D. Gunopulos, N. Koudas, and S. Berchtold. 2003. Eﬃcient biased sampling for approximate clustering and outlier detection in large datasets. IEEE TKDE, 15(5). [21] Zhou, S., Zhou, A., Cao, J., Wen, J., Fan, Y., Hu. Y. 2000. Combining sampling technique with DBSCAN algorithm for clustering large spatial databases. In Proc. of the PAKDD. 169-172. [22] Stuart, Alan. 1962. Basic Ideas of Scientific Sampling, Hafner Publishing Company, New York. [23] M. Ester, H. P. Kriegel, and X. Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of 2nd ACM SIGKDD, pages 226–231. [24] “Amazon’s S3 cloud service turns into a puff of smoke”. 2008. In InformationWeek NewsFilter, Aug. [25] J. N. Hoover: “Outages force cloud computing users to rethink tactics”. In InformationWeek, Aug. 16, 2008. [26] M. Steinbach, L. Ertoz, and V. Kumar. 2003. Challenges of clustering high dimensional data. In L. T. Wille, editor, New Vistas in Statistical Physics – Applications in Econophysics, Bioinformatics, and Pattern Recognition. Springer-Verlag. [27] N. D. Sidiropoulos and R. Bros. 1999. Mathematical Programming Algorithms for Regression-based Non-linear Filtering in RN . IEEE Trans. on Signal Processing, Mar. [28] J. L. Rodgers and W. A. Nicewander. 1988. Thirteen ways to look at the correlation coefficient. The American Statistician, 42(1):59–66, February. [29] Al-Naymat, G., Chawla, S., & Taheri, J. (2012). SparseDTW: A Novel Approach to Speed up Dynamic Time Warping. [30] M. Kumar, N.R. Patel, J. Woo. 2002. Clustering seasonality patterns in the presence of errors, Proceedings of KDD ’02, Edmonton, Alberta, Canada. [31] Kullback, S.; Leibler, R.A. 1951. "On Information and Sufficiency". Annals of Mathematical Statistics 22 (1): 79– 86. doi:10.1214/aoms/1177729694. MR 39968. [32] Ng R.T., and Han J. 1994. Efficient and Effective Clustering Methods for Spatial Data Mining, In proc. of VLDB, 144-155. [33] S. Guha, R. Rastogi, K. Shim. 1998. CURE: an efficient clustering algorithm for large databases. In proc. of SIGMOD. pp. 73–84. [34] García J.A., Fdez-Valdivia J., Cortijo F. J., and Molina R. 1994. A Dynamic Approach for Clustering Data. Signal Processing, Vol. 44, No. 2, 1994, pp. 181-196. [35] Jianbo Shi and Jitendra Malik. 2000. Normalized Cuts and Image Segmentation, IEEE Transactions on PAMI, Vol. 22, No. 8, Aug 2000. [36] W. Wang, J. Yang, R. Muntz, R. 1997. STING: a statistical information grid approach to spatial data mining, VLDB’97, Athens, Greek, pp. 186–195. [37] Hans-Peter Kriegel, Peer Kröger, Jörg Sander, Arthur Zimek 2011. Density-based Clustering. WIREs Data Mining and Knowledge Discovery 1 (3): 231–240. doi:10.1002/widm.30. [38] Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, Jörg Sander. 1999. OPTICS: Ordering Points To Identify the Clustering Structure. ACM SIGMOD. pp. 49–60. [39] Achtert, E.; Böhm, C.; Kröger, P. 2006. "DeLi-Clu: Boosting Robustness, Completeness, Usability, and Efficiency of Hierarchical Clustering by a Closest Pair Ranking". LNCS: Advances in Knowledge Discovery and Data Mining. Lecture Notes in Computer Science 3918: 119–128. [40] Liu P, Zhou D, Wu NJ. 2007. VDBSCAN: varied density based spatial clustering of applications with noise. In Proc. ofICSSSM. pp 1–4. [46] Krantz, Steven; Parks, Harold R. 2002. A Primer of Real Analytic Functions (2nd ed.). Birkhäuser. [47] J. Bentley. 1986. Programming Pearls, Addison-Wesley, Reading, MA.6. [48] Box, G. E. P.; Jenkins, G. M.; Reinsel, G. C. 1994. Time Series Analysis: Forecasting and Control (3rd ed.). Upper Saddle River, NJ: Prentice–Hall. [49] Beckmann, N.; Kriegel, H. P.; Schneider, R.; Seeger, B. 1990. "The R*-tree: an efficient and robust access method for points and rectangles". In proc. of SIGMOD. p. 322. [50] X. Golay, S. Kollias, G. Stoll, D. Meier, A. Valavanis, P. Boesiger, A new correlation-based fuzzy logic clustering algorithm for fMRI, Mag. Resonance Med. 40 (1998) 249– 260. [51] Y. Kakizawa, R.H. Shumway, N. Taniguchi, Discrimination and clustering for multivariate time series, J. Amer. Stat. Assoc. 93 (441) (1998) 328–340. [52] M. Kumar, N.R. Patel, J. Woo, Clustering seasonality patterns in the presence of errors. In Proc. of KDD ’02. [53] R.H. Shumway, Time–frequency clustering and discriminant analysis, Stat. Probab. Lett. 63 (2003) 307– 314. [54] J.J. van Wijk, E.R. van Selow. 1999. Cluster and calendar based visualization of time series data. In Proc. of SOIV. [55] T.W. Liao, B. Bolt, J. Forester, E. Hailman, C. Hansen, R.C. Kaste, J. O’May, Understanding and projecting the battle state, 23rd Army Science Conference, Orlando, FL, December 2–5, 2002. [56] S. Policker, A.B. Geva, Nonstationary time series analysis by temporal clustering, IEEE Trans. Syst. Man Cybernet.B: Cybernet. 30 (2) (2000) 339–343. [57] T.-C. Fu, F.-L. Chung, V. Ng, R. Luk. 2001. Pattern discovery from stock time series using self-organizing maps, KDD Workshop on Temporal Data Mining. pp. 27– 37. [58] D. Piccolo,A distance measure for classifyingARMA models, J. Time Ser. Anal. 11 (2) (1990) 153–163. [59] J. Beran, G. Mazzola, Visualizing the relationship between time series by hierarchical smoothing models, J. Comput. Graph. Stat. 8 (2) (1999) 213–238. [60] M. Ramoni, P. Sebastiani, P. Cohen, Bayesian clustering by dynamics, Mach. Learning 47 (1) (2002) 91–121. [41] Tao Pei, Ajay Jasra, David J. Hand, A. X. Zhu, C. Zhou. 2009. DECODE: a new method for discovering clusters of different densities in spatial data, Data Min Knowl Disc. [61] M. Ramoni, P. Sebastiani, P. Cohen, Multivariate clustering by dynamics. Proceedings of AAAI-2000. pp. 633–638. [42] P. Cheeseman, J. Stutz. 1996. Bayesian classification (AutoClass): theory and results. Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press. [62] K. Kalpakis, D. Gada, V. Puttagunta, Distance measures for effective clustering of ARIMA time-series. In proc. of ICDM. pp. 273–280. [43] T. Kohonen. 1990. The self-organizing maps, Proc. IEEE 78 (9) (1990) 1464–1480. [63] D. Tran, M. Wagner, Fuzzy c-means clustering-based speaker verification, in: N.R. Pal, M. Sugeno (Eds.), AFSS 2002, Lecture Notes in Artificial Intelligence, 2275, 2002, pp. 318–324. [44] C. Guo, H. Li, and D. Pan. 2010. An improved piecewise aggregate approximation based on statistical features for time series mining. KSEM’10, pages 234–244. [45] Box, Hunter and Hunter. 1978. Statistics for experimenters. Wiley. p. 130.2. [64] T. Oates, L. Firoiu, P.R. Cohen, Clustering time series with hidden Markov models and dynamic time warping, In Proc. of the IJCAI-99. [65] Danon L, D´ıaz-Guilera A, Duch J and Arenas A 2005 J. Stat. Mech. P09008. [66] http://en.wikipedia.org/wiki/Autoregressive_model [67] http://en.wikipedia.org/wiki/Harmonic_oscillator [68] http://en.wikipedia.org/wiki/Stochastic_drift [69] http://en.wikipedia.org/wiki/Pulse_(signal_processing) [70] http://en.wikipedia.org/wiki/Random_walk [71] http://en.wikipedia.org/wiki/Principal_component_analysis [72] http://research.microsoft.com/enus/people/juding/yadingdoc.pdf [73] Oppenheim, Alan V. Ronald W. Schafer, John R. Buck (1999). Discrete-Time Signal Processing (2nd ed.). Prentice Hall.ISBN 0-13-754920-2. [74] http://en.wikipedia.org/wiki/Nyquist%E2%80%93Shannon _sampling_theorem

Proceedings Template - WORD

Related documents

Products

Support

Proceedings Template - WORD

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib