Proceedings Template - WORD

advertisement
Technical Document of YADING
Fast and Automatic Clustering of Large-Scale Time Series Data
YADING project
Member: Rui Ding, Qiang Wang, Yingnong Dang, Qiang Fu, Haidong Zhang, Dongmei Zhang
Software Analytics Group
Microsoft Research
December, 2013
1. DOCUMENT
Appendix includes the detailed proofs of several statements, and
some other useful information.
๐‘ƒ(๐‘›๐‘–′ ≥ ๐‘š) > 1 − ๐›ผ ↔ ๐‘ƒ (๐‘ ≥
−๐‘ง๐›ผ & ๐‘š ≤ ๐‘ ๐‘๐‘– .
๐œŽ
2
(๐‘๐‘– −
1.1.1 Sample Size Determination
Sampling is the most effective mechanism to handle the scale of
the input dataset. Since we want to achieve high performance and
we do not assume any distribution of the input dataset, we choose
random sampling [47] as our sampling algorithm.
In practice, a predefined sampling rate is often used to determine
the size of the sampled dataset s. As N, the size of the input dataset,
keeps increasing, s also increases accordingly, which will result
in slow clustering performance on the sampled dataset.
Furthermore, it is unclear what impact the increased number of
samples may have on the clustering accuracy. We come up with
the following theoretical bounds to guide the selection of s.
Assume that the ground truth of clustering is known for ๐’ฏ๐‘×๐ท , i.e.
all the ๐‘‡๐‘– ∈ ๐’ฏ๐‘×๐ท belong to k known groups, and ๐‘›๐‘– represents the
๐‘›
number of time series in the ith group. Let ๐‘๐‘– = ๐‘– denote the
๐‘
๐‘›′
population ratio of group i. Similarly, ๐‘๐‘–′ = ๐‘– denote the
๐‘ 
population ratio of the ith group on the sampled dataset. |๐‘๐‘– − ๐‘๐‘–′ |
reflects the ratio deviation between the input dataset ๐’ฏ๐‘×๐ท and the
sampled dataset ๐’ฏ๐‘ ×๐‘‘ . We formalize the selection of the sample
size s as finding the lower bound ๐‘ ๐‘™ and upper bound ๐‘ ๐‘ข such that,
given a tolerance ๐œ– and a confidence level 1 − ๐›ผ, (1) group i with
๐‘๐‘– less than ๐œ– is not guaranteed to have sufficient instances in the
sampled dataset for ๐‘  < ๐‘ ๐‘™ , and (2) the maximum of ratio
deviation |๐‘๐‘– − ๐‘๐‘–′ |, 1 ≤ ๐‘– ≤ ๐‘˜, is within a given tolerance for ๐‘  ≥
๐‘ ๐‘ข . Intuitively, the lower bound constrains the smallest size of
clusters that are possible to be found; and the upper bound
indicates that when the sample size is greater than a threshold,
more samples will not impact the clustering result.
Lemma 1 (lower bound): Given m, the least number of instances
present in the sampled dataset for group i, tolerance ๐œ–, and the
confidence level 1 − ๐›ผ , the sample size ๐‘  ≥
๐‘ƒ(๐‘›๐‘–′
๐‘ง
๐‘ง 2
๐‘š+๐‘ง๐›ผ ( ๐›ผ+√๐‘š+ ๐›ผ )
2
4
๐‘๐‘–
satisfies
≥ ๐‘š) > 1 − ๐›ผ . Here, ๐‘ง๐›ผ/2 is a function of ๐›ผ ,
๐‘ƒ(๐‘ > ๐‘ง๐›ผ/2 ) = ๐›ผ/2, where ๐‘~๐‘(0, 1).
With confidence level 1 − ๐›ผ, Lemma 1 provides the lower bound
on sample size s that guarantees m instances in the sampled
dataset for any cluster with population ratio higher than ๐œ–. For
sample, if a cluster has ๐‘๐‘– > 1% , and we set ๐‘š = 5 with
confidence 95% (i.e. 1 − ๐›ผ = 0.95), then we get ๐‘ ๐‘™ ≥ 1,030. In
this case, when ๐‘  < 1,030 , the clusters with ๐‘๐‘– < 1% have
perceptible probability (>5%) to be missed in the sampled dataset.
It should be noted that the selection of m is related to the
clustering method applied to the sampled dataset. For example,
DBSCAN is a density-based method, and it typically requires 4
nearest neighbors of a specific object to identify a cluster. Thus,
any cluster with size less than 5 is difficult to be found. The
consideration on clustering method also supports our
formalization for deciding ๐‘ ๐‘™ .
๐‘š−๐‘ ๐‘
๐‘š−๐‘ ๐‘
๐‘–
๐‘–
Proof: Event {๐‘›๐‘–′ ≥ ๐‘š} ↔ { ๐‘– ๐‘– ≥
} ↔ {๐‘ ≥
}, the
๐œŽ
๐œŽ
๐œŽ
′
last statement holds since ๐‘›๐‘– ~๐‘(๐‘ ๐‘๐‘– , ๐‘ ๐‘๐‘– (1 − ๐‘๐‘– )), and here ๐œŽ =
√๐‘ ๐‘๐‘– (1 − ๐‘๐‘– ). So
)>1−๐›ผ ↔
๐‘š−๐‘ ๐‘๐‘–
๐œŽ
≤
The last inequality can be transformed to
1.1 Lemmas and Proofs
๐‘›′ −๐‘ ๐‘
๐‘š−๐‘ ๐‘๐‘–
2
๐‘š + ๐‘ง๐›ผ2 /2
๐‘š + ๐‘ง๐›ผ2 /2
๐‘š2
, ๐‘š ≤ ๐‘ ๐‘๐‘–
2 ) ≥(
2 ) −
๐‘  + ๐‘ง๐›ผ
๐‘  + ๐‘ง๐›ผ
๐‘ (๐‘  + ๐‘ง๐›ผ2 )
Consider ๐‘ง๐›ผ usually valued in [0, 3], so ๐‘  โ‰ซ ๐‘ง๐›ผ2 , so ๐‘  + ๐‘ง๐›ผ2 ≈ ๐‘ ,
apply this to the inequality above, we get a simplified version ๐‘  ≥
๐‘š+๐‘ง๐›ผ (
๐‘ง๐›ผ
๐‘ง 2
+√๐‘š+ ๐›ผ )
2
4
, hence the lemma is proven.
๐‘๐‘–
Lemma 2 (upper bound): Given tolerance ๐œ–, and the confidence
level 1 − ๐›ผ, the sample size ๐‘  ≥
2
๐‘ง๐›ผ/2
4๐œ– 2
satisfies ๐‘ƒ (max|๐‘๐‘– − ๐‘๐‘–′ | <
๐‘–
๐œ–) > 1 − ๐›ผ. Here, ๐‘ง๐›ผ/2 is a function of ๐›ผ, ๐‘ƒ(๐‘ > ๐‘ง๐›ผ/2 ) = ๐›ผ/2,
where ๐‘~๐‘(0, 1).
Lemma 2 implies that the sample size s only depends on the
tolerance ๐œ– and the confidence level 1 − ๐›ผ, and it is independent
of the input data size. For example, if we set ๐œ– = 0.01 and 1 −
๐›ผ = 0.95, which means that for any group, the difference of its
population ratio between the input dataset and the sampled dataset
is less than 0.01, then the lowest sample size to guarantee such
๐‘ง2
setting is ๐‘  ≥ 0.025 2 ~9,600. More samples than 9,600 are not
4×0.01
necessary. This makes ๐‘  = 9,600 the upper bound of the sample
size. Moreover, this sample size does not change with the size of
the input dataset.
Proof: the probability that a sample belongs to the ith group is ๐‘๐‘– ,
due to the property of random sampling. Since each sample is
independent, the number of instances belonging to the ith group
form Binomial distribution, which is ๐‘›๐‘–′ ~๐ต(๐‘ , ๐‘๐‘– ).
๐ต(๐‘ , ๐‘๐‘– )~๐‘(๐‘ ๐‘๐‘– , ๐‘ ๐‘๐‘– (1 − ๐‘๐‘– )) when ๐‘  is large, and the
distribution is not too skew [45], so next we assume
๐‘›๐‘–′ ~๐‘(๐‘ ๐‘๐‘– , ๐‘ ๐‘๐‘– (1 − ๐‘๐‘– )) by approximation. Then,
๐‘๐‘–′ =
๐‘›๐‘–′
๐‘ 
~๐‘(๐‘๐‘– ,
๐‘๐‘– (1−๐‘๐‘– )
๐‘ 
) ๏ƒ  ๐‘Œ = ๐‘๐‘–′ − ๐‘๐‘– ~๐‘(0,
๐‘๐‘– (1−๐‘๐‘– )
๐‘ 
),
๐œ–
Event {|๐‘๐‘– − ๐‘๐‘–′ | < ๐œ–} ↔ {|๐‘Œ| < ๐œ–} ↔ {|๐‘| < } where ๐œŽ =
๐œŽ
๐‘๐‘– (1−๐‘๐‘– )
√
๐‘ 
So when
.
๐œ–
๐œŽ
๐œ–
> ๐‘ง๐›ผ/2 , ๐‘ƒ (|๐‘| < ) > 1 − ๐›ผ is achieved. Expand ๐œŽ,
๐œŽ
we get the range value of ๐‘  should be ๐‘  ≥
1
2
๐‘ง๐›ผ/2
4๐œ– 2
๐‘๐‘– (1−๐‘๐‘– ) 2
๐‘ง๐›ผ ,
๐œ–2
2
since
๐‘๐‘– (1 − ๐‘๐‘– ) ≤ , so ๐‘  ≥
is a valid value range of the ith group,
4
hence the lemma is proven.
Lemma 2 provides a loose upper bound, since we replace
1
๐‘๐‘– (1 − ๐‘๐‘– ) to to bound all the value of ratios. So set sample size
4
smaller than 9,600 may preserve reasonable results and increase
performance. In practice, we change sample size from 1,030
(lower bound) to 10,000 in real data sets for testing clustering
results, we finally choose 2,000 since the accuracy of clustering
results are very close when sample size is larger than 2,000.
1.1.2 Phase-Shift Overcoming
In this section, we investigate how ๐ฟ1 distance combined with
density-based clustering could overcome phase-shift trouble. The
first observation is, when phase shift is small enough, the ๐ฟ1
distance could also be small enough; another observation is, when
data scale becomes large, the distance between the kNN (kth
nearest neighbor) and a particular time series can be short enough,
so that all the time series are connected by applying DBSCAN.
general, such way for describing time series is very common and
nature.
Preliminaries: denote a time series ๐‘‡(๐‘Ž) = {๐‘“(๐‘Ž + ๐‘), ๐‘“(๐‘Ž +
2๐‘), ๐‘“(๐‘Ž + 3๐‘), … ๐‘“(๐‘Ž + ๐‘š๐‘)}, here ๐‘Ž is the initial phase, ๐‘ is
the interval that time series is sampled, ๐‘š is the length. The time
series is generated by an underlying continuous model ๐‘“(๐‘ก) .
Here, we assume ๐‘“(๐‘ก) which is an analytic function [46]. Another
time series with phase shift ๐›ฟ is represented by ๐‘‡(๐‘Ž − ๐›ฟ) =
{๐‘“(๐‘Ž + ๐‘ − ๐›ฟ), ๐‘“(๐‘Ž + 2๐‘ − ๐›ฟ), … ๐‘“(๐‘Ž + ๐‘š๐‘ − ๐›ฟ)}.
Without noise, the time series instances are identical, which is
trivial for applying any type of clustering methods. When noise is
incorporated, the outcome time series instances have deviations.
Lemma 1: ∃๐‘€, ๐‘ . ๐‘ก. , ๐ฟ1(๐‘‡(๐‘Ž), ๐‘‡(๐‘Ž − ๐›ฟ)) โ‰” ∑๐‘š
๐‘–=1|๐‘“(๐‘Ž + ๐‘–๐‘) −
๐‘“(๐‘Ž + ๐‘–๐‘ − ๐›ฟ)| ≤ ๐‘š๐‘€๐›ฟ.
Proof: according to Taylor’s theorem about analytic function:
๐‘“(๐‘ฅ) = ๐‘“(๐‘Ž) + ๐‘“ ′ (๐œƒ)(๐‘ฅ − ๐‘Ž), ๐‘คโ„Ž๐‘’๐‘Ÿ๐‘’ ๐œƒ ∈ (๐‘Ž, ๐‘ฅ)
We immediately get
๐‘š
๐ฟ1(๐‘‡(๐‘Ž), ๐‘‡(๐‘Ž − ๐›ฟ)) = ∑|๐‘“(๐‘Ž + ๐‘–๐‘) − ๐‘“(๐‘Ž + ๐‘–๐‘ − ๐›ฟ)|
๐‘–=1
๐‘š
= ∑|๐‘“ ′ (๐œƒ๐‘– )๐›ฟ| ≤ ๐‘š๐‘€๐›ฟ
๐‘–=1
Where ๐‘€ = max |๐‘“ ′ (๐œƒ๐‘– )| , ๐œƒ๐‘– ∈ (๐‘Ž + ๐‘–๐‘ − ๐›ฟ, ๐‘Ž + ๐‘–๐‘).
๐‘–
Now suppose we have ๐‘› time series ๐‘‡(๐‘Ž − ๐›ฟ๐‘– ) , which only
differed from ๐‘‡(๐‘Ž) by a phase shift ๐›ฟ๐‘– . Without generality, let
๐›ฟ๐‘– ∈ [0, โˆ†] . We assume these time series are generated
independently, with ๐›ฟ๐‘– uniformly distributed in the interval [0, โˆ†].
Denote clustering parameter as ๐œ€: (๐œ€ is a distance threshold, if the
distance between a specific object and its kNN is smaller than ๐œ€,
then it is a core point). Denote event ๐ธ๐‘› :
๐ธ๐‘› โ‰” {๐‘‡(๐‘Ž − ๐›ฟ๐‘– ), ๐‘– = 1, 2, … ๐‘›. ๐‘๐‘’๐‘™๐‘œ๐‘›๐‘” ๐‘ก๐‘œ ๐‘ ๐‘Ž๐‘š๐‘’ ๐‘๐‘™๐‘ข๐‘ ๐‘ก๐‘’๐‘Ÿ }
Lemma 2: ๐‘ƒ(๐ธ๐‘› ) ≥ 1 − ๐‘›(1
๐œ€
−
)๐‘›
๐‘š๐‘€๐‘˜โˆ†
Proof: divide the interval [0, โˆ†] into several buckets with length
๐œ€
equals to
. According to the mechanism of DBSCAN, if each
๐‘š๐‘€๐‘˜
bucket contains at least one time series, then
๐œ€
๐ฟ1(๐‘‡(๐‘Ž), ๐‘–๐‘ก๐‘  ๐‘˜๐‘๐‘) ≤ ๐‘˜ × ๐‘š๐‘€ ×
= ๐œ€, so all the time series
๐‘š๐‘€๐‘˜
are core points, and they are density-connected, so all the time
series will be grouped into one cluster.
Denote event ๐‘ˆ๐‘— โ‰” {๐‘—๐‘กโ„Ž ๐‘๐‘ข๐‘๐‘˜๐‘’๐‘ก ๐‘–๐‘  ๐‘’๐‘š๐‘๐‘ก๐‘ฆ}, then
๐‘ƒ(๐‘Ž๐‘ก ๐‘™๐‘’๐‘Ž๐‘ ๐‘ก ๐‘œ๐‘›๐‘’ ๐‘๐‘ข๐‘๐‘˜๐‘’๐‘ก ๐‘–๐‘  ๐‘’๐‘š๐‘๐‘ก๐‘ฆ) = ๐‘ƒ(โ‹ƒ๐‘— ๐‘ˆ๐‘— ) ≤ ∑๐‘— ๐‘ƒ(๐‘ˆ๐‘— ) =
๐œ€
๐‘›(1 −
)๐‘› . Note that event {๐‘›๐‘œ ๐‘’๐‘š๐‘๐‘ก๐‘ฆ ๐‘๐‘ข๐‘๐‘˜๐‘’๐‘ก} is just a
๐‘š๐‘€๐‘˜โˆ†
subset of ๐ธ๐‘› , so
๐‘ƒ(๐ธ๐‘› ) ≥ ๐‘ƒ(๐‘›๐‘œ ๐‘’๐‘š๐‘๐‘ก๐‘ฆ ๐‘๐‘ข๐‘๐‘˜๐‘’๐‘ก) = 1 −
๐‘ƒ(โ‹ƒ๐‘— ๐‘ˆ๐‘— ) ≥ 1 − ๐‘› (1 −
๐œ€
๐‘›
To illustrate how density-based clustering overcome the random
noise, we give an theoretical analysis on the time series generated
by AR(1) model, which is relatively simple, and without loss of
generality. Let ๐‘ฅ๐‘– be the value of ith epoch of a particular time
series instance, AR(1) is represented by ๐‘ฅ๐‘– = ๐‘Ž๐‘ฅ๐‘–−1 + ๐œ‡, where
|๐‘Ž| < 1 to make it stable, and ๐œ‡~๐‘(0, ๐œŽ 2 ) is the white noise. A
given time series โƒ‘๐’™ โ‰” {๐‘ฅ1 , ๐‘ฅ2 , … , ๐‘ฅ๐‘‘ }๐‘‡ , let ๐‘ฅ1 = 0 be the initial
value.
Joint distribution of โƒ‘๐’™: Denote ๐‘“(๐‘ฅ1 , ๐‘ฅ2 , … , ๐‘ฅ๐‘‘ ) as the p.d.f. of a
time series generated by the given AR(1) model. So
๐‘“(๐‘ฅ0 , ๐‘ฅ1 , … , ๐‘ฅ๐‘‘ ) =
๐‘“(๐‘ฅ๐‘‘ |๐‘ฅ0 , ๐‘ฅ1 , … , ๐‘ฅ๐‘‘−1 )๐‘“(๐‘ฅ๐‘‘−1 |๐‘ฅ0 , … , ๐‘ฅ๐‘‘−2 ) … ๐‘“(๐‘ฅ1 |๐‘ฅ0 )๐‘ƒ(๐‘ฅ0 ) ,
according to identity transformation of probability. ๐‘ƒ(๐‘ฅ0 = 0) =
1
1. ๐‘“(๐‘ฅ๐‘– |๐‘ฅ0 , … , ๐‘ฅ๐‘–−1 ) =
exp [−
√2๐œ‹๐œŽ
Markov property. So finally, we get
1
โƒ‘ ) โ‰” ๐‘“(๐‘ฅ0 , ๐‘ฅ1 , … , ๐‘ฅ๐‘‘ ) =
๐‘“(๐’™
(๐‘ฅ๐‘– −๐‘Ž๐‘ฅ๐‘–−1 )2
2๐œŽ 2
๐‘‘
๐‘‘ ∏ exp [−
(2๐œ‹๐œŽ 2 )2
=
Here,
Σ −1
=
1 + ๐‘Ž2
−๐‘Ž
0
] according to the
๐‘–=1
(๐‘ฅ๐‘– − ๐‘Ž๐‘ฅ๐‘–−1 )2
]
2๐œŽ 2
1 ๐‘‡ −1
โƒ‘๐’™ Σ โƒ‘๐’™
2
exp
[−
]
๐‘‘
๐œŽ2
(2๐œ‹๐œŽ 2 )2
1
−๐‘Ž
1 + ๐‘Ž2
−๐‘Ž
โ‹ฎ
0
−๐‘Ž
๐‘Ž2
โ‹ฏ
0
โ‹ฑ
โ‹ฎ
2 −๐‘Ž
1
+
๐‘Ž
0
โ‹ฏ
(
−๐‘Ž
1)
Now let ๐‘ time series instances (each with length= ๐‘‘) generated
independently from the discussed AR(1) model. For density
based clustering, denote clustering parameter as ๐œ€: (๐œ€ is a distance
threshold, if the distance between a specific object and its kNN is
๐‘˜
smaller than ๐œ€, then it is a core point). Define ๐œŒ = ๐‘‘ , where
๐‘๐‘‘ ๐‘Ÿ
๐‘‰๐‘Ÿ โ‰” ๐‘๐‘‘ × ๐‘Ÿ ๐‘‘ , is the volume of the hyper-sphere (see Figure 2) in
the ๐‘‘-dimensional ๐ฟ๐‘ space, e.g., in Euclidean space, ๐‘๐‘‘ =
๐œ‹๐‘‘/2
๐‘‘
2
Γ( +1)
Lemma 1: the ratio of the ๐‘ objects (refer to these time series
instances) can be clustered together by applying density based
clustering is
) , hence the lemma is proven.
๐‘š๐‘€๐‘˜โˆ†
Corollary: lim ๐‘ƒ(๐ธ๐‘› ) = 1
๐‘›→∞
This is intuitive.
Description: lemma 2 provides the probabilistic confidence
bound that how probable the phase-shift-time-series instances
will be grouped together. As ๐‘› goes to infinity, the confidence
will converged to 1 in probability.
1.1.3 Random Noise Overcoming
This section describes how density based clustering overcome the
random noise which is encode into the specific models to generate
particular time series.
Here we discuss the time series generated by “precise model” +
“white noise”, and we refer “random noise” as the white noise. In
๐‘(๐œŒ, ๐‘) ≈
∫
โƒ‘ )๐‘‘๐‘ฅ1 ๐‘‘๐‘ฅ2 … ๐‘‘๐‘ฅ๐‘‘
… ∫ ๐‘“(๐’™
โƒ‘ )≥๐œŒ
๐‘๐‘“(๐’™
โƒ‘ ) is the p.d.f. of โƒ‘๐’™ , the
The proof is straightforward. Since ๐‘“(๐’™
โƒ‘ ).
local density of a given position โƒ‘๐’™ is approximately as ๐‘๐‘“(๐’™
When the local density is greater than ๐œŒ , the points in the
โƒ‘)
neighborhood of โƒ‘๐’™ can be identified as core points, because ๐‘“(๐’™
โƒ‘ ), all
is a smooth function. By considering the convexity of ๐‘“(๐’™
the region with density greater than ๐œŒ can be grouped together.
Corollary: lim ๐‘(๐œŒ, ๐‘) = 1
๐‘→∞
The density can be as large as possible, once the number of
objects increasing to infinity.
1.1.4 Multi-Density Estimation
๐‘‘: Dimensionality of each object.
Density estimation is key to density-based clustering algorithms.
It is performed manually or with slow performance in most of the
existing algorithms [23][38][39][40][41]]. In this section, we
define a concept density radius and provide theoretical proof on
its estimation. We use density radius in YADING to identify the
core points of the input dataset and conduct multi-density
clustering accordingly.
๐‘˜๐‘‘๐‘–๐‘  : The distance between an object and its kNN.
We define ๐‘˜๐‘‘๐‘–๐‘  of an object as the distance between this object
and its kNN. A ๐‘˜๐‘‘๐‘–๐‘  curve is a list of ๐‘˜๐‘‘๐‘–๐‘  values in descending
order. Figure 2 shows an example of ๐‘˜๐‘‘๐‘–๐‘  curve with ๐‘˜ = 4 . We
define density radius as the most frequent ๐‘˜๐‘‘๐‘–๐‘  value. Intuitively,
most objects contain exactly ๐‘˜ nearest neighbors in a hypersphere with radius equals to density radius.
๐‘˜๐‘‘๐‘–๐‘  ๐‘๐‘ข๐‘Ÿ๐‘ฃ๐‘’: Aggregate the ๐‘˜๐‘‘๐‘–๐‘  value of each object, then sort in
descending order.
|{๐‘œ๐‘๐‘—๐‘’๐‘๐‘ก๐‘  ๐‘คโ„Ž๐‘œ๐‘ ๐‘’ ๐‘˜
≤๐‘Ÿ}|
๐‘‘๐‘–๐‘ 
๐ธ๐ท๐น๐‘˜ (๐‘Ÿ) โ‰”
, here ๐ธ๐ท๐น๐‘˜ also refers to
๐‘
empirical distribution function of ๐‘˜๐‘‘๐‘–๐‘  .
๐‘‰๐‘Ÿ โ‰” ๐‘๐‘‘ × ๐‘Ÿ ๐‘‘ , is the volume of the hyper-sphere (see Figure 2) in
the ๐‘‘ -dimensional ๐ฟ๐‘ space, e.g., in Euclidean space, ๐‘๐‘‘ =
๐œ‹๐‘‘/2
๐‘‘
2
.
Γ( +1)
Density-radius: the most frequent ๐‘˜๐‘‘๐‘–๐‘  value in ๐‘˜๐‘‘๐‘–๐‘  ๐‘๐‘ข๐‘Ÿ๐‘ฃ๐‘’.
Preliminaries:
We transform the estimation of density radius to identifying the
inflection point on ๐‘˜๐‘‘๐‘–๐‘  curve. Here, inflection point takes general
definition of having its second derivative equal to zero. Next, we
provide the intuition behind this transformation followed by
theoretical proof.
Intuitively, the local area of an inflection point on ๐‘˜๐‘‘๐‘–๐‘  curve is
the flattest (i.e. the slopes on its left-hand and right-hand sides has
the smallest difference). On ๐‘˜๐‘‘๐‘–๐‘  curve, the points in the
neighborhood of a inflection point have close values of ๐‘˜๐‘‘๐‘–๐‘  .
For example, in Figure 2, there are three inflection points with
corresponding ๐‘˜๐‘‘๐‘–๐‘  values equal to 1,500, 500, and 200. In other
words, most points on this curve have ๐‘˜๐‘‘๐‘–๐‘  values close to 1,500,
500, or 200. According to the definition of density radius, these
three values can be used to approximate three density radiuses.
We now provide theoretical on estimating density radius by
identifying the inflection point on ๐‘˜๐‘‘๐‘–๐‘  curve. We first prove that
the Y-value, ๐‘˜๐‘‘๐‘–๐‘  , of each inflection point on the ๐‘˜๐‘‘๐‘–๐‘  curve
equals to one unique density radius. Specifically, given a dataset
with single density, we provide analytical form to its ๐‘˜๐‘‘๐‘–๐‘  curve,
and prove that there exists a unique inflection point with Y-value
equal to the density radius of the dataset. We further generalize
the estimation to the dataset with multiple densities.
To make the mathematical deduction easier, we use ๐ธ๐ท๐น๐‘˜ (๐‘Ÿ) โ‰”
|{๐‘œ๐‘๐‘—๐‘’๐‘๐‘ก๐‘  ๐‘คโ„Ž๐‘œ๐‘ ๐‘’ ๐‘˜๐‘‘๐‘–๐‘ ≤๐‘Ÿ}|
to represent ๐‘˜๐‘‘๐‘–๐‘  curve equivalently. EDF
๐‘
is short for Empirical Distribution Function. It is the ๐‘˜๐‘‘๐‘–๐‘  curve
rotated 90 degrees clockwise with normalized Y-Axis. The Xvalue of inflection point on ๐ธ๐ท๐น๐‘˜ (๐‘Ÿ) equals to the Y-value of
inflection point on ๐‘˜๐‘‘๐‘–๐‘  curve.
3000
2500
Problem 1 (Analytical expression of ๐‘ฌ๐‘ซ๐‘ญ๐’Œ (๐’“) ): Suppose ๐‘
objects are sampled independently, by a uniform distribution,
which is defined on a region with volume is ๐‘‰, see Figure 2. So
๐‘
the density ๐œŒ = . ๐‘‰๐‘Ÿ is an arbitrary hyper-sphere region with
๐‘‰
radius ๐‘Ÿ, centered at ๐‘‚, where a particular object is located at ๐‘‚.
Define event ๐ธ๐‘š,๐‘Ÿ โ‰” {๐‘œ๐‘›๐‘™๐‘ฆ ๐‘š ๐‘๐‘œ๐‘–๐‘›๐‘ก๐‘  ๐‘–๐‘›๐‘ ๐‘–๐‘‘๐‘’ ๐‘‰๐‘Ÿ , ๐‘š ≥ 1}.
๐‘š−1 ๐‘š−1
Lemma 1: ๐‘ƒ(๐ธ๐‘š,๐‘Ÿ ) = ๐ถ๐‘−1
๐‘ƒ๐‘Ÿ
(1 − ๐‘ƒ๐‘Ÿ )๐‘−๐‘š , where ๐‘ƒ๐‘Ÿ =
๐‘‰๐‘Ÿ
๐‘‰
=
๐‘๐‘‘ ×๐‘Ÿ ๐‘‘
๐‘‰
.
Proof: since there’s already one object located inside sphere
(located at center), so ๐ธ๐‘š,๐‘Ÿ requires ๐‘š − 1 extra objects inside
sphere, ๐‘ − ๐‘š objects outside. ๐‘ƒ๐‘Ÿ is the probability that one
object is sampled inside the sphere, so the lemma is obvious
which is a binomial-distribution.
Lemma 2: ๐‘ƒ(๐‘˜๐‘‘๐‘–๐‘  ≤ ๐‘Ÿ) = ∑๐‘
๐‘š=๐‘˜+1 ๐‘ƒ(๐ธ๐‘š,๐‘Ÿ ).
Proof: if ๐‘˜๐‘‘๐‘–๐‘  ≤ ๐‘Ÿ, then the kNN of an object is inside of the
hyper-sphere, so there are at least ๐‘˜ + 1 objects inside of hypersphere; and vice versa.
Corollary 1: ๐ธ๐ท๐น๐‘˜ (๐‘Ÿ) ≈ ๐‘ƒ(๐‘˜๐‘‘๐‘–๐‘  ≤ ๐‘Ÿ).
Three potential inflection points
Should point out that, although the ๐‘˜๐‘‘๐‘–๐‘  of each object share same
distribution, they are actually not totally independent, so we
cannot directly use ๐ธ๐ท๐น๐‘˜ (๐‘Ÿ) to approximate ๐‘ƒ(๐‘˜๐‘‘๐‘–๐‘  ≤ ๐‘Ÿ). But we
assume this is a good approximation, which can also be evidenced
by simulation experiments.
2000
4-dis
Figure 2. One population generated by uniform distribution
1500
1000
500
Corollary 2: ๐ธ๐ท๐น1 (๐‘Ÿ) ≈ 1 − ๐‘’ −๐œŒ๐‘๐‘‘ ๐‘Ÿ
0
0
200
400
600
800
1000
Point index
Figure 1. 4-dis curve of a time series dataset
The mentioned notations are summarized as follows:
๐‘: Total number of objects (same as points).
๐‘‘
1๐‘‘๐‘–๐‘  has the simplest version. According to Lemma 2 and 3,
๐‘−1
๐ธ๐ท๐น1 (๐‘Ÿ) = 1 − (1 − ๐‘ƒ๐‘Ÿ )๐‘−1 = 1 − (1 −
๐‘
≈ 1 − (1 −
๐œŒ๐‘๐‘‘ ๐‘Ÿ ๐‘‘
)
๐‘
๐œŒ๐‘๐‘‘ ๐‘Ÿ ๐‘‘
๐‘‘
) ≈ 1 − ๐‘’ −๐œŒ๐‘๐‘‘ ๐‘Ÿ
๐‘
The last two approximations make sense when ๐‘ is large.
Lemma
Lemma 3 (existence and uniqueness): there exist one and only
one inflection point on ๐ธ๐ท๐น๐‘˜ (๐‘Ÿ), and Y-value of inflection point
on ๐‘˜๐‘‘๐‘–๐‘  curve is density radius.
๐‘…, ๐‘ ๐‘œ ๐‘กโ„Ž๐‘Ž๐‘ก lim ๐‘Ÿ๐‘— = ๐‘Ÿ๐‘–1 . Same statement satisfied for ๐‘Ÿ๐‘–2 .
Proof: Denote ๐‘Ÿ๐‘– as the X-value of inflection point of ๐ธ๐ท๐น๐‘˜ (๐‘Ÿ),
๐‘‘2 ๐ธ๐ท๐น1 (๐‘Ÿ)
๐‘‘
๐‘‘
= ๐‘‘(๐‘‘ − 1)๐‘๐‘‘ ๐‘Ÿ ๐‘‘−2 (๐‘1 ๐œŒ1 ๐‘’ −๐œŒ1๐‘๐‘‘ ๐‘Ÿ + ๐‘2 ๐œŒ2 ๐‘’ −๐œŒ2๐‘๐‘‘ ๐‘Ÿ )
๐‘‘๐‘Ÿ 2
๐‘‘
− ๐‘‘2 ๐‘๐‘‘2 ๐‘Ÿ 2๐‘‘−2 (๐‘1 ๐œŒ12 ๐‘’ −๐œŒ1๐‘๐‘‘ ๐‘Ÿ
๐‘‘
+ ๐‘2 ๐œŒ22 ๐‘’ −๐œŒ2 ๐‘๐‘‘ ๐‘Ÿ )
so
๐‘‘ 2 ๐ธ๐ท๐น๐‘˜ (๐‘Ÿ)
๐‘‘๐‘Ÿ 2
๐‘‘
๐‘‘๐‘Ÿ ๐‘‘−2 ๐‘’ −๐œŒ๐‘๐‘‘ ๐‘Ÿ (๐‘‘
|๐‘Ÿ=๐‘Ÿ๐‘– = 0
,
๐‘‘ 2 ๐ธ๐ท๐น๐‘˜ (๐‘Ÿ)
where
๐‘‘๐‘Ÿ 2
=
๐‘‘๐‘Ÿ ๐‘‘ ).
๐œŒ๐‘๐‘‘
− 1 − ๐œŒ๐‘๐‘‘
Since the first derivation
of ๐ธ๐ท๐น๐‘˜ (๐‘Ÿ) is the probability density function; the second
derivation equal to zero means that ๐‘Ÿ = ๐‘Ÿ๐‘– has the maximum
likelihood. In other words, ๐‘Ÿ๐‘– is the most frequent value of ๐‘˜๐‘‘๐‘–๐‘  ,
which is the definition of density radius. Since the X-value of
inflection point on ๐ธ๐ท๐น๐‘˜ (๐‘Ÿ) equals to the Y-value of inflection
point on ๐‘˜๐‘‘๐‘–๐‘  curve, the lemma is proven.
Corollary 3: For ๐ธ๐ท๐น1 (๐‘Ÿ), inflection point ๐‘Ÿ๐‘– = (
Take
๐‘‘ 2 ๐ธ๐ท๐น๐‘˜ (๐‘Ÿ)
๐‘‘๐‘Ÿ 2
๐‘‘−1 1
๐‘‘ ๐œŒ๐‘๐‘‘
)
1
๐‘‘
|๐‘Ÿ=๐‘Ÿ๐‘– = 0 , with the formulation of ๐ธ๐ท๐น1 (๐‘Ÿ) ≈
๐‘‘
1 − ๐‘’ −๐œŒ๐‘๐‘‘ ๐‘Ÿ , we get the expression of ๐‘Ÿ๐‘– .
Problem 2 (mixture of density): Denote ๐œƒ๐‘– = {๐œŒ๐‘– , ๐‘‘๐‘– , ๐‘›๐‘– } as the
parameter vector of ๐ธ๐ท๐น๐‘–,๐‘˜ (๐‘Ÿ) for a particular population which
is with single density. Now suppose there are โ„Ž regions with
different densities, located at space without overlap. Let’s
estimate the expression for the overall ๐ธ๐ท๐น๐‘˜ (๐‘Ÿ).
Lemma 4: ๐ธ๐ท๐น๐‘˜ (๐‘Ÿ) = ∑โ„Ž๐‘–=1
๐‘›๐‘– ๐ธ๐ท๐น๐‘–,๐‘˜ (๐‘Ÿ)
๐‘
, where ๐‘ = ∑โ„Ž๐‘–=1 ๐‘›๐‘–
This can be easily obtained by using the definition of ๐ธ๐ท๐น๐‘˜ (๐‘Ÿ).
The mixture model is just the linear combination of each
individual ๐ธ๐ท๐น๐‘–,๐‘˜ (๐‘Ÿ), this enables model inference: given the
overall ๐ธ๐ท๐น๐‘˜ (๐‘Ÿ), identify the underlying models, which can be
represented by {๐œƒ1 , … ๐œƒโ„Ž } , further, multi-densities can be
obtained.
Mixture model identification:
Denote the expression of ๐ธ๐ท๐น๐‘˜ (๐‘Ÿ) as ๐ธ๐ท๐น๐‘˜ (๐‘Ÿ|๐œƒ1 , … ๐œƒโ„Ž ) , now
given the ๐‘˜๐‘‘๐‘–๐‘  ๐‘๐‘ข๐‘Ÿ๐‘ฃ๐‘’ which can be represented by a list of
{๐‘ฆ๐‘– , ๐‘Ÿ๐‘– }, we formulate the problem as an optimization problem:
arg min ∑[๐‘ฆ๐‘– − ๐ธ๐ท๐น๐‘˜ (๐‘Ÿ๐‘– |๐œƒ1 , ๐œƒ2 , … ๐œƒโ„Ž )]2
๐‘–=1
โ„Ž
๐‘ ๐‘ข๐‘๐‘—๐‘’๐‘๐‘ก ๐‘ก๐‘œ ∑ ๐‘๐‘— = 1 , ๐‘๐‘— ≥ 0, ∀๐‘—
๐‘—=1
Several well-developed techniques, such as EM method is
suitable to solve such a problem.
The next lemma provides theoretical bounds to show that the
inflection points on the ๐ธ๐ท๐น๐‘˜ (๐‘Ÿ) of mixture densities can
approximate the inflection points of ๐ธ๐ท๐น๐‘–,๐‘˜ (๐‘Ÿ) (each single
density) when the densities are different enough.
Without loss of generality, and also for simplicity consideration,
we consider two mixture densities, and set ๐‘˜ = 1, and we assume
the intrinsic dimensionality of the two density regions are equal.
Denote ๐‘Ÿ๐‘–1 , ๐‘Ÿ๐‘–2 are the X-value of inflection point of two density
๐œŒ
regions, with density ๐œŒ1 , ๐œŒ2 respectively. Denote ๐œ‘ = 1. Denote
๐‘… โ‰” {๐‘Ÿ๐‘– |
๐‘‘ 2 ๐ธ๐ท๐น1 (๐‘Ÿ๐‘– )
๐‘‘๐‘Ÿ 2
๐œŒ2
= 0}. Denote ๐‘1 =
๐‘›1
๐‘
, ๐‘2 =
๐‘›2
๐‘
.
∃๐‘Ÿ๐‘– ∈ ๐‘…, ๐‘ ๐‘œ ๐‘กโ„Ž๐‘Ž๐‘ก lim ๐‘Ÿ๐‘– = ๐‘Ÿ๐‘–1
๐œ‘→0
;
∃๐‘Ÿ๐‘— ∈
and
๐œ‘→∞
Proof: According to the form
Put the form of ๐‘Ÿ๐‘–1 = (
๐‘‘−1
1
๐‘‘
1
๐‘‘ ๐œŒ1 ๐‘๐‘‘
) into it, we get
1
๐‘‘ ๐ธ๐ท๐น1 (๐‘Ÿ๐‘–1 )
๐‘‘−1 2
=
(
) ๐œ‘(1 − ๐œ‘)๐‘’ −๐œ‘(1−๐‘‘ )
๐‘‘๐‘Ÿ 2
๐‘Ÿ๐‘–1
2
The lemma is proven not matter ๐œ‘ → 0 or ๐œ‘ → ∞.
This lemma indicates that, when the density different is large
enough, the inflection point obtained from the ๐ธ๐ท๐น๐‘˜ (๐‘Ÿ) can
approximate the inflection point of each single density region,
which further represents the density radius.
Corollary 4: For high dimensionality ๐‘‘ โ‰ซ 1, ๐œ‘ =
worst density different that for approximation.
3±√5
2
3±√5 ๐‘‘ 2 ๐ธ๐ท๐น (๐‘Ÿ๐‘–1 )
1
1
Let 1 − ≈ 1, we get that when ๐œ‘ =
,
๐‘‘
2
๐‘‘๐‘Ÿ 2
maximum/minimum value, which is not equal to 0.
is the
get the
The corollary implies that, when the density difference is proper,
the mixture ๐ธ๐ท๐น๐‘˜ (๐‘Ÿ) can hardly be used for density radius
identification.
1.2 Tables for Pseudo-code
1.2.1 Dimensionality Reduction
We adopt PAA for dimensionality reduction because of its
computational efficiency and its capability of preserving the
shape of time series. Denote a time series instance with length ๐ท
as ๐‘‡๐‘– โ‰” (๐‘ก๐‘–1 , ๐‘ก๐‘–2 , … , ๐‘ก๐‘–๐ท ) . The transformation from ๐‘‡๐‘– to ๐‘‡๐‘–′ โ‰”
๐‘‘
๐ท
(๐œ๐‘–1 , ๐œ๐‘–2 , … , ๐œ๐‘–๐‘‘ ) where ๐œ๐‘–๐‘— = ∑๐‘‘
๐ท
๐ท
๐‘—
๐ท
๐‘‘
๐‘˜= (๐‘—−1)+1
๐‘ก๐‘–๐‘˜ , is called PAA
with frame length equal to . PAA segments a time series instance
๐‘‘
into d frames, and uses one value (i.e. the mean value) to represent
each frame so as to reduce its length from D to d.
๐‘
๐œƒ1 ~๐œƒโ„Ž
5:
One key issue in applying PAA is to specify a proper ๐‘‘
automatically. As proved by the Nyquist-Shannon sampling
theory, any time series without frequencies higher than B Hertz
can be perfectly recovered by its sampled points with sampling
rate 2*B. This means that using 2*B as sampling rate can preserve
the shape of a frequency-bound time series. Although some time
series under our study are often imperfectly frequency-bound
signals, most of them can be approximated by frequency-bound
signals because their very high-frequency components are usually
corresponding to noise. Therefore, we transform the problem of
determining d into estimating the upper bound of frequencies.
In this paper, we propose a novel auto-correlation-based approach
to identify the approximate value of the frequency upper bound
of all the input time series instances. The frame length d is then
easily determined as the inverse of the frequency upper bound.
In more details, we first identify the typical frequency of each
time series instance ๐‘‡๐‘– by locating the first local minimum on its
auto-correlation curve ๐‘”๐‘– , which is denoted as ๐‘”๐‘– (๐‘ฆ) =
∑๐ท−๐‘ฆ
๐‘—=1 ๐‘ก๐‘–๐‘— ๐‘ก๐‘–(๐‘—+๐‘ฆ) , where ๐‘ฆ is the lag. If there is a local minimum of
๐‘”๐‘– on a particular lag ๐‘ฆ ′ , then ๐‘ฆ ′ relates to a typical half-period if
๐‘”๐‘– (๐‘ฆ ′ ) < 0. In this case, we call 1/๐‘ฆ ′ the typical frequency for
๐‘‡๐‘– . The smaller ๐‘ฆ ′ is, the higher the frequency it represents.
distance between each pair of objects on the sample set. Multidensity estimation costs ๐‘‚(๐‘  log ๐‘ ) since it adopts divide-andconquer strategy.
Then, we sort all the detected typical frequencies in ascending
order, and select the 80th percentile to approximate the frequency
upper bound of all the time series instances. The reason why we
do not use the exact maximum typical frequency is to remove the
potential instability caused by the small amount of extraordinary
noise in some time series instances.
1.2.3 Clustering
Regarding implementation, the auto-correlation curves ๐‘”๐‘– (๐‘ฆ) can
be obtained efficiently using the Fast Fourier transforms: (1)
๐น๐‘” (๐‘“) = ๐น๐น๐‘‡[๐‘‡๐‘– ] ; (2) ๐‘†(๐‘“) = ๐น๐‘” (๐‘“)๐น๐‘”∗ (๐‘“) ; (3) ๐‘”(๐‘ฆ) =
๐ผ๐น๐น๐‘‡[๐‘†(๐‘“)]. Where IFFT is inverse Fast Fourier transforms, and
the asterisk denotes complex conjugate.
Table 1 shows the algorithm of automatically estimating the
frame length.
Once we obtain the density radiuses, the clustering algorithm is
straightforward. With each density radius specified, from the
smallest to the largest, DBSCAN is performed accordingly. In our
implementation, we set ๐‘˜ = 4 , which is the MinPts value in
DBSCAN. The implementation is illustrated in Table 3.
Table 3. Algorithm for multi-density based clustering
/* p: the sample data set
radiuses: the density radiuses */
MULTIDBSCAN(p, radiuses)
for each radius ∈ radiuses
objs ๏ƒŸ cluster from DBSCAN(p, radius)
remove objs from p
mark p as noise objects
Table 1. Auto estimation of the frame length
FRAMELENGTH(๐“ฃ′๐’”×๐‘ซ )
for each ๐‘ป๐’Š ∈ ๐“ฃ′๐’”×๐‘ซ )
๐’ˆ๐’Š (๐’š) ๏ƒŸ auto-correlation applied to ๐‘ป๐’Š
๐’š∗๐’Š ๏ƒŸ get first local minimum of ๐’ˆ๐’Š (๐’š)
๐’š∗ ๏ƒŸ 80% percentile on sorted {๐’š∗๐Ÿ ~ ๐’š∗๐’” }
return ๐’š∗
The time complexity of data reduction is ๐‘‚(๐‘ ๐ท log ๐ท + ๐‘๐ท) ,
specifically, obtaining frame length cost ๐‘‚(๐‘ ๐ท log ๐ท) , and
applying PAA to the entire input data set cost ๐‘‚(๐‘๐ท).
1.2.2 Density Estimation
We implement a fast algorithm to estimate density radiuses
(Table 2). We first find the inflection point with the minimum
difference between its left and right-hand slopes. We then
recursively repeat this process on the two sub-curves segmented
by the obtained inflection point, until no more significant
inflection points are found.
The DBSCAN costs ๐‘‚(๐‘  log ๐‘ ). Since it is performed the number
of times not exceeding ๐‘ , the total cost is ๐‘‚(๐‘  2 log ๐‘ ).
1.2.4 Assignment
After clustering is performed on the sampled dataset, a cluster
label needs to be assigned to each unlabeled time series instance
in the input dataset. The assignment process is straightforward.
For an unlabeled instance, its closest labeled instance is found. If
their distance is less than the density radius of the cluster the
labeled instance belongs to, then the unlabeled instance is
considered to be in the same cluster as the labeled instance.
Otherwise, it is labeled as noise.
Labeled points
Unlabeled points
๐‘‘๐‘–๐‘  − ๐œ€
๐‘Ž
๐‘
๐‘‘๐‘–๐‘ 
Table 2. Algorithm for estimating density radiuses
Function 1
Function 2
DENSITYRADIUSES(๐’Œ๐’…๐’Š๐’” )
length ๏ƒŸ |๐’Œ๐’…๐’Š๐’” |
allocate res as list
INFLECTIONPOINT(๐’Œ๐’…๐’Š๐’” , 0, length,
res)
return res
INFLECTIONPOINT (๐’Œ๐’…๐’Š๐’” , s, e, res)
r ๏ƒŸ -1, diff ๏ƒŸ -1
for i ๏ƒŸ s to e
left ๏ƒŸ SLOPE (๐’Œ๐’…๐’Š๐’” , s, i)
right ๏ƒŸ SLOPE (๐’Œ๐’…๐’Š๐’” , i, e)
if left or right greater than threshold1
continue
if |left - right| smaller than diff
diff ๏ƒŸ |left-right|
r ๏ƒŸ ith element of ๐’Œ๐’…๐’Š๐’”
if diff smaller than threshod2 /*record
the inflection point, and recursively
search*/
add r to res
INFLECTIONPOINT (๐’Œ๐’…๐’Š๐’” , s, r-1)
INFLECTIONPOINT (๐’Œ๐’…๐’Š๐’” , r+1, e)
The time complexity of estimating density radiuses is as follows.
The generation of ๐‘˜๐‘‘๐‘–๐‘  curve costs ๐‘‚(๐‘‘๐‘  2 ) due to calculation of
Figure 3. Illustration of the pruning strategy in assignment
The assignment process involves the distance computation
between every pair of unlabeled and labeled instances, which has
complexity ๐‘‚(๐‘๐‘ ๐‘‘). The observation illustrated in Figure 3 can
reduce such computation. If an unlabeled object ๐‘Ž is far from one
labeled object ๐‘, i.e. their distance dis is greater than the density
radius ๐œ€ of b’s cluster, then the distance between a and the labeled
neighbors of ๐‘ (within ๐‘‘๐‘–๐‘  − ๐œ€ ) is also greater than ๐œ€ (according
to triangle inequality). Therefore, the distance computation
between a and each of b’s neighbors is saved.
We design a data structure named Sorted Neighbor Graph (SNG)
to achieve the above pruning strategy. When performing densitybased clustering on the sampled dataset, if an instance ๐‘ is
determined to be a core point, then b is added to SNG, and its
distances to all the other instances in the sampled dataset are
computed and stored in SNG in ascending order. Quick-sort is
used in the construction of SNG, so the time complexity of SNG
is ๐‘‚(๐‘  2 log ๐‘ ).
The implementation of assignment using SNG is shown in Table
4. Its time complexity is ๐‘‚(๐‘๐‘ ๐‘‘). Although SNG and pruning can
reduce the search space, in the worst case, every unlabeled
instance has to compare to every labeled instance.
Table 4. Algorithm for assignment
// uObj: the list of unlabeled objects
ASSIGNMENT(SNG, uObj)
for each obj ∈ uObj
set the label of obj as “noisy”
for each o ∈ {keys of SNG}
if o has been inspected
continue;
dis ๏ƒŸ L1 distance between o and obj
if dis less than density radius of o
mark obj with same label of o
break
mark o as inspected
jump ๏ƒŸ dis - density radius of o
i ๏ƒŸ BINARYSEARCH(SNG[o], jump)
for each neighbor ∈ SNG[o] with index greater than i
if density radius of neighbor is less than jump
mark neighbor as inspected
else break /*this is a sorted list*/
Random Walk [70]. ๐‘ฅ๐‘ก = ๐‘ฅ๐‘ก−1 + ๐œ‡
We use random walk to represent the noise time series. Since the
random walk is accumulation of white noise, with no extra
information contained
Using the aforementioned stochastic models, we generate a
template for creating simulation datasets to be used in the
subsequent experiments.
1.3.2 Template Details
According to our evaluation design, we use one template, named
TemplateA for RQ1, RQ2 and partially RQ3 (robustness to
random noise), and we use TemplateB to mimic phase
perturbation phenomenon.
Specifically for TemplateA, it consists of 15 groups of time series,
here each group indicates a specific label assigned to each object
in this group. The group size varies significantly, which is
represented by the population ratio ranging from 0.1% to 30%.
The mapping between GroupID and model type is random. The
parameters of the models are also arbitrarily chosen. Table 5 lists
the information of each group in the template.. Based on this
template, the size of each simulation dataset N and the length of
each time series instance D are set to meet the requirements of
each experiment
1.3 Evaluation Report
This report includes the detailed and complementary information
according to the paper of YADING, named “YADING: Fast
Clustering of Large-Scale Time Series Data”. For easier
illustration, the three research questions mentioned in paper are
listed as follows:
Table 5. Details of TemplateA
GroupID
Ratio
Model
1
30%
AR(1)
2
20%
Forced
Oscillation
3
13%
Drifting
4
9.0%
Drifting
We use five different underlying stochastic models to generate
simulation time series data. Below are the detailed illustration of
these models, and corresponding parameters. These models cover
a wide range of characteristics of time series.
5
6.0%
AR(1)
๐›ผ = 20, ๐›ฝ = 0.5, ๐œŽ = 1
6
4.0%
Forced
Oscillation
AR(1) Model [66]. ๐‘ฅ๐‘ก = ๐›ฝ ∗ ๐‘ฅ๐‘ก−1 + ๐›ผ + ๐œ‡
7
2.8%
Peak
AR(1) model is a simplified version of general ARMA mode.
Here ๐›ผ , ๐›ฝ and ๐œ‡ are parameters. |๐›ฝ| < 1 to assure this is a
stationary process. ๐›ผ, ๐›ฝ decide the asymptotic converged value of
๐‘ฅ๐‘ก , and ๐œ‡~๐‘(0, ๐œŽ) is white noise.
8
1.8%
Forced
Oscillation
9
1.2%
Peak
๐‘š = 1, ๐œ” = 2, ๐‘“ = 30,
๐›พ = 1, ๐›ฝ = 0, ๐œŽ = 10
๐‘š = 1, ๐œ” = 2, ๐‘“ = 10,
๐›พ = 1, ๐›ฝ = 0, ๐œŽ = 10,
๐‘” = 1000
๐‘š = 1, ๐œ” = 2, ๐‘“ = 10,
๐›พ = 1, ๐›ฝ = 0, ๐œŽ = 10
๐‘š = 1, ๐œ” = 2, ๐‘“
= 100,
๐›พ = 1, ๐›ฝ = 0, ๐œŽ = 10,
๐‘” = 1000
10
0.8%
RQ1. How efficiently can YADING cluster time series data?
RQ2. How does sample size affect the clustering accuracy?
RQ3. How robust is YADING to time series variation?
1.3.1 Models for Generating Simulation Data
1
[๐‘“∗cos(๐›พ∗๐‘ก+๐›ฝ)+๐œ‡]+2๐‘ฅ๐‘ก−1 −๐‘ฅ๐‘ก−2
Forced Oscillation [67]. ๐‘ฅ๐‘ก = ๐‘š
1+๐œ”2
Forced Oscillation is used to model the cyclical time series. Here
γ is the circular frequency of external “force”, ๐›ฝ is the initial
phase. ๐‘š and ๐œ” are the intrinsic properties of studied time series.
๐œ‡~๐‘(0, ๐œŽ) is white noise.
Drift [68]. Leverage the formula of AR(1) model but make ๐›ฝ >
1 . Here ๐›ฝ − 1 becomes the drift coefficient. To avoid the
divergence of ๐‘ฅ๐‘ก , we set a threshold ๐‘, so that ๐‘ฅ๐‘ก is set to initial
value after ๐‘ steps.
Peak [69]. Leverage the forced oscillation model, but set the force
๐น(๐‘ก) = ๐‘“ ∗ cos(๐›พ ∗ ๐‘ก + ๐›ฝ) + ๐œ‡ + ๐‘”๐›ฟ(๐‘ก − ๐‘ก๐‘ ) , where ๐›ฟ(๐‘ฅ) =
1, ๐‘ฅ = 0
{
. This type of time series is used to mimic the spikes or
0, ๐‘’๐‘™๐‘ ๐‘’
transient anomalies.
AR(1)
Params
๐›ผ = 10, ๐›ฝ =
0.5, ๐œŽ =10
๐‘š = 1, ๐œ” = 2, ๐‘“ = 30,
๐›พ = 0.1, ๐›ฝ = 0, ๐œŽ = 10
๐›ผ = 10, ๐›ฝ = 1, ๐œŽ
= 10,
๐‘ = 30
๐›ผ = 20, ๐›ฝ = 1, ๐œŽ = 5,
๐‘ = 30
๐›ผ = 30, ๐›ฝ = 0.5, ๐œŽ =1
๐‘š = 1, ๐œ” = 2, ๐‘“ = 20,
๐›พ = 1, ๐›ฝ = 0, ๐œŽ = 10
๐›ผ = 5, ๐›ฝ = 1, ๐œŽ = 5,
12
0.4%
Drifting
๐‘ = 30
๐›ผ = 20, ๐›ฝ =
13
0.2%
AR(1)
0.5, ๐œŽ =10
Forced
๐‘š = 1, ๐œ” = 2, ๐‘“ = 20,
14
0.2%
Oscillation
๐›พ = 0.1, ๐›ฝ = 0, ๐œŽ = 10
๐›ผ = 5, ๐›ฝ = 1, ๐œŽ = 10,
15
0.1%
Drifting
๐‘ = 30
According to the settings in Table 1, the data set is generated once
the data size and dimensionality is specified.
11
0.5%
Forced
Oscillation
Table 6. Details of TemplateB
GroupID
Ratio
Model
1
31%
Forced
Oscillation
2
20%
AR(1)
14%
Forced
Oscillation
3
4
9.0%
Drifting
5
6.0%
AR(1)
Params
๐‘š = 1, ๐œ” = 2, ๐‘“ = 32,
๐›พ = 0.08, ๐œŽ = 5,
๐›ฝ ∈ [0.0, 2๐œ‹⁄3]
๐›ผ = 0, ๐›ฝ =
−0.5, ๐œŽ =10
๐‘š = 1, ๐œ” = 2, ๐‘“ = 64,
๐›พ = 0.1, ๐œŽ = 20,
๐›ฝ ∈ [0.0, ๐œ‹⁄3]
๐›ผ = 20, ๐›ฝ = 0.8,
๐œŽ = 8, ๐‘ = 8
๐›ผ = 25, ๐›ฝ = 0.5, ๐œŽ = 1
๐›ผ = 14, ๐›ฝ = 0.85,
๐œŽ = 8, ๐‘ = 32
๐‘š = 1, ๐œ” = 2, ๐‘“ = 8,
7
2.8%
Peak
๐›พ = 1, ๐›ฝ = 3, ๐œŽ = 10,
๐‘” = 400
๐‘š = 1, ๐œ” = 2, ๐‘“ = 8,
8
1.2%
Peak
๐›พ = 1, ๐›ฝ = 3, ๐œŽ = 10,
๐‘” = 400
We use TemplateB to mimic the phase perturbation phenomenon.
As illustrated in Table 2, we set the phase perturbation for Forced
Oscillation models. The initial phase is set by uniform distribution
defined in the given interval, e.g., for Group 1, ๐›ฝ ∈ [0.0, 2๐œ‹⁄3];
for Group 3, ๐›ฝ ∈ [0.0, ๐œ‹⁄3]. Here ๐›ฝ is the initial phase.
6
6.0%
Drifting
2. REFERENCES
[1] Debregeas, A., and Hebrail, G. 1998. Interactive
interpretation of Kohonen maps applied to curves. In Proc.
of KDD’98. 179-183.
[2] Derrick, K., Bill, K., and Vamsi, C. 2012. Large scale/big
data federation & virtualization: a case study.
http://rhsummit.files.wordpress.com/2012/03/kittler_large_
scale_big_data.pdf.
[3] D. A. Patterson. 2002. A simple way to estimate the cost of
downtime. In Proc. of LISA’ 02, pp. 185-188.
[4] Eamonn, K., and Shruti, K. 2002. On the need for time
series data mining benchmarks: a survey and empirical
demonstration. In Proc. of KDD’02, July 23-26.
[5] T. W. Liao. 2005. Clustering of time series data—A survey
Pattern Recognit., vol. 38, no. 11, pp. 1857–1874, Nov.
[6] C. Faloutsos, M. Ranganathan, and Y. Manolopoulos.
1994. Fast subsequence matching in time series databases.
In Proc. of the ACM SIGMOD Conf., May.
[7] X. Golay, S. Kollias, G. Stoll, D. Meier, A. Valavanis, P.
Boesiger. 1998. A new correlation-based fuzzy logic
clustering algorithm for fMRI, Mag. Resonance Med. 40
249–260.
[8] D. Rafiei, and A. Mendelzon. 1997. Similarity-based
queries for time series data. In Proc. of the ACM SIGMOD
Conf., Tucson, AZ, May.
[9] B. K. Yi, H. V. Jagadish, and C. Faloutsos. 1998. Efficient
retrieval of similar time sequences under time warping. In
IEEE Proc. of ICDE, Feb.
[10] R. Agrawal, K. L. Lin, H. S. Sawhney, and K. Shim. 1995.
Fast similarity search in the presence of noise, scaling, and
translation in time series database. In Proc. of the VLDB
conf., Zurich, Switzerland.
[11] J. Han, M. Kamber. 2001. Data Mining: Concepts and
Techniques, Morgan Kaufmann, San Francisco, pp. 346–
389.
[12] Chu, K. & Wong, M. 1999. Fast time-series searching with
scaling and shifting. In proc. of PODS. pp 237-248.
[13] Faloutsos, C., Jagadish, H., Mendelzon, A. & Milo, T.
1997. A signature technique for similarity-based queries. In
Proc. of the ICCCS.
[14] Chan, K. & Fu, A. W. 1999. Efficient time series matching
by wavelets. In proc. of ICDE. pp 126-133.
[15] Popivanov, I. & Miller, R. J. 2002. Similarity search over
time series data using wavelets. In proc. of ICDE. pp 212221.
[16] Keogh, E., Chakrabarti, K., Pazzani, M. & Mehrotra, S.
2001. Locally adaptive dimensionality reduction for
indexing large time series databases. In proc. of ACM
SIGMOD. pp 151-162.
[17] Korn, F., Jagadish, H. & Faloutsos, C. 1997. Efficiently
supporting ad hoc queries in large datasets of time
sequences. In proc. of the ACM SIGMOD. pp 289-300.
[18] Yi, B. & Faloutsos, C. 2000. Fast time sequence indexing
for arbitrary lp norms. In proc. of the VLDB. pp 385-394.
[19] E. J. Keogh, K. Chakrabarti, M. J. Pazzani, and S.
Mehrotra. 2001. Dimensionality Reduction for Fast
Similarity Search in Large Time Series Databases. Knowl.
Inf. Syst., 3(3).
[20] G. Kollios, D. Gunopulos, N. Koudas, and S. Berchtold.
2003. E๏ฌƒcient biased sampling for approximate clustering
and outlier detection in large datasets. IEEE TKDE, 15(5).
[21] Zhou, S., Zhou, A., Cao, J., Wen, J., Fan, Y., Hu. Y. 2000.
Combining sampling technique with DBSCAN algorithm
for clustering large spatial databases. In Proc. of the
PAKDD. 169-172.
[22] Stuart, Alan. 1962. Basic Ideas of Scientific Sampling,
Hafner Publishing Company, New York.
[23] M. Ester, H. P. Kriegel, and X. Xu. 1996. A density-based
algorithm for discovering clusters in large spatial databases
with noise. In Proceedings of 2nd ACM SIGKDD, pages
226–231.
[24] “Amazon’s S3 cloud service turns into a puff of smoke”.
2008. In InformationWeek NewsFilter, Aug.
[25] J. N. Hoover: “Outages force cloud computing users to
rethink tactics”. In InformationWeek, Aug. 16, 2008.
[26] M. Steinbach, L. Ertoz, and V. Kumar. 2003. Challenges of
clustering high dimensional data. In L. T. Wille, editor,
New Vistas in Statistical Physics – Applications in
Econophysics, Bioinformatics, and Pattern Recognition.
Springer-Verlag.
[27] N. D. Sidiropoulos and R. Bros. 1999. Mathematical
Programming Algorithms for Regression-based Non-linear
Filtering in RN . IEEE Trans. on Signal Processing, Mar.
[28] J. L. Rodgers and W. A. Nicewander. 1988. Thirteen ways
to look at the correlation coefficient. The American
Statistician, 42(1):59–66, February.
[29] Al-Naymat, G., Chawla, S., & Taheri, J. (2012).
SparseDTW: A Novel Approach to Speed up Dynamic Time
Warping.
[30] M. Kumar, N.R. Patel, J. Woo. 2002. Clustering seasonality
patterns in the presence of errors, Proceedings of KDD ’02,
Edmonton, Alberta, Canada.
[31] Kullback, S.; Leibler, R.A. 1951. "On Information and
Sufficiency". Annals of Mathematical Statistics 22 (1): 79–
86. doi:10.1214/aoms/1177729694. MR 39968.
[32] Ng R.T., and Han J. 1994. Efficient and Effective
Clustering Methods for Spatial Data Mining, In proc. of
VLDB, 144-155.
[33] S. Guha, R. Rastogi, K. Shim. 1998. CURE: an efficient
clustering algorithm for large databases. In proc. of
SIGMOD. pp. 73–84.
[34] García J.A., Fdez-Valdivia J., Cortijo F. J., and Molina R.
1994. A Dynamic Approach for Clustering Data. Signal
Processing, Vol. 44, No. 2, 1994, pp. 181-196.
[35] Jianbo Shi and Jitendra Malik. 2000. Normalized Cuts and
Image Segmentation, IEEE Transactions on PAMI, Vol.
22, No. 8, Aug 2000.
[36] W. Wang, J. Yang, R. Muntz, R. 1997. STING: a statistical
information grid approach to spatial data mining, VLDB’97,
Athens, Greek, pp. 186–195.
[37] Hans-Peter Kriegel, Peer Kröger, Jörg Sander, Arthur
Zimek 2011. Density-based Clustering. WIREs Data
Mining and Knowledge Discovery 1 (3): 231–240.
doi:10.1002/widm.30.
[38] Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel,
Jörg Sander. 1999. OPTICS: Ordering Points To Identify
the Clustering Structure. ACM SIGMOD. pp. 49–60.
[39] Achtert, E.; Böhm, C.; Kröger, P. 2006. "DeLi-Clu:
Boosting Robustness, Completeness, Usability, and
Efficiency of Hierarchical Clustering by a Closest Pair
Ranking". LNCS: Advances in Knowledge Discovery and
Data Mining. Lecture Notes in Computer Science 3918:
119–128.
[40] Liu P, Zhou D, Wu NJ. 2007. VDBSCAN: varied density
based spatial clustering of applications with noise. In Proc.
ofICSSSM. pp 1–4.
[46] Krantz, Steven; Parks, Harold R. 2002. A Primer of Real
Analytic Functions (2nd ed.). Birkhäuser.
[47] J. Bentley. 1986. Programming Pearls, Addison-Wesley,
Reading, MA.6.
[48] Box, G. E. P.; Jenkins, G. M.; Reinsel, G. C. 1994. Time
Series Analysis: Forecasting and Control (3rd ed.). Upper
Saddle River, NJ: Prentice–Hall.
[49] Beckmann, N.; Kriegel, H. P.; Schneider, R.; Seeger, B.
1990. "The R*-tree: an efficient and robust access method
for points and rectangles". In proc. of SIGMOD. p. 322.
[50] X. Golay, S. Kollias, G. Stoll, D. Meier, A. Valavanis, P.
Boesiger, A new correlation-based fuzzy logic clustering
algorithm for fMRI, Mag. Resonance Med. 40 (1998) 249–
260.
[51] Y. Kakizawa, R.H. Shumway, N. Taniguchi,
Discrimination and clustering for multivariate time series,
J. Amer. Stat. Assoc. 93 (441) (1998) 328–340.
[52] M. Kumar, N.R. Patel, J. Woo, Clustering seasonality
patterns in the presence of errors. In Proc. of KDD ’02.
[53] R.H. Shumway, Time–frequency clustering and
discriminant analysis, Stat. Probab. Lett. 63 (2003) 307–
314.
[54] J.J. van Wijk, E.R. van Selow. 1999. Cluster and calendar
based visualization of time series data. In Proc. of SOIV.
[55] T.W. Liao, B. Bolt, J. Forester, E. Hailman, C. Hansen,
R.C. Kaste, J. O’May, Understanding and projecting the
battle state, 23rd Army Science Conference, Orlando, FL,
December 2–5, 2002.
[56] S. Policker, A.B. Geva, Nonstationary time series analysis
by temporal clustering, IEEE Trans. Syst. Man Cybernet.B: Cybernet. 30 (2) (2000) 339–343.
[57] T.-C. Fu, F.-L. Chung, V. Ng, R. Luk. 2001. Pattern
discovery from stock time series using self-organizing
maps, KDD Workshop on Temporal Data Mining. pp. 27–
37.
[58] D. Piccolo,A distance measure for classifyingARMA
models, J. Time Ser. Anal. 11 (2) (1990) 153–163.
[59] J. Beran, G. Mazzola, Visualizing the relationship between
time series by hierarchical smoothing models, J. Comput.
Graph. Stat. 8 (2) (1999) 213–238.
[60] M. Ramoni, P. Sebastiani, P. Cohen, Bayesian clustering by
dynamics, Mach. Learning 47 (1) (2002) 91–121.
[41] Tao Pei, Ajay Jasra, David J. Hand, A. X. Zhu, C. Zhou.
2009. DECODE: a new method for discovering clusters of
different densities in spatial data, Data Min Knowl Disc.
[61] M. Ramoni, P. Sebastiani, P. Cohen, Multivariate
clustering by dynamics. Proceedings of AAAI-2000. pp.
633–638.
[42] P. Cheeseman, J. Stutz. 1996. Bayesian classification
(AutoClass): theory and results. Advances in Knowledge
Discovery and Data Mining, AAAI/MIT Press.
[62] K. Kalpakis, D. Gada, V. Puttagunta, Distance measures for
effective clustering of ARIMA time-series. In proc. of
ICDM. pp. 273–280.
[43] T. Kohonen. 1990. The self-organizing maps, Proc. IEEE
78 (9) (1990) 1464–1480.
[63] D. Tran, M. Wagner, Fuzzy c-means clustering-based
speaker verification, in: N.R. Pal, M. Sugeno (Eds.), AFSS
2002, Lecture Notes in Artificial Intelligence, 2275, 2002,
pp. 318–324.
[44] C. Guo, H. Li, and D. Pan. 2010. An improved piecewise
aggregate approximation based on statistical features for
time series mining. KSEM’10, pages 234–244.
[45] Box, Hunter and Hunter. 1978. Statistics for experimenters.
Wiley. p. 130.2.
[64] T. Oates, L. Firoiu, P.R. Cohen, Clustering time series with
hidden Markov models and dynamic time warping, In Proc.
of the IJCAI-99.
[65] Danon L, D´ฤฑaz-Guilera A, Duch J and Arenas A 2005 J.
Stat. Mech. P09008.
[66] http://en.wikipedia.org/wiki/Autoregressive_model
[67] http://en.wikipedia.org/wiki/Harmonic_oscillator
[68] http://en.wikipedia.org/wiki/Stochastic_drift
[69] http://en.wikipedia.org/wiki/Pulse_(signal_processing)
[70] http://en.wikipedia.org/wiki/Random_walk
[71] http://en.wikipedia.org/wiki/Principal_component_analysis
[72] http://research.microsoft.com/enus/people/juding/yadingdoc.pdf
[73] Oppenheim, Alan V. Ronald W. Schafer, John R. Buck
(1999). Discrete-Time Signal Processing (2nd ed.). Prentice
Hall.ISBN 0-13-754920-2.
[74] http://en.wikipedia.org/wiki/Nyquist%E2%80%93Shannon
_sampling_theorem
Download