Dynamic clustering of interval data based on hibrid L , L and L

advertisement
Dynamic clustering of interval data based on hibrid
L1, L2 and L∞ distances
?
Leandro C. Souza1,2, , Renata M. C. R. Souza1 , Getúlio J. A. Amaral3
1. Universidade Federal de Pernambuco (UFPE), Cin, Recife - PE, Brazil
2. Universidade Federal Rural do Semi-Árido (UFERSA), DCEN, Mossoró - RN, Brazil
3. Universidade Federal de Pernambuco (UFPE), DE, Recife - PE, Brazil
? Contact author: lcs6@cin.ufpe.br
Keywords: Interval Distance, Interval Symbolic Data, Dynamic Clustering
Cluster analysis is a traditional approach to provide exploratory discovery of knowledge. Dynamic
Partitional clustering approach does partitions on data and associates prototypes to each partition.
Distances measures are necessary to perform the clustering. In literature, a variety of distances are
proposed for clustering interval data, as L1 , L2 and L∞ distances. Therefore, interval data has an
extra kind of information, which is not verified on the point one, and it is related with the variation
or uncertain represented by them. In this way, we propose a mapping from intervals to points,
which preserves their location and internal variation and allows formulating the new hybrid L1 , L2
and L∞ distances, all based on the Lq distance for point data. These distances are used to perform
dynamic clustering for interval data. Let Γ a set of p-dimensional interval data with N observations,
such as Γ = {I1 , I2 , · · · , In , · · · , IN }. An interval multivariate instance representation In ∈ Γ is
given by
In = [a1n , b1n ], [a2n , b2n ], · · · , [apn , bpn ] ,
(1)
where n = 1, 2, · · · , N and ajn ≤ bjn , for j = 1, · · · , p. Consider the partition of the set Γ in K
clusters. Let Gk the p-dimensional interval prototype of class k and Ck the k th class. Partitional
dynamic clustering over Γ is proposed to minimize the criterion Jφ , defined by
Jφ =
K X
X
φ(In , Gk ),
(2)
k=1 In ∈Ck
where φ is a distance function. For the generic interval instance In , the mapping M which preserves location and internal variation generates one point and one vector(both p-dimensional) and
it is given by
(3)
([a1n , b1n ], · · · , [apn , bpn ]) → {(a1n , · · · , apn ), (∆1n , · · · , ∆pn )},
M
with ∆jn = bjn − ajn . As two different kind of information are used, occurs a hybridism on mapping.
The hybrid L1 (HL1 ) distance formulation is given by
dHL1 (In , Gk ) =
p
X
|ajn − ajGk | + |∆jn − ∆jGk | .
(4)
j=1
The hybrid L2 (HL2 ) distance has the expression
dHL2 (In , Gk ) =
p
X
j=1
|ajn − ajGk |2 + |∆jn − ∆jGk |2 .
(5)
The hybrid L∞ distance (HL∞ ) distance is proposed as follows
p
p
j=1
j=1
dHL∞ (In , Gk ) = max{|ajn − ajGk |} + max{|∆jn − ∆jGk |},
(6)
where max{·} is the maximum function. To compare the quality of the clustering results, adjusted
rand index (ARI) is used associated with a synthetic dataset. ARI values more close to 1 indicates a
strong agreement between the obtained clusters and a known partition. Bootstrap statistical method
constructs non-parametric confidence intervals for the mean of ARI values for the distances L1 , L2 ,
L∞ , HL1 , HL2 and HL∞ distances, with 95% of confidence. In the synthetic dataset, intervals
are constructed sorting randomly values for centers and ranges, which delivers three clusters, two
ellipsoidal (with 150 elements) and the third one spherical (with 50 elements). The centers, with
coordinates
(cx , cy ),
bivariate
normal distributions with parameters µ and Σ, with µ =
follow
σx2 0
µx
and Σ =
, with the following values : Cluster 1: µx = 30, µy = 10, σx2 =
0 σy2
µy
100 e σy2 = 25; Cluster 2: µx = 50, µy = 30, σx2 = 36 e σy2 = 144; and Cluster 3: µx = 30,
µy = 35, σx2 = 16 e σy2 = 16; The range is generated using uniform distributions over an interval
[v, u], represented by U n(v, u). The rectangle with center coordinates in the point (cxi , cyi ) has
ranges represented by γxi and γyi , for x and y, respectively . Interval data is constructed by
([cxi − γxi /2, cxi + γxi /2], [cyi − γyi /2, cyi + γyi /2]).
(7)
A general configuration is used for the ranges, where the uniform distributions are different for
clusters and dimensions. Table 1 shows these distributions. Table 2 presents the non-parametric
Table 1: Uniform distributions for interval ranges
Cluster
1
2
3
γx distribution
U n(4, 7)
U n(1, 2)
U n(2, 3)
γy distribution
U n(1, 3)
U n(6, 9)
U n(3, 6)
confidence intervals for this synthetic configuration. 100 datasets were generated. For each one,
the clustering was applied 100 times. The solution which has the lowest criterion was selected,
resulting in 100 ARI values. The bootstrap method is applied to the ARI values with 2000 repetitions and confidence of 95%. The confidence intervals for the ARI means revel the better adjust of
Table 2: Non-parametric confidence intervals for the comparison of distances
Distance
ARI confidence
interval
L1
HL1
L2
HL2
L∞
HL∞
[0.72, 0.77]
[0.85, 0.87]
[0.51, 0.56]
[0.51, 0.55]
[0.81, 0.84]
[0.78, 0.82]
HL1 , instead its limits are greater than the other distances.
References
[1] Chavent, M., Lechevallier, Y. (2002). Dynamical clustering of interval data: Optimization of an adequacy criterion based on hausdorff distance, Classification, Clustering, and Data Analysis, 53–60.
[2] Souza, R. M. C. R., De Carvalho, F. D. A. T. (2004). Clustering of interval data based on city-block
distances, Pattern Recognition Letters 25, 353–365.
[3] De Carvalho, F. D. A. T., Brito, P., Bock, H.-H. (2006). Dynamic clustering for interval data based on
l2 distance, Computational Statistics 21, 231–250.
Download