Dynamic clustering of interval data based on hibrid L1, L2 and L∞ distances ? Leandro C. Souza1,2, , Renata M. C. R. Souza1 , Getúlio J. A. Amaral3 1. Universidade Federal de Pernambuco (UFPE), Cin, Recife - PE, Brazil 2. Universidade Federal Rural do Semi-Árido (UFERSA), DCEN, Mossoró - RN, Brazil 3. Universidade Federal de Pernambuco (UFPE), DE, Recife - PE, Brazil ? Contact author: lcs6@cin.ufpe.br Keywords: Interval Distance, Interval Symbolic Data, Dynamic Clustering Cluster analysis is a traditional approach to provide exploratory discovery of knowledge. Dynamic Partitional clustering approach does partitions on data and associates prototypes to each partition. Distances measures are necessary to perform the clustering. In literature, a variety of distances are proposed for clustering interval data, as L1 , L2 and L∞ distances. Therefore, interval data has an extra kind of information, which is not verified on the point one, and it is related with the variation or uncertain represented by them. In this way, we propose a mapping from intervals to points, which preserves their location and internal variation and allows formulating the new hybrid L1 , L2 and L∞ distances, all based on the Lq distance for point data. These distances are used to perform dynamic clustering for interval data. Let Γ a set of p-dimensional interval data with N observations, such as Γ = {I1 , I2 , · · · , In , · · · , IN }. An interval multivariate instance representation In ∈ Γ is given by In = [a1n , b1n ], [a2n , b2n ], · · · , [apn , bpn ] , (1) where n = 1, 2, · · · , N and ajn ≤ bjn , for j = 1, · · · , p. Consider the partition of the set Γ in K clusters. Let Gk the p-dimensional interval prototype of class k and Ck the k th class. Partitional dynamic clustering over Γ is proposed to minimize the criterion Jφ , defined by Jφ = K X X φ(In , Gk ), (2) k=1 In ∈Ck where φ is a distance function. For the generic interval instance In , the mapping M which preserves location and internal variation generates one point and one vector(both p-dimensional) and it is given by (3) ([a1n , b1n ], · · · , [apn , bpn ]) → {(a1n , · · · , apn ), (∆1n , · · · , ∆pn )}, M with ∆jn = bjn − ajn . As two different kind of information are used, occurs a hybridism on mapping. The hybrid L1 (HL1 ) distance formulation is given by dHL1 (In , Gk ) = p X |ajn − ajGk | + |∆jn − ∆jGk | . (4) j=1 The hybrid L2 (HL2 ) distance has the expression dHL2 (In , Gk ) = p X j=1 |ajn − ajGk |2 + |∆jn − ∆jGk |2 . (5) The hybrid L∞ distance (HL∞ ) distance is proposed as follows p p j=1 j=1 dHL∞ (In , Gk ) = max{|ajn − ajGk |} + max{|∆jn − ∆jGk |}, (6) where max{·} is the maximum function. To compare the quality of the clustering results, adjusted rand index (ARI) is used associated with a synthetic dataset. ARI values more close to 1 indicates a strong agreement between the obtained clusters and a known partition. Bootstrap statistical method constructs non-parametric confidence intervals for the mean of ARI values for the distances L1 , L2 , L∞ , HL1 , HL2 and HL∞ distances, with 95% of confidence. In the synthetic dataset, intervals are constructed sorting randomly values for centers and ranges, which delivers three clusters, two ellipsoidal (with 150 elements) and the third one spherical (with 50 elements). The centers, with coordinates (cx , cy ), bivariate normal distributions with parameters µ and Σ, with µ = follow σx2 0 µx and Σ = , with the following values : Cluster 1: µx = 30, µy = 10, σx2 = 0 σy2 µy 100 e σy2 = 25; Cluster 2: µx = 50, µy = 30, σx2 = 36 e σy2 = 144; and Cluster 3: µx = 30, µy = 35, σx2 = 16 e σy2 = 16; The range is generated using uniform distributions over an interval [v, u], represented by U n(v, u). The rectangle with center coordinates in the point (cxi , cyi ) has ranges represented by γxi and γyi , for x and y, respectively . Interval data is constructed by ([cxi − γxi /2, cxi + γxi /2], [cyi − γyi /2, cyi + γyi /2]). (7) A general configuration is used for the ranges, where the uniform distributions are different for clusters and dimensions. Table 1 shows these distributions. Table 2 presents the non-parametric Table 1: Uniform distributions for interval ranges Cluster 1 2 3 γx distribution U n(4, 7) U n(1, 2) U n(2, 3) γy distribution U n(1, 3) U n(6, 9) U n(3, 6) confidence intervals for this synthetic configuration. 100 datasets were generated. For each one, the clustering was applied 100 times. The solution which has the lowest criterion was selected, resulting in 100 ARI values. The bootstrap method is applied to the ARI values with 2000 repetitions and confidence of 95%. The confidence intervals for the ARI means revel the better adjust of Table 2: Non-parametric confidence intervals for the comparison of distances Distance ARI confidence interval L1 HL1 L2 HL2 L∞ HL∞ [0.72, 0.77] [0.85, 0.87] [0.51, 0.56] [0.51, 0.55] [0.81, 0.84] [0.78, 0.82] HL1 , instead its limits are greater than the other distances. References [1] Chavent, M., Lechevallier, Y. (2002). Dynamical clustering of interval data: Optimization of an adequacy criterion based on hausdorff distance, Classification, Clustering, and Data Analysis, 53–60. [2] Souza, R. M. C. R., De Carvalho, F. D. A. T. (2004). Clustering of interval data based on city-block distances, Pattern Recognition Letters 25, 353–365. [3] De Carvalho, F. D. A. T., Brito, P., Bock, H.-H. (2006). Dynamic clustering for interval data based on l2 distance, Computational Statistics 21, 231–250.