Comparison of some approaches to clustering categorical data Rezankova H.1 , Husek D.2 , Kudova P.2 , and Snasel V.3 1 2 3 University of Economics, Prague, nám. W. Churchilla 4, 130 67 Prague 3, Czech Republic rezanka@vse.cz Institute of Computer Science, Academy of Sciences of the Czech Republic, Pod Vodárenskou věžı́ 2, 182 07 Prague, Czech Republic dusan@cs.cas.cz, petra@cs.cas.cz Technical University of Ostrava, 17. listopadu 15, 708 33 Ostrava-Poruba, Czech Republic vaclav.snasel@vsb.cz Summary. The mushrooms dataset from the UCI web page was analyzed by different techniques for clustering. Hierarchical cluster analysis with several linkage methods, k-means algorithm, the CLARA algorithm, two-step cluster analysis, selforganizing map, growing cell structure and genetic algorithms were applied to this dataset. The objects were clustered to different numbers of clusters. The obtained results are compared with published results obtained by ROCK, k-modes and khistograms algorithms. Key words: cluster analysis, categorical data, neural networks 1 Introduction A large amount of categorical data is coming from different areas of research, both social and nature sciences. The data files comprises either dichotomous variables or variables containing more than two categories. Recently, many new techniques have been developed for analysis of such a kind of the data. These approaches stressed on similarity and dissimilarity measures for two objects (both for symmetric and asymmetric dichotomous variables). Further, some special methods have been designed, for example monothetic cluster analysis for binary data. However, at this moment such techniques are implemented in software packages only rarely and incompletely. For instance, the SPSS system offers similarity and dissimilarity measures for binary variables, monothetic cluster analysis is implemented in the S-PLUS system. To overcome the inconvinience, several different approaches for solving this task were used in the past. First, transformation of the dataset with multi-categorical variables into the dataset with only binary variables was used. Here is important to take into account nominal or ordinal variables nature. After the transformation, methods for clustering binary data can be applied. However, this approach has some 608 Rezankova H., Husek D., Kudova P., and Snasel V. peculiarities. The process of transformation is time-consuming, and what more the number of arising variables can be extremely high. Another possibility is to use simple matching coefficient sij . It is a ratio of matching pairs of variable values to the total number of variables. This approach is not used frequently because it is usually not implemented in most common software packages. Procentual disagreement (1 – sij ) is offered in the STATISTICA system. This measure for hierarchical cluster analysis provides assignment to clusters only in a graphical form (dendrogram), so results are hardly interpretable for large datasets. Beside of mentioned traditional approaches, we can use log-likelihood measure. This measure is implemented in the TwoStep Cluster Analysis procedure in the SPSS system (from version 11.5). It is a probability based distance. The distance between two clusters is related to the decrease in log-likelihood as they are combined into one cluster. The distance between clusters Ch and Ch′ is defined as dhh′ = ξh + ξh′ − ξhh,h′ i , (1) where the index hh, h′ i represents the cluster formed by combining clusters h and h′ and XXN p ξg = Kl glm l=1 m=1 log Nglm Ng (2) where Ng is a number of objects in the g th cluster, Kl is a number of categories of the lth categorical variable and Nglm is a frequency of the mth category of the lth categorical variable in the g th cluster. Further, some new techniques have been developed recently, for example CACTUS (CAtegorical ClusTering Using Summaries), see [GGR99], ROCK (RObust Clustering using linKs), see [GRS00], and k-histograms, see [HXD03]. Moreover, neural networks based approaches (e.g. self-organizing map network) are used for this purpose. The aim of this paper is to compare results described in [GRS00] and [HXD03] with results obtained by statistical techniques implemented in some commercial statistical packages, and with neural networks and genetic algorithms. This comparison is either missing or incorrect in the mentioned publications. 2 The dataset There already exist datasets with categorical variables, used as a benchmark for different methods comparison. Some of them are accessible on Internet and some of them were analyzed and results were published in different papers. We can mention mushrooms dataset as an example, which can be downloaded from the UCI web page (Repository of machine learning databases – [NHB98]). We used this dataset by reason of comparison with results discussed in [GRS00, HXD03, HXD05]. The dataset includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500-525). A number of objects is 8124 and a number of variables is 22 (all nominally valued). Each Comparison of some approaches to clustering 609 variable represents a physical characteristic (color, odor, size, shape etc.). The 23rd variable indicates if the mushroom is edible or poisonous. However, identification of species is not included. The numbers of edible and poisonous mushrooms are 4208 and 3916, respectively. Variables specify classification to these two classes very well. Application of logistic regression leads to classification of individual mushrooms without mistakes. With using discriminant analysis for transformed binary data (see next paragraph), only three objects are classified incorrectly. For traditional clustering algorithms, we transformed the data into the dataset with only binary variables. Each variable with K categories was transformed into K binary variables. Some categories described on the UCI web page are not used in the dataset. The new variables corresponding to these categories and containing only zero values were omitted from the analysis. Further, we omitted the stalk-root variable containing 2480 missing values. 3 Methods used for analyses We applied statistical methods, neural networks, and genetic algorithms to the data mentioned above. 3.1 Statistical methods The dataset with binary variables were analyzed by the k-means algorithm and several techniques of hierarchical cluster analysis (HCA) in the SPSS system, and the CLARA (Clustering LARge Applications) algorithm, which is the extension of the PAM (Partitioning Around Medoids) algorithm, in the S-PLUS system. We used Jaccard coefficient (Jac) as a similarity measure for asymmetric binary variables and complete linkage (CL), single linkage (SL), average linkage within groups (ALWG), and average linkage between groups (ALBG) algorithms as hierarchical clustering methods. For clustering objects to two clusters, we also applied Dice coefficient. Jaccard and Dice coefficients can be expressed by the formula Px x = P x x + P |x Θ p Θ sij il jl i=1 p i=1 . p il jl il (3) − xjl | i=1 If Θ = 1, we obtain the Jaccard coefficient, the formula with Θ = 2 expresses Dice coefficient. The other clustering methods implemented in SPSS are appropriate for squared Euclidean measure and therefore they were not used for the total comparison. In [GRS00], an Euclidean distance and a centroid-based hierarchical clustering algorithm were used. We applied both this approach and the better combination of the squared Euclidean measure and the centroid linkage algorithm for the comparison with results described in [GRS00]. Moreover, we used two-step cluster analysis (TSCA) with the log-likelihood measure in the SPSS system. It was applied to the dataset both with nominal and binary variables. 610 Rezankova H., Husek D., Kudova P., and Snasel V. 3.2 Neural networks and genetic algorithms Neural networks (NN) and genetic algorithms were applied to analyze the binary data. We used NN self-organizing maps (SOM), growing cell structure NN (GCS), and genetic algorithms (GAs). The SOM is a vector quantization algorithm that places a number of references or codebook vectors into a high-dimensional data space. This NN was proposed by Kohonen [KOH91], therefore often referred to as a Kohonen maps. SOM‘s architecture is represented by a regular two-dimensional array of neurons. Each neuron is associated with a weight vector wi ∈ RN where RN is N -dimensional Euclidian space, and i = 1, . . . , k where k is a number of clusters. In fact the resulting SOM represents a projection of the probability density function of the high-dimensional input data onto the two-dimensional array. Input data vector x ∈ RN is mapped onto a node c, where c = argmini {||x − wi ||} and || · || is the Euclidean norm. The weight vectors are adapted during learning phase as long as the networks reflect the density of the training data. There are many modifications of SOM NN, well known is the self-organizing feature maps (SOFM) [KOH91, KOH01]. The growing cell structure (GCS) [FRI94] binds on SOFM modifications. GCS’s neurons are not organized in a regular grid (like in SOM), but they are placed in vertices of a network of k-dimensional simplexes (usually a 2-dimensional network of triangles is used). The training process is similar to SOM’s training, except that the learning rate is constant over time and always only the best matching unit is updated. The main difference is that the network architecture is adapted during the learning phase. The learning always starts with a minimal network, i.e. with one simplex. After a constant number of adaptation steps new cells are added to the network and superfluous cells are removed. For more precise description of the algorithm, see [FRI94]. Evolutionary approaches, such as GAs, represent a class of stochastic algorithms applicable on a wide area of optimization problems. GAs were proposed by Koza [KOZ92]. They work with a population of individuals, each individual represents one feasible solution to a given optimization problem. All individuals evaluated with a fitness function, the better the solution the higher the fitness value. New generation of populations are created by means of selection, crossover and mutation operators. When applying GA on clustering, the individual consists of k blocks corresponding to k clusters. Each block contains the vector (center ) representing its cluster. We are going to minimize the error E = Σxj ||xj − centers ||2 , where s = arg mini ||xj − centeri ||. The lower the error the higher is the fitness function, i.e. the probability that the corresponding individual will be selected for further reproduction. We have two types of operator selection, the former is classical one-point selection applied to the whole blocks, and the letter modifies one randomly chosen block by combining the values of this block in parent individuals. The mutation causes random perturbations of an individual. 4 Comparison of results At first, we clustered the objects to two clusters. We applied only statistical methods, because the use of neural networks and genetic algorithm do not provide good results for this dataset. Table 1 shows obtained results. With SL, CL, and ALBG algorithms, Comparison of some approaches to clustering 611 edible mushrooms predominate in both clusters. These results are not included. Twostep cluster analysis was applied to both nominal and binary data. The results of these two approaches were identical. Further, we should notice that the k-means algorithm is the quickest but the results depend on the order of the objects. We evaluated results by clustering accuracy r according to [HXD03]. This accuracy is generally defined as Pa k r= v v=1 (4) N where N is a number of objects in the dataset, k is a number of clusters and av is a number of correctly assigned objects. The minimum value of accuracy is 61.6 percent (for ALWG with the Jaccard coefficient) and the maximum value is 89 percent (for TSCA). For k-histograms algorithm presented in [HXD03], the accuracy was about 57 percent. Table 1. Comparison of clustering to 2 clusters by statistical methods Method k-means HCA, Jac, ALWG HCA, Dice, ALWG CLARA TSCA edible correctly 3836 3056 3760 4157 4208 Number of mushrooms edible poisonous poisonous wrong correctly wrong 372 1152 448 51 0 1229 1952 3100 2988 3024 2687 1964 816 928 892 Accuracy 62.3% 61.6% 84.4% 87.9% 89.0% Moreover, we applied statistical methods for clustering objects to 21 clusters as well as [GRS00]. We found that the results obtained by hierarchical cluster analysis with using Jaccard coefficient as a similarity measure and the ALBG algorithm are identical with results obtained by the ROCK algorithm described in [GRS00]. In the mentioned paper, the authors characterize results by 20 “pure” clusters (mushrooms in a “pure” cluster are either all edible or all poisonous), the minimum size is 8, the maximum size is 1728 and a number of clusters with the size less than 100 equals 9. We obtained the same results also with some other similarity and distance measures, e.g. Dice and Ochiai coefficients, and Euclidean measures. As we mentioned above, we also used centroid linkage algorithm for the squared Euclidean and Euclidean measures. We applied these approaches for clustering objects to 20 clusters as well as [GRS00]. In the former case, we obtained 18 “pure” clusters (in the sense mentioned above). The sizes of minimum and maximum clusters were 8 and 1728, respectively. In the latter case, we obtained results entirely distinct from the results described in [GRS00]. There was one majority cluster with 8098 objects. The other 19 clusters were “pure”. Furthermore, we tried to cluster objects to 23 clusters because the dataset includes 23 species (this case is not included in [GRS00]). We found two techniques, which give all “pure” clusters. There are SL and ALBG algorithms with using Jac- 612 Rezankova H., Husek D., Kudova P., and Snasel V. card coefficient. A comparison of some methods from the viewpoint of a number of “pure” clusters is shown in Table 2. Table 2. Number of “pure” clusters for different methods and numbers of clusters Total number of clusters Methods 2 4 6 12 17 22 23 25 k-means 0 0 0 2 9 16 16 16 HCA, Jac, CL 0 2 2 9 15 20 21 23 HCA, Jac ,ALWG 0 1 2 7 12 18 19 21 HCA, Jac, ALBG 1 2 3 8 13 21 23 25 HCA, Jac, SL 1 3 4 10 14 22 23 25 TSCA – binary 1 3 4 8 14 20 21 24 TSCA – nominal 1 3 4 8 14 20 21 22 CLARA 0 0 0 7 7 13 15 16 k-modes* 0 0 1 3 6 9 12 12 k-histograms* 0 0 1 7 12 18 20 21 *results published in [4] Neural networks and genetic algorithm were also applied for different number of clusters. A number of iterations in the GCS algorithm differs depending on a number of clusters. For 6, 12, 17, 22 and 32 clusters it was 500, 1000, 1500, 2000 and 3000 iterations. The GCS algorithm was applied ten times and the case with the best results was chosen. These results were compared with results obtained by statistical methods. Table 3 shows this comparison from the viewpoint of accuracy (in the sense expressed by (4) for separation of edible and poisonous mushrooms). As concerns as GA and SOM, GA provides 89.5% for 4 clusters and SOM 96.0% for 25 clusters. Table 3. Accuracy for different methods and numbers of clusters Methods k-means HCA, Jac, CL HCA, Jac, ALWG HCA, Jac, ALBG HCA, Jac, SL CLARA TSCA – binary TSCA – nominal GCS 4 78.3% 76.2% 88.8% 68.0% 68.2% 90.7% 89.0% 89.0% x 6 80.2% 82.7% 88.8% 89.3% 89.4% 75.5% 89.0% 89.0% 90.0% Total number of clusters 12 17 22 92.8% 94.7% 95.7% 97.4% 98.7% 98.7% 95.1% 98.4% 99.0% 89.5% 94.4% 99.6% 89.7% 91.2% 100.0% 93.1% 96.0% 93.7% 95.7% 97.2% 98.8% 93.4% 98.2% 99.4% 92.9% 90.4% 93.9% 23 95.8% 99.2% 99.0% 100.0% 100.0% 96.8% 99.4% 99.4% 91.0% 32 98.5% 99.2% 99.4% 100.0% 100.0% 98.7% 99.6% 99.4% 95.9% 5 Conclusion Although cluster analysis of nominal variables is a popular theme for the publications, it is implemented in software packages only rarely. For large datasets two-step Comparison of some approaches to clustering 613 cluster analysis in SPSS can be used. As the special methods are not offered or they can be not be used for large datasets, the users have to transform the dataset into binary variables for using traditional technique. We tried to compare both approaches on the base of mushrooms datasets. We found out neither neural networks nor genetic algorithm are better than statistical techniques in investigated parameters. For distinction of edible and poisonous mushrooms in 23 clusters (by the reason of 23 species), we found two techniques, which give all “pure” clusters. They are the single linkage and average linkage within groups algorithm with using Jaccard coefficient. Furthermore, we found that the results for 21 clusters obtained by hierarchical cluster analysis with using the average linkage between groups algorithm are identical with results obtained by the ROCK algorithm described in [GRS00]. References [FRI94] [GGR99] [GRS00] [HXD03] [HXD05] [KOH91] [KOH01] [KOZ92] [NHB98] Fritzke, B.: Growing cell structures – A self-organizing network for unsupervised and supervised learning. Neural Network, 7, 1141–1160 (1994) Ganti, V., Gehrke, J., Ramakrishnan, R.: CACTUS – Clustering Categorical Data Using Summaries. In: Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Press, San Diego, 73–83 (1999) Guha, S., Rastogi, R., Shim, K.: ROCK: A Robust Clustering Algorithm for Categorical Attributes. Information Systems, 25, 345–366 (2000) He, Z., Xu, X., Deng, S., Dong, B.: K-Histograms: An Efficient Clustering Algorithm for Categorical Dataset. Technical Report No. Tr-2003-08, Harbin Institute of Technology (2003) He, Z., Xu, X., Deng, S.: TCSOM: Clustering Transactions Using SelfOrganizing Map. Neural Processing Letters, 22, 249–262 (2005) Kohonen, T.: Self-Organizing Maps. Proc. IEEE, 78, 1464–1480 (1991) Kohonen, T.: Self-Organizing Maps, 3rd edn. Springer, Berlin Heidelberg New York (2001) Koza, J.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT PRESS, Cambridge (1992) Newman, D.J., Hettich, S., Blake, C.L., Merz, C.J.: UCI Repository of machine learning databases. University of California, Irvine, CA (1998), [http://www.ics.uci.edu/∼mlearn/MLRepository. html] Acknowledgement This work was partially supported by grant 201/05/0079 awarded by the Grant Agency of the Czech Republic by the projects No. 1ET100300414, and by the Institutional Research Plan AVOZ10300504 awarded by the Grant Agency of the Czech Republic.