Comparison of some approaches to clustering categorical data

advertisement
Comparison of some approaches to clustering
categorical data
Rezankova H.1 , Husek D.2 , Kudova P.2 , and Snasel V.3
1
2
3
University of Economics, Prague, nám. W. Churchilla 4, 130 67 Prague 3, Czech
Republic rezanka@vse.cz
Institute of Computer Science, Academy of Sciences of the Czech Republic, Pod
Vodárenskou věžı́ 2, 182 07 Prague, Czech Republic dusan@cs.cas.cz,
petra@cs.cas.cz
Technical University of Ostrava, 17. listopadu 15, 708 33 Ostrava-Poruba, Czech
Republic vaclav.snasel@vsb.cz
Summary. The mushrooms dataset from the UCI web page was analyzed by different techniques for clustering. Hierarchical cluster analysis with several linkage
methods, k-means algorithm, the CLARA algorithm, two-step cluster analysis, selforganizing map, growing cell structure and genetic algorithms were applied to this
dataset. The objects were clustered to different numbers of clusters. The obtained
results are compared with published results obtained by ROCK, k-modes and khistograms algorithms.
Key words: cluster analysis, categorical data, neural networks
1 Introduction
A large amount of categorical data is coming from different areas of research, both
social and nature sciences. The data files comprises either dichotomous variables or
variables containing more than two categories.
Recently, many new techniques have been developed for analysis of such a kind
of the data. These approaches stressed on similarity and dissimilarity measures for
two objects (both for symmetric and asymmetric dichotomous variables). Further,
some special methods have been designed, for example monothetic cluster analysis for binary data. However, at this moment such techniques are implemented in
software packages only rarely and incompletely. For instance, the SPSS system offers similarity and dissimilarity measures for binary variables, monothetic cluster
analysis is implemented in the S-PLUS system.
To overcome the inconvinience, several different approaches for solving this task
were used in the past. First, transformation of the dataset with multi-categorical
variables into the dataset with only binary variables was used. Here is important
to take into account nominal or ordinal variables nature. After the transformation,
methods for clustering binary data can be applied. However, this approach has some
608
Rezankova H., Husek D., Kudova P., and Snasel V.
peculiarities. The process of transformation is time-consuming, and what more the
number of arising variables can be extremely high.
Another possibility is to use simple matching coefficient sij . It is a ratio of
matching pairs of variable values to the total number of variables. This approach is
not used frequently because it is usually not implemented in most common software
packages. Procentual disagreement (1 – sij ) is offered in the STATISTICA system.
This measure for hierarchical cluster analysis provides assignment to clusters only in
a graphical form (dendrogram), so results are hardly interpretable for large datasets.
Beside of mentioned traditional approaches, we can use log-likelihood measure.
This measure is implemented in the TwoStep Cluster Analysis procedure in the
SPSS system (from version 11.5). It is a probability based distance. The distance
between two clusters is related to the decrease in log-likelihood as they are combined
into one cluster. The distance between clusters Ch and Ch′ is defined as
dhh′ = ξh + ξh′ − ξhh,h′ i ,
(1)
where the index hh, h′ i represents the cluster formed by combining clusters h and h′
and
XXN
p
ξg =
Kl
glm
l=1 m=1
log
Nglm
Ng
(2)
where Ng is a number of objects in the g th cluster, Kl is a number of categories of
the lth categorical variable and Nglm is a frequency of the mth category of the lth
categorical variable in the g th cluster.
Further, some new techniques have been developed recently, for example CACTUS (CAtegorical ClusTering Using Summaries), see [GGR99], ROCK (RObust
Clustering using linKs), see [GRS00], and k-histograms, see [HXD03]. Moreover,
neural networks based approaches (e.g. self-organizing map network) are used for
this purpose.
The aim of this paper is to compare results described in [GRS00] and [HXD03]
with results obtained by statistical techniques implemented in some commercial statistical packages, and with neural networks and genetic algorithms. This comparison
is either missing or incorrect in the mentioned publications.
2 The dataset
There already exist datasets with categorical variables, used as a benchmark for
different methods comparison. Some of them are accessible on Internet and some of
them were analyzed and results were published in different papers. We can mention
mushrooms dataset as an example, which can be downloaded from the UCI web
page (Repository of machine learning databases – [NHB98]). We used this dataset
by reason of comparison with results discussed in [GRS00, HXD03, HXD05].
The dataset includes descriptions of hypothetical samples corresponding to 23
species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500-525). A
number of objects is 8124 and a number of variables is 22 (all nominally valued). Each
Comparison of some approaches to clustering
609
variable represents a physical characteristic (color, odor, size, shape etc.). The 23rd
variable indicates if the mushroom is edible or poisonous. However, identification of
species is not included. The numbers of edible and poisonous mushrooms are 4208
and 3916, respectively.
Variables specify classification to these two classes very well. Application of
logistic regression leads to classification of individual mushrooms without mistakes.
With using discriminant analysis for transformed binary data (see next paragraph),
only three objects are classified incorrectly.
For traditional clustering algorithms, we transformed the data into the dataset
with only binary variables. Each variable with K categories was transformed into
K binary variables. Some categories described on the UCI web page are not used
in the dataset. The new variables corresponding to these categories and containing
only zero values were omitted from the analysis. Further, we omitted the stalk-root
variable containing 2480 missing values.
3 Methods used for analyses
We applied statistical methods, neural networks, and genetic algorithms to the data
mentioned above.
3.1 Statistical methods
The dataset with binary variables were analyzed by the k-means algorithm and
several techniques of hierarchical cluster analysis (HCA) in the SPSS system, and
the CLARA (Clustering LARge Applications) algorithm, which is the extension of
the PAM (Partitioning Around Medoids) algorithm, in the S-PLUS system. We used
Jaccard coefficient (Jac) as a similarity measure for asymmetric binary variables and
complete linkage (CL), single linkage (SL), average linkage within groups (ALWG),
and average linkage between groups (ALBG) algorithms as hierarchical clustering
methods. For clustering objects to two clusters, we also applied Dice coefficient.
Jaccard and Dice coefficients can be expressed by the formula
Px x
=
P x x + P |x
Θ
p
Θ
sij
il jl
i=1
p
i=1
.
p
il jl
il
(3)
− xjl |
i=1
If Θ = 1, we obtain the Jaccard coefficient, the formula with Θ = 2 expresses Dice
coefficient.
The other clustering methods implemented in SPSS are appropriate for squared
Euclidean measure and therefore they were not used for the total comparison.
In [GRS00], an Euclidean distance and a centroid-based hierarchical clustering algorithm were used. We applied both this approach and the better combination of the
squared Euclidean measure and the centroid linkage algorithm for the comparison
with results described in [GRS00].
Moreover, we used two-step cluster analysis (TSCA) with the log-likelihood measure in the SPSS system. It was applied to the dataset both with nominal and binary
variables.
610
Rezankova H., Husek D., Kudova P., and Snasel V.
3.2 Neural networks and genetic algorithms
Neural networks (NN) and genetic algorithms were applied to analyze the binary
data. We used NN self-organizing maps (SOM), growing cell structure NN (GCS),
and genetic algorithms (GAs).
The SOM is a vector quantization algorithm that places a number of references
or codebook vectors into a high-dimensional data space. This NN was proposed by
Kohonen [KOH91], therefore often referred to as a Kohonen maps. SOM‘s architecture is represented by a regular two-dimensional array of neurons. Each neuron
is associated with a weight vector wi ∈ RN where RN is N -dimensional Euclidian
space, and i = 1, . . . , k where k is a number of clusters. In fact the resulting SOM
represents a projection of the probability density function of the high-dimensional
input data onto the two-dimensional array. Input data vector x ∈ RN is mapped
onto a node c, where c = argmini {||x − wi ||} and || · || is the Euclidean norm. The
weight vectors are adapted during learning phase as long as the networks reflect the
density of the training data. There are many modifications of SOM NN, well known
is the self-organizing feature maps (SOFM) [KOH91, KOH01].
The growing cell structure (GCS) [FRI94] binds on SOFM modifications. GCS’s
neurons are not organized in a regular grid (like in SOM), but they are placed in
vertices of a network of k-dimensional simplexes (usually a 2-dimensional network
of triangles is used). The training process is similar to SOM’s training, except that
the learning rate is constant over time and always only the best matching unit is
updated. The main difference is that the network architecture is adapted during
the learning phase. The learning always starts with a minimal network, i.e. with
one simplex. After a constant number of adaptation steps new cells are added to
the network and superfluous cells are removed. For more precise description of the
algorithm, see [FRI94].
Evolutionary approaches, such as GAs, represent a class of stochastic algorithms applicable on a wide area of optimization problems. GAs were proposed by
Koza [KOZ92]. They work with a population of individuals, each individual represents one feasible solution to a given optimization problem. All individuals evaluated
with a fitness function, the better the solution the higher the fitness value. New generation of populations are created by means of selection, crossover and mutation
operators. When applying GA on clustering, the individual consists of k blocks
corresponding to k clusters. Each block contains the vector (center ) representing
its cluster. We are going to minimize the error E = Σxj ||xj − centers ||2 , where
s = arg mini ||xj − centeri ||. The lower the error the higher is the fitness function,
i.e. the probability that the corresponding individual will be selected for further reproduction. We have two types of operator selection, the former is classical one-point
selection applied to the whole blocks, and the letter modifies one randomly chosen
block by combining the values of this block in parent individuals. The mutation
causes random perturbations of an individual.
4 Comparison of results
At first, we clustered the objects to two clusters. We applied only statistical methods,
because the use of neural networks and genetic algorithm do not provide good results
for this dataset. Table 1 shows obtained results. With SL, CL, and ALBG algorithms,
Comparison of some approaches to clustering
611
edible mushrooms predominate in both clusters. These results are not included. Twostep cluster analysis was applied to both nominal and binary data. The results of
these two approaches were identical. Further, we should notice that the k-means
algorithm is the quickest but the results depend on the order of the objects. We
evaluated results by clustering accuracy r according to [HXD03]. This accuracy is
generally defined as
Pa
k
r=
v
v=1
(4)
N
where N is a number of objects in the dataset, k is a number of clusters and av is a
number of correctly assigned objects. The minimum value of accuracy is 61.6 percent
(for ALWG with the Jaccard coefficient) and the maximum value is 89 percent (for
TSCA). For k-histograms algorithm presented in [HXD03], the accuracy was about
57 percent.
Table 1. Comparison of clustering to 2 clusters by statistical methods
Method
k-means
HCA, Jac, ALWG
HCA, Dice, ALWG
CLARA
TSCA
edible
correctly
3836
3056
3760
4157
4208
Number of mushrooms
edible poisonous poisonous
wrong
correctly
wrong
372
1152
448
51
0
1229
1952
3100
2988
3024
2687
1964
816
928
892
Accuracy
62.3%
61.6%
84.4%
87.9%
89.0%
Moreover, we applied statistical methods for clustering objects to 21 clusters as
well as [GRS00]. We found that the results obtained by hierarchical cluster analysis
with using Jaccard coefficient as a similarity measure and the ALBG algorithm are
identical with results obtained by the ROCK algorithm described in [GRS00]. In the
mentioned paper, the authors characterize results by 20 “pure” clusters (mushrooms
in a “pure” cluster are either all edible or all poisonous), the minimum size is 8, the
maximum size is 1728 and a number of clusters with the size less than 100 equals 9.
We obtained the same results also with some other similarity and distance measures,
e.g. Dice and Ochiai coefficients, and Euclidean measures.
As we mentioned above, we also used centroid linkage algorithm for the squared
Euclidean and Euclidean measures. We applied these approaches for clustering objects to 20 clusters as well as [GRS00]. In the former case, we obtained 18 “pure”
clusters (in the sense mentioned above). The sizes of minimum and maximum clusters were 8 and 1728, respectively. In the latter case, we obtained results entirely
distinct from the results described in [GRS00]. There was one majority cluster with
8098 objects. The other 19 clusters were “pure”.
Furthermore, we tried to cluster objects to 23 clusters because the dataset includes 23 species (this case is not included in [GRS00]). We found two techniques,
which give all “pure” clusters. There are SL and ALBG algorithms with using Jac-
612
Rezankova H., Husek D., Kudova P., and Snasel V.
card coefficient. A comparison of some methods from the viewpoint of a number of
“pure” clusters is shown in Table 2.
Table 2. Number of “pure” clusters for different methods and numbers of clusters
Total number of clusters
Methods
2
4
6 12 17 22 23 25
k-means
0
0
0
2
9 16 16 16
HCA, Jac, CL
0
2
2
9 15 20 21 23
HCA, Jac ,ALWG
0
1
2
7 12 18 19 21
HCA, Jac, ALBG
1
2
3
8 13 21 23 25
HCA, Jac, SL
1
3
4 10 14 22 23 25
TSCA – binary
1
3
4
8 14 20 21 24
TSCA – nominal
1
3
4
8 14 20 21 22
CLARA
0
0
0
7
7 13 15 16
k-modes*
0
0
1
3
6
9 12 12
k-histograms*
0
0
1
7 12 18 20 21
*results published in [4]
Neural networks and genetic algorithm were also applied for different number
of clusters. A number of iterations in the GCS algorithm differs depending on a
number of clusters. For 6, 12, 17, 22 and 32 clusters it was 500, 1000, 1500, 2000
and 3000 iterations. The GCS algorithm was applied ten times and the case with
the best results was chosen. These results were compared with results obtained by
statistical methods. Table 3 shows this comparison from the viewpoint of accuracy
(in the sense expressed by (4) for separation of edible and poisonous mushrooms).
As concerns as GA and SOM, GA provides 89.5% for 4 clusters and SOM 96.0% for
25 clusters.
Table 3. Accuracy for different methods and numbers of clusters
Methods
k-means
HCA, Jac, CL
HCA, Jac, ALWG
HCA, Jac, ALBG
HCA, Jac, SL
CLARA
TSCA – binary
TSCA – nominal
GCS
4
78.3%
76.2%
88.8%
68.0%
68.2%
90.7%
89.0%
89.0%
x
6
80.2%
82.7%
88.8%
89.3%
89.4%
75.5%
89.0%
89.0%
90.0%
Total number of clusters
12
17
22
92.8% 94.7% 95.7%
97.4% 98.7% 98.7%
95.1% 98.4% 99.0%
89.5% 94.4% 99.6%
89.7% 91.2% 100.0%
93.1% 96.0% 93.7%
95.7% 97.2% 98.8%
93.4% 98.2% 99.4%
92.9% 90.4% 93.9%
23
95.8%
99.2%
99.0%
100.0%
100.0%
96.8%
99.4%
99.4%
91.0%
32
98.5%
99.2%
99.4%
100.0%
100.0%
98.7%
99.6%
99.4%
95.9%
5 Conclusion
Although cluster analysis of nominal variables is a popular theme for the publications, it is implemented in software packages only rarely. For large datasets two-step
Comparison of some approaches to clustering
613
cluster analysis in SPSS can be used. As the special methods are not offered or they
can be not be used for large datasets, the users have to transform the dataset into binary variables for using traditional technique. We tried to compare both approaches
on the base of mushrooms datasets.
We found out neither neural networks nor genetic algorithm are better than statistical techniques in investigated parameters. For distinction of edible and poisonous
mushrooms in 23 clusters (by the reason of 23 species), we found two techniques,
which give all “pure” clusters. They are the single linkage and average linkage within
groups algorithm with using Jaccard coefficient. Furthermore, we found that the results for 21 clusters obtained by hierarchical cluster analysis with using the average
linkage between groups algorithm are identical with results obtained by the ROCK
algorithm described in [GRS00].
References
[FRI94]
[GGR99]
[GRS00]
[HXD03]
[HXD05]
[KOH91]
[KOH01]
[KOZ92]
[NHB98]
Fritzke, B.: Growing cell structures – A self-organizing network for unsupervised and supervised learning. Neural Network, 7, 1141–1160 (1994)
Ganti, V., Gehrke, J., Ramakrishnan, R.: CACTUS – Clustering Categorical Data Using Summaries. In: Proceedings of the 5th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining. ACM
Press, San Diego, 73–83 (1999)
Guha, S., Rastogi, R., Shim, K.: ROCK: A Robust Clustering Algorithm
for Categorical Attributes. Information Systems, 25, 345–366 (2000)
He, Z., Xu, X., Deng, S., Dong, B.: K-Histograms: An Efficient Clustering Algorithm for Categorical Dataset. Technical Report No. Tr-2003-08,
Harbin Institute of Technology (2003)
He, Z., Xu, X., Deng, S.: TCSOM: Clustering Transactions Using SelfOrganizing Map. Neural Processing Letters, 22, 249–262 (2005)
Kohonen, T.: Self-Organizing Maps. Proc. IEEE, 78, 1464–1480 (1991)
Kohonen, T.: Self-Organizing Maps, 3rd edn. Springer, Berlin Heidelberg
New York (2001)
Koza, J.: Genetic Programming: On the Programming of Computers by
Means of Natural Selection. MIT PRESS, Cambridge (1992)
Newman, D.J., Hettich, S., Blake, C.L., Merz, C.J.: UCI Repository of
machine learning databases. University of California, Irvine, CA (1998),
[http://www.ics.uci.edu/∼mlearn/MLRepository. html]
Acknowledgement
This work was partially supported by grant 201/05/0079 awarded by the Grant
Agency of the Czech Republic by the projects No. 1ET100300414, and by the Institutional Research Plan AVOZ10300504 awarded by the Grant Agency of the Czech
Republic.
Download