A co-clustering algorithm for interval-valued data

advertisement
A co-clustering algorithm for interval-valued data
?
Francisco de A. T. de Carvalho1, , Roberto C. Fernandes 1
1. Centro de Informatica - CIn/UFPE
? Contact author: fatc@cin.ufpe.bf
Keywords: Co-clustering, Double k-means, Interval-valued data, Symbolic data analysis
Co-clustering, also known as bi-clustering or block clustering, methods aim simultaneously cluster objects and variables of a data set. They resume the initial data matrix into a much smaller
matrix representing homogeneous blocks or co-clusters of similar objects and variables (Govaert,
1995; Govaert and Nadif, 2013). A clear advantage of this approach, Rather than the traditional
sequential approach, is that the simultaneous clustering of objects and variables may provide news
insights about the association between isolated clusters of objects and isolated clusters of variables
(Mechelen et al., 2004). Co-clustering is being used successfully in many different areas such as
text mining, bioinformatics, etc.
This presentation aims at giving a co-clustering algorithm able to simultaneously cluster objects
and interval-valued variables. Interval-valued variables are needed, for example, when an object
represents a group of individuals and the variables used to describe it need to assume a value which
express the variability inherent to the description of a group. Interval-valued data arise in practical
situations such as recording monthly interval temperatures at meteorological stations, daily interval
stock prices, etc. Another source of interval-valued data is the aggregation of huge databases into
a reduced number of groups, the properties of which are described by interval-valued variables.
Therefore, tools for interval-valued data analysis are very much required (Bock and Diday, 2000) .
Symbolic data analysis has provided suitable tools for clustering objects described by intervalvalued variables: agglomerative (Gowda and Diday, 1991; Guru et al, 2004) and divisive (Gowda
and Ravi, 1995; Chavent, 2000) hierarchical methods, partitioning hard (Chavent and Lechevallier,
2002; De Souza and De Carvalho, 2004; De Carvalho et al., 2006) and fuzzy (Yang et al, 2004; De
Carvalho, 2007) cluster algorithms. However, much more less attention was given for simultaneous
clustering of objects and symbolic variables (Verde and Lechevallier, 2001).
This presentation gives a co-clustering algorithm for interval-valued data using suitable Euclidean,
City-Block and Hausdorff distances. The presented algorithm is a double k-means type (Govaert, 1995; Govaert and Nadif, 2013). It means that it is an iterative three steps (representation,
allocation of variables, allocation of objects) relocation algorithm that looks, simultaneously, for
the representatives of the co-clusters (blocks), for a partition of the set of variables into a fixed
number of clusters and for a partition of the set of objects also in a fixed number of clusters, such
that a clustering criterion (objective function) measuring the fit between the initial data matrix and
the matrix of co-clusters representatives is locally minimized. These steps are repeated until the
algorithm convergence that can be proved (Diday and Coll, 1980).
In this presentation it is given the clustering criterion (objective function) and the main steps of
the algorithm (the computation of the co-clusters representatives, the determination of the best
partition of the variables and the determination of the best partition of the objects). The usefulness of the presented co-clustering algorithm is illustrated with its execution on some benchmark
interval-valued data sets.
References
H-.H. Bock and E. Diday (2000). Analysis of Symbolic Data, Springer, Berlin et al.
M. Chavent (2000). Criterion-based divisive clustering for symbolic objects. In in H.-H. Bock,
and E. Diday (Eds.), Analysis of symbolic data, exploratory methods for extracting statistical
information from complex data (Springer, Berlin), pp. 291–311.
M. Chavent and Y. Lechevallier (2002). Dynamical clustering algorithm of interval data: optimization of an adequacy criterion based on Hausdorff distance. In IFCS 2002, 8th Conference
of the International Federation of Classification Societies (Cracow, Poland), pp. 53–59.
F.A.T. De Carvalho (2007). Fuzzy c-means clustering methods for symbolic interval data. Pattern
Recognition Letters, 28, 423–437.
F.A.T. De Carvalho, P. Brito, and H.-H. Bock (2006). Dynamic clustering for interval data based
on L2 distance. Computational Statistics, 2, 231–250.
R.M.C.R. de Souza and F.A.T. de Carvalho (2004). Clustering of interval data based on City-Block
distances. Pattern Recognition Letters, 25, 353–365.
Diday and Coll (1980). Optimisation en Classification Automatique INRIA, Le Chesnay.
Y. El-Sonbaty and M.A. Ismail (1998). Fuzzy clustering for symbolic data. IEEE Transactions on
Fuzzy Systems, 6, 195–204.
G. Govaert (1991). Simultaneous clustering of rows and columns. Control and Cybernetics 24,
437–458.
G. Govaert and M. Nadif (2013). Co-clustering: models, algorithms and applications, Wiley, New
York.
K.C. Gowda and E. Diday (1991). Symbolic clustering using a new dissimilarity measure. Pattern
Recognition 24, 567–578.
K.C. Gowda and T.R. Ravi (1995). Divisive clustering of symbolic objects using the concepts of
both similarity and dissimilarity. Pattern Recognition 28, 1277–1282.
D.S. Guru, B.B. Kiranagi, and P. Nagabhushan (2004). Multivalued type proximity measure and
concept of mutual similarity value useful for clustering symbolic patterns. Pattern Recognition
Letters 25, 1203–1213.
I. V. Mechelen, H. H. Bock, and P. D. Boeck (2004). Two-mode clustering methods: a structured
overview. Statistical methods in medical research, 13, 363–394.
R. Verde and Y. Lechevallier (2005). Crossed Clustering Method on Symbolic Data Tables. In
Proceedings of the Meeting of the Classification and Data Analysis Group (CLADAG) of the
Italian Statistical Society - CLADAG 2003 (Bologna, Italy), pp. 87–94.
M.-S. Yang, P.-Y. Hwang, and D.-H. Chen (2004). Fuzzy clustering algorithms for mixed feature
variables. Fuzzy Sets and Systems, 141, 301–317.
Download