A co-clustering algorithm for interval-valued data ? Francisco de A. T. de Carvalho1, , Roberto C. Fernandes 1 1. Centro de Informatica - CIn/UFPE ? Contact author: fatc@cin.ufpe.bf Keywords: Co-clustering, Double k-means, Interval-valued data, Symbolic data analysis Co-clustering, also known as bi-clustering or block clustering, methods aim simultaneously cluster objects and variables of a data set. They resume the initial data matrix into a much smaller matrix representing homogeneous blocks or co-clusters of similar objects and variables (Govaert, 1995; Govaert and Nadif, 2013). A clear advantage of this approach, Rather than the traditional sequential approach, is that the simultaneous clustering of objects and variables may provide news insights about the association between isolated clusters of objects and isolated clusters of variables (Mechelen et al., 2004). Co-clustering is being used successfully in many different areas such as text mining, bioinformatics, etc. This presentation aims at giving a co-clustering algorithm able to simultaneously cluster objects and interval-valued variables. Interval-valued variables are needed, for example, when an object represents a group of individuals and the variables used to describe it need to assume a value which express the variability inherent to the description of a group. Interval-valued data arise in practical situations such as recording monthly interval temperatures at meteorological stations, daily interval stock prices, etc. Another source of interval-valued data is the aggregation of huge databases into a reduced number of groups, the properties of which are described by interval-valued variables. Therefore, tools for interval-valued data analysis are very much required (Bock and Diday, 2000) . Symbolic data analysis has provided suitable tools for clustering objects described by intervalvalued variables: agglomerative (Gowda and Diday, 1991; Guru et al, 2004) and divisive (Gowda and Ravi, 1995; Chavent, 2000) hierarchical methods, partitioning hard (Chavent and Lechevallier, 2002; De Souza and De Carvalho, 2004; De Carvalho et al., 2006) and fuzzy (Yang et al, 2004; De Carvalho, 2007) cluster algorithms. However, much more less attention was given for simultaneous clustering of objects and symbolic variables (Verde and Lechevallier, 2001). This presentation gives a co-clustering algorithm for interval-valued data using suitable Euclidean, City-Block and Hausdorff distances. The presented algorithm is a double k-means type (Govaert, 1995; Govaert and Nadif, 2013). It means that it is an iterative three steps (representation, allocation of variables, allocation of objects) relocation algorithm that looks, simultaneously, for the representatives of the co-clusters (blocks), for a partition of the set of variables into a fixed number of clusters and for a partition of the set of objects also in a fixed number of clusters, such that a clustering criterion (objective function) measuring the fit between the initial data matrix and the matrix of co-clusters representatives is locally minimized. These steps are repeated until the algorithm convergence that can be proved (Diday and Coll, 1980). In this presentation it is given the clustering criterion (objective function) and the main steps of the algorithm (the computation of the co-clusters representatives, the determination of the best partition of the variables and the determination of the best partition of the objects). The usefulness of the presented co-clustering algorithm is illustrated with its execution on some benchmark interval-valued data sets. References H-.H. Bock and E. Diday (2000). Analysis of Symbolic Data, Springer, Berlin et al. M. Chavent (2000). Criterion-based divisive clustering for symbolic objects. In in H.-H. Bock, and E. Diday (Eds.), Analysis of symbolic data, exploratory methods for extracting statistical information from complex data (Springer, Berlin), pp. 291–311. M. Chavent and Y. Lechevallier (2002). Dynamical clustering algorithm of interval data: optimization of an adequacy criterion based on Hausdorff distance. In IFCS 2002, 8th Conference of the International Federation of Classification Societies (Cracow, Poland), pp. 53–59. F.A.T. De Carvalho (2007). Fuzzy c-means clustering methods for symbolic interval data. Pattern Recognition Letters, 28, 423–437. F.A.T. De Carvalho, P. Brito, and H.-H. Bock (2006). Dynamic clustering for interval data based on L2 distance. Computational Statistics, 2, 231–250. R.M.C.R. de Souza and F.A.T. de Carvalho (2004). Clustering of interval data based on City-Block distances. Pattern Recognition Letters, 25, 353–365. Diday and Coll (1980). Optimisation en Classification Automatique INRIA, Le Chesnay. Y. El-Sonbaty and M.A. Ismail (1998). Fuzzy clustering for symbolic data. IEEE Transactions on Fuzzy Systems, 6, 195–204. G. Govaert (1991). Simultaneous clustering of rows and columns. Control and Cybernetics 24, 437–458. G. Govaert and M. Nadif (2013). Co-clustering: models, algorithms and applications, Wiley, New York. K.C. Gowda and E. Diday (1991). Symbolic clustering using a new dissimilarity measure. Pattern Recognition 24, 567–578. K.C. Gowda and T.R. Ravi (1995). Divisive clustering of symbolic objects using the concepts of both similarity and dissimilarity. Pattern Recognition 28, 1277–1282. D.S. Guru, B.B. Kiranagi, and P. Nagabhushan (2004). Multivalued type proximity measure and concept of mutual similarity value useful for clustering symbolic patterns. Pattern Recognition Letters 25, 1203–1213. I. V. Mechelen, H. H. Bock, and P. D. Boeck (2004). Two-mode clustering methods: a structured overview. Statistical methods in medical research, 13, 363–394. R. Verde and Y. Lechevallier (2005). Crossed Clustering Method on Symbolic Data Tables. In Proceedings of the Meeting of the Classification and Data Analysis Group (CLADAG) of the Italian Statistical Society - CLADAG 2003 (Bologna, Italy), pp. 87–94. M.-S. Yang, P.-Y. Hwang, and D.-H. Chen (2004). Fuzzy clustering algorithms for mixed feature variables. Fuzzy Sets and Systems, 141, 301–317.