August 21, 2007 Discovering Clusters of Arbitrary Shapes Rachsuda Jiamthapthaksin, Jiyeon Choo, Christian Giusti1, Dan Jiang, Christoph F. Eick, and Ricardo Vilalta Department of Computer Science, University of Houston Houston, Texas 77204-3010, U.S.A {rachsuda, jchoo, djiang, ceick, vilalta}@cs.uh.edu Clustering is an unsupervised learning technique that is used for exploratory data analysis, for summary generation, and as a preprocessing step for other data mining tasks. Our on-going research applies clustering in scientific discovery, such as identifying pollution hotspots in environmental data, discovering co-location patterns concerning chemicals on the Martian surface and in the Texas ground water, and for region discovery in general. Due to the characteristics and diverse nature of the data used, clusters may be of arbitrary shapes and can be nested within one another. Examples of such shapes are chain-like patterns that represent active and inactive volcanoes, as depicted in Figure 1 (a). Generally traditional clustering algorithms, such as k-means and k-medoids, fail to detect non-spherical shapes, as demonstrated in Figure 1 (b). Our research aims to find solutions for this challenging problem. We focus on developing novel techniques that discover clusters of arbitrary shapes efficiently and effectively. Additionally, the following other criteria are taken into consideration: speed and scalability, comprehensive parameters tuning, robustness to noise and outliers, and the capability to detect clusters at different levels of granularity. (a) Chain-like patterns (b) Clusters detected by K-means Figure 1: Discovering Chain-like Clusters in the Volcano Dataset. Two approaches are currently explored in order to find clusters of arbitrary shapes. The first approach is a generic clustering framework that approximates arbitrary shape clusters using unions of small convex polygons [CJCCGE2007], as depicted in Figure 2. One of benefits of this framework its flexibility by allowing for three plug-in components: 1) the formation of large number of sub-clusters, 2) the construction of neighboring relation of sub-clusters (Fig. 3), and 3) the fitness function that measures cluster quality. In this framework, fitness functions play an important role in capturing arbitrary shapes of clusters—unfortunately, the use of fitness functions in clustering algorithm has not been studied much by past research. We investigate two approaches for acquiring fitness functions: 1) using shape 1 Department of Mathematics and Computer Science, University of Udine Via delle Scienze, 33100, Udine, Italy. {giusti@dimi.uniud.it} signatures [SS06] as summarized information to capture shape characteristic of clusters, and 2) learning fitness functions based on the domain expert’s or other feedback. (a) Input (b) Output Figure 2: An illustration of MOSAIC’s Clustering Approach for the Complex9 Dataset Figure 3: An illustration of the neighboring relations between sub-clusters The second approach explores the use of density estimation techniques for discovering arbitrary shape clusters [Jiang2006, JEC2007]. These techniques rely on influence functions in which a point's influence on another point decreases as the distance between the two points increases. Based on influence functions, density functions are constructed, as depicted in Fig. 4, and clusters are formed by using hill climbing algorithms that associates objects in the dataset with maxima of the sodefined functionobjects that are associated with the same maximum belong to the same cluster. The ability to discover clusters with arbitrary shapes is important in many real applications, 3 of which are highlighted in this paragraph. Finding non-spherical regions is of critical importance for hotspot and co-location discovery in spatial datasets [EVJW2006, EDPS2007]; otherwise, arbitrary shape, regional patterns will remain undetected in spatial data mining. Another application is the use of clustering as a preprocessing step for classification. In [VAE2003], a decomposition of classes into subclasses in order to simplify the approximation of class distributions is proposed. It is clear that such decompositions do not always have spherical shapes, Figure 4 (a): Binary Complex9 Dataset Figure 4: (b) A Density Function for the Binary Complex9 Dataset and therefore the capability to discover arbitrary shape clusters is important for class decomposition. Third, in the image processing field an arbitrary shape clustering technique is important to obtain a spatial clustering of features extracted from an image. In [GPP2006] a method for discovering objects in images using spatial clustering of features is presented. The method extracts features from a picture through a weak segmentation procedure, and uses the displacement of features over the picture to subdivide the picture in meaningful parts, relying on clustering algorithms. References [CJCCGE2007] J. Choo, R. Jiamthapthaksin, C. Chen, O. Celepcikay, C. Giusti, and C. Eick, MOSAIC: A Proximity Graph Approach to Agglomerative Clustering, in Proc. 9th International Conference on Data Warehousing and Knowledge Discovery (DaWaK), Regensburg, Germany, September 2007. [EDPS2007] C. Eick, R. Parmar, W. Ding, T.Stepinksi, and J.-P. Nicot, Finding Regional Colocations Patterns for Sets of Continuous Variables, submitted for publication, October 2007. [EVJW2006] C.Eick, B. Vaezian, D. Jiang, and J. Wang, Discovery of Interesting Regions in Spatial Datasets Using Supervised Clustering, in Proceedings 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), Berlin, Germany, September 2006. [JEC2007] D. Jiang, C. Eick, and C. Chen, On Supervised Density Estimation Techniques and Their Application to Spatial Data Mining, in Proc. ACM-GIS Conference, Seattle, Washington, November 2007. [J2006] Dan Jiang, Design and Implementation of a Density-based Supervised Clustering Algorithm, Master Thesis, University of Houston, December 2006. [GPP2006] C. Giusti, G. G. Pieroni, and L. Pieroni, Attention trees and semantic paths, in Proceedings of The Human Vision and Electronic Imaging XII conferences, San Jose, California, February 2007. [SS06] C. Shahabi. and M. Safar, M., An experimental study of alternative shape-based image retrieval techniques, Springer Science + Business Media, LLC 2006. [VAE2003] R. Vilalta, M. Achari, and C. Eick, Class Decomposition via Clustering: A New Framework for Low-Variance Classifiers, in Proceedings Third IEEE International Conference on Data Mining (ICDM), Melbourne, Florida, November 2003.