COMPARISON OF CLUSTERING ALGORITHMS: PARTITIONAL AND HIERARCHICAL Principal Investigator Dr.Sanjay Ranka Professor Department of Computer Science, University of Florida Teaching Assistant Manas Somaiya Authors Joyesh Mishra, Gnana Sundar Rajendiran, Vasanth Prabhu Sundararaj Department of Computer Science, University of Florida Gainesville www.cise.ufl.edu Final Report December 2007 TABLE OF CONTENTS I. ABSTRACT .........................................................................................................................1 II. DETAILED REPORT..........................................................................................................1 1. 2. 3. 4. K-Means Partitional clustering ................................................................................1 1.1 Characteristics of K - means ..............................................................................1 1.2 Algorithm ...........................................................................................................1 1.3 Observations ......................................................................................................2 Agglomerative Hierarchical Clustering ...................................................................4 2.1 Definition ...........................................................................................................4 2.2 Algorithms implemented in this Project ............................................................4 2.3 Datasets and Experiments ..................................................................................6 DBSCAN (Using KD Trees)..................................................................................12 3.1 DBSCAN Algorithm ........................................................................................12 3.2 DBSCAN Performance Enhancements Using KD Trees .................................12 3.3 Observations regarding DBSCAN Issues ........................................................13 CURE – Hierarchical Clustering (Using KD Trees) ..............................................13 4.1 CURE Hierarchical Clustering Algorithm .......................................................14 4.2 CURE Overview ..............................................................................................15 4.3 CURE - Data Structures Used ..........................................................................15 4.4 Benefits of CURE against Other Algorithms ...................................................16 4.5 Observations towards Sensitivity to Parameters ..............................................17 III. CONCLUSION ..................................................................................................................18 IV. REFERENCES ..................................................................................................................18 LIST OF FIGURES Figure 1 K – means Initial K clusters...............................................................................................2 Figure 2 K – means Clusters getting rearranged by computing new centroids ................................3 Figure 3 K – means Converged clusters ..........................................................................................3 Figure 4 Union By Rank ..................................................................................................................6 Figure 5 SPAETH dataset ................................................................................................................7 Figure 6 Agglomerative Clusters After 28000 iterations .................................................................7 Figure 7 Agglomerative Clusters After 64000 iterations .................................................................8 Figure 8 Agglomerative Clusters After 65388 iterations .................................................................8 Figure 9 Agglomerative Non – globular clusters .............................................................................9 Figure 10 CURE Non – globular clusters ........................................................................................9 Figure 11 Complete Link ...............................................................................................................10 Figure 12 Complete Link clusters After 2000 iterations ................................................................10 Figure 13 Complete Link clusters After 2012 iterations ................................................................11 Figure 14 CURE clusters ...............................................................................................................11 Figure 15 DBSCAN performance measurements ..........................................................................13 Figure 16 Partitioning results .........................................................................................................17 I. ABSTRACT Clustering is one of the important streams in data mining useful for discovering groups and identifying interesting distributions in the underlying data. This project aims in analyzing and comparing the partitional and hierarchical clustering algorithms namely DBSCAN and k-means (partitional) with Agglomerative and CURE (hierarchical). The comparison is done based on the extent to which each of these algorithms identify the clusters, their pros and cons and the timing that each algorithm takes to identify the clusters present in the dataset. Among each clustering algorithm, computation time was measures as the size of data set increased. This was used to test the scalability of the algorithm and if it could be disintegrated and executed concurrently on several machines. k-means is a partitional clustering technique that helps to identify k clusters from a given set of n data points in d-dimensional space. It starts with k random centers and a single cluster, and refines it at each step arriving to k clusters. Currently, the time complexity for implementing k - means is O (I * k * d * n), where I is the number of iterations. If we could use the KD-Tree data structure in the implementation, it can further reduce the complexity to O (I * k * d * log (n)). DBSCAN discovers clusters of arbitrary shape relying on a density based notion of clusters. Given eps as the input parameter, unlike k-means clustering, it tries to find out all possible clusters by classifying each point as core, border or noise. DBSCAN can be expensive as computation of nearest neighbors requires computing all pair wise proximities. Additional implementation includes KD-Trees to store the data which would allow efficient retrieval of data and bring down the time complexity from O(m^2) to O(m log m). Agglomerative Hierarchical Clustering is one of the non-parametric approaches to Clustering which is based on the measures of the dissimilarities among the current cluster set in each iteration. In general we will start with the points as individual clusters and at each step merge the closest pair of clusters by defining a notion of cluster proximity. We will implement three algorithms, namely, Single-Linkage Clustering and Complete-Linkage Clustering. We will be analyzing the advantages and drawbacks of Agglomerative Hierarchical Clustering by comparing it with the other Algorithms CURE, DBSCAN and K-Means. CURE clustering algorithm helps in attaining scalability for clustering in large databases without sacrificing quality of the generated clusters. The algorithm uses KD-Trees and Min Heaps for efficient data analysis and repetitive clustering. The random sampling, partitioning of clusters and two pass merging helps in scaling the algorithm for large datasets. Our implementation would provide a comparative study of CURE against other partitioning and hierarchical algorithm. 1 II. DETAILED REPORT 1. K-Means Partitional clustering Clustering based on k-means is closely related to a number of other clustering and location problems. These include the Euclidean k-medians in which the objective is to minimize the sum of distances to the nearest center and the geometric k-center problem in which the objective is to minimize the maximum distance from every point to its closest center. There are no efficient solutions known to any of these problems and some formulations are NP-hard. The large constant factors suggest that it is not a good candidate for practical implementation. One of the most popular heuristics for solving the k-means problem is based on a simple iterative scheme for finding a locally minimal solution. This algorithm is often called the k-means algorithm. 1.1 Characteristics of K - means a. It is a prototype based Clustering. It can only be applied to clusters that have the notion of a centre. b. The algorithm has a space complexity of O (I * K * m * n), where I is the number of iterations, K is the number of clusters, m is the number of dimensions and n is the number of points. c. Using KD Trees the overall Time Complexity reduces to O (n * logn). KD Tree is a data structure that will help grouping the points that will be most likely be a cluster at each point of decision between isolating the clusters. 1.2 Algorithm a. Select K initial centroids b. Repeat For each point, find its closes centroid and assign that point to the centroid. This results in the formation of K clusters Recompute centroid for each cluster Until the centroids do not change In the first step, points are assigned to the initial centroids, which are all in the larger group of points. After points are assigned to a centroid, the centroid is then updated. In the second step, points are assigned to the updated centroids, and the centroids are updated again. When the k-means algorithm terminate, the centroids would have identified the natural groupings of points. For some combinations of proximity functions and types of centroids, k-means always converge to a solution i.e., k-means reaches a state in which no points are shifting from one cluster to another and hence the centroids do not change. 1.3 Observations The datasets used for running k-means algorithm is a 2d array of x y points obtained from SPAETH (http://people.scs.fsu.edu/~burkardt/datasets/spaeth/spaeth.html). The list of figures given below shows how the k-means algorithm converges for the set of data points. Figure 1 K – means Initial K clusters Figure 2 K – means Clusters getting rearranged by computing new centroids Figure 3 K – means Converged clusters With LabVIEW 8.2.1 compiler, and with 3360 points, the k-means algorithm took 355 ms to arrive to convergence. The hardware used is Intel@ Core™2 IV 1.73 Ghz with 1GB RAM. The pros of k-means algorithm are: a. It is very simple to implement b. This algorithm is very fast for low dimensional data c. It can find pure sub clusters if large number of clusters is specified The cons of k-means of algorithm are: a. K-Means cannot handle non-globular data of different sizes and densities b. K-Means will not identify outliers c. K-Means is restricted to data which has the notion of a centre (centroid) 2. Agglomerative Hierarchical Clustering 2.1 Definition Hierarchical clustering builds a cluster hierarchy or, in other words, a tree of clusters, also known as a dendrogram. Every cluster node contains child clusters; sibling clusters partition the points covered by their common parent. Such an approach allows exploring data on different levels of granularity. Hierarchical clustering methods are categorized into agglomerative (bottom-up) and divisive (top-down). An agglomerative clustering starts with one-point (singleton) clusters and recursively merges two or more most appropriate clusters. A divisive clustering starts with one cluster of all data points and recursively splits the most appropriate cluster. The process continues until a stopping criterion (frequently, the requested number k of clusters) is achieved. In this project, we will dealing with Agglomerative Hierarchical Clustering. Advantages of hierarchical clustering: Embedded flexibility regarding the level of granularity Ease of handling of any forms of similarity or distance applicability to any attribute types Disadvantages of hierarchical clustering: Vagueness of termination criteria The fact that most hierarchical algorithms do not revisit once constructed (intermediate) clusters with the purpose of their improvement 2.2 Algorithms implemented in this Project In this project, we have implemented two linkage metric algorithms, Single-Link (MIN) and Complete-Link (MAX) algorithms. Time Complexity is O(n2logn). Single Link Algorithm In this algorithm, the proximity of two clusters is defined as the minimum of the distance (maximum of the similarity) between any two points in the two different clusters. Using graph terminology, if you start with all points as singleton clusters and add links between points one at a time, shortest links first, and then these single links combine the points into clusters. In the project, a new method is used to implement the single link algorithm. A minimum spanning tree is implemented using the Kruskal’s algorithm. Union-by-Rank and Path compression methods are used for optimization. Minimum Spanning Tree - Given a connected, undirected graph, a spanning tree of that graph is a sub-graph which is a tree and connects all the vertices together. A single graph can have many different spanning trees. We can also assign a weight to each edge, which is a number representing how unfavorable it is, and use this to assign a weight to a spanning tree by computing the sum of the weights of the edges in that spanning tree. A minimum spanning tree or minimum weight spanning tree is then a spanning tree with weight less than or equal to the weight of every other spanning tree. More generally, any undirected graph (not necessarily connected) has a minimum spanning forest, which is a union of minimum spanning trees for its connected components. Kruskal’s algorithm - Kruskal's algorithm is an algorithm in graph theory that finds a minimum spanning tree for a connected weighted graph. This means it finds a subset of the edges that forms a tree that includes every vertex, where the total weight of all the edges in the tree is minimized. If the graph is not connected, then it finds a minimum spanning forest (a minimum spanning tree for each connected component). Kruskal's algorithm is an example of a greedy algorithm. It works as follows: create a forest F (a set of trees), where each vertex in the graph is a separate tree create a set S containing all the edges in the graph while S is nonempty o remove an edge with minimum weight from S o if that edge connects two different trees, then add it to the forest, combining two trees into a single tree o otherwise discard that edge At the termination of the algorithm, the forest has only one component and forms a minimum spanning tree of the graph. Union By Rank – In this we have a parent of shallower tree point to other tree. We will be maintaining the rank(x) as an upper bound on the depth of the tree rooted at x. Consider the following example, Figure 4 Union By Rank If suppose, rank(x) = 3, rank(y) = 2, then Union (x, y) results in with the rank of the resultant tree = greater rank. If the two trees are of the same rank then the rank of the resultant tree increases by one. Path Compression 1st walk: Find the name of the set. Take a walk until we reach the root. 2nd walk: Retrace the path and join all the elements along the path to the root using another pointer. This enables future finds to take shorter paths. In the implementation of Single Link Algorithm, each point is initially considered as a singleton cluster. When the Euclidean distance between two clusters (trees) is minimum when compared with the other clusters, the two clusters are merged into a single cluster (tree) and the root node is updated. Complete Link Algorithm In this algorithm, the proximity of two clusters is defined as the maximum of the distance (minimum of the similarity) between any two points in the two different clusters. Using graph terminology, if you start with all points as singleton clusters and add links between points one at a time, shortest links first, then a group of points is not a cluster until all the points in it are completely linked, i.e. form a clique. Single Link is susceptible to noise/outliers. Complete Link may not work well with non-globular clusters. 2.3 Datasets and Experiments Single Link Algorithm Testing Dataset: SPAETH2 dataset (2D- voice modulation data) from the Florida State University’s website (Around 900 data points) Figure 5 SPAETH dataset Output Cluster – Plot Globular Clusters After 28000 iterations (3 clusters remain) Figure 6 Agglomerative Clusters After 28000 iterations After 64000 iterations (2 Clusters remain) Figure 7 Agglomerative Clusters After 64000 iterations Final Cluster (After 65388 iterations) Figure 8 Agglomerative Clusters After 65388 iterations Non-Globular Clusters (Run on CheckBoard data) Single Link Figure 9 Agglomerative Non – globular clusters CURE Figure 10 CURE Non – globular clusters Complete Link It was executed on a part of the Census data obtained from UCI Repository Figure 11 Complete Link Output Cluster – Plot (Compared with CURE algorithm) After 2000 iterations (13 clusters remain) Figure 12 Complete Link clusters After 2000 iterations Final Cluster (after 2012 iterations) Figure 13 Complete Link clusters After 2012 iterations CURE Figure 14 CURE clusters 3. DBSCAN (Using KD Trees) The main reason why natural clusters are recognizable is that within each cluster we have a typical density of points which is considerably higher than outside of the cluster. Furthermore, the density within the areas of noise is lower than the density in any of the clusters. With this understanding, we can describe core, border and noise points in a given data set next. Core points: A point is a core point if the number of points within a given neighborhood around the point as determined by the distance function and as user specified distance parameter Eps, exceeds a certain threshold, MinPts, which is also a user-specified parameter. Border points: A border point is not a core point, but falls within the neighborhood of a core point. Noise points: A noise point is any point that is neither a core point nor a border point. 3.1 DBSCAN Algorithm 1. 2. 3. 4. 5. Label all points as core, border or noise points Eliminate noise points Put an edge between all core points that are within Eps of each other Make each group of connected core points into a separate cluster Assign each border point to one of the clusters of its associated core points 3.2 DBSCAN Performance Enhancements Using KD Trees We used KD Trees to improve the efficiency of DBSCAN clustering. The worst case time complexity of DBSCAN algorithm is O(m^2). However, it can be shown that in low dimensional data, this time complexity can be reduced to O(m*logm) using KD Trees. The Initialization of KD Trees is a one time cost which the algorithm incurs while reading the data points from File. Once the KD Tree has been initialized, it can be used across the algorithm to classify core points, border points and noise points based on the the number of nearest neighbors found as well as find the nearest core point for a border point. KD Tree helps to decrease the search time for nearest neighbor of a point from O(n) to O(log n) where n is the size of the data set. We saw performance improvements by using KD Trees. The algorithm was run on a Intel Pentium IV 1.8 Ghz (Duo Core) System with 1 GB RAM. The program was compiled using Java 1.6 Compiler. No. of Points Clustering Time (sec) 1572 3.5 3568 10.9 7502 39.5 10256 78.4 Figure 15 DBSCAN performance measurements 3.3 Observations regarding DBSCAN Issues The following are our observations: 1. DBSCAN algorithm performs efficiently for low dimensional data. 2. The algorithm is robust towards outliers and noise points 3. Using KD Tree improves the efficiency over traditional DBSCAN algorithm 4. DBSCAN is highly sensitive to user parameters MinPts and Eps. Slight change in the values may produce different clustering results and prior knowledge about these values cannot be inferred that easily. 5. The dataset cannot be sampled as sampling would affect the density measures. 6. The Algorithm is not partitionable for multi-processor systems. 7. DBSCAN fails to identify clusters if density varies and if the dataset is too sparse. 8. 4. CURE – Hierarchical Clustering (Using KD Trees) Partitional Clustering Algorithms attempt to determine k – partitions that optimize a certain criterion function. The square error criterion, defined below, is the most commonly used (mi is the mean of the cluster Ci). The square error is a good measure of the within cluster variation across all the partitions. This objective tries to make the k clusters as compact and separated as possible. However when there are large differences in the sizes or geometries of clusters, the square error method could split large clusters to minimize the square error. Next we considered DBSCAN which has been explained above. Apart from problems with variable clustering, DBSCAN could not be made concurrent. In this age, when computation power is getting cheaper day by day and Dedicated Grids have been setup for data intensive computations, an algorithm is required which can be parallelized and takes advantage of all the resources available. DBSCAN as we would see had scaling problems as data sets increased. In comparison to Agglomerative Clustering, CURE certainly would perform well. Agglomerative Clustering provide options of choosing Single Link vs Complete Link to cluster points. While the former identifies only globular clusters efficiently the latter is computationally intensive. Hence for data sets ranging more than 5000 points, agglomerative clustering was highly inefficient though quality of clustering could be achieved by using one of the above options. Our experiments on CURE clustering algorithm, suggest that CURE depends on few parameters and if once they are tuned for a given data set pertaining to a domain, the algorithm can scale well by adding more resources and partitioning the data. 4.1 CURE Hierarchical Clustering Algorithm The CURE clustering algorithm is a hierarchical algorithm which merges two clusters at every step and the clustering process is carried over in two passes. The overall hierarchical algorithm is as follows: To enhance performance, scalability as well as quality of clustering, CURE takes into account few more pre-clustering and post-clustering steps. 4.2 CURE Overview While drawing the random sample, due importance was given to the fact that all clusters were represented and none of them were missed out by estimating a minimum probability. 4.3 CURE - Data Structures Used We used two data structures namely the KD Tree and Min Heap. Following are the brief description of both of them. 4.3.1 KD Tree A KD-Tree (short for k-dimensional tree) is a space-partitioning data structure for organizing points in a k-dimensional space. KD-Trees are a useful data structure for several applications, such as searches involving a multidimensional search key (e.g. range searches and nearest neighbour searches). KD-Trees are a special case of BSP trees. A KD-Tree uses only splitting planes that are perpendicular to one of the coordinate system axes. This differs from BSP trees, in which arbitrary splitting planes can be used. In addition, in the typical definition every node of a KD-Tree, from the root to the leaves, stores a point. This differs from BSP trees, in which leaves are typically the only nodes that contain points (or other geometric primitives). As a consequence, each splitting plane must go through one of the points in the KD-Tree. KD-Tries are a variant that store data only in leaf nodes. It is worth noting that in an alternative definition of KD-Tree the points are stored in its leaf nodes only, although each splitting plane still goes through one of the points. In Cure, the KD Tree is initialized during the initial phase of clustering to hold all the points. Later on in the algorithm, we use this tree for nearest neighbor search and finding closest clusters based on representative points of a cluster. When a new cluster is formed, new representative points are added to the KD Trees. The representative points of older clusters are deleted from the tree. KD Tree improves the search of points in k dimensional space from O(n) to O(log n) as it uses binary partitioning across coordinate axes. 4.3.2 Min Heap A Min Heap is a simple heap data structure created using a binary tree. It can be seen as a binary tree with two additional constraints: 1. The shape property: all levels of the tree, except possibly the last one (deepest) are fully filled, and, if the last level of the tree is not complete, the nodes of that level are filled from left to right. 2. The heap property: each node is lesser than or equal to each of its children. The Min Heap stores the minimum element at the root of the heap. In Cure, we always merge two clusters at every step. Thus the cluster to be merged would necessary be having the closest distance from another nearby cluster as the heap is created using inter-cluster distance comparisons. Hence we can get this cluster in O(1) time always. We used java.util.PriorityQueue which supports all the Min Heap operations. 4.4 Benefits of CURE against Other Algorithms K-Means (& Centroid based Algorithms): Unsuitable for non-spherical and size differing clusters. CLARANS: Needs multiple data scan (R* Trees were proposed later on). CURE uses KD Trees inherently to store the dataset and use it across passes. BIRCH: Suffers from identifying only convex or spherical clusters of uniform size DBSCAN: No parallelism, High Sensitivity, Sampling of data may affect density measures. 4.5 Observations towards Sensitivity to Parameters We observed that the random sample size was an important criterion while preclustering the data set. Hence we used the Chernoff bounds as given in [1] to calculate the minimum size of sample to be selected. Random Sampling often missed out some of the smaller clusters. The next important parameter was the Shrink Factor of Representative Points(a). If we increased a to make it 1, the algorithm would degenerate to MST based algorithms. If the parameter a is reduced to 0.1, CURE starts behaving as a centroid based algorithm. Thus for a range of 0.3 to 0.7, CURE identified the right clusters. The number of Representative Points present in a cluster is an important parameter. If the cluster is too sparse, it may need more representative points than a compact smaller cluster. We observed that if the number of representative points is increased to 8 or 10, sparse clusters with variable size and density were identified properly. But with increase in representative points, the computation time for clustering increased as for every new cluster formed, new representative points have to be calculated and shrunk. One of the most important observations of our experiments was with respect to partitioning of data sets as CURE supports concurrent execution of the first pass of algorithm. As the number of partitions was increased from 2 to 6 or 10, the clustering time dropped significantly. Though the number of clusters to be merged increased in the second step, but the advantage of concurrent execution was far more. But what we noticed is that if we increased the number of partitions to higher numbers such as 50, the clustering would not give proper results as some of the partitions would not have any data to cluster. Hence, though the time consumed would be lesser, the quality of cluster gets affected and CURE could not identify all the clusters correctly. Some of them got merged to form bigger clusters. Hence, a partitioning of 10 – 20 would result in efficient speed up of algorithm while maintaining the quality of clusters. Partitioning Results No. of Points 1572 3568 7502 10256 Partition P = 2 6.4 7.8 29.4 75.7 Partition P = 2 6.5 7.6 21.6 43.6 Partition P = 5 6.1 7.3 12.2 21.2 Time (sec) Figure 16 Partitioning results If a chart is plotted for the same, we can see that as the partitioning is increased, the time taken to cluster increases very slowly even though the data set size has increased by four times. III.CONCLUSION From the clusters obtained through various algorithms and the time taken by each algorithm on the datasets, we can say that, K – means is not the best of clustering methods with its high space complexity. For high dimensional data, K – means takes a lot of time and memory. Also it cannot always converge. Our experiments suggest that DBSCAN faired well for low-dimensional data. Also, if the density of clusters did not vary too much, DBSCAN fairly identified all the clusters. But if the size of the data increases and if shapes and density of clusters vary too much, DBSCAN resulted in combining or splitting those clusters. Cure could identify all the clusters properly. But CURE depends on some of the user parameters which have to be data specific. The range of such parameters do not vary too much many of them being from 0 – 1. Cure could identify several clusters with high purity which Kmeans and DBSCAN failed to identify. With respect to agglomerative clustering, clusters with high purity could be obtained but the computation time for clustering was high. Application of Kruskal and Union-By-Rank Algorithm helped to improve the efficiency but still the computation time increased significantly as the size of the data set increased. IV. REFERENCES 1. An Efficient k-Means Clustering Algorithm: Analysis and Implementation - Tapas Kanungo, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, Angela Y. Wu. 2. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, KDD '96 3. CURE : An Efficient Clustering Algorithm for Large Databases – S. Guha, R. Rastogi and K. Shim, 1998. 4. Introduction to Clustering Techniques – by Leo Wanner 5. A comprehensive overview of Basic Clustering Algorithms – Glenn Fung 6. Introduction to Data Mining – Tan/Steinbach/Kumar 7. Thomas T. Cormen , Charles E. Leiserson , Ronald L. Rivest, Introduction to algorithms, MIT Press, Cambridge, MA, 1990 8. Tian Zhang , Raghu Ramakrishnan , Miron Livny, BIRCH: an efficient data clustering method for very large databases, Proceedings of the 1996 ACM SIGMOD international conference on Management of data, p.103-114, June 04-06, 1996, Montreal, Quebec, Canada 9. An Efficient K-Means Clustering Algorithm. K. Alsabti, S. Ranka, V. Singh. 1998 10. Density based Indexing for Approximate Nearest Neighbor Queries. K. Bennett, U. Fayyad, D. Geiger. Microsoft Research. 1998 11. The Analysis of a Simple K-Means Algorithm. T. Kanungo, D.M. Mount, N.S. Netanyahu, C. Piatko, R. Silverman and A.Y. Wu. 2000 12. Accelerating exact K-Means algorithms with Geometric Reasoning. D Pelleg and A. Moore. 1999.