CLUSTERING Revision Spring 2007 SJSU Falguni Negandhi Overview Definition Main Features Types of Clustering Some Clustering Approaches Why we use Clustering Some methods of Clustering Conclusion References Definition Clustering is “the process of organizing objects into groups whose members are similar in some way”. A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters. Clustering Main Features Clustering – a data mining technique Usage: Statistical Data Analysis Machine Learning Data Mining Pattern Recognition Image Analysis Bioinformatics Types of Clustering Hierarchical Finding new clusters using previously found ones Partitional Finding all clusters at once Self-Organizing Maps Hybrids (incremental) Some Clustering Approaches Clustering Hierarchical Agglomerative Partitional Divisive Categorical Sampling Large DB Compression Why clustering? A few good reasons ... Simplifications Pattern detection Useful in data concept construction Unsupervised learning process Where to use clustering? Data mining Information retrieval text mining Web analysis marketing medical diagnostic Major Existing clustering methods Distance-based Hierarchical Partitioning Probabilistic Distance based method • In this case we easily identify the 4 clusters into which the data can be divided; the similarity criterion is distance: two or more objects belong to the same cluster if they are “close” according to a given distance. This is called distance-based clustering. Hierarchical clustering Agglomerative (bottom up) 1. 2. 3. start with 1 point (singleton) recursively add two or more appropriate clusters Stop when k number of clusters is achieved. Divisive (top dow) 1. 2. 3. Start with a big cluster Recursively divide into smaller clusters Stop when k number of clusters is achieved. Partitioning clustering 1. 2. Divide data into proper subset recursively go through each subset and relocate points between clusters (opposite to visit-once approach in Hierarchical approach) This recursive relocation= higher quality cluster Steps of hierarchical clustering Given a set of N items to be clustered, and an N*N distance (or similarity) matrix, the basic process of hierarchical clustering (defined by S.C. Johnson in 1967) is this: Start by assigning each item to a cluster, so that if you have N items, you now have N clusters, each containing just one item. Let the distances (similarities) between the clusters the same as the distances (similarities) between the items they contain. Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one cluster less. Compute distances (similarities) between the new cluster and each of the old clusters. Repeat steps 2 and 3 until all items are clustered into K number of clusters Hierarchical algorithms Agglomerative algorithms begin with each element as a separate cluster and merge them into successively larger clusters. Divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters. Hierarchical agglomerative general algorithm 1. 2. 3. Find the 2 closest objects and merge them into a cluster Find and merge the next two closest points, where a point is either an individual object or a cluster of objects. If more than one cluster remains, return to step 2 Partitioning clustering 1. 2. Divide data into proper subset recursively go through each subset and relocate points between clusters (opposite to visit-once approach in Hierarchical approach) Probabilistic clustering 1. 2. 3. Data are picked from mixture of probability distribution. Use the mean, variance of each distribution as parameters for cluster Single cluster membership K-Clustering K-clustering algorithm Result: Given the input set S and a fixed integer k, a partition of S into k subsets must be returned. K-means clustering is the most common partitioning algorithm. K-Means re-assigns each record in the dataset to only one of the new clusters formed. A record or data point is assigned to the nearest cluster (the cluster which it is most similar to) using a measure of distance or similarity K-clustering algo cont'd 1. Select k initial cluster centroids, c1, c2, c3..., ck. 2. Assign each instance x in S to the cluster whose centroid is the nearest to x. 3. For each cluster, re-compute its centroid based on which elements are contained in. 4. Go to (2) until convergence is achieved K-Means Clustering Separate the objects (data points) into K clusters. Cluster center (centroid) = the average of all the data points in the cluster. Assigns each data point to the cluster whose centroid is nearest (using distance function.) K-Means Algorithm 1. Place K points into the space of the objects being clustered. They represent the initial group centroids. 2. Assign each object to the group that has the closest centroid. 3. Recalculate the positions of the K centroids. 4. Repeat Steps 2 & 3 until the group centroids no longer move. K-Means Algorithm: Example Output Conclusion very useful in data mining applicable for both text and graphical based data Help simplify data complexity classification detect hidden pattern in data References Dr. M.H. Dunham - http://engr.smu.edu/~mhd/dmbook/part2.ppt. Dr. Lee, Sin-Min – San Jose State University Mu-Yu Lu, SJSU Database System Concepts, Silberschatz, Korth, Sudarshan Wikipedia: http://en.wikipedia.org/wiki/Data_clustering#Types_of_clustering Enrique Blanco Garcia http://genome.imim.es/~eblanco/seminars/docs/clustering/index_types.html #hierarchy Data Clustering - Wikipedia http://en.wikipedia.org/wiki/Data_clustering A Tutorial on Clustering Algorithms http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/ The K-Means Data Clustering Problem http://people.scs.fsu.edu/~burkardt/f_src/kmeans/kmeans.html