CSE 634 Data Mining Techniques CLUSTERING Part 2( Group no: 1 ) By: Anushree Shibani Shivaprakash & Fatima Zarinni Spring 2006 Professor Anita Wasilewska SUNY Stony Brook References Jiawei Han and Michelle Kamber. Data Mining Concept and Techniques (Chapter8). Morgan Kaufman, 2002. M. Ester, H.P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases. KDD'96. http://ifsc.ualr.edu/xwxu/publications/kdd-96.pdf How to explain hierarchical clustering. http://www.analytictech.com/networks/hiclus.htm Tian Zhang, Raghu Ramakrishnan, Miron Livny. Birch: An efficient data clustering method for very large databases Data mining- Margaret H. Dunham http://cs.sunysb.edu/~cse634/ Presentation 9 – Cluster Analysis Introduction Major clustering methods Partitioning methods Hierarchical methods Density-based methods Grid-based methods Hierarchical methods 1. 2. Here we group data objects into a tree of clusters. There are two types of hierarchical clustering Agglomerative hierarchical clustering. Divisive hierarchical clustering Agglomerative hierarchical clustering Group data objects in a bottom-up fashion. Initially each data object is in its own cluster. Then we merge these atomic clusters into larger and larger clusters, until all of the objects are in a single cluster or until certain termination conditions are satisfied. A user can specify the desired number of clusters as a termination condition. Divisive hierarchical clustering Groups data objects in a top-down fashion. Initially all data objects are in one cluster. We then subdivide the cluster into smaller and smaller clusters, until each object forms cluster on its own or satisfies certain termination conditions, such as a desired number of clusters is obtained. AGNES & DIANA Application of AGNES( AGglomerative NESting) and DIANA( Divisive ANAlysis) to a data set of five objects, {a, b, c, d, e}. Step 0 a Step 1 Step 2 Step 3 Step 4 ab b abcde c cde d de e Step 4 agglomerative (AGNES) Step 3 Step 2 Step 1 Step 0 divisive (DIANA) AGNES-Explored 1. 2. 3. Given a set of N items to be clustered, and an NxN distance (or similarity) matrix, the basic process of Johnson's (1967) hierarchical clustering is this: Start by assigning each item to its own cluster, so that if you have N items, you now have N clusters, each containing just one item. Let the distances (similarities) between the clusters equal the distances (similarities) between the items they contain. Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one less cluster. AGNES 4. 5. 6. Compute distances (similarities) between the new cluster and each of the old clusters. Repeat steps 2 and 3 until all items are clustered into a single cluster of size N. Step 3 can be done in different ways, which is what distinguishes single-link from complete-link and average-link clustering Similarity/Distance metrics single-link clustering, distance = shortest distance complete-link clustering, distance = longest distance average-link clustering, distance = average distance from any member of one cluster to any member of the other cluster. Single Linkage Hierarchical Clustering 1. Say “Every point is its own cluster” Single Linkage Hierarchical Clustering 1. Say “Every point is its own cluster” 2. Find “most similar” pair of clusters Single Linkage Hierarchical Clustering 1. Say “Every point is its own cluster” 2. Find “most similar” pair of clusters 3. Merge it into a parent cluster Single Linkage Hierarchical Clustering 1. Say “Every point is its own cluster” 2. Find “most similar” pair of clusters 3. Merge it into a parent cluster 4. Repeat Single Linkage Hierarchical Clustering 1. Say “Every point is its own cluster” 2. Find “most similar” pair of clusters 3. Merge it into a parent cluster 4. Repeat DIANA (Divisive Analysis) Introduced in Kaufmann and Rousseeuw (1990) Inverse order of AGNES Eventually each node forms a cluster on its own 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 0 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Overview Divisive Clustering starts by placing all objects into a single group. Before we start the procedure, we need to decide on a threshold distance. The procedure is as follows: The distance between all pairs of objects within the same group is determined and the pair with the largest distance is selected. Overview-contd This maximum distance is compared to the threshold distance. If it is larger than the threshold, this group is divided in two. This is done by placing the selected pair into different groups and using them as seed points. All other objects in this group are examined, and are placed into the new group with the closest seed point. The procedure then returns to Step 1. If the distance between the selected objects is less than the threshold, the divisive clustering stops. To run a divisive clustering, you simply need to decide upon a method of measuring the distance between two objects. DIANA- Explored In DIANA, a divisive hierarchical clustering method, all of the objects form one cluster. The cluster is split according to some principle, such as the minimum Euclidean distance between the closest neighboring objects in the cluster. The cluster splitting process repeats until, eventually, each new cluster contains a single object or a termination condition is met. Difficulties with Hierarchical clustering It encounters difficulties regarding the selection of merge and split points. Such a decision is critical because once a group of objects is merged or split, the process at the next step will operate on the newly generated clusters. It will not undo what was done previously. Thus, split or merge decisions, if not well chosen at some step, may lead to lowquality clusters. Solution to improve Hierarchical clustering 1. 2. 3. One promising direction for improving the clustering quality of hierarchical methods is to integrate hierarchical clustering with other clustering techniques. A few such methods are: Birch Cure Chameleon BIRCH: An Efficient Data Clustering Method for Very Large Databases Paper by: Miron Livny Tian Zhang Raghu Ramakrishnan Computer Sciences Dept. Computer Sciences Dept. Computer Sciences Dept. University of Wisconsin- Madison University of Wisconsin- Madison University of Wisconsin- Madison miron@cs.wisc.edu raghu@cs.wisc.edu zhang@cs.wisc.edu In Proceedings of the International Conference Management of Data (ACM-SIGMOD), pages 103-114, Montreal, Canada, June, 1996. Reference For Paper www2.informatik.huberlin.de/wm/mldm2004/zhang96 birch.pdf Birch (Balanced Iterative Reducing and Clustering Using Hierarchies) 1. 2. A hierarchical clustering method. It introduces two concepts : Clustering feature Clustering feature tree (CF tree) These structures help the clustering method achieve good speed and scalability in large databases. Clustering Feature Definition Given N d-dimensional data points in a cluster: {Xi} where i = 1, 2, …, N, CF = (N, LS, SS) N is the number of data points in the cluster, LS is the linear sum of the N data points, SS is the square sum of the N data points. Clustering feature concepts Each record (data object) is a tuple of values of attributes and here is called a vector. Here is a database. We define (Vi1, …Vid) = Oi Linear Sum Definition N N N N LS = ∑ Oi = (∑Vi1, ∑ Vi2,… ∑Vid) i=1 i=1 i=1 i =1 Definition Name Square sum N N N N SS = ∑ Oi2 = ( ∑Vi12, ∑Vi22… ∑Vid2) i =1 i=1 i=1 i=1 Name Definition Example of a case Assume N = 5 and d = 2 Linear Sum 5 5 5 LS = ∑ Oi = (∑Vi1, ∑ Vi2) i=1 i=1 i=1 Square Sum 5 5 SS =( ∑Vi12), ∑Vi22) i=1 i=1 Example 2 Clustering feature = CF=( N, LS, SS) N=5 LS = (16, 30) SS = ( 54, 190) CF = (5, (16,30),(54,190)) 10 Object Attribute1 Attribute2 O1 3 4 O2 2 6 O3 4 5 O4 4 7 O5 3 8 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 CF-Tree A CF-tree is a height-balanced tree with two parameters: branching factor (B for nonleaf node and L for leaf node) and threshold T. The entry in each nonleaf node has the form [CFi, childi] The entry in each leaf node is a CF; each leaf node has two pointers: `prev' and`next'. The CF tree is basically a tree used to store all the clustering features. CF Tree CF1 CF2 CF3 CF6 child1 child2 child3 child6 Root Non-leaf node CF1 CF2 CF3 CF5 child1 child2 child3 child5 Leaf node prev CF1 CF2 CF6 next Leaf node prev CF1 CF2 CF4 next BIRCH Clustering Phase 1: scan DB to build an initial in-memory CF tree (a multi-level compression of the data that tries to preserve the inherent clustering structure of the data) Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree BIRCH Algorithm Overview Summary of Birch Scales linearly- with a single scan you get good clustering and the quality of clustering improves with a few additional scans. It handles noise (data points that are not part of the underlying pattern) effectively. Density-Based Clustering Methods Clustering based on density, such as densityconnected points instead of distance metric. Cluster = set of “density connected” points. Major features: Discover clusters of arbitrary shape Handle noise Need “density parameters” as termination condition(when no new objects can be added to the cluster.) Example: DBSCAN (Ester, et al. 1996) OPTICS (Ankerst, et al 1999) DENCLUE (Hinneburg & D. Keim 1998) Density-Based Clustering: Background Eps neighborhood: Eps of a given object The neighborhood within a radius MinPts: Minimum number of points in an Epsneighborhood of that object. Core object :If the Eps neighborhood contains at least a minimum number of points Minpts, then the object is a core object Directly density-reachable: A point p is directly density-reachable from a point q wrt. Eps, MinPts if 1) p is within the Eps neighborhood of q 2) q is a core object p q MinPts = 5 Eps = 1 Figure showing the density reachability and density connectivity in density based clustering M, P, O, R and S are core objects since each is in an Eps neighborhood containing at least 3 points Minpts = 3 Eps=radius of the circles Directly density reachable Q is directly density reachable from M. M is directly density reachable from P and vice versa. Indirectly density reachable Q is indirectly density reachable from P since Q is directly density reachable from M and M is directly density reachable from P. But, P is not density reachable from Q since Q is not a core object. Core, border, and noise points DBSCAN is a density-based algorithm. Density = number of points within a specified radius (Eps) A point is a core point if it has more than a specified number of points (MinPts) within Eps These are points that are at the interior of a cluster. A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point. A noise point is any point that is not a core point nor a border point. DBSCAN (Density based Spatial clustering of Application with noise): The Algorithm Arbitrary select a point p Retrieve all points density-reachable from p wrt Eps and MinPts. If p is a core point, a cluster is formed. If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database. Continue the process until all of the points have been processed. Conclusions We discussed two hierarchical clustering methods – Agglomerative and Divisive. We also discussed Birch- a hierarchical clustering which produces good clustering over a single scan and with a few additional scans you get better clustering. DBSCAN is a density based clustering algorithm and through this algorithm we discover clusters of arbitrary shapes. Distance is not the metric unlike the case of hierarchical methods. GRID-BASED CLUSTERING METHODS This is the approach in which we quantize space into a finite number of cells that form a grid structure on which all of the operations for clustering is performed. So, for example assume that we have a set of records and we want to cluster with respect to two attributes, then, we divide the related space (plane), into a grid structure and then we find the clusters. Salary (10,000) 8 Our “space” is this plane 7 6 5 4 3 2 1 0 20 30 40 50 60 Age Techniques for Grid-Based Clustering The following are some techniques that are used to perform Grid-Based Clustering: CLIQUE (CLustering In QUest.) STING (STatistical Information Grid.) WaveCluster Looking at CLIQUE as an Example CLIQUE is used for the clustering of highdimensional data present in large tables. By high-dimensional data we mean records that have many attributes. CLIQUE identifies the dense units in the subspaces of high dimensional data space, and uses these subspaces to provide more efficient clustering. Definitions That Need to Be Known Unit : After forming a grid structure on the space, each rectangular cell is called a Unit. Dense: A unit is dense, if the fraction of total data points contained in the unit exceeds the input model parameter. Cluster: A cluster is defined as a maximal set of connected dense units. How Does CLIQUE Work? Let us say that we have a set of records that we would like to cluster in terms of n-attributes. So, we are dealing with an ndimensional space. MAJOR STEPS : CLIQUE partitions each subspace that has dimension 1 into the same number of equal length intervals. Using this as basis, it partitions the ndimensional data space into non-overlapping rectangular units. CLIQUE: Major Steps (Cont.) Now CLIQUE’S goal is to identify the dense ndimensional units. It does this in the following way: CLIQUE finds dense units of higher dimensionality by finding the dense units in the subspaces. So, for example if we are dealing with a 3dimensional space, CLIQUE finds the dense units in the 3 related PLANES (2-dimensional subspaces.) It then intersects the extension of the subspaces representing the dense units to form a candidate search space in which dense units of higher dimensionality would exist. CLIQUE: Major Steps. (Cont.) Each maximal set of connected dense units is considered a cluster. Using this definition, the dense units in the subspaces are examined in order to find clusters in the subspaces. The information of the subspaces is then used to find clusters in the n-dimensional space. It must be noted that all cluster boundaries are either horizontal or vertical. This is due to the nature of the rectangular grid cells. Example for CLIQUE Let us say that we want to cluster a set of records that have three attributes, namely, salary, vacation and age. The data space for the this data would be 3-dimensional. vacation age salary Example (Cont.) After plotting the data objects, each dimension, (i.e., salary, vacation and age) is split into intervals of equal length. Then we form a 3-dimensional grid on the space, each unit of which would be a 3-D rectangle. Now, our goal is to find the dense 3-D rectangular units. Example (Cont.) To do this, we find the dense units of the subspaces of this 3-d space. So, we find the dense units with respect to age for salary. This means that we look at the salaryage plane and find all the 2-D rectangular units that are dense. We also find the dense 2-D rectangular units for the vacationage plane. Salary (10,000) 0 1 2 3 4 5 6 7 20 30 40 50 age 60 Vacation (week) 0 1 2 3 4 5 6 7 Example 1 20 30 40 50 age 60 Example (Cont.) Now let us try to visualize the dense units of the two planes on the following 3-d figure : Vacation =3 S y r a al 30 50 age Example (Cont.) We can extend the dense areas in the vacation-age plane inwards. We can extend the dense areas in the salary-age plane upwards. The intersection of these two spaces would give us a candidate search space in which 3-dimensional dense units exist. We then find the dense units in the salary-vacation plane and we form an extension of the subspace that represents these dense units. Example (Cont.) Now, we perform an intersection of the candidate search space with the extension of the dense units of the salary-vacation plane, in order to get all the 3-d dense units. So, What was the main idea? We used the dense units in subspaces in order to find the dense units in the 3-dimensional space. After finding the dense units, it is very easy to find clusters. Reflecting upon CLIQUE Why does CLIQUE confine its search for dense units in high dimensions to the intersection of dense units in subspaces? Because the Apriori property employs prior knowledge of the items in the search space so that portions of the space can be pruned. The property for CLIQUE says that if a kdimensional unit is dense then so are its projections in the (k-1) dimensional space. Strength and Weakness of CLIQUE Strength It automatically finds subspaces of the highest dimensionality such that high density clusters exist in those subspaces. It is quite efficient. It is insensitive to the order of records in input and does not presume some canonical data distribution. It scales linearly with the size of input and has good scalability as the number of dimensions in the data increases. Weakness The accuracy of the clustering result may be degraded at the expense of simplicity of the simplicity of this method. STING: A Statistical Information Grid Approach to Spatial Data Mining Paper by: Wei Wang Jiong Yang Department of Computer Science Department of Computer Science University of California, Los University of California, Los Angeles Angeles CA 90095, U.S.A. CA 90095, U.S.A. weiwang@cs.ucla.edu jyang@cs.ucla.edu Richard Muntz Department of Computer Science University of California, Los Angeles CA 90095, U.S.A. muntz@cs.ucla.edu VLDB Conference Athens, Greece, 1997 Reference For Paper http://georges.gardarin.free.fr/Cours_XMLDM_Master2 /Sting.PDF Definitions That Need to Be Known Spatial Data: Data that have a spatial or location component. These are objects that themselves are located in physical space. Examples: My house, lake Geneva, New York City, etc. Spatial Area: The area that encompasses the locations of all the spatial data is called spatial area. STING (Introduction) STING is used for performing clustering on spatial data. STING uses a hierarchical multi resolution grid data structure to partition the spatial area. STINGS big benefit is that it processes many common “region oriented” queries on a set of points, efficiently. We want to cluster the records that are in a spatial table in terms of location. Placement of a record in a grid cell is completely determined by its physical location. Hierarchical Structure of Each Grid Cell The spatial area is divided into rectangular cells. (Using latitude and longitude.) Each cell forms a hierarchical structure. This means that each cell at a higher level is further partitioned into 4 smaller cells in the lower level. In other words each cell at the ith level (except the leaves) has 4 children in the i+1 level. The union of the 4 children cells would give back the parent cell in the level above them. Hierarchical Structure of Cells (Cont.) The size of the leaf level cells and the number of layers depends upon how much granularity the user wants. So, Why do we have a hierarchical structure for cells? We have them in order to provide a better granularity, or higher resolution. A Hierarchical Structure for Sting Clustering Statistical Parameters Stored in each Cell For each cell in each layer we have attribute dependent and attribute independent parameters. Attribute Independent Parameter: Count : number of records in this cell. Attribute Dependent Parameter: (We are assuming that our attribute values are real numbers.) Statistical Parameters (Cont.) For each attribute of each cell we store the following parameters: M mean of all values of each attribute in this cell. S Standard Deviation of all values of each attribute in this cell. Min The minimum value for each attribute in this cell. Max The maximum value for each attribute in this cell. Distribution The type of distribution that the attribute value in this cell follows. (e.g. normal, exponential, etc.) None is assigned to “Distribution” if the distribution is unknown. Storing of Statistical Parameters Statistical information regarding the attributes in each grid cell, for each layer are pre-computed and stored before hand. The statistical parameters for the cells in the lowest layer is computed directly from the values that are present in the table. The Statistical parameters for the cells in all the other levels are computed from their respective children cells that are in the lower level. How are Queries Processed ? STING can answer many queries, (especially region queries) efficiently, because we don’t have to access full database. How are spatial data queries processed? We use a top-down approach to answer spatial data queries. Start from a pre-selected layer-typically with a small number of cells. The pre-selected layer does not have to be the top most layer. For each cell in the current layer compute the confidence interval (or estimated range of probability) reflecting the cells relevance to the given query. Query Processing (Cont.) The confidence interval is calculated by using the statistical parameters of each cell. Remove irrelevant cells from further consideration. When finished with the current layer, proceed to the next lower level. Processing of the next lower level examines only the remaining relevant cells. Repeat this process until the bottom layer is reached. Different Grid Levels during Query Processing. Sample Query Examples Assume that the spatial area is the map of the regions of Long Island, Brooklyn and Queens. Our records represent apartments that are present throughout the above region. Query : “ Find all the apartments that are for rent near Stony Brook University that have a rent range of: $800 to $1000” The above query depend upon the parameter “near.” For our example near means within 15 miles of Stony Brook University. Advantages and Disadvantages of STING ADVANTAGES: Very efficient. The computational complexity is O(k) where k is the number of grid cells at the lowest level. Usually k << N, where N is the number of records. STING is a query independent approach, since statistical information exists independently of queries. Incremental update. DISADVANTAGES: All Cluster boundaries are either horizontal or vertical, and no diagonal boundary is selected. Thank you !