©2006 Jiawei Han and Micheline Kamber, All rights reserved www.cs.uiuc.edu/~hanj University of Illinois at Urbana-Champaign Department of Computer Science Jiawei Han — Chapter 7 — Concepts and Techniques Data Mining: 1 6. Density-Based Methods 5. Hierarchical Methods 4 Partitioning Methods 4. 3. A Categorization of Major Clustering Methods 2. Types of Data in Cluster Analysis 1. What is Cluster Analysis? Chapter 7. Cluster Analysis 2 Typical applications As a preprocessing step for other algorithms As a stand-alone tool to get insight into data distribution Unsupervised learning: no predefined classes Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters Cluster analysis distance (or similarity) measures Dissimilar to the objects in other clusters Similar to one another within the same cluster Cluster: a collection of data objects What is Cluster Analysis? 3 clustered along continent faults Earth-quake studies: Observed earth quake epicenters should be type, value, and geographical location City-planning: Identifying groups of houses according to their house a high average claim cost Insurance: Identifying groups of motor insurance policy holders with observation database Land use: Identification of areas of similar land use in an earth programs bases, and then use this knowledge to develop targeted marketing Marketing: Help marketers discover distinct groups in their customer Examples of Clustering Applications 4 Example – clustering for ontology aligment 5 Discovery of clusters with arbitrary shape Able to deal with noise and outliers Insensitive to order of input records High dimensionality Incorporation of user-specified constraints Interpretability and usability Minimal requirements for domain knowledge to determine input parameters Ability to deal with different types of attributes Scalability Requirements of Clustering in Data Mining 6 6. Density-Based Methods 5. Hierarchical Methods 4 Partitioning Methods 4. 3. A Categorization of Major Clustering Methods 2. Types of Data in Cluster Analysis 1. What is Cluster Analysis? Chapter 7. Cluster Analysis 7 Dissimilarity matrix Distance table (one mode) Data matrix n objects, p attributes (two modes) One row represents one object ... x if ... ... ... ... ... 0 : ... : d ( n ,2 ) ... ... ... ... ... d ( 3,2 ) 0 x nf x 1f ... 0 d(2,1) d(3,1 ) : d ( n ,1) x 11 ... x i1 ... x n1 Data Structures ... 0 x 1p ... x ip ... x np 8 Variables of mixed types a positive measurement on a nonlinear scale (growth) Ratio variables ranking is important (e.g. medals(gold,silver,bronze)) Ordinal (color/red,yellow,blue,green) A generalization of the binary variable in that it can take more than 2 states Nominal variables Variables with 2 states (on/off, yes/no) Binary variables Continuous measurements (weight, temperature, …) Interval-scaled variables Type of data in clustering analysis 9 + ... + . xnf ) xif − m f zif = sf Calculate the standardized measurement (z-score) m f = 1n (x1 f + x2 f s f = 1n (| x1 f − m f | + | x2 f − m f | +...+ | xnf − m f |) Calculate the mean absolute deviation: where Sometimes we need to standardize the data Interval-valued variables 10 d(i,j) ≥ 0 d(i,i) = 0 d(i,j) = d(j,i) d(i,j) ≤ d(i,k) + d(k,j) Distances are normally used to measure the similarity or dissimilarity between two data objects Properties p Distances between objects 11 d (i, j) =| x − x | + | x − x | +...+ | x − x | i1 j1 i2 j 2 ip jp Manhattan distance: d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 ) i1 j1 i2 j2 ip jp Euclidean distance: where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, Distances between objects 12 If q = 1, d is Manhattan distance If q = 2, d is Euclidean distance q is a positive integer d (i, j) = q (| x − x |q + | x − x |q +...+ | x − x |q ) i1 j1 i2 j2 ip jp Minkowski distance: Distances between objects 13 test); 1 is the important state, 0 the other important than the other (e.g. outcome of disease asymmetric binary variables: one state is more important; 0/1 symmetric binary variables: both states are equally Binary Variables 14 sum a +b c+d p a: number of attributes having 1 for object i and 1 for object j b: number of attributes having 1 for object i and 0 for object j c: number of attributes having 0 for object i and 1 for object j d: number of attributes having 0 for object i and 0 for object j p = a+b+c+d 1 0 1 a b Object i 0 c d sum a + c b + d Object j Contingency tables for Binary Variables 15 0 1 a b Object i 0 c d sum a + c b + d 1 Object j a +b c+d p sum d (i, j ) = b+c a+b+c+d Distance measure for symmetric binary variables 16 0 a +b c+d p sum Jaccard coefficient = 1- d(i,j) = b+c a+b+c sim Jaccard (i, j ) = d (i, j ) = 1 a b Object i 0 c d sum a + c b + d 1 Object j a a+b+c Distance measure for asymmetric binary variables 17 Y Y Y N N P P P N N N N N P N d ( jack , mary ) = 0 + 1 = 0 . 33 2 + 0 + 1 1 + 1 d ( jack , jim ) = = 0 . 67 1 + 1 + 1 1 + 2 d ( jim , mary ) = = 0 . 75 1 + 1 + 2 gender is a symmetric attribute the remaining attributes are asymmetric binary distance based on these let the values Y and P be set to 1, and the value N be set to 0 Jack M Mary F Jim M N N N Example Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4 Dissimilarity between Binary Variables 18 d ( i , j ) = p −p m m: # of matches, p: total # of variables creating a new binary variable for each of the M nominal states Method 2: use a large number of binary variables Method 1: Simple matching Nominal or Categorical Variables 19 if compute the dissimilarity using methods for intervalscaled variables z r if − 1 = M f − 1 map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by r iff ∈ {1,..., M f } Can be treated like interval-scaled replace xif by their rank Order is important, e.g., rank An ordinal variable can be discrete or continuous Ordinal Variables 20 treat them as continuous ordinal data, treat their rank as interval-scaled yif = log(xif) apply logarithmic transformation treat them like interval-scaled variables—not a good choice! (why?—the scale can be distorted) Methods: Ratio-scaled variable: a positive measurement on a nonlinear scale, approximately at exponential scale, such as AeBt or Ae-Bt Ratio-Scaled Variables 21 δ δ d p ( f ) ( f ) f = 1 ij ij p ( f ) f = 1 ij delta(i,j) = 0 iff (i) x-value is missing or (ii) x-values are 0 and f asymmetric binary attribute Σ f is binary or nominal: dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise f is interval-based: use the normalized distance f is ordinal or ratio-scaled compute ranks rif and z if = r if − 1 M f −1 and treat zif as interval-scaled d (i, j ) = Σ A database may contain all the six types of variables symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio One may use a weighted formula to combine their effects: Variables of Mixed Types 22 Cosine measure Broad applications: information retrieval, biologic taxonomy, etc. Vector objects: keywords in documents, gene features in micro-arrays, etc. Vector Objects 23 receptor cloning sim(d,q) = d . q |d| x |q| adrenergic Q (1,1,1) ( , , ) Doc1 (1,1,0) Doc2 (0,1,0) Vector model for information retrieval (simplified) 24 6. Density-Based Methods 5. Hierarchical Methods 4 Partitioning Methods 4. 3. A Categorization of Major Clustering Methods 2. Types of Data in Cluster Analysis 1. What is Cluster Analysis? Chapter 7. Cluster Analysis 25 Typical methods: DBSCAN, OPTICS, DenClue Based on connectivity and density functions Density-based approach: Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON some criterion Create a hierarchical decomposition of the set of data (or objects) using Hierarchical approach: Typical methods: k-means, k-medoids, CLARANS e.g., minimizing the sum of square errors Construct various partitions and then evaluate them by some criterion, Partitioning approach: Major Clustering Approaches (I) 26 Typical methods: COD (obstacles), constrained clustering Clustering by considering user-specified or application-specific constraints User-guided or constraint-based: Typical methods: pCluster Based on the analysis of frequent patterns Frequent pattern-based: Typical methods: EM, SOM, COBWEB fit of that model to each other A model is hypothesized for each of the clusters and tries to find the best Model-based: Typical methods: STING, WaveCluster, CLIQUE based on a multiple-level granularity structure Grid-based approach: Major Clustering Approaches (II) 27 Medoid: one chosen, centrally located object in the cluster dis(Ki, Kj) = dis(Mi, Mj) Medoid: distance between the medoids of two clusters, i.e., dis(Ki, Kj) = dis(Ci, Cj) Centroid: distance between the centroids of two clusters, i.e., element in the other, i.e., dis(Ki, Kj) = avg(tip, tjq) Average: avg di distance t b between t an element l t iin one cluster l t and d an and an element in the other, i.e., dis(Ki, Kj) = max(tip, tjq) Complete link: largest distance between an element in one cluster and an element in the other, i.e., dis(Ki, Kj) = min(tip, tjq) Single link: smallest distance between an element in one cluster Typical Alternatives to Calculate the Distance between Clusters 28 Σ N (t − cm ) 2 Rm = i =1 ip N Σ N Σ N (t − t ) 2 Dm = i =1 i =1 ip iq N ( N −1) all pairs of points in the cluster Diameter: square root of average mean squared distance between cluster to its centroid N Radius: square root of average distance from any point of the ) Cm = ip Centroid: the “middle” of a cluster ΣiN= 1(t Centroid, Radius and Diameter of a Cluster (for numerical data sets) 29 6. Density-Based Methods 5. Hierarchical Methods 4 Partitioning Methods 4. 3. A Categorization of Major Clustering Methods 2. Types of Data in Cluster Analysis 1. What is Cluster Analysis? Chapter 7. Cluster Analysis 30 k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster k-means (MacQueen’67): Each cluster is represented by the center of the cluster Heuristic methods: k-means and k-medoids algorithms Global optimal: exhaustively enumerate all partitions Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion Σ km=1Σtmi∈Km (Cm − tmi ) 2 Partitioning method: Construct a partition of a database D of n objects into a set of k clusters, s.t., min sum of squared distance Partitioning Algorithms: Basic Concept 31 2. Repeat Update cluster means (calculate mean value of the objects for each cluster) Until no change Assign each object to the cluster to which the object is most similar based on mean values of the objects in the cluster 1. arbitrarily choose k objects as initial cluster centers Given k, data D The K-Means Clustering Method 32 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Example Arbitrarily choose K objects as initial cluster centers K=2 0 Assign each h object to most similar center 3 4 5 9 10 Update the cluster means 0 0 1 2 3 4 6 7 8 0 1 2 3 4 5 0 6 10 5 2 9 7 1 8 6 0 7 reassign 6 8 5 7 4 9 3 8 2 10 1 1 2 3 4 9 0 Update the cluster means 5 6 7 8 9 10 10 0 1 2 3 4 5 6 7 8 9 10 0 1 1 2 2 The K-Means Clustering Method 3 3 4 4 5 5 7 8 9 6 7 8 9 reassign 6 10 10 33 Not suitable to discover clusters with non-convex shapes Unable to handle noisy data and outliers Need to specify k, the number of clusters, in advance Applicable only when mean is defined, then what about categorical data? Weakness Comment: Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k)) Strength: Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. Comments on the K-Means Method 34 0 10 0 9 1 1 8 2 2 7 3 3 6 4 4 5 5 5 4 6 6 3 7 7 2 8 8 1 9 9 0 10 10 centrally located object in a cluster. 0 1 2 3 4 5 6 7 8 9 10 cluster as a reference point, medoids can be used, which is the most K-Medoids: Instead of taking the mean value of the object in a distort the distribution of the data. Since an object with an extremely large value may substantially The k-means algorithm is sensitive to outliers ! What Is the Problem of the K-Means Method? 35 PAM (Partitioning Around Medoids, 1987) CLARA (Kaufmann & Rousseeuw, 1990) CLARANS (Ng & Han, 1994): Randomized sampling well for large data sets PAM works effectively for small data sets, but does not scale total distance of the resulting clustering of the medoids by one of the non-medoids if it improves the starts from an initial set of medoids and iteratively replaces one Find representative objects, called medoids, in clusters The K-Medoids Clustering Method 36 0 1 2 3 4 Until no change Do loop K=2 0 1 2 3 4 5 6 7 8 9 5 6 7 8 9 10 If quality is improved. Swapping O and O Arbitrary choose k objects as initial medoids modmar 10 2 3 4 5 6 7 8 9 10 0 0 6 1 1 5 2 4 3 5 6 7 8 2 3 0 4 2 Compute total cost of swapping 0 3 1 10 4 5 6 7 8 0 9 9 8 10 7 9 Total Cost = 26 1 1 2 3 4 5 10 0 1 2 3 4 5 0 7 7 6 8 8 Assign each remaining object to nearest medoids 9 9 6 10 10 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 Randomly select a nonmedoid object,O 1 9 10 Total Cost = 20 A Typical K-Medoids Algorithm (PAM) - idea 9 37 10 modmar Algorithm repeat steps 2-3 until there is no change Then assign each non-selected object to the most similar representative object If TCih < 0, i is replaced by h Select a pair i and h, which corresponds to the minimum swapping cost For each pair of non-selected object h and selected object i, calculate the total swapping pp g cost TCih Select k representative objects arbitrarily PAM (Kaufman and Rousseeuw, 1987) PAM (Partitioning Around Medoids) (1987) 38 1 2 3 4 5 6 7 9 10 0 1 h 2 3 4 i 5 j 6 t 7 8 9 Cjih = d(j, t) - d(j, i) 0 1 2 3 4 5 6 7 8 9 10 10 Cjih = d(j, h) - d(j, i) 8 0 1 2 t 0 1 2 3 3 i 4 4 j 5 5 6 6 t h h i 7 7 8 8 j 9 9 10 10 Cjih = d(j, h) - d(j, t) 0 1 2 3 4 5 6 7 8 9 10 Cjih = 0 0 0 1 0 3 4 1 h 5 6 2 i j 7 8 2 3 4 5 6 7 t 9 9 8 10 10 PAM Clustering: Total swapping cost TCih=∑jCjih 39 where n is # of data,k is # of clusters O(k(n-k)2 ) for each iteration Pam works efficiently for small data sets but does not scale well for large data sets. Pam is more robust than k-means in the presence of noise and outliers because a medoid is less influenced by outliers or other extreme values than a mean CLARA(Clustering LARge Applications) Sampling based method, What Is the Problem with PAM? 40 Built in statistical analysis packages, such as S+ A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased Efficiency depends on the sample size Weakness: Strength: deals with larger data sets than PAM It draws multiple samples of the data set, applies PAM on each sample, and gives the best clustering as the output CLARA (Kaufmann and Rousseeuw in 1990) CLARA (Clustering Large Applications) (1990) 41 Calculate average dissimilarity of the clustering. If the value is smaller than the current minimum, use this value as current minimum and retain the k medoids as best so far. assign each non-selected object in the entire data set to the most similar mediod Draw sample of s objects from the entire data set and perform PAM to find k mediods of the sample Repeat n times: Algorithm (n=5, s = 40+2k) CLARA (Clustering Large Applications) (1990) 42 … … Node represents k objects (medoids), a potential solution for the clustering. Nodes are neighbors if the sets of objects differ by one object. Each node has k(n-k) neighbors. Cost differential between two neighbors is TCih (with Oi and Oh are the differing nodes in the mediod sets) … … Graph abstraction 43 Searches in the original graph Searches part of the graph Uses the neighbors to guide the search CLARANS CLARA searches in smaller graphs (as it uses PAM on samples of the entire data set) PAM searches for node in the graph with minimum cost Graph Abstraction 44 It is more efficient and scalable than both PAM and CLARA CLARANS (A Clustering Algorithm based on Randomized Search) (Ng and Han’94) CLARANS (“Randomized” CLARA) (1994) 45 Compare the cost of current node with minimum cost so far. If the cost is lower, set minumum cost to cost of the current node, and bestnode to the current node. Consider random neighbor S of the current node and calculate the cost differential. If S has lower cost, then set S to current and repeat this step. If S does not have lower cost, repeat this step (check at most Maxneighbor neighbors) Take arbitrary node in the graph Repeat numlocal times: (find local minimum) Maxneighbor: maximum number of neighbors to compare Numlocal: number of local minima to be found Algorithm CLARANS (“Randomized” CLARA) (1994) 46 6. Density-Based Methods 5. Hierarchical Methods 4 Partitioning Methods 4. 3. A Categorization of Major Clustering Methods 2. Types of Data in Cluster Analysis 1. What is Cluster Analysis? Chapter 7. Cluster Analysis 47 Step 4 e d c b a Step 0 Step 3 ab Step 1 Step 2 Step 1 Step 0 de cde abcde Step 2 Step 3 Step 4 divisive (DIANA) agglomerative gg (AGNES) Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition Hierarchical Clustering 48 d (1, 2 ), 5 = max{d1, 5 , d 2 ,5 } = max{9,8} = 9 2 1 2 3 5 d (1, 2 ), 4 = max{d1 , 4 , d 2 , 4 } = max{10,9} = 10 (1,2) 0 3 6 0 4 10 7 0 5 9 5 4 0 4 8 0 3 9 0 7 0 5 4 0 (1,2) 3 4 5 d (1, 2 ), 3 = max{d1, 3 , d 2, 3 } = max{6,3} = 6 1 0 2 2 3 6 4 10 5 9 1 2 3 4 5 Complete-link Clustering Example 49 2 5 1 2 3 4 (1,2) 0 3 6 0 (4,5) 10 7 0 d 3 ,( 4 , 5 ) = max{d3 , 4 , d 3, 5 } = max{7,5} = 7 (1,2) 0 3 6 0 4 10 7 0 5 9 5 4 0 (1,2) 3 (4,5) 4 8 0 3 9 0 7 0 5 4 0 (1,2) 3 4 5 d (1, 2 ), ( 4 ,5 ) = max{d (1, 2 ), 4 , d (1, 2 ), 5 } = max{10,9} = 10 1 0 2 2 3 6 4 10 5 9 1 2 3 4 5 Complete-link Clustering Example 50 8 0 3 9 0 7 0 5 4 0 th=9 10 6 th=5 (1,2) 0 3 6 0 4 10 7 0 5 9 5 4 0 (1,2) 3 4 5 d (1, 2 , 3 ),( 4, 5) = max{d(1, 2), ( 4, 5) , d 3, ( 4, 5)} = 10 1 0 2 2 3 6 4 10 5 9 1 2 3 4 5 4 2 1 2 3 4 5 (1,2) 0 3 6 0 (4,5) 10 7 0 (1,2) 3 (4,5) Complete-link Clustering Example 51 Introduced in Kaufmann and Rousseeuw (1990) Implemented in statistical analysis packages, e.g., Splus Use the Single-Link method and the dissimilarity matrix. Merge nodes that have the least dissimilarity Go on in a non-descending fashion Eventually all nodes belong to the same cluster AGNES (Agglomerative Nesting) 52 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 AGNES (Agglomerative Nesting) 53 A clustering of the data objects is obtained by cutting the d d dendrogram att the th desired d i d level, l l then th each h connected t d component forms a cluster. Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram. 54 Dendrogram: Shows How the Clusters are Merged 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 10 Eventually each node forms a cluster on its own 10 Inverse order of AGNES 10 Implemented in statistical analysis packages, e.g., Splus 10 Introduced in Kaufmann and Rousseeuw (1990) DIANA (Divisive Analysis) 55 can never undo what was done previously do not scale well: time complexity of at least O(n2), where n is the number of total objects CHAMELEON (1999): hierarchical clustering using dynamic modeling ROCK (1999): clustering categorical data by neighbor and link analysis BIRCH (1996): uses CF-tree and incrementally adjusts the quality of sub-clusters Integration of hierarchical with distance-based clustering Major weakness of agglomerative clustering methods Recent Hierarchical Clustering Methods 56 Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree Phase 1: scan DB to build an initial in-memory CF tree (a multi-level compression of the data that tries to preserve the inherent clustering structure of the data) Weakness: handles only numeric data, and sensitive to the order of the data record, not always natural clusters. Scales linearly: finds a good clustering with a single scan and improves the quality with a few additional scans Incrementally construct a CF (Clustering Feature) tree, a hierarchical data structure for multiphase clustering Birch: Balanced Iterative Reducing and Clustering using Hierarchies (Zhang, Ramakrishnan & Livny, SIGMOD’96) BIRCH (1996) 57 SS: ∑ Ni=1=Xi2 LS: ∑ Ni=1=Xi 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 N: Number of data points 4 5 6 7 8 9 10 (3,4) (2,6) (4,5) (4,7) (3,8) CF = (5, (16,30),(54,190)) Clustering Feature: CF = (N, LS, SS) Clustering Feature Vector in BIRCH 58 threshold: max diameter of sub-clusters stored at the leaf nodes Branching factor: specify the maximum number of children. A CF tree has two parameters A leaf node represents a cluster made of the subclusters represented by its entries A nonleaf node represents a cluster made of the subclusters represented by its children A nonleaf node in a tree has children and stores the sums of the CFs of their children A CF tree is a height-balanced tree that stores the clustering features for a hierarchical clustering g registers crucial measurements for computing cluster and utilizes storage efficiently summary of the statistics for a given subcluster: the 0-th, 1st and 2nd moments of the subcluster from the statistical point of view. Clustering feature: CF-Tree in BIRCH 59 CFk next Leaf node child11 child12 child13 CF1 child2 child3 child1 Non-leaf node CF2 CF3 CF2 CF3 CF1 prev CFa CFb T=7 B=6 Root prev CFl CFm child15 CF5 child6 CF6 The CF Tree Structure CFq next Leaf node 60 2 2 O(n + nmmma + n log n) ROCK: RObust Clustering using linKs S. Guha, R. Rastogi & K. Shim, ICDE’99 Major ideas Use links to measure similarity/proximity maximize the sum of the number of links between points within a cluster, minimize the sum of the number of links for points in different clusters Computational complexity: ROCK: Clustering Categorical Data 61 Sim ( T 1 , T 2 ) = { a , b , c , d , e} {c} = 1 = 0 .2 5 Traditional measures for categorical data may not work well, e.g., Jaccard coefficient Example: Two groups (clusters) of transactions C1. <a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {a, d, e}, {b, c, d}, {b, c, e}, {b, d, e}, {c, d, e} C2. <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g} Jaccard coefficient may lead to wrong clustering result C1: 0.2 ({a, b, c}, {b, d, e}} to 0.5 ({a, b, c}, {a, b, d}) C1 & C2: could be as high as 0.5 ({a, b, c}, {a, b, f}) Jaccard coefficient-based similarity function: T ∩ T2 S im ( T1 , T2 ) = 1 T1 ∪ T2 Ex. Let T1 = {a, b, c}, T2 = {c, d, e} Similarity Measure in ROCK 62 Link(pi,pj) Link(pi pj) is the number of common neighbors between pi and pj (sim and t between 0 and 1) iff sim(p1,p2) >= t Neighbor: p1 and p2 are neighbors Link Measure in ROCK 63 {a, b, d}, {a, b, e}, {a, b, g}, {a, b, c}, {a, b, f} link(T1, T3) = 5, since they have 5 common neighbors {a, c, d}, {a, c, e}, {b, c, d}, {b, c, e} link(T1, T2) = 4, since they have 4 common neighbors Let T1 = {a, {a b, b c} c}, T2 = {c, {c d d, e} e}, T3 = {a {a, b, b f} and sim the Jaccard coefficient similarity and t=0.5 C2 <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g} C1 <a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {a, d, e}, {b, c, d}, {b, c, e}, {b, d, e}, {c, d, e} Links: # of common neighbors Link Measure in ROCK 64 G(Ci,Cj) goodness measure for merging g g Ci and Cjj =g = Link(Ci,Cj) divided by the expected number of cross links Link(Ci,Cj) = the number of cross links between clusters Ci and Cj Link Measure in ROCK 65 Algorithm: sampling-based clustering Draw random sample Hierarchical clustering with links using goodness measure of merging Label data in disk: a point is assigned to the cluster for which it has the most neighbors after normalization The ROCK Algorithm 66 Measures the similarity based on a dynamic model 2. Use an agglomerative hierarchical clustering algorithm: find the genuine clusters by repeatedly combining these sub-clusters 1. Use a graph partitioning algorithm: cluster objects into a large number of relatively small sub-clusters A two-phase algorithm Two clusters are merged only if the interconnectivity and closeness (proximity) between two clusters are high relative to the internal interconnectivity of the clusters and closeness of items within the clusters CHAMELEON: by G. Karypis, E.H. Han, and V. Kumar’99 CHAMELEON: Hierarchical Clustering Using Dynamic Modeling (1999) 67 Data Set Sparse Graph Construct Final Clusters Partition the Graph Merge Partition Overall Framework of CHAMELEON 68 2. Dynamic notion of neighborhood: in regions with high density, a neighborhood radius is small, while in sparse regions the neighborhood radius is large Edge weight is density of the region Edge between two nodes if points corresponding to either of the nodes are among the k-most similar points of the point corresponding to the other node Based on k-nearest neighbor graph 1. Use a graph partitioning algorithm: cluster objects into a large number of relatively small sub-clusters A two-phase algorithm CHAMELEON: Hierarchical Clustering Using Dynamic Modeling (1999) 69 Merge if both measures are above user-defined thresholds Closeness of clusters Ci and Cj: average similarity between points in Ci that are connected to points in Cj Interconnectivity between clusters Ci and Cj: normalized sum of the weights of the edges that connect nodes in Ci and Cj 2. Use an agglomerative hierarchical clustering algorithm: find the genuine clusters by repeatedly combining these sub-clusters 1. A two-phase algorithm CHAMELEON: Hierarchical Clustering Using Dynamic Modeling (1999) 70 CHAMELEON (Clustering Complex Objects) 71 6. Density-Based Methods 5. Hierarchical Methods 4 Partitioning Methods 4. 3. A Categorization of Major Clustering Methods 2. Types of Data in Cluster Analysis 1. What is Cluster Analysis? Chapter 7. Cluster Analysis 72 Several interesting studies: DBSCAN: Ester, et al. (KDD’96) OPTICS: Ankerst, et al (SIGMOD’99). DENCLUE: Hinneburg & D. Keim (KDD’98) CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based) Clustering based on density (local cluster criterion), such as density-connected points Major features: Discover clusters of arbitrary shape Handle noise One scan Need density parameters as termination condition Density-Based Clustering Methods 73 core point condition: |NEps (q)| >= MinPts p belongs to NEps(q) q p Eps = 1 cm MinPts = 5 Directly density-reachable: A point p is directly densityreachable from a point q w.r.t. Eps, MinPts if { belongs {q b l tto D | dist(p,q) di t( ) <= E Eps}} MinPts: Minimum number of points in an Epsneighborhood of that point Eps: Maximum radius of the neighborhood NEps(p): ( ) Two parameters: Density-Based Clustering: Basic Concepts 74 A point p is density-reachable from a point q w.r.t. Eps, MinPts if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-reachable from pi A point p is density-connected to a point q w.r.t. Eps, MinPts if there is a point o such that both, p and q are density-reachable from o w.r.t. Eps and MinPts Density-connected Density-reachable: p q o p1 p Density-Reachable and Density-Connected q 75 Core Border MinPts = 5 Eps = 1cm Outlier Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points Discovers clusters of arbitrary shape in spatial databases with noise DBSCAN: Density Based Spatial Clustering of Applications with Noise 76 Continue the process until all of the points have been processed. If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database. If p is a core point, a cluster is formed. Retrieve all points density-reachable from p w.r.t. Eps and MinPts. Arbitrary select a point p DBSCAN: The Algorithm 77 DBSCAN: Sensitive to Parameters 78 CHAMELEON (Clustering Complex Objects) 79 OPTICS: Ordering Points To Identify the Clustering Structure Ankerst, Breunig, Kriegel, and Sander (SIGMOD’99) Produces a special order of the database wrt its y clustering g structure density-based This cluster-ordering contains info equivalent to the density-based clusterings corresponding to a broad range of parameter settings Good for both automatic and interactive cluster analysis, including finding intrinsic clustering structure Can be represented graphically or using visualization techniques OPTICS: A Cluster-Ordering Method (1999) 80 p2 r(p1, o) = 1.5cm. r(p2,o) = 4cm o p1 ε = 3 cm MinPts = 5 Core Distance of p wrt MinPts: smallest distance eps’ between p and an object in its eps-neighborhood such that p would be a core object for eps’ and MinPts. Otherwise, undefined. Reachability Distance of p wrt o: Max (core-distance (o), d (o, p)) if o is core object. U d fi d otherwise Undefined th i Max (core-distance (o), d (o, p)) OPTICS: Some Extension from DBSCAN 81 Compute core distance for o Write object o to ordered file and mark o as processed If o is not a core object, j , restart at ((1)) (o is a core object …) Put neighbors of o in Seedlist and order (2) Find neighbors (eps-neighborhood) Take new object from Seedlist with smallest reachability and restart at (2) If neighbor n is not yet in SeedList then add (n, reachability from o) else if reachability from o < current reachability, then update reachability + order SeedList wrt reachability (1) Select non-processed object o OPTICS 82 εಫ ε undefined Reachability -distance ε Cluster-order of the objects 83 Using statistical density functions Major features But needs a large number of parameters Significant faster than existing algorithm (e.g., DBSCAN) Allows a compact mathematical description of arbitrarily shaped clusters in high-dimensional data sets Good for data sets with large g amounts of noise Solid mathematical foundation DENsity-based CLUstEring by Hinneburg & Keim (KDD’98) DENCLUE: Using Statistical Density Functions 84 2σ 2 (x) = ∑ e i =1 N 2σ 2 − d ( x , xi ) 2 ∇f D Gaussian ( x, xi ) = ∑i =1 ( xi − x) ⋅ e N 2σ 2 Clusters can be determined mathematically by identifying density attractors. Density attractors are local maxima of the overall density function d ( x , xi ) 2 f D Gaussian − Overall density of the data space can be calculated as the sum of the influence function of all data points fGaussian (x, y) = e Influence function: describes the impact of a data point within its neighborhood d ( x, y)2 − Uses grid cells but only keeps information about grid cells that do actually contain data points and manages these cells in a tree-based access structure Denclue: Technical Essence 85 Density Attractor 86 Arbitrary-shape cluster for a set of significant density attractors X for threshold k: points that are density attracted to some density attractor in X Points that are attracted to a density attractor with density less than k are called outliers Center-defined cluster for a significant density attractor x for threshold k: points that are density attracted by x Set of significant density attractors X for threshold k: for each pair of density attractors x1, x2 in X there is a path from x1 to x2 such that each point on the path has density larger than or equal to k Significant density attractor for threshold k: density attractor with density larger than or equal to k Denclue: Technical Essence 87 Center-Defined and Arbitrary 88