Clustering Lecture 1: Basics Jing Gao SUNY Buffalo 1 Class Structure • Topics – Clustering, Classification – Network mining – Anomaly detection • Expectation – Sign-in – Take quiz in class – Two more projects on clustering and classification – One more homework on network mining or anomaly detection • Website – http://www.cse.buffalo.edu/~jing/cse601/fa12/ 2 Outline • Basics – Motivation, definition, evaluation • Methods – – – – – Partitional, Hierarchical Density-based Mixture model Spectral methods • Advanced topics – Clustering ensemble – Clustering in MapReduce – Semi-supervised clustering, subspace clustering, co-clustering, etc. 3 Readings • Tan, Steinbach, Kumar, Chapters 8 and 9. • Han, Kamber, Pei. Data Mining: Concepts and Techniques. Chapters 10 and 11. • Additional readings posted on website 4 Clustering Basics • • • • Definition and Motivation Data Preprocessing and Similarity Computation Objective of Clustering Clustering Evaluation 5 Clustering • Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Intra-cluster distances are minimized Inter-cluster distances are maximized 6 Application Examples • A stand-alone tool: explore data distribution • A preprocessing step for other algorithms • Pattern recognition, spatial data analysis, image processing, market research, WWW, … – Cluster documents – Cluster web log data to discover groups of similar access patterns 7 Clustering Co-expressed Genes Gene Expression Data Matrix Gene Expression Patterns Co-expressed Genes Why looking for co-expressed genes? Co-expression indicates co-function; Co-expression also indicates co-regulation. 8 Gene-based Clustering 1.5 Expression Value 1 0.5 0 -0.5 -1 -1.5 Time Point 1.5 Expression Level 1 0.5 0 -0.5 -1 -1.5 Time Points 1.5 Expression Value 1 0.5 0 -0.5 -1 -1.5 Time Points Examples of co-expressed genes and coherent patterns in gene expression data Iyer’s data [2] [2] Iyer, V.R. et al. The transcriptional program in the response of human fibroblasts to serum. Science, 283:83–87, 1999. 9 Other Applications • Land use: Identification of areas of similar land use in an earth observation database • Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs • City-planning: Identifying groups of houses according to their house type, value, and geographical location • Climate: understanding earth climate, find patterns of atmospheric and ocean 10 Two Important Aspects • Properties of input data – Define the similarity or dissimilarity between points • Requirement of clustering – Define the objective and methodology 11 Clustering Basics • • • • Definition and Motivation Data Preprocessing and Distance computation Objective of Clustering Clustering Evaluation 12 Data Representation • Data: Collection of data objects and their attributes • An attribute is a property or characteristic of an object Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No A collection of attributes describe an object 7 Yes Divorced 220K No 8 No Single 85K Yes – Object is also known as record, point, case, sample, entity, or instance 9 No Married 75K No 10 No Single 90K Yes – Examples: eye color of a person, temperature, etc. – Attribute is also known as dimension, variable, field, characteristic, or feature • Attributes Objects 60K 10 13 Data Matrix • Represents n objects with p variables – An n by p matrix The value of the i-th object on the f-th attribute Attributes Objects x11 x1 f x x if i1 xn1 xnf x 1p x ip x np 14 Gene Expression Data sample 1 sample 2 sample 3 sample 4 sample … • Clustering genes gene 1 0.13 0.72 0.1 0.57 •Genes are objects gene 2 0.34 1.58 1.05 1.15 gene 3 0.43 1.1 0.97 1 •Experiment conditions are attributes gene 4 1.22 0.97 1 0.85 gene 5 -0.89 1.21 1.29 1.08 gene 6 1.1 1.45 1.44 1.12 gene 7 0.83 1.15 1.1 1 gene 8 0.87 1.32 1.35 1.13 gene 9 -0.33 1.01 1.38 1.21 gene 10 0.10 0.85 1.03 1 • Find genes with similar function gene … 15 Similarity and Dissimilarity • Similarity – Numerical measure of how alike two data objects are – Is higher when objects are more alike – Often falls in the range [0,1] • Dissimilarity – Numerical measure of how different are two data objects – Lower when objects are more alike – Minimum dissimilarity is often 0 – Upper limit varies 16 Distance Matrix • Represents pairwise distance in n objects – – – – An n by n matrix d(i,j): distance or dissimilarity between objects i and j Nonnegative Close to 0: similar 0 d (2,1) 0 d (3,1) d (3,2) 0 d (n,1) d (n,2) 0 17 Data Matrix -> Distance Matrix s1 g1 s2 s3 … s4 0.13 0.72 0.1 0.57 g2 0.34 1.58 1.05 1.15 g3 0.43 g4 1.22 g5 -0.89 1.21 1.29 1.08 g6 1.1 1.45 1.44 1.12 g3 g7 0.83 1.15 g4 g8 0.87 1.32 1.35 1.13 g9 -0.33 1.01 1.38 1.21 g 10 0.10 1 g1 1 0.85 g2 1.1 0.97 0.97 g1 1.1 0.85 1.03 1 0 g2 g3 g4 … d(1,2) d(1,3) d(1,4) 0 d(2,3) d(2,4) 0 d(3,4) 0 … 1 Distance Matrix … Original Data Matrix 18 Types of Attributes • Discrete – Has only a finite or countably infinite set of values – Examples: zip codes, counts, or the set of words in a collection of documents – Note: binary attributes are a special case of discrete attributes • Ordinal – Has only a finite or countably infinite set of values – Order of values is important – Examples: rankings (e.g., pain level 1-10), grades (A, B, C, D) • Continuous – Has real numbers as attribute values – Examples: temperature, height, or weight – Continuous attributes are typically represented as floating-point variables 19 Similarity/Dissimilarity for Simple Attributes p and q are the attribute values for two data objects. Discrete Ordinal Continuous Dissimilarity and similarity between p and q 20 Minkowski Distance—Continuous Attribute • Minkowski distance: a generalization d (i, j) q | x x |q | x x |q ... | x x |q (q 0) i1 j1 i2 j2 ip jp • If q = 2, d is Euclidean distance • If q = 1, d is Manhattan distance xi Xi (1,7) 12 8.48 q=2 6 q=1 6 Xj(7,1) xj 21 Standardization • Calculate the mean absolute deviation mf 1 n (x1 f x2 f ... xnf ) . sf 1 n (| x1 f m f | | x2 f m f | ... | xnf m f |) • Calculate the standardized measurement (z-score) xif m f zif sf 22 Mahalanobis Distance 1 d ( p, q) ( p q) ( p q) T is the covariance matrix of the input data X j ,k 1 n ( X ij X j )( X ik X k ) n 1 i 1 Belongs to the family of bregman divergence For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6. 23 Mahalanobis Distance Covariance Matrix: C 0.3 0.2 0 . 2 0 . 3 A: (0.5, 0.5) B B: (0, 1) A C: (1.5, 1.5) Mahal(A,B) = 5 Mahal(A,C) = 4 24 Common Properties of a Distance • Distances, such as the Euclidean distance, have some well known properties 1. 2. 3. d(p, q) 0 for all p and q and d(p, q) = 0 only if p = q. (Positive definiteness) d(p, q) = d(q, p) for all p and q. (Symmetry) d(p, r) d(p, q) + d(q, r) for all points p, q, and r. (Triangle Inequality) where d(p, q) is the distance (dissimilarity) between points (data objects), p and q. • A distance that satisfies these properties is a metric 25 Similarity for Binary Attributes • Common situation is that objects, p and q, have only binary attributes • Compute similarities using the following quantities M01 = the number of attributes where p was 0 and q was 1 M10 = the number of attributes where p was 1 and q was 0 M00 = the number of attributes where p was 0 and q was 0 M11 = the number of attributes where p was 1 and q was 1 • Simple Matching and Jaccard Coefficients SMC = number of matches / total number of attributes = (M11 + M00) / (M01 + M10 + M11 + M00) J = number of matches / number of not-both-zero attributes values = (M11) / (M01 + M10 + M11) 26 SMC versus Jaccard: Example p= 1000000000 q= 0000001001 M01 = 2 M10 = 1 M00 = 7 M11 = 0 (the number of attributes where p was 0 and q was 1) (the number of attributes where p was 1 and q was 0) (the number of attributes where p was 0 and q was 0) (the number of attributes where p was 1 and q was 1) SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7) = 0.7 J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0 27 Document Data • Each document becomes a `term' vector, – each term is a component (attribute) of the vector, – the value of each component is the number of times the corresponding term occurs in the document. team coach pla y ball score game wi n lost timeout season Document 1 3 0 5 0 2 6 0 2 0 2 Document 2 0 7 0 2 1 0 0 3 0 0 Document 3 0 1 0 0 1 2 2 0 3 0 28 Cosine Similarity • If d1 and d2 are two document vectors, then cos( d1, d2 ) = (d1 d2) / ||d1|| ||d2|| , where indicates vector dot product and || d || is the length of vector d. • Example: d1 = 3 2 0 5 0 0 0 2 0 0 d2 = 1 0 0 0 0 0 0 1 0 2 d1 d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5 ||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481 ||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245 cos( d1, d2 ) = .3150 29 Correlation • Correlation measures the linear relationship between objects • To compute correlation, we standardize data objects, p and q, and then take their dot product (continuous attributes) pk ( pk mean( p)) / std ( p) qk (qk mean(q)) / std (q) s( p, q) p q 30 Common Properties of a Similarity • Similarities, also have some well known properties. 1. s(p, q) = 1 (or maximum similarity) only if p = q. 2. s(p, q) = s(q, p) for all p and q. (Symmetry) where s(p, q) is the similarity between points (data objects), p and q. 31 Characteristics of the Input Data Are Important • • • • • • Sparseness Attribute type Type of Data Dimensionality Noise and Outliers Type of Distribution • => Conduct preprocessing and select the appropriate dissimilarity or similarity measure • => Determine the objective of clustering and choose the appropriate method 32 Clustering Basics • • • • Definition and Motivation Data Preprocessing and Distance computation Objective of Clustering Clustering Evaluation 33 Considerations for Cluster Analysis • Partitioning criteria – Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning is desirable) • Separation of clusters – Exclusive (e.g., one customer belongs to only one region) vs. overlapping (e.g., one document may belong to more than one topic) • Hard versus fuzzy – In fuzzy clustering, a point belongs to every cluster with some weight between 0 and 1 – Weights must sum to 1 – Probabilistic clustering has similar characteristics • Similarity measure and data types • Heterogeneous versus homogeneous – Cluster of widely different sizes, shapes, and densities 34 Requirements of Clustering • Scalability • Ability to deal with different types of attributes • Minimal requirements for domain knowledge to determine input parameters • Able to deal with noise and outliers • Discovery of clusters with arbitrary shape • Insensitive to order of input records • High dimensionality • Incorporation of user-specified constraints • Interpretability and usability • What clustering results we want to get? 35 Notion of a Cluster can be Ambiguous How many clusters? Six Clusters Two Clusters Four Clusters 36 Partitional Clustering Input Data A Partitional Clustering 37 Hierarchical Clustering p1 p3 p4 p2 p1 p2 p3 p4 Clustering Solution 1 p1 p3 p4 p2 p1 p2 p3 p4 Clustering Solution 2 38 Types of Clusters: Center-Based • Center-based – A cluster is a set of objects such that an object in a cluster is closer (more similar) to the “center” of a cluster, than to the center of any other cluster – The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most “representative” point of a cluster 4 center-based clusters 39 Types of Clusters: Density-Based • Density-based – A cluster is a dense region of points, which is separated by lowdensity regions, from other regions of high density. – Used when the clusters are irregular or intertwined, and when noise and outliers are present. 6 density-based clusters 40 Clustering Basics • • • • Definition and Motivation Data Preprocessing and Distance computation Objective of Clustering Clustering Evaluation 41 Cluster Validation • Cluster validation – Quality: “goodness” of clusters – Assess the quality and reliability of clustering results • Why validation? – To avoid finding clusters formed by chance – To compare clustering algorithms – To choose clustering parameters • e.g., the number of clusters 42 Aspects of Cluster Validation • Comparing the clustering results to ground truth (externally known results) – External Index • Evaluating the quality of clusters without reference to external information – Use only the data – Internal Index • Determining the reliability of clusters – To what confidence level, the clusters are not formed by chance – Statistical framework 43 Comparing to Ground Truth • Notation – N: number of objects in the data set – P={P1,…,Ps}: the set of “ground truth” clusters – C={C1,…,Ct}: the set of clusters reported by a clustering algorithm • The “incidence matrix” – N N (both rows and columns correspond to objects) – Pij = 1 if Oi and Oj belong to the same “ground truth” cluster in P; Pij=0 otherwise – Cij = 1 if Oi and Oj belong to the same cluster in C; Cij=0 otherwise 44 Rand Index and Jaccard Coefficient • A pair of data object (Oi,Oj) falls into one of the following categories – – – – SS: Cij=1 and Pij=1; DD: Cij=0 and Pij=0; SD: Cij=1 and Pij=0; DS: Cij=0 and Pij=1; • Rand index Rand (agree) (agree) (disagree) (disagree) | Agree | | SS | | DD | | Agree | | Disagree | | SS | | SD | | DS | | DD | – may be dominated by DD • Jaccard Coefficient | SS | Jaccard coefficien t | SS | | SD | | DS | 45 Clustering g1 g2 g3 g4 g5 g1 1 1 1 0 0 g2 1 1 1 0 0 Clustering g3 1 1 1 0 0 Same Cluster Different Cluster g4 0 0 0 1 1 9 4 g5 0 0 0 1 1 4 8 g2 g3 g4 g5 Ground truth Same Cluster Different Cluster Groundtruth g1 g1 1 1 0 0 0 g2 1 1 0 0 0 g3 0 0 1 1 1 g4 0 0 1 1 1 g5 0 0 1 1 1 Rand | SS | | DD | 17 | SS | | SD | | DS | | DD | 25 Jaccard | SS | 9 | SS | | SD | | DS | 17 46 Entropy and Purity • Notation – | Ck Pj | the number of objects in both the k-th cluster of the clustering solution and j-th cluster of the groundtruth – | Ck | the number of objects in the k-th cluster of the clustering solution – | Pj | the number of objects in the j-th cluster of the groundtruth • Purity Purity 1 max j | Ck Pj | N k • Normalized Mutual Information NMI I (C , P) H (C ) H ( P) I (C , P) k H (C ) k j | Ck Pj | N | Ck | |C | log k N N log N | Ck Pj | | Ck || Pj | H ( P) j | Pj | N log | Pj | N 47 Example P1 P2 P3 P4 P5 P6 Total C1 3 5 40 506 96 27 677 C2 4 7 280 29 39 2 361 C3 1 1 1 7 4 671 685 C4 10 162 3 119 73 2 369 C5 331 22 5 70 13 23 464 C6 5 358 12 212 48 13 648 total 354 555 341 943 273 738 3204 NMI I (C , P) H (C ) H ( P) I (C , P) k H (C ) k j Purity 1 max j | Ck Pj | N k 506 280 671 162 331 358 3204 0.7203 Purity | Ck Pj | N | Ck | |C | log k N N log N | Ck Pj | | Ck || Pj | H ( P) j | Pj | N log | Pj | N 48 Internal Index • “Ground truth” may be unavailable • Use only the data to measure cluster quality – Measure the “cohesion” and “separation” of clusters – Calculate the correlation between clustering results and distance matrix 49 Cohesion and Separation • Cohesion is measured by the within cluster sum of squares WSS ( x mi ) 2 i xCi • Separation is measured by the between cluster sum of squares BSS Ci (m mi )2 i where |Ci| is the size of cluster i, m is the centroid of the whole data set • BSS + WSS = constant • WSS (Cohesion) measure is called Sum of Squared Error (SSE)—a commonly used measure • A larger number of clusters tend to result in smaller SSE 50 Example 1 m1 m 2 3 4 m2 5 WSS (1 3)2 (2 3)2 (4 3)2 (5 3)2 10 K=1 : BSS 4 (3 3)2 0 Total 10 0 10 K=2 : WSS (1 1.5) 2 (2 1.5) 2 (4 4.5) 2 (5 4.5) 2 1 BSS 2 (3 1.5) 2 2 (4.5 3) 2 9 Total 1 9 10 K=4: WSS (1 1) 2 (2 2) 2 (4 4) 2 (5 5) 2 0 BSS 1 (1 3) 2 1 (2 3) 2 1 (4 3) 2 1 (5 3) 2 10 Total 0 10 10 51 Silhouette Coefficient • Silhouette Coefficient combines ideas of both cohesion and separation • For an individual point, i – Calculate a = average distance of i to the points in its cluster – Calculate b = min (average distance of i to points in another cluster) – The silhouette coefficient for a point is then given by s = 1 – a/b if a < b, (s = b/a - 1 if a b, not the usual case) – Typically between 0 and 1 – The closer to 1 the better b a • Can calculate the Average Silhouette width for a cluster or a clustering 52 Correlation with Distance Matrix • Distance Matrix – Dij is the similarity between object Oi and Oj • Incidence Matrix – Cij=1 if Oi and Oj belong to the same cluster, Cij=0 otherwise • Compute the correlation between the two matrices – Only n(n-1)/2 entries needs to be calculated • High correlation indicates good clustering 53 Correlation with Distance Matrix • Given Distance Matrix D = {d11,d12, …, dnn } and Incidence Matrix C= { c11, c12,…, cnn } . • Correlation r between D and C is given by n r (d i 1, j 1 n _ ij _ d )(cij c) (dij d ) i 1, j 1 _ n 2 _ 2 ( c c ) ij i 1, j 1 54 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 y Random Points y Are There Clusters in the Data? 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0.2 0.4 0.6 0.8 0 1 DBSCAN 0 0.2 0.4 x 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 y y K-means 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0.2 0.4 0.6 x 0.6 0.8 1 x 0.8 1 0 Complete Link 0 0.2 0.4 0.6 0.8 1 x 55 Measuring Cluster Validity Via Correlation 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 y y • Correlation of incidence and distance matrices for the Kmeans clusterings of the following two data sets 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0.2 0.4 0.6 x Corr = -0.9235 0.8 1 0 0 0.2 0.4 0.6 0.8 1 x Corr = -0.5810 56 Using Similarity Matrix for Cluster Validation • Order the similarity matrix with respect to cluster labels and inspect visually. 1 1 0.9 0.8 0.7 Points y 0.6 0.5 0.4 0.3 0.2 0.1 0 10 0.9 20 0.8 30 0.7 40 0.6 50 0.5 60 0.4 70 0.3 80 0.2 90 0.1 100 0 0.2 0.4 0.6 x 0.8 1 20 40 60 80 0 100 Similarity Points 57 Using Similarity Matrix for Cluster Validation • Clusters in random data are not so crisp 1 1 10 0.9 20 0.8 30 0.7 40 0.6 50 0.5 60 0.4 70 0.3 80 0.2 90 0.1 0.9 0.8 0.7 Points y 0.6 0.5 0.4 0.3 0.2 0.1 0 100 0 0.2 0.4 0.6 x 0.8 1 20 40 60 80 0 100 Similarity Points 58 Reliability of Clusters • Need a framework to interpret any measure – For example, if our measure of evaluation has the value, 10, is that good, fair, or poor? • Statistics provide a framework for cluster validity – The more “atypical” a clustering result is, the more likely it represents valid structure in the data 59 Statistical Framework for SSE • Example – Compare SSE of 0.005 against three clusters in random data – SSE Histogram of 500 sets of random data points of size 100— lowest SSE is 0.0173 1 50 0.9 45 0.8 40 0.7 35 30 Count y 0.6 0.5 0.4 20 0.3 15 0.2 10 0.1 0 25 5 0 0.2 0.4 0.6 x 0.8 1 0 0.016 0.018 0.02 0.022 0.024 0.026 0.028 0.03 0.032 0.034 SSE SSE = 0.005 60 Determine the Number of Clusters Using SSE • SSE curve 10 9 6 8 4 7 6 SSE 2 0 5 4 -2 3 2 -4 1 -6 0 5 10 15 Clustering of Input Data 2 5 10 15 20 25 30 K SSE wrt K 61 Take-away Message • What’s clustering? • Why clustering is important? • How to preprocess data and compute dissimilarity/similarity from data? • What’s a good clustering solution? • How to evaluate the clustering results? 62