Unsupervised learning & Cluster Analysis: Basic Concepts and Algorithms Assaf Gottlieb Some of the slides are taken form Introduction to data mining, by Tan, Steinbach, and Kumar What is unsupervised learning & Cluster Analysis ? Learning without a priori knowledge about the classification of samples; learning without a teacher. Kohonen (1995), “Self-Organizing Maps” “Cluster analysis is a set of methods for constructing a (hopefully) sensible and informative classification of an initially unclassified set of data, using the variable values observed on each individual.” B. S. Everitt (1998), “The Cambridge Dictionary of Statistics” What do we cluster? Features/Variables Samples/Instances Applications of Cluster Analysis Understanding Group related documents for browsing, group genes and proteins that have similar functionality, or group stocks with similar price fluctuations Data Exploration Get insight into data distribution Understand patterns in the data Summarization Reduce the size of large data sets A preprocessing step Objectives of Cluster Analysis Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Competing objectives Intra-cluster distances are minimized Inter-cluster distances are maximized Notion of a Cluster can be Ambiguous Two Clusters How many clusters? Six Clusters Four Clusters Depends on “resolution” ! Prerequisites Understand the nature of your problem, the type of features, etc. The metric that you choose for similarity (for example, Euclidean distance or Pearson correlation) often impacts the clusters you recover. Similarity/Distance measures Euclidean Distance d ( x, y) N n 1 ( xn yn ) 2 Highly depends on scale of features may require normalization City Block d DM1 xi wi i 1 deuc=0.5846 deuc=2.6115 deuc=1.1345 These examples of Euclidean distance match our intuition of dissimilarity pretty well… deuc=1.41 deuc=1.22 …But what about these? What might be going on with the expression profiles on the left? On the right? Similarity/Distance measures N 1 xi yi i 1 Ccosine ( x , y ) N x y Cosine Pearson Correlation C pearson ( x , y ) N i 1 ( xi mx )( yi m y ) [i 1 ( xi mx ) 2 ][ i 1 ( yi m y ) 2 ] N N Invariant to scaling (Pearson also to addition) Spearman correlation for ranks Similarity/Distance measures Jaccard similarity When interested in intersection size JSim ( X , Y ) XUY X X∩Y Y X Y X Y Types of Clusterings Important distinction between hierarchical and partitional sets of clusters Partitional Clustering A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset Hierarchical clustering A set of nested clusters organized as a hierarchical tree Partitional Clustering Original Points A Partitional Clustering Hierarchical Clustering p1 p3 p4 p2 p1 p2 p3 p4 Dendrogram 1 p1 p3 p4 p2 p1 p2 p3 p4 Dendrogram 2 Other Distinctions Between Sets of Clustering methods Exclusive versus non-exclusive Fuzzy versus non-fuzzy In fuzzy clustering, a point belongs to every cluster with some weight between 0 and 1 Weights must sum to 1 Partial versus complete In non-exclusive clusterings, points may belong to multiple clusters. Can represent multiple classes or ‘border’ points In some cases, we only want to cluster some of the data Heterogeneous versus homogeneous Cluster of widely different sizes, shapes, and densities Clustering Algorithms Hierarchical clustering K-means Bi-clustering Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree like diagram that records the sequences of merges or splits 5 6 0.2 4 3 4 2 0.15 5 2 0.1 1 0.05 3 0 1 3 2 5 4 6 1 Strengths of Hierarchical Clustering Do not have to assume any particular number of clusters Any desired number of clusters can be obtained by ‘cutting’ the dendogram at the proper level They may correspond to meaningful taxonomies Example in biological sciences (e.g., animal kingdom, phylogeny reconstruction, …) Hierarchical Clustering Two main types of hierarchical clustering Agglomerative (bottom up): Divisive (top down): Start with the points as individual clusters At each step, merge the closest pair of clusters until only one cluster (or k clusters) left Start with one, all-inclusive cluster At each step, split a cluster until each cluster contains a point (or there are k clusters) Traditional hierarchical algorithms use a similarity or distance matrix Merge or split one cluster at a time Agglomerative Clustering Algorithm More popular hierarchical clustering technique Basic algorithm is straightforward 1. 2. 3. 4. 5. 6. Compute the proximity matrix Let each data point be a cluster Repeat Merge the two closest clusters Update the proximity matrix Until only a single cluster remains Key operation is the computation of the proximity of two clusters Different approaches to defining the distance between clusters distinguish the different algorithms Starting Situation Start with clusters of individual points and p1 p2 p3 p4 p5 ... a proximity matrix p1 p2 p3 p4 p5 . . Proximity Matrix . ... p1 p2 p3 p4 p9 p10 p11 p12 Intermediate Situation After some merging steps, we have some clusters C1 C2 C3 C4 C5 C1 C2 C3 C3 C4 C4 C5 C1 Proximity Matrix C2 C5 ... p1 p2 p3 p4 p9 p10 p11 p12 Intermediate Situation We want to merge the two closest clusters (C2 and C5) and update the proximity matrix. C1 C2 C3 C4 C5 C1 C2 C3 C3 C4 C4 C5 C1 Proximity Matrix C2 C5 ... p1 p2 p3 p4 p9 p10 p11 p12 After Merging The question is “How do we update the proximity C2 matrix?” C1 C1 U C5 C3 C4 ? ? ? C3 ? C2 U C5 C4 C1 ? C3 ? C4 ? Proximity Matrix C2 U C5 ... p1 p2 p3 p4 p9 p10 p11 p12 How to Define Inter-Cluster Similarity p1 Similarity? p2 p3 p4 p5 p1 p2 p3 p4 MIN MAX Group Average Distance Between Centroids Ward’s method (not discussed) p5 . . . Proximity Matrix ... How to Define Inter-Cluster Similarity p1 p2 p3 p4 p5 p1 p2 p3 p4 MIN MAX Group Average Distance Between Centroids Other methods driven by an objective function p5 Ward’s Method uses squared error . . . Proximity Matrix ... How to Define Inter-Cluster Similarity p1 p2 p3 p4 p5 p1 p2 p3 p4 MIN MAX Group Average Distance Between Centroids Other methods driven by an objective function p5 Ward’s Method uses squared error . . . Proximity Matrix ... How to Define Inter-Cluster Similarity p1 p2 p3 p4 p5 p1 p2 p3 p4 MIN MAX Group Average Distance Between Centroids Other methods driven by an objective function p5 Ward’s Method uses squared error . . . Proximity Matrix ... How to Define Inter-Cluster Similarity p1 p2 p3 p4 p5 p1 p2 p3 p4 MIN MAX Group Average Distance Between Centroids p5 . . . Proximity Matrix ... Cluster Similarity: MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters Determined by one pair of points, i.e., by one link in the proximity graph. I1 I2 I3 I4 I5 I1 1.00 0.90 0.10 0.65 0.20 I2 0.90 1.00 0.70 0.60 0.50 I3 0.10 0.70 1.00 0.40 0.30 I4 0.65 0.60 0.40 1.00 0.80 I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5 Hierarchical Clustering: MIN 1 5 3 5 0.2 2 1 2 3 0.15 6 0.1 0.05 4 4 Nested Clusters 0 3 6 2 5 Dendrogram 4 1 Strength of MIN Original Points • Can handle non-elliptical shapes Two Clusters Limitations of MIN Original Points • Sensitive to noise and outliers Two Clusters Cluster Similarity: MAX or Complete Linkage Similarity of two clusters is based on the two least similar (most distant) points in the different clusters Determined by all pairs of points in the two clusters I1 I2 I3 I4 I5 I1 1.00 0.90 0.10 0.65 0.20 I2 0.90 1.00 0.70 0.60 0.50 I3 0.10 0.70 1.00 0.40 0.30 I4 0.65 0.60 0.40 1.00 0.80 I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5 Hierarchical Clustering: MAX 4 1 2 5 5 0.4 0.35 2 0.3 0.25 3 3 6 1 4 0.2 0.15 0.1 0.05 0 Nested Clusters 3 6 4 Dendrogram 1 2 5 Strength of MAX Original Points • Less susceptible to noise and outliers Two Clusters Limitations of MAX Original Points •Tends to break large clusters •Biased towards globular clusters Two Clusters Cluster Similarity: Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters. proximity(p , p ) i proximity(Clusteri , Clusterj ) j piClusteri p jClusterj |Clusteri ||Clusterj | Need to use average connectivity for scalability since total proximity favors large clusters I1 I2 I3 I4 I5 I1 1.00 0.90 0.10 0.65 0.20 I2 0.90 1.00 0.70 0.60 0.50 I3 0.10 0.70 1.00 0.40 0.30 I4 0.65 0.60 0.40 1.00 0.80 I5 0.20 0.50 0.30 0.80 1.00 1 2 4 5 3 Hierarchical Clustering: Group Average 5 4 1 0.25 2 5 0.2 2 0.15 3 6 1 4 3 Nested Clusters 0.1 0.05 0 3 6 4 1 Dendrogram 2 5 Hierarchical Clustering: Group Average Compromise between Single and Complete Link Strengths Less susceptible to noise and outliers Limitations Biased towards globular clusters Hierarchical Clustering: Comparison 1 5 4 3 5 5 2 2 5 1 2 1 3 MAX MIN 2 3 6 3 4 1 4 4 1 5 2 5 2 3 3 6 1 4 4 6 Group Average Hierarchical Clustering: Problems and Limitations Once a decision is made to combine two clusters, it cannot be undone Different schemes have problems with one or more of the following: Sensitivity to noise and outliers Difficulty handling different sized clusters and convex shapes Breaking large clusters (divisive) Dendrogram correspond to a given hierarchical clustering is not unique, since for each merge one needs to specify which subtree should go on the left and which on the right They impose structure on the data, instead of revealing structure in these data. How many clusters? (some suggestions later) K-means Clustering Partitional clustering approach Each cluster is associated with a centroid (center point) Each point is assigned to the cluster with the closest centroid Number of clusters, K, must be specified The basic algorithm is very simple K-means Clustering – Details Initial centroids are often chosen randomly. The centroid is (typically) the mean of the points in the cluster. ‘Closeness’ is measured mostly by Euclidean Typical distance, cosine similarity, correlation, etc. choice K-means will converge for common similarity measures mentioned above. Most of the convergence happens in the first few iterations. Clusters produced vary from one run to another. Often the stopping condition is changed to ‘Until relatively few points change clusters’ Complexity is O( n * K * I * d ) n = number of points, K = number of clusters, I = number of iterations, d = number of attributes Evaluating K-means Clusters Most common measure is Sum of Squared Error (SSE) For each point, the error is the distance to the nearest cluster To get SSE, we square these errors and sum them. K SSE dist2 (mi , x ) i 1 xCi x is a data point in cluster Ci and mi is the representative point for cluster Ci can show that mi corresponds to the center (mean) of the cluster Given two clusters, we can choose the one with the smallest error One easy way to reduce SSE is to increase K, the number of clusters A good clustering with smaller K can have a lower SSE than a poor clustering with higher K Issues and Limitations for K-means How to choose initial centers? How to choose K? How to handle Outliers? Clusters different in Shape Density Size Assumes clusters are spherical in vector space Sensitive to coordinate changes Two different K-means Clusterings Original Points 3 2.5 2 y 1.5 1 0.5 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x 2.5 2.5 2 2 1.5 1.5 y 3 y 3 1 1 0.5 0.5 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 x Optimal Clustering 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x Sub-optimal Clustering Importance of Choosing Initial Centroids Iteration 6 1 2 3 4 5 3 2.5 2 y 1.5 1 0.5 0 -2 -1.5 -1 -0.5 0 x 0.5 1 1.5 2 Importance of Choosing Initial Centroids Iteration 1 Iteration 2 Iteration 3 2.5 2.5 2.5 2 2 2 1.5 1.5 1.5 y 3 y 3 y 3 1 1 1 0.5 0.5 0.5 0 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 x 0 0.5 1 1.5 2 -2 Iteration 4 Iteration 5 2.5 2 2 2 1.5 1.5 1.5 1 1 1 0.5 0.5 0.5 0 0 0 -0.5 0 x 0.5 1 1.5 2 0 0.5 1 1.5 2 1 1.5 2 y 2.5 y 2.5 y 3 -1 -0.5 Iteration 6 3 -1.5 -1 x 3 -2 -1.5 x -2 -1.5 -1 -0.5 0 x 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 x 0.5 Importance of Choosing Initial Centroids … Iteration 5 1 2 3 4 3 2.5 2 y 1.5 1 0.5 0 -2 -1.5 -1 -0.5 0 x 0.5 1 1.5 2 Importance of Choosing Initial Centroids … Iteration 1 Iteration 2 2.5 2.5 2 2 1.5 1.5 y 3 y 3 1 1 0.5 0.5 0 0 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 x 0 0.5 Iteration 3 2.5 2 2 2 1.5 1.5 1.5 y 2.5 y 2.5 y 3 1 1 1 0.5 0.5 0.5 0 0 0 -1 -0.5 0 x 0.5 2 Iteration 5 3 -1.5 1.5 Iteration 4 3 -2 1 x 1 1.5 2 -2 -1.5 -1 -0.5 0 x 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 x 0.5 1 1.5 2 Solutions to Initial Centroids Problem Multiple runs Sample and use hierarchical clustering to determine initial centroids Select more than k initial centroids and then select among these initial centroids Select most widely separated Bisecting K-means Not as susceptible to initialization issues Bisecting K-means Bisecting K-means algorithm Variant of K-means that can produce a partitional or a hierarchical clustering Bisecting K-means Example Issues and Limitations for K-means How to choose initial centers? How to choose K? How to handle Outliers? Depends on the problem + some suggestions later Preprocessing Clusters different in Shape Density Size Issues and Limitations for K-means How to choose initial centers? How to choose K? How to handle Outliers? Clusters different in Shape Density Size Limitations of K-means: Differing Sizes Original Points K-means (3 Clusters) Limitations of K-means: Differing Density Original Points K-means (3 Clusters) Limitations of K-means: Non-globular Shapes Original Points K-means (2 Clusters) Overcoming K-means Limitations Original Points K-means Clusters One solution is to use many clusters. Find parts of clusters, but need to put together. Overcoming K-means Limitations Original Points K-means Clusters Overcoming K-means Limitations Original Points K-means Clusters K-means Pros Simple Fast for low dimensional data It can find pure sub clusters if large number of clusters is specified Cons K-Means cannot handle non-globular data of different sizes and densities K-Means will not identify outliers K-Means is restricted to data which has the notion of a center (centroid) Biclustering/Co-clustering Two genes can have similar expression patterns only under some conditions Similarly, in two related conditions, some genes may exhibit different expression N patterns genes M conditions Biclustering As a result, each cluster may involve only a subset of genes and a subset of conditions, which form a “checkerboard” structure: Biclustering In general – a hard task (NP-hard) Heuristic algorithms described briefly: Cheng & Church – deletion of rows and columns. Biclusters discovered one at a time Order-Preserving SubMatrixes Ben-Dor et al. Coupled Two-Way Clustering (Getz. Et al) Spectral Co-clustering Cheng and Church Objective function for heuristic methods (to minimize): 1 H (I , J ) I J 2 ( a a a a ) ij iJ Ij IJ iI , jJ Greedy method: Initialization: the bicluster contains all rows and columns. Iteration: 1. Compute all aIj, aiJ, aIJ and H(I, J) for reuse. 2. Remove a row or column that gives the maximum decrease of H. Termination: when no action will decrease H or H <= . Mask this bicluster and continue Problem removing “trivial” biclusters Ben-Dor et al. (OPSM) Model: For a condition set T and a gene g, the conditions in T can be ordered in a way so that the expression values are sorted in ascending order (suppose the values are all unique). Submatrix A is a bicluster if there is an ordering (permutation) of T such that the expression values of all genes in G are sorted in ascending order. Idea of algorithm: to grow partial models until they become complete models. t1 t2 t3 t4 t5 g1 7 13 19 2 50 g2 19 23 39 6 42 g3 4 6 8 2 10 Induced permutation 2 3 4 1 5 Ben-Dor et al. (OPSM) Getz et al. (CTWC) Idea: repeatedly perform one-way clustering on genes/conditions. Stable clusters of genes are used as the attributes for condition clustering, and vice versa. Spectral Co-clustering Main idea: Normalize the 2 dimension Form a matrix of size m+n (using SVD) Use k-means to cluster both types of data http://adios.tau.ac.il/SpectralCoClustering/ Evaluating cluster quality Use known classes (pairwise Fmeasure, best class F-measure) Clusters can be evaluated with “internal” as well as “external” measures Internal measures are related to the inter/intra cluster distance External measures are related to how representative are the current clusters to “true” classes Inter/Intra Cluster Distances Intra-cluster distance (Sum/Min/Max/Avg) the (absolute/squared) distance between - All pairs of points in the cluster OR - Between the centroid and all points in the cluster OR - Between the “medoid” and all points in the cluster Inter-cluster distance Sum the (squared) distance between all pairs of clusters Where distance between two clusters is defined as: - distance between their centroids/medoids - - (Spherical clusters) Distance between the closest pair of points belonging to the clusters - (Chain shaped clusters) Davies-Bouldin index A function of the ratio of the sum of withincluster (i.e. intra-cluster) scatter to between cluster (i.e. inter-cluster) separation Let C={C1,….., Ck} be a clustering of a set of N objects: 1 k DB . Ri k i 1 with Ri max Rij j 1,..k ,i j and Rij i j var(Ci ) var(C j ) || ci c j || where Ci is the ith cluster and ci is the centroid for cluster i Davies-Bouldin index example For eg: for the clusters shown Compute Rij i j var(Ci ) var(C j ) || ci c j || var(C1)=0, var(C2)=4.5, var(C3)=2.33 Centroid is simply the mean here, so c1=3, c2=8.5, c3=18.33 So, R12=1, R13=0.152, R23=0.797 Now, compute R1=1 (max of R12 and R13); R2=1 (max of R21 and R23); R3=0.797 (max of R31 and R32) 1 k Finally, compute DB . Ri k i 1 DB=0.932 Ri max Rij j 1,..k ,i j Davies-Bouldin index example (ctd) For eg: for the clusters shown Compute Rij i j var(Ci ) var(C j ) || ci c j || Only 2 clusters here var(C1)=12.33 while var(C2)=2.33; c1=6.67 while c2=18.33 R12=1.26 Now compute Since we have only 2 clusters here, R1=R12=1.26; Ri max Rij R2=R21=1.26 j 1,..k ,i j Finally, compute 1 k DB . Ri DB=1.26 k i 1 Other criteria Dunn method (Xi, Xj): intercluster distance between clusters Xi and Xj (Xk): intracluster distance of cluster Xk Silhouette method (X i , X j ) V (U ) m inm in 1 i c 1 j c m ax ( X ) k j i 1 k c Identifying outliers C-index Compare sum of distances S over all pairs from the same cluster against the same # of smallest and largest pairs. Example dataset AML/ALL dataset (Golub et al.) Leukemia 72 patients (samples) 7129 genes Genes/ features 4 groups Two major types ALL & AML T & B Cells in ALL With/without treatment in AML samples Validity index c=2 c=3 c=4 c=5 c=6 V11 V21 V31 V41 V51 V61 V12 V22 V32 V42 V52 V62 V13 V23 V33 V43 V53 Validity V63 index Average V11 0.18 1.21 0.68 0.55 0.61 0.93 0.62 4.27 2.40 1.95 2.15 3.26 0.23 1.55 0.87 0.71 c0.78 =2 1.18 1.34 0.18 0.25 0.69 0.37 0.22 0.32 0.52 0.56 2.20 1.18 0.70 1.01 1.66 0.20 0.80 0.43 0.26 c0.32 =3 0.61 0.68 0.25 0.20 0.58 0.39 0.27 0.33 0.53 0.63 1.87 1.24 0.85 1.07 1.69 0.23 0.69 0.46 0.31 c0.39 =4 0.63 0.69 0.20 0.19 0.49 0.32 0.21 0.27 0.42 0.59 1.56 1.03 0.66 0.86 1.36 0.22 0.57 0.37 0.24 c0.31 =5 0.50 0.57 0.19 0.19 0.47 0.33 0.23 0.28 0.40 0.61 1.49 1.06 0.73 0.90 1.28 0.23 0.55 0.39 0.27 c0.33 =6 0.47 0.57 0.19 V21 V31 V41 V51 V61 V12 V22 V32 V42 V52 V62 V13 V23 V33 1.21 0.68 0.55 0.61 0.93 0.62 4.27 2.40 1.95 2.15 3.26 0.23 1.55 0.87 0.69 0.37 0.22 0.32 0.52 0.56 2.20 1.18 0.70 1.01 1.66 0.20 0.80 0.43 0.58 0.39 0.27 0.33 0.53 0.63 1.87 1.24 0.85 1.07 1.69 0.23 0.69 0.46 0.49 0.32 0.21 0.27 0.42 0.59 1.56 1.03 0.66 0.86 1.36 0.22 0.57 0.37 0.47 0.33 0.23 0.28 0.40 0.61 1.49 1.06 0.73 0.90 1.28 0.23 0.55 0.39 AML/ALL dataset Davies-Bouldin index - C=4 Dunn method - C=2 Silhouette method – C=2 Visual evaluation - coherency Cluster quality example do you see clusters? C Silhouette 2 3 4 5 6 7 8 9 10 0.4922 0.5739 0.4773 0.4991 0.5404 0.541 0.5171 0.5956 0.6446 C Silhouette 2 3 4 5 6 7 8 9 10 0.4863 0.5762 0.5957 0.5351 0.5701 0.5487 0.5083 0.5311 0.5229 Dimensionality Reduction Map points in high-dimensional space to lower number of dimensions Preserve structure: pairwise distances, etc. Useful for further processing: Less computation, fewer parameters Easier to understand, visualize Dimensionality Reduction Feature selection vs. Feature Extraction Feature selection – select important features Pros meaningful features Less work acquiring Unsupervised Variance, Fold UFF Dimensionality Reduction Feature Extraction Transforms the entire feature set to lower dimension. Pros Uses objective function to select the best projection Sometime single features are not good enough Unsupervised PCA, SVD Principal Components Analysis (PCA) approximating a high-dimensional data set with a lower-dimensional linear subspace Second principal component * * * * * * * * ** * * * * * * * * * Data points * * * * First principal component * Original axes Singular Value Decomposition Principal Components Analysis (PCA) Rule of thumb for selecting number of components “Knee” in screeplot Cumulative percentage variance * * * * * * * * ** * * * * * * * * * * * * * * Tools for clustering Matlab – COMPACT http://adios.tau.ac.il/compact/ Tools for clustering Matlab – COMPACT http://adios.tau.ac.il/compact/ Tools for clustering Cluster +TreeView (Eisen et al.) http://rana.lbl.gov/eisen/?page_id=42 Summary Clustering is ill-defined and considered an “art” In fact, this means you need to Understand your data beforehand Know how to interpret clusters afterwards The problem determines the best solution (which measure, which clustering algorithm) – try to experiment with different options.