Clustering

Clustering Credit: Padhraic Smyth University of California, Irvine Clustering • “automated detection of group structure in data” – Typically: partition N data points into K groups (clusters) such that the points in each group are more similar to each other than to points in other groups – descriptive technique (contrast with predictive) – for real-valued vectors, clusters can be thought of as clouds of points in p-dimensional space Clustering Sometimes easy Sometimes impossible and sometimes in between Why is Clustering useful? • “Discovery” of new knowledge from data – Contrast with supervised classification (where labels are known) – Long history in the sciences of categories, taxonomies, etc – Can be very useful for summarizing large data sets • For large n and/or high dimensionality • Applications of clustering – – – – – Clustering of documents produced by a search engine Segmentation of customers for an e-commerce store Discovery of new types of galaxies in astronomical data Clustering of genes with similar expression profiles Cluster pixels in an image into regions of similar intensity – …. many more General Issues in Clustering • Clustering algorithm = Representation + Score + Optimization • Cluster Representation: • Score: • Optimization and Search • – – – – What types or “shapes” of clusters are we looking for? What defines a cluster? A clustering = assignment of n objects to K clusters Score = quantitative criterion used to evaluate different clusterings Finding the optimal (minimal/maximal score) clustering is typically NP-hard • Greedy algorithms to optimize the score are widely used Other issues – Distance function, D[x(i),x(j)] critical aspect of clustering, both – – How is K selected? Different types of data • • distance of individual pairs of objects distance of individual objects from clusters • • Real-valued versus categorical Attribute-valued vectors vs. n2 distance matrix Different Types of Clustering Algorithms • partition-based clustering – e.g. K-means • probabilistic model-based clustering – e.g. fuzzy k-means, mixture models [both of the above work with measurement data, e.g., feature vectors] • hierarchical clustering – e.g. hierarchical agglomerative clustering • graph-based clustering – E.g., min-cut algorithms [both of the above work with distance data, e.g., distance matrix] Different Types of Input to Clustering Algorithms • Data matrix: – N rows d columns • Distance matrix – N x N distances between objects Partition-Based Clustering • input: – n data points X={x(1) … x(n)}, of dimension d – • K = number of cluster s output: C = {C1 … CK} = specification of K clusters – implicit representation: • each x(i) is assigned to a unique Cj (hard-assignment) – explicit representation • each Cj is specified in some manner, e.g., as a mean or a region in input space • Optimization algorithm – require that score[C] is minimized (or maximized) • e.g., sum-of-squares of within cluster distances – exhaustive search is intractable – combinatorial optimization problem: assign n objects to K classes – large search space: number of possible clusterings is approximately Kn / K! • so, use greedy iterative method • will be subject to local maxima Score Functions for Partition-Based Clustering • want compact clusters – minimize within cluster distances wc(C) • want different clusters far apart – maximize between cluster distances bc(C) • given cluster partitioning C, find centers c1…cK – e.g. for vectors, use centroids of points in cluster Ci • Ci = 1/(ni)  x  Ci x – wc(C) = sum-of-squares within cluster distance (minimize) • wc(C) = i=1…k wc(Ci) where wc(Ci) = i=1…k  – bc(C) = distance between clusters (maximize) • bc(C) = i,j=1…k d(ci,cj)2 x  Ci d(x,ci)2 K-means Clustering • basic idea: – Score = wc(C) = sum-of-squares within cluster distance – start with randomly chosen cluster centers c1 … cK – repeat until no cluster memberships change: • assign each point x to cluster with nearest center – find smallest d(x,ci), over all c1 … cK • recompute cluster centers over data assigned to them – Ci = 1/(ni)  x  Ci x • algorithm terminates (finite number of steps) – Score(C) at each iteration (if membership changes) • converges to at least a local minimum of Score(C) – not necessarily the global minimum … – different initial centers (seeds) can lead to diff local minima Squared Errors and Cluster Centers • Squared error (distance) between a data point x and a cluster center c: d[x,c]= j ( xj - cj )2 • •Total squared error between a cluster center c(k) and all Nk points assigned to that cluster: Sk = i d [ xi , ck ] Distance is usually defined as Euclidean distance • Total squared error summed across K clusters SSE = Data Mining Lectures  k Sk Lectures 12,13: Clustering Padhraic Smyth, UC Irvine K-means Complexity • time complexity = O(I e n K) << exhaustive ~ Kn / K! – I = number of interations (steps) – e = cost of distance computation (e=p for Euclidean dist) • Approximations/speed-up tricks for very large n – use x(i)’s nearest to means as cluster centers instead of actual mean • reuse of cached dists from size n2 dist mat D (lowers effective “e”) – “condense”: reduce “n” by replacing group with prototype – Additional references: K-means 1. Ask user how many clusters they’d like. (e.g. K=5) (Example is courtesy of Andrew Moore, CMU) K-means 1. Ask user how many clusters they’d like. (e.g. K=5) 2. Randomly guess K cluster Center locations K-means 1. Ask user how many clusters they’d like. (e.g. K=5) 2. Randomly guess K cluster Center locations 3. Each datapoint finds out which Center it’s closest to. (Thus each Center “owns” a set of datapoints) K-means 1. Ask user how many clusters they’d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it’s closest to. 4. Each Center finds the centroid of the points it owns K-means 1. Ask user how many clusters they’d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it’s closest to. 4. Each Center finds the centroid of the points it owns 5. New Centers => new boundaries 6. Repeat until no change Image K-means clustering of RGB (3 value) pixel color intensities, K = 11 segments (from David Forsyth, UC Berkeley) Image Clusters on color K-means clustering of RGB (3 value) pixel color intensities, K = 11 segments (from David Forsyth, UC Berkeley) Issues in K-means clustering • Simple, but useful – tends to select compact “isotropic” cluster shapes – can be useful for initializing more complex methods – many algorithmic variations on the basic theme • e.g., in signal processing/data compression is similar to vector-quantization • Choice of distance measure – Euclidean distance – Weighted Euclidean distance – Many others possible • Selection of K – “screen diagram” - plot SSE versus K, look for “knee” of curve • Limitation: may not be any clear K value Convergence to Global Minimum • Does K-means always converge to the best possible solution? – i.e., the set of K centers that minimize the SSE? • No: always converges to *some* solution, but not necessarily the best – Depends on the starting point chosen • To think about: prove that SSE always decreases after every iteration of the K-means algorithm, until convergence. (hint: need to prove that assignment step and computation of cluster centers both decrease the SSE) Local Search and Local Minima Data Mining Lectures Lectures 12,13: Clustering Padhraic Smyth, UC Irvine Suboptimal Results from K-means on Simulated Data Why does k-means not perform so well on this example? Finite Mixture Models p ( x)  ? Finite Mixture Models K p (x)   p(x, ck ) k 1 Finite Mixture Models K p (x)   p (x, ck ) k 1 K   p (x | ck ) p (ck ) k 1 Finite Mixture Models K p (x)   p (x, ck ) k 1 K   p (x | ck ) p (ck ) k 1 K   p (x | ck , k ) wk k 1 Finite Mixture Models K p (x)   p (x, ck ) k 1 K   p (x | ck ) p (ck ) k 1 K   p (x | ck , k ) wk k 1 Component Modelk Weightk Parametersk 0.5 p(x) 0.4 Component 1 Component 2 0.3 0.2 0.1 0 -5 0 5 10 5 10 0.5 p(x) 0.4 Mixture Model 0.3 0.2 0.1 0 -5 Data Mining Lectures 0 x Lectures 12,13: Clustering Padhraic Smyth, UC Irvine 0.5 p(x) 0.4 Component 1 Component 2 0.3 0.2 0.1 0 -5 0 5 10 5 10 0.5 p(x) 0.4 Mixture Model 0.3 0.2 0.1 0 -5 Data Mining Lectures 0 x Lectures 12,13: Clustering Padhraic Smyth, UC Irvine 2 p(x) 1.5 Component Models 1 0.5 0 -5 0 5 10 5 10 0.5 p(x) 0.4 Mixture Model 0.3 0.2 0.1 0 -5 Data Mining Lectures 0 x Lectures 12,13: Clustering Padhraic Smyth, UC Irvine Interpretation of Mixtures 1. C has a direct (physical) interpretation e.g., C  {age of fish}, C = {male, female} Interpretation of Mixtures 1. C has a direct (physical) interpretation e.g., C  {age of fish}, C = {male, female} 2. C is a convenient hidden variable (i.e., the cluster variable) - focuses attention on subsets of the data e.g., for visualization, clustering, etc - C might have a physical/real interpretation but not necessarily so Probabilistic Clustering: Mixture Models • assume a probabilistic model for each component cluster • mixture model: f(x) = k=1…K wk fk(x;k) • wk are K mixing weights – 0  wk  1 and k=1…K wk = 1 • where K component densities fk(x;k) can be: – – – – Gaussian Poisson exponential ... P d • Note: – Assumes a model for the data (advantages and disadvantages) – Results in probabilistic membership: p(cluster k | x) Gaussian Mixture Models (GMM) • model for k-th component is normal N(k,k) – often assume diagonal covariance: jj = j2 , ij = 0 – or sometimes even simpler: jj = 2 , ij = 0 • f(x) = k=1…K wk fk(x;k) with k = <k , k> or <k ,k> • generative model: – randomly choose a component • selected with probability wk – generate x ~ N(k,k) – note: k & k both d-dim vectors Learning Mixture Models from Data • Score function = log-likelihood L() – L() = log p(X|) = log H p(X,H|) – H = hidden variables (cluster memberships of each x) – L() cannot be optimized directly • EM Procedure – General technique for maximizing log-likelihood with missing data – For mixtures • E-step: compute “memberships” p(k | x) = wk fk(x;k) / f(x) • M-step: pick a new  to maximize expected data log-likelihood • Iterate: guaranteed to climb to (local) maximum of L() The E (Expectation) Step n data points Current K clusters and parameters (mean and covariance for each cluster) E step: Compute p(data point i is in group k) given mean and covariance for each cluster The M (Maximization) Step n data points New parameters for the K clusters M step: Compute , given n data points and memberships a new estimate for mean and covariance for each cluster Complexity of EM for mixtures n data points K models Complexity per iteration scales as O( n K f(p) ) Comments on Mixtures and EM Learning • Complexity of each EM iteration – Depends on the probabilistic model being used • e.g., for Gaussians, Estep is O(nK), Mstep is O(nKp2) – Sometimes E or M-step is not closed form • => can require numerical optimization or sampling within each iteration • Generalized EM (GEM): instead of maximizing likelihood, just increase likelihood • EM can be thought of as hill-climbing with direction and step-size provided automatically • K-means as a special case of EM – Gaussian mixtures with isotropic (diagonal, equi-variance) k ‘s – Approximate the E-step by choosing most likely cluster (instead of using membership probabilities) • Generalizations… – – – – Mixtures of multinomials for text data Mixtures of Markov chains for Web sequences + more Will be discussed later in lectures on text and Web data ANEMIA PATIENTS AND CONTROLS Red Blood Cell Hemoglobin Concentration 4.4 4.3 4.2 4.1 4 3.9 3.8 3.7 3.3 Data Mining Lectures 3.4 3.5 3.6 3.7 Red Blood Volume Lectures 12,13:Cell Clustering 3.8 3.9 4 Padhraic Smyth, UC Irvine EM ITERATION 1 Red Blood Cell Hemoglobin Concentration 4.4 4.3 4.2 4.1 4 3.9 3.8 3.7 3.3 Data Mining Lectures 3.4 3.5 3.6 3.7 RedLectures Blood Volume 12,13:Cell Clustering 3.8 3.9 4 Padhraic Smyth, UC Irvine EM ITERATION 3 Red Blood Cell Hemoglobin Concentration 4.4 4.3 4.2 4.1 4 3.9 3.8 3.7 3.3 Data Mining Lectures 3.4 3.5 3.6 3.7 RedLectures Blood Volume 12,13:Cell Clustering 3.8 3.9 4 Padhraic Smyth, UC Irvine EM ITERATION 5 Red Blood Cell Hemoglobin Concentration 4.4 4.3 4.2 4.1 4 3.9 3.8 3.7 3.3 Data Mining Lectures 3.4 3.5 3.6 3.7 RedLectures Blood Volume 12,13:Cell Clustering 3.8 3.9 4 Padhraic Smyth, UC Irvine EM ITERATION 10 Red Blood Cell Hemoglobin Concentration 4.4 4.3 4.2 4.1 4 3.9 3.8 3.7 3.3 Data Mining Lectures 3.4 3.5 3.6 3.7 RedLectures Blood Volume 12,13:Cell Clustering 3.8 3.9 4 Padhraic Smyth, UC Irvine EM ITERATION 15 Red Blood Cell Hemoglobin Concentration 4.4 4.3 4.2 4.1 4 3.9 3.8 3.7 3.3 Data Mining Lectures 3.4 3.5 3.6 3.7 RedLectures Blood Volume 12,13:Cell Clustering 3.8 3.9 4 Padhraic Smyth, UC Irvine EM ITERATION 25 Red Blood Cell Hemoglobin Concentration 4.4 4.3 4.2 4.1 4 3.9 3.8 3.7 3.3 Data Mining Lectures 3.4 3.5 3.6 3.7 RedLectures Blood Volume 12,13:Cell Clustering 3.8 3.9 4 Padhraic Smyth, UC Irvine ANEMIA DATA WITH LABELS Red Blood Cell Hemoglobin Concentration 4.4 4.3 4.2 4.1 4 3.9 3.8 3.7 3.3 Data Mining Lectures 3.4 3.5 3.6 3.7 RedLectures Blood Volume 12,13:Cell Clustering 3.8 3.9 4 Padhraic Smyth, UC Irvine LOG-LIKELIHOOD AS A FUNCTION OF EM ITERATIONS 490 480 Log-Likelihood 470 460 450 440 430 420 410 400 0 Data Mining Lectures 5 10 15 EM Iteration Lectures 12,13: Clustering 20 25 Padhraic Smyth, UC Irvine Selecting K in mixture models • cannot just choose K that maximizes likelihood – • Likelihood L() is always larger for larger K Model selection alternatives: – 1) penalize complexity • • • – 2) Bayesian: compute posteriors p(k | data) • • • – P(k|data) requires computation of p(data|k) = marginal likelihood Can be tricky to compute for mixture models Recent work on Dirichlet process priors has made this more practical 3) (cross) validation: • • • • – e.g., BIC = L() – d/2 log n , d = # parameters (Bayesian information criterion) Asymptotically correct under certain assumptions Often used in practice for mixture models even though assumptions for theory are not met Score different models by log p(Xtest | ) split data into train and validate sets Works well on large data sets Can be noisy on small data (logL is sensitive to outliers) Note: all of these methods evaluate the quality of the clustering as a density estimator, rather than with any explicit notion of clustering Example of BIC Score for Red-Blood Cell Data Example of BIC Score for Red-Blood Cell Data True number of classes (2) selected by BIC Hierarchical Clustering • • Representation: tree of nested clusters Works from a distance matrix – advantage: x’s can be any type of object – disadvantage: computation • two basic approachs: – merge points (agglomerative) – divide superclusters (divisive) • visualize both via “dendograms” – shows nesting structure – merges or splits = tree nodes • Applications – e.g., clustering of gene expression data – Useful for seeing hierarchical structure, for relatively small data sets Simple example of hierarchical clustering Agglomerative Methods: Bottom-Up • algorithm based on distance between clusters: – for i=1 to n let Ci = { x(i) }, i.e. start with n singletons – while more than one cluster left • let Ci and Cj be cluster pair with minimum distance, dist[Ci , Cj ] • merge them, via Ci = Ci  Cj and remove Cj • time complexity = O(n2) to O(n3) – n iterations (start: n clusters; end: 1 cluster) – 1st iteration: O(n2) to find nearest singleton pair • space complexity = O(n2) – accesses all distances between x(i)’s • interpreting large n dendrogram difficult anyway (like decision trees) – large n idea: partition-based clusters at leafs Distances Between Clusters • single link / nearest neighbor measure: – D(Ci,Cj) = min { d(x,y) | x  Ci, y  Cj } – can be outlier/noise sensitive Distances Between Clusters • single link / nearest neighbor measure: – D(Ci,Cj) = min { d(x,y) | x  Ci, y  Cj } – can be outlier/noise sensitive • complete link / furthest neighbor measure: – D(Ci,Cj) = max { d(x,y) | x  Ci, y  Cj } – enforces more “compact” clusters Distances Between Clusters • single link / nearest neighbor measure: – D(Ci,Cj) = min { d(x,y) | x  Ci, y  Cj } – can be outlier/noise sensitive • complete link / furthest neighbor measure: – D(Ci,Cj) = max { d(x,y) | x  Ci, y  Cj } – enforces more “compact” clusters • intermediates between those extremes: – average link: D(Ci,Cj) = avg { d(x,y) | x  Ci, y  Cj } – centroid: D(Ci,Cj) = d(ci,cj) where ci , cj are centroids • Note that centroid require that vector mean can be defined • Which to choose? Different methods may be used for exploratory purposes, depends on goals and application Old-Faithful Eruption Timing Data Set • Notice that these variables are not scaled: so waiting time will get a lot more emphasis in Euclidean distance calculations than duration Dendrogram Using Single-Link Method Old Faithful Eruption Duration vs Wait Data Notice how single-link tends to “chain”. dendrogram y-axis = crossbar’s distance score Dendogram Using Ward’s SSE Distance Old Faithful Eruption Duration vs Wait Data More balanced than single-link. Hierarchical Cluster Structure of languages Scalability of Hierarchical Clustering • N objects to cluster • Hierarchical clustering algorithms scale as O(N2) to O(N3) – Why? • This is problematic for large N…… • Solutions? – Use K-means (or a similar algorithm) to create an initial set of K clusters and then use hierarchical clustering from there – Use approximate fast algorithms Data Mining Lectures Lectures 12,13: Clustering Padhraic Smyth, UC Irvine Divisive Methods: Top-Down • algorithm: – begin with single cluster containing all data – split into components, repeat until clusters = single points • two major types: – monothetic: • split by one variable at a time -- restricts search space • analogous to decision trees – polythetic • splits by all variables at once -- many choices makes this difficult • less commonly used than agglomerative methods – generally more computationally intensive • more choices in search space MATLAB Demo • In-class Matlab demo comparing different hierarchical algorithms and k-means, on the same data sets Spectral/Graph-based Clustering Idea: • think of distance matrix as a weighted graph where - objects are nodes - edges exist for objects with high similarity Spectral/Graph-based Clustering Idea: • think of distance matrix as a weighted graph where - objects are nodes - edges exist for objects with high similarity • finding a good grouping of the objects is somewhat equivalent to finding low-weight subgraphs - can be reduced to “min-cut” algorithms - related to eigenstructure of distance matrix - e.g., Shi and Malik, 1997, for image segmentation Clustering non-vector objects • E.g., sequences, images, documents, etc – Can be of varying lengths, sizes • Distance matrix approach – E.g., compute edit distance/transformations for pairs of sequences – Apply clustering (e.g., hierarchical) based on distance matrix – However….does not scale well computationally • “Vectorization” – Represent each object as a vector – Cluster resulting vectors using vector-space algorithm – However…. can lose (e.g., sequence) information by going to vector space • Probabilistic model-based clustering – Treat as mixture of stochastic models (e.g., Markov models) – Can naturally handle variable lengths/sizes – Will discuss application to Web session clustering later in the quarter Clustering of TROPICAL CYCLONES Western North Pacific 1983-2002 (from Scott Gaffney’s Phd Thesis, 2004: uses mixture of regressions clustering) K-Means Clustering Task Clustering Representation Partition based on K centers Score Function Within-cluster sum of squared errors Search/Optimization Iterative greedy search Data Management None specified Models, Parameters K centers Probabilistic Model-Based Clustering Task Representation Score Function Search/Optimization Clustering Mixture of Probability Components Log-likelihood EM (iterative) Data Management None specified Models, Parameters Probability model Single-Link Hierarchical Clustering Task Clustering Representation Tree of nested groupings Score Function No global score Search/Optimization Iterative merging of nearest neighbors Data Management None specified Models, Parameters Dendrogram Summary • Many different approaches and algorithms • No “optimal” or “best” approach – What type of cluster structure are you looking for? • Computational complexity may be an issue for large n • Dimensionality is also an issue • Validation/selection of K is often an ill-posed problem – Often no “right answer” on what the optimal number of clusters is

Clustering

Related documents

Products

Support

Clustering

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib