Machine Learning Clustering 1 Untrained Methods • So far we analyzez TRAINED machine learning methods: given a set of categories, and some labelled training examples, the learner acquires a categorization MODEL than can be applied to classify unseen examples • The last 3 methods describe UNTRAINED machine learning. No labeled examples are available. 2 Clustering: the task • Partition unlabeled examples into disjoint subsets of clusters, such that: – Examples within a cluster are “very similar” – Examples in different clusters are “very different” • Discover new categories in an unsupervised manner (no sample category labels provided). • Notice that clusters define a category extensionally (by enumeration) rater than intentionally (through some label or general property of the category) • Automated labeling of clusters is a sub-problem of clustering 3 Clustering Example • Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Intra-cluster distances are minimized Inter-cluster distances are maximized Applications • Understanding Group genes and proteins that have similar functionality, or group stocks with similar price fluctuations • Summarization – Reduce the size of large data sets Clustering precipitation in Australia 5 Notion of a Cluster can be Ambiguous How many clusters? Six Clusters Two Clusters Four Clusters Clustering types • Hierarchical clustering – Clusters are created iteratively, using clusters created in previous step • Partitional clustering – A single partition is created, minimizing some cost function Partitional Clustering Original Points A Partitional Clustering Hierarchical Clustering p1 p3 p4 p2 p1 p2 Traditional Hierarchical Clustering p3 p4 Traditional Dendrogram Hierarchical Clustering • Build a tree-based hierarchical taxonomy (dendrogram) from a set of unlabeled examples. animal vertebrate fish reptile amphib. mammal invertebrate worm insect crustacean • Recursive application of a standard clustering algorithm can produce a hierarchical clustering. 10 Agglomerative vs. Divisive Clustering • Agglomerative (bottom-up) methods start with each example in its own cluster and iteratively combine them to form larger and larger clusters. • Divisive (partitional, top-down) separate all examples immediately into clusters. 11 Direct Clustering Method • Direct clustering methods require a specification of the number of clusters, k, desired. • A clustering evaluation function assigns a real-value quality measure to a clustering. • The number of clusters can be determined automatically by explicitly generating clusterings for multiple values of k and choosing the best result according to a clustering evaluation function. 12 Hierarchical Agglomerative Clustering (HAC) • Assumes a similarity function for determining the similarity of two instances. • Starts with all instances in a separate cluster and then repeatedly joins the two clusters that are most similar until there is only one cluster. • The history of merging forms a binary tree or hierarchy. 13 HAC Algorithm Start with all instances in their own cluster. Until there is only one cluster: Among the current clusters, determine the two clusters, ci and cj, that are most similar. Replace ci and cj with a single cluster ci cj So, the main issue is the similarity function 14 Cluster Similarity • Assume a similarity function that determines the similarity of two instances: sim(x,y). – Cosine similarity of document vectors. • How to compute similarity of two clusters each possibly containing multiple instances? – Single Link: Similarity of two most similar members. – Complete Link: Similarity of two least similar members. – Group Average: Average similarity between members. 15 Example 16 Single Link Agglomerative Clustering • Use maximum similarity of pairs: sim(ci ,c j ) max sim( x, y) xci , yc j • Can result in “straggly” (long and thin) clusters due to chaining effect. – Appropriate in some domains, such as clustering islands. 17 Single Link Example 18 Complete Link Agglomerative Clustering • Use minimum similarity of pairs: sim(ci ,c j ) min sim( x, y ) xci , yc j • Makes more “tight,” spherical clusters that are typically preferable. 19 Complete Link Example 20 Computational Complexity • In the first iteration, all HAC methods need to compute similarity of all pairs of n individual instances which is O(n2). • In each of the subsequent n2 merging iterations, it must compute the distance between the most recently created cluster and all other existing clusters. • In order to maintain an overall O(n2) performance, computing similarity to each other cluster must be done in constant time. 21 Computing Cluster Similarity • After merging ci and cj, the similarity of the resulting cluster to any other cluster, ck, can be computed by: – Single Link: sim((ci c j ), ck ) max(sim(ci , ck ), sim(c j , ck )) – Complete Link: sim((ci c j ), ck ) min(sim(ci , ck ), sim(c j , ck )) 22 Group Average Agglomerative Clustering • Use average similarity across all pairs within the merged cluster to measure the similarity of two clusters. 1 sim(ci , c j ) sim( x, y) ci c j ( ci c j 1) x(ci c j ) y(ci c j ): y x • Compromise between single and complete link. • Averaged across all ordered pairs in the merged cluster instead of unordered pairs between the two clusters to encourage tight clusters. 23 Computing Group Average Similarity • Assume cosine similarity and normalized vectors with unit length. • Always maintain sum of vectors in each cluster. s (c j ) x xc j • Compute similarity of clusters in constant time: sim(ci , c j ) (s (ci ) s (c j )) (s (ci ) s (c j )) (| ci | | c j |) (| ci | | c j |)(| ci | | c j | 1) 24 Non-Hierarchical Clustering • Typically must provide the number of desired clusters, k. • Randomly choose k instances as seeds, one per cluster. • Form initial clusters based on these seeds. • Iterate, repeatedly reallocating instances to different clusters to improve the overall clustering. • Stop when clustering converges or after a fixed number of iterations. 25 K-Means • Assumes instances are real-valued vectors. • Clusters based on centroids, center of gravity, or mean of points in a cluster, c: 1 μ(c) x | c | xc • Reassignment of instances to clusters is based on distance to the current cluster centroids. 26 Distance Metrics • Euclidian distance (L2 norm): m L2 ( x , y ) ( xi yi ) 2 • L1 norm: i 1 m L1 ( x, y ) xi yi i 1 • Cosine Similarity (transform to a distance by subtracting from 1): x y 1 x y 27 K-Means Algorithm Let d be the distance measure between instances. Select k random instances {s1, s2,… sk} as seeds. Until clustering converges or other stopping criterion: For each instance xi: Assign xi to the cluster cj such that d(xi, sj) is minimal. (Update the seeds to the centroid of each cluster) For each cluster cj sj = (cj) 28 K Means Example (K=2) Pick seeds Reassign clusters Compute centroids Reasssign clusters x x x x Compute centroids Reassign clusters Converged! 29 Time Complexity • Assume computing distance between two instances is O(m) where m is the dimensionality of the vectors. • Reassigning clusters: O(kn) distance computations, or O(knm). • Computing centroids: Each instance vector gets added once to some centroid: O(nm). • Assume these two steps are each done once for I iterations: O(Iknm). • Linear in all relevant factors, assuming a fixed number of iterations, more efficient than O(n2) HAC. 30 K-Means Objective • The objective of k-means is to minimize the total sum of the squared distance of every point to its corresponding cluster centroid. K l 1 xi X l || xi l || 2 • Finding the global optimum is NP-hard. • The k-means algorithm is guaranteed to converge a local optimum. 31 Seed Choice • Results can vary based on random seed selection. • Some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings. • Select good seeds using a heuristic or the results of another method. • K-means ++ : optimal seed selection (David Arthur and Sergei Vassilvitskii, 2007) • With the k-means++ initialization, the algorithm is guaranteed to find a solution that is O(log k) competitive to the optimal k-means solution. 32 Kmeans ++ • Choose one center uniformly at random from among the data points. • For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen. • Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(x)2. • Repeat Steps 2 and 3 until k centers have been chosen. • Now that the initial centers have been chosen, proceed using standard k-means clustering. 33 Soft Clustering • Clustering typically assumes that each instance is given a “hard” assignment to exactly one cluster. • Does not allow uncertainty in class membership or for an instance to belong to more than one cluster. • Soft clustering gives probabilities that an instance belongs to each of a set of clusters. • Each instance is assigned a probability distribution across a set of discovered categories (probabilities of all categories must sum to 1). 35 Learning from Probabilistically Labeled Data • Instead of training data labeled with “hard” category labels, training data is labeled with “soft” probabilistic category labels. • When estimating model parameters from training data, weight counts by the corresponding probability of the given category label. • For example, if P(c1 | E) = 0.8 and P(c2 | E) = 0.2, each item xj in E contributes only 0.8 towards the counts n1 and n1j, and 0.2 towards the counts n2 and n2j . 36 Expectation Maximumization (EM) • • • • Probabilistic method for soft clustering. Direct method that assumes k clusters:{c1, c2,… ck} Soft version of k-means. Assumes a probabilistic model of categories that allows computing P(ci | E) for each category, ci, for a given example, E. 37 EM Algorithm • Iterative method for learning probabilistic categorization model from unsupervised data. • Initially assume random assignment of examples to categories. • Learn an initial probabilistic model by estimating model parameters from this randomly labeled data. • Iterate following two steps until convergence: – Expectation (E-step): Compute P(ci | E) for each example given the current model, and probabilistically re-label the examples based on these posterior probability estimates. – Maximization (M-step): Re-estimate the model parameters, , from the probabilistically re-labeled data. 38 EM Initialize: Assign random probabilistic labels to unlabeled data Unlabeled Examples + + + + + 39 EM Initialize: Give soft-labeled training data to a probabilistic learner + + + + Prob. Learner + 40 EM Initialize: Produce a probabilistic classifier + + + + Prob. Learner Prob. Classifier + 41 EM E Step: Relabel unlabled data using the trained classifier + Prob. Learner Prob. Classifier + + + + 42 EM M step: Retrain classifier on relabeled data + Prob. Learner Prob. Classifier + + + + Continue EM iterations until probabilistic labels on unlabeled data converge. 43 A 2-Cluster Example • In this example, it is assumed that there are two clusters and the attribute value distributions in both clusters are normal distributions (μ, σ = average and stand. dev.). N(1,12) N(2,22) •44 Assume that we have the following 51 samples and it is believed that these samples are taken from two normal distributions: • • • • • • • • • 51 43 62 64 45 42 46 45 45 62 47 52 64 51 65 48 49 46 64 51 52 62 49 48 62 43 40 48 64 51 63 43 65 66 65 46 39 62 64 52 63 64 48 64 48 51 48 64 42 48 41 Examples have just 1 attribute, whose vales are shown here. E.g. age of myocardial infarction. We suspect ages vary according to countryside vrs city, want to prove. •45 Operation of the EM Algorithm • The EM algorithm is to figure out the parameters for the model. • Let {s1, s2,…, sn} denote the the set of (51)samples. • In this example, we need to figure out the following 5 parameters: 1, 1, 2, 2, P(C1). •46 Operation of the EM Algorithm (2) • For a general 1-dimensional case that has k clusters, we need to figure out totally 2k+(k-1) parameters. • The EM algorithm begins with an initial guess of the parameter values. 47 Example (2) • For our example, we initially assume that P(C1)=0.5. Accordingly, we divide the dataset into two subsets (at random): – {51(*2.5),52(*3),62(*5),63(*2),64(*8),65(*3 ),66(*1), 41(*1)}; – {51(*2.5),49(*2),48(*7),47(*1),46(*3),45(*3 ),43(*3),42(*2), 40(*1),39(*1)}; – 1=47.63; – 1 =3.79; *= # of occurrences of a value assigned to a cluster – 2=58.53; total sums to 25,5 per category – 2 =5.57. •48 Example (3) • E step: Then, the probabilities that sample si belongs to these two clusters are recomputed as follow (remember hypothesis of normal distribution): x P robxi | C1 P robC1 P robC1 1 P robC1 | xi e 2 P robxi P robxi 2 1 2 i P robxi | C2 P robC2 P robC2 1 P robC2 | xi e P robxi P robxi 2 2 pi 1 2 1 xi 2 2 2 22 P robC1 | xi , P robC1 | xi P robC2 | xi where xi is at tributevalue of sample si . 49 Example (4) • For example, for an object with value=52, (3 occ.) we have 0. 5 1 P robC1 | 52 e P robx52 2 (3.79) 0.5 0.1357 P robx52 2 52 47.63 2 2( 3.79 ) 2 52 58.53 2 0.5 1 2(5.57 ) 2 P robC 2 | 52 e P robx52 2 5.57 0.5 0.0903 P robx52 2 P robC1 | x52 0.1357 pi 0.600, P robC1 | x52 P robC 2 | x52 0.1357 0.0903 50 Example (5) •M- step: The new estimated values of parameters are computed as follows. n n pi xi (1 pi ) xi 1 ; 2 i 1 n p (1 p ) i i 1 i 1 n 12 i 1 n pi xi 1 n 2 i 1 n p i i 1 i ; 22 (1 pi )xi 1 2 i 1 n (1 p ) i 1 i n P(C1 ) p i 1 n i 51 Example (6) • The process is repeated until the clustering results converge. • Generally, we attempt to maximize the following likelihood function: P(C ) Px |C P(C )Px |C . 1 i 1 2 i 2 i Lecture notes of Dr. Yen-Jen Oyang 52 Example (7) • Once we have figured out the approximate parameter values, then we assign sample si into C1, if P robC1 | xi pi 0.5, P robC1 | xi P robC2 | xi P robxi | C1 P robC1 P robC1 1 P robC1 | xi e P robxi P robxi 2 1 xi 1 2 P robxi | C2 P robC2 P robC2 1 P robC2 | xi e P robxi P robxi 2 2 • Otherwise, si is assigned into C2. •53 2 12 xi 2 2 2 22 . Semi-Supervised Learning • For supervised categorization, generating labeled training data is expensive. • Idea: Use unlabeled data to aid supervised categorization. • Use EM in a semi-supervised mode by training EM on both labeled and unlabeled data. – Train initial probabilistic model on user-labeled subset of data instead of randomly labeled unsupervised data. – Labels of user-labeled examples are “frozen” and never relabeled during EM iterations. – Labels of unsupervised data are constantly probabilistically relabeled by EM. 54 Semi-Supervised EM Training Examples + + + Unlabeled Examples + Prob. Learner Prob. Classifier + + + + 55 Semi-Supervised EM Training Examples + + + + Prob. Learner Prob. Classifier + + + + 56 Semi-Supervised EM Training Examples + + + Prob. Learner Prob. Classifier + + + + + 57 Semi-Supervised EM Training Examples + + + Unlabeled Examples + Prob. Learner Prob. Classifier + + + + 58 Semi-Supervised EM Training Examples + + + + Prob. Learner Prob. Classifier + + + + Continue retraining iterations until probabilistic labels on unlabeled data converge. 59 Issues in Unsupervised Learning • How to evaluate clustering? – Internal: • Tightness and separation of clusters (e.g. k-means objective) • Fit of probabilistic model to data – External • Compare to known class labels on benchmark data • • • • • Improving search to converge faster and avoid local minima. Overlapping clustering. Ensemble clustering. Clustering structured relational data. Semi-supervised methods other than EM: – Co-training – Transductive SVM’s – Semi-supervised clustering (must-link, cannot-link) 60 Conclusions • Unsupervised learning induces categories from unlabeled data. • There are a variety of approaches, including: – HAC – k-means – EM • Semi-supervised learning uses both labeled and unlabeled data to improve results. 61