clusters

Machine Learning Clustering 1 Untrained Methods • So far we analyzez TRAINED machine learning methods: given a set of categories, and some labelled training examples, the learner acquires a categorization MODEL than can be applied to classify unseen examples • The last 3 methods describe UNTRAINED machine learning. No labeled examples are available. 2 Clustering: the task • Partition unlabeled examples into disjoint subsets of clusters, such that: – Examples within a cluster are “very similar” – Examples in different clusters are “very different” • Discover new categories in an unsupervised manner (no sample category labels provided). • Notice that clusters define a category extensionally (by enumeration) rater than intentionally (through some label or general property of the category) • Automated labeling of clusters is a sub-problem of clustering 3 Clustering Example • Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Intra-cluster distances are minimized Inter-cluster distances are maximized Applications • Understanding Group genes and proteins that have similar functionality, or group stocks with similar price fluctuations • Summarization – Reduce the size of large data sets Clustering precipitation in Australia 5 Notion of a Cluster can be Ambiguous How many clusters? Six Clusters Two Clusters Four Clusters Clustering types • Hierarchical clustering – Clusters are created iteratively, using clusters created in previous step • Partitional clustering – A single partition is created, minimizing some cost function Partitional Clustering Original Points A Partitional Clustering Hierarchical Clustering p1 p3 p4 p2 p1 p2 Traditional Hierarchical Clustering p3 p4 Traditional Dendrogram Hierarchical Clustering • Build a tree-based hierarchical taxonomy (dendrogram) from a set of unlabeled examples. animal vertebrate fish reptile amphib. mammal invertebrate worm insect crustacean • Recursive application of a standard clustering algorithm can produce a hierarchical clustering. 10 Agglomerative vs. Divisive Clustering • Agglomerative (bottom-up) methods start with each example in its own cluster and iteratively combine them to form larger and larger clusters. • Divisive (partitional, top-down) separate all examples immediately into clusters. 11 Direct Clustering Method • Direct clustering methods require a specification of the number of clusters, k, desired. • A clustering evaluation function assigns a real-value quality measure to a clustering. • The number of clusters can be determined automatically by explicitly generating clusterings for multiple values of k and choosing the best result according to a clustering evaluation function. 12 Hierarchical Agglomerative Clustering (HAC) • Assumes a similarity function for determining the similarity of two instances. • Starts with all instances in a separate cluster and then repeatedly joins the two clusters that are most similar until there is only one cluster. • The history of merging forms a binary tree or hierarchy. 13 HAC Algorithm Start with all instances in their own cluster. Until there is only one cluster: Among the current clusters, determine the two clusters, ci and cj, that are most similar. Replace ci and cj with a single cluster ci  cj So, the main issue is the similarity function 14 Cluster Similarity • Assume a similarity function that determines the similarity of two instances: sim(x,y). – Cosine similarity of document vectors. • How to compute similarity of two clusters each possibly containing multiple instances? – Single Link: Similarity of two most similar members. – Complete Link: Similarity of two least similar members. – Group Average: Average similarity between members. 15 Example 16 Single Link Agglomerative Clustering • Use maximum similarity of pairs: sim(ci ,c j )  max sim( x, y) xci , yc j • Can result in “straggly” (long and thin) clusters due to chaining effect. – Appropriate in some domains, such as clustering islands. 17 Single Link Example 18 Complete Link Agglomerative Clustering • Use minimum similarity of pairs: sim(ci ,c j )  min sim( x, y ) xci , yc j • Makes more “tight,” spherical clusters that are typically preferable. 19 Complete Link Example 20 Computational Complexity • In the first iteration, all HAC methods need to compute similarity of all pairs of n individual instances which is O(n2). • In each of the subsequent n2 merging iterations, it must compute the distance between the most recently created cluster and all other existing clusters. • In order to maintain an overall O(n2) performance, computing similarity to each other cluster must be done in constant time. 21 Computing Cluster Similarity • After merging ci and cj, the similarity of the resulting cluster to any other cluster, ck, can be computed by: – Single Link: sim((ci  c j ), ck )  max(sim(ci , ck ), sim(c j , ck )) – Complete Link: sim((ci  c j ), ck )  min(sim(ci , ck ), sim(c j , ck )) 22 Group Average Agglomerative Clustering • Use average similarity across all pairs within the merged cluster to measure the similarity of two clusters.   1 sim(ci , c j )  sim( x, y)   ci  c j ( ci  c j  1) x(ci c j ) y(ci c j ): y  x • Compromise between single and complete link. • Averaged across all ordered pairs in the merged cluster instead of unordered pairs between the two clusters to encourage tight clusters. 23 Computing Group Average Similarity • Assume cosine similarity and normalized vectors with unit length. • Always maintain sum of vectors in each cluster.   s (c j )  x  xc j • Compute similarity of clusters in constant time:     sim(ci , c j )  (s (ci )  s (c j ))  (s (ci )  s (c j ))  (| ci |  | c j |) (| ci |  | c j |)(| ci |  | c j | 1) 24 Non-Hierarchical Clustering • Typically must provide the number of desired clusters, k. • Randomly choose k instances as seeds, one per cluster. • Form initial clusters based on these seeds. • Iterate, repeatedly reallocating instances to different clusters to improve the overall clustering. • Stop when clustering converges or after a fixed number of iterations. 25 K-Means • Assumes instances are real-valued vectors. • Clusters based on centroids, center of gravity, or mean of points in a cluster, c:   1 μ(c)  x  | c | xc • Reassignment of instances to clusters is based on distance to the current cluster centroids. 26 Distance Metrics • Euclidian distance (L2 norm): m   L2 ( x , y )   ( xi  yi ) 2 • L1 norm:   i 1 m L1 ( x, y )   xi  yi i 1 • Cosine Similarity (transform to a distance by subtracting from 1):   x y 1   x  y 27 K-Means Algorithm Let d be the distance measure between instances. Select k random instances {s1, s2,… sk} as seeds. Until clustering converges or other stopping criterion: For each instance xi: Assign xi to the cluster cj such that d(xi, sj) is minimal. (Update the seeds to the centroid of each cluster) For each cluster cj sj = (cj) 28 K Means Example (K=2) Pick seeds Reassign clusters Compute centroids Reasssign clusters x x x x Compute centroids Reassign clusters Converged! 29 Time Complexity • Assume computing distance between two instances is O(m) where m is the dimensionality of the vectors. • Reassigning clusters: O(kn) distance computations, or O(knm). • Computing centroids: Each instance vector gets added once to some centroid: O(nm). • Assume these two steps are each done once for I iterations: O(Iknm). • Linear in all relevant factors, assuming a fixed number of iterations, more efficient than O(n2) HAC. 30 K-Means Objective • The objective of k-means is to minimize the total sum of the squared distance of every point to its corresponding cluster centroid.   K l 1 xi X l || xi  l || 2 • Finding the global optimum is NP-hard. • The k-means algorithm is guaranteed to converge a local optimum. 31 Seed Choice • Results can vary based on random seed selection. • Some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings. • Select good seeds using a heuristic or the results of another method. • K-means ++ : optimal seed selection (David Arthur and Sergei Vassilvitskii, 2007) • With the k-means++ initialization, the algorithm is guaranteed to find a solution that is O(log k) competitive to the optimal k-means solution. 32 Kmeans ++ • Choose one center uniformly at random from among the data points. • For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen. • Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(x)2. • Repeat Steps 2 and 3 until k centers have been chosen. • Now that the initial centers have been chosen, proceed using standard k-means clustering. 33 Soft Clustering • Clustering typically assumes that each instance is given a “hard” assignment to exactly one cluster. • Does not allow uncertainty in class membership or for an instance to belong to more than one cluster. • Soft clustering gives probabilities that an instance belongs to each of a set of clusters. • Each instance is assigned a probability distribution across a set of discovered categories (probabilities of all categories must sum to 1). 35 Learning from Probabilistically Labeled Data • Instead of training data labeled with “hard” category labels, training data is labeled with “soft” probabilistic category labels. • When estimating model parameters  from training data, weight counts by the corresponding probability of the given category label. • For example, if P(c1 | E) = 0.8 and P(c2 | E) = 0.2, each item xj in E contributes only 0.8 towards the counts n1 and n1j, and 0.2 towards the counts n2 and n2j . 36 Expectation Maximumization (EM) • • • • Probabilistic method for soft clustering. Direct method that assumes k clusters:{c1, c2,… ck} Soft version of k-means. Assumes a probabilistic model of categories that allows computing P(ci | E) for each category, ci, for a given example, E. 37 EM Algorithm • Iterative method for learning probabilistic categorization model from unsupervised data. • Initially assume random assignment of examples to categories. • Learn an initial probabilistic model by estimating model parameters  from this randomly labeled data. • Iterate following two steps until convergence: – Expectation (E-step): Compute P(ci | E) for each example given the current model, and probabilistically re-label the examples based on these posterior probability estimates. – Maximization (M-step): Re-estimate the model parameters, , from the probabilistically re-labeled data. 38 EM Initialize: Assign random probabilistic labels to unlabeled data Unlabeled Examples +  +  +  +  +  39 EM Initialize: Give soft-labeled training data to a probabilistic learner +  +  +  +  Prob. Learner +  40 EM Initialize: Produce a probabilistic classifier +  +  +  +  Prob. Learner Prob. Classifier +  41 EM E Step: Relabel unlabled data using the trained classifier +  Prob. Learner Prob. Classifier +  +  +  +  42 EM M step: Retrain classifier on relabeled data +  Prob. Learner Prob. Classifier +  +  +  +  Continue EM iterations until probabilistic labels on unlabeled data converge. 43 A 2-Cluster Example • In this example, it is assumed that there are two clusters and the attribute value distributions in both clusters are normal distributions (μ, σ = average and stand. dev.). N(1,12) N(2,22) •44 Assume that we have the following 51 samples and it is believed that these samples are taken from two normal distributions: • • • • • • • • • 51 43 62 64 45 42 46 45 45 62 47 52 64 51 65 48 49 46 64 51 52 62 49 48 62 43 40 48 64 51 63 43 65 66 65 46 39 62 64 52 63 64 48 64 48 51 48 64 42 48 41 Examples have just 1 attribute, whose vales are shown here. E.g. age of myocardial infarction. We suspect ages vary according to countryside vrs city, want to prove. •45 Operation of the EM Algorithm • The EM algorithm is to figure out the parameters for the model. • Let {s1, s2,…, sn} denote the the set of (51)samples. • In this example, we need to figure out the following 5 parameters: 1, 1, 2, 2, P(C1). •46 Operation of the EM Algorithm (2) • For a general 1-dimensional case that has k clusters, we need to figure out totally 2k+(k-1) parameters. • The EM algorithm begins with an initial guess of the parameter values. 47 Example (2) • For our example, we initially assume that P(C1)=0.5. Accordingly, we divide the dataset into two subsets (at random): – {51(*2.5),52(*3),62(*5),63(*2),64(*8),65(*3 ),66(*1), 41(*1)}; – {51(*2.5),49(*2),48(*7),47(*1),46(*3),45(*3 ),43(*3),42(*2), 40(*1),39(*1)}; – 1=47.63; – 1 =3.79; *= # of occurrences of a value assigned to a cluster – 2=58.53; total sums to 25,5 per category – 2 =5.57. •48 Example (3) • E step: Then, the probabilities that sample si belongs to these two clusters are recomputed as follow (remember hypothesis of normal distribution):  x    P robxi | C1  P robC1  P robC1  1 P robC1 | xi     e 2 P robxi  P robxi  2  1 2 i P robxi | C2  P robC2  P robC2  1 P robC2 | xi     e P robxi  P robxi  2  2 pi  1 2 1  xi  2 2  2 22 P robC1 | xi  , P robC1 | xi   P robC2 | xi  where xi is at tributevalue of sample si . 49 Example (4) • For example, for an object with value=52, (3 occ.) we have 0. 5 1 P robC1 | 52   e P robx52 2  (3.79) 0.5   0.1357 P robx52 2  52  47.63 2  2( 3.79 ) 2  52 58.53 2  0.5 1 2(5.57 ) 2 P robC 2 | 52   e P robx52 2  5.57 0.5   0.0903 P robx52 2 P robC1 | x52 0.1357 pi    0.600, P robC1 | x52  P robC 2 | x52 0.1357 0.0903 50 Example (5) •M- step: The new estimated values of parameters are computed as follows. n n  pi xi  (1  pi ) xi 1  ; 2  i 1 n p  (1  p ) i i 1 i 1 n  12  i 1 n  pi xi  1  n 2 i 1 n p i i 1 i ;  22   (1  pi )xi  1  2 i 1 n  (1  p ) i 1 i n P(C1 )  p i 1 n i 51 Example (6) • The process is repeated until the clustering results converge. • Generally, we attempt to maximize the following likelihood function:  P(C ) Px |C   P(C )Px |C . 1 i 1 2 i 2 i Lecture notes of Dr. Yen-Jen Oyang 52 Example (7) • Once we have figured out the approximate parameter values, then we assign sample si into C1, if P robC1 | xi  pi   0.5, P robC1 | xi   P robC2 | xi  P robxi | C1  P robC1  P robC1  1 P robC1 | xi     e P robxi  P robxi  2  1  xi  1 2  P robxi | C2  P robC2  P robC2  1 P robC2 | xi     e P robxi  P robxi  2  2 • Otherwise, si is assigned into C2. •53 2 12  xi   2 2  2 22 . Semi-Supervised Learning • For supervised categorization, generating labeled training data is expensive. • Idea: Use unlabeled data to aid supervised categorization. • Use EM in a semi-supervised mode by training EM on both labeled and unlabeled data. – Train initial probabilistic model on user-labeled subset of data instead of randomly labeled unsupervised data. – Labels of user-labeled examples are “frozen” and never relabeled during EM iterations. – Labels of unsupervised data are constantly probabilistically relabeled by EM. 54 Semi-Supervised EM Training Examples + + + Unlabeled Examples +  Prob. Learner Prob. Classifier +  +  +  +  55 Semi-Supervised EM Training Examples + + + +  Prob. Learner Prob. Classifier +  +  +  +  56 Semi-Supervised EM Training Examples + + + Prob. Learner Prob. Classifier +  +  +  +  +  57 Semi-Supervised EM Training Examples + + + Unlabeled Examples +  Prob. Learner Prob. Classifier +  +  +  +  58 Semi-Supervised EM Training Examples + + + +  Prob. Learner Prob. Classifier +  +  +  +  Continue retraining iterations until probabilistic labels on unlabeled data converge. 59 Issues in Unsupervised Learning • How to evaluate clustering? – Internal: • Tightness and separation of clusters (e.g. k-means objective) • Fit of probabilistic model to data – External • Compare to known class labels on benchmark data • • • • • Improving search to converge faster and avoid local minima. Overlapping clustering. Ensemble clustering. Clustering structured relational data. Semi-supervised methods other than EM: – Co-training – Transductive SVM’s – Semi-supervised clustering (must-link, cannot-link) 60 Conclusions • Unsupervised learning induces categories from unlabeled data. • There are a variety of approaches, including: – HAC – k-means – EM • Semi-supervised learning uses both labeled and unlabeled data to improve results. 61

clusters

Related documents

Products

Support

clusters

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib