CS315-L17-Clustering

Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1 Supervised vs. Unsupervised Learning Two Fundamental Methods in Machine Learning Supervised Learning (“learn from my example”)     Goal: A program that performs a task as good as humans. TASK – well defined (the target function) EXPERIENCE – training data provided by a human PERFORMANCE – error/accuracy on the task Unsupervised Learning (“see what you can find”)     Goal: To find some kind of structure in the data. TASK – vaguely defined No EXPERIENCE No PERFORMANCE (but, there are some evaluations metrics) 2 What is Clustering? The most common form of Unsupervised Learning Clustering is the process of grouping a set of physical or abstract objects into classes (“clusters”) of similar objects It can be used in IR:   1 0.9 0.8 To improve recall in search 0.7 For better navigation of search results 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 1.2 Ex1: Cluster to Improve Recall Cluster hypothesis: Documents with similar text are related Thus, when a query matches a document D, also return other documents in the cluster containing D. 4 Ex2: Cluster for Better Navigation 5 Clustering Characteristics Flat Clustering vs Hierarchical Clustering   Flat: just dividing objects in groups (clusters) Hierarchical: organize clusters in a hierarchy Evaluating Clustering  Internal Criteria  The intra-cluster similarity is high (tightness)  The inter-cluster similarity is low (separateness)  External Criteria  Did we discover the hidden classes? (we need gold standard data for this evaluation) 6 Clustering for Web IR Representation for clustering   Document representation Need a notion of similarity/distance How many clusters?    Fixed a priori? Completely data driven? Avoid “trivial” clusters - too large or small 7 Recall: Documents as vectors Each doc j is a vector of tf.idf values, one component for each term.  Can normalize to unit length.  dj dj    dj wi , j  n i 1 where wi , j  tf i , j  idfi wi , j Vector space    terms are axes - aka features N docs live in this space even with stemming, may have 20,000+ dimensions What makes documents related? 8 Intuition for relatedness D2 D3 D1 x y t1 t2 D4 Documents that are “close together” in vector space talk about the same things. 9 What makes documents related? Ideal: semantic similarity. Practical: statistical similarity  We will use cosine similarity. We will describe algorithms in terms of cosine similarity. Cosine similarity of normalized d j , dk : n sim( d , d )   w  w j k i1 i, j i, k This is known as the “normalized inner product”. 10 Clustering Algorithms Hierarchical algorithms  Bottom-up, agglomerative clustering Partitioning “flat” algorithms   Usually start with a random (partial) partitioning Refine it iteratively The famous k-means partitioning algorithm:   Given: a set of n documents and the number k Compute: a partition of k clusters that optimizes the chosen partitioning criterion 11 K-means Assumes documents are real-valued vectors. Clusters based on centroids of points in a cluster, c (= the center of gravity or mean) :   1 μ(c)  x  | c | xc Reassignment of instances to clusters is based on distance to the current cluster centroids. See Animation 12 K-Means Algorithm Let d be the distance measure between instances. Select k random instances {s1, s2,… sk} as seeds. Until clustering converges or other stopping criterion: For each instance xi: Assign xi to the cluster cj such that d(xi, sj) is minimal. (Update the seeds to the centroid of each cluster) For each cluster cj sj = (cj) 13 K-means: Different Issues When to stop?   When a fixed number of iterations is reached When centroid positions do not change Seed Choice   Results can vary based on random seed selection. Try out multiple starting points Example showing sensitivity to seeds If you start with centroids: B and E you converge to A B C If you start with centroids D and F you converge to: D E F 14 Hierarchical clustering Build a tree-based hierarchical taxonomy (dendrogram) from a set of unlabeled examples. animal vertebrate fish reptile amphib. mammal invertebrate worm insect crustacean 15 Hierarchical Agglomerative Clustering We assume there is a similarity function that determines the similarity of two instances. Algorithm: Start with all instances in their own cluster. Until there is only one cluster: Among the current clusters, determine the two clusters, ci and cj, that are most similar. Replace ci and cj with a single cluster ci  cj Watch animation of HAC 16 What is the most similar cluster? Single-link  Similarity of the most cosine-similar (single-link) Complete-link  Similarity of the “furthest” points, the least cosine-similar Group-average agglomerative clustering  Average cosine between pairs of elements Centroid clustering  Similarity of clusters’ centroids 17 Single link clustering 1) Use maximum similarity of pairs: sim(ci ,c j )  max sim( x, y) xci , yc j 2) After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is: sim((ci  c j ), ck )  max(sim(ci , ck ), sim(c j , ck )) 18 Complete link clustering 1) Use minimum similarity of pairs: 2) After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is: 19 Major issue - labeling After clustering algorithm finds clusters - how can they be useful to the end user? Need a concise label for each cluster   In search results, say “Animal” or “Car” in the jaguar example. In topic trees (Yahoo), need navigational cues.  Often done by hand, a posteriori. 20 How to Label Clusters Show titles of typical documents    Titles are easy to scan Authors create them for quick scanning! But you can only show a few titles which may not fully represent cluster Show words/phrases prominent in cluster    More likely to fully represent cluster Use distinguishing words/phrases But harder to scan 21 Further issues Complexity:  Clustering is computationally expensive. Implementations need careful balancing of needs. How to decide how many clusters are best? Evaluating the “goodness” of clustering  There are many techniques, some focus on implementation issues (complexity/time), some on the quality of 22

CS315-L17-Clustering

Related documents

Products

Support

CS315-L17-Clustering

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib