lecture_17

Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1 Supervised vs. Unsupervised Learning Supervised Learning     Goal: A program that performs a task as good as humans. TASK – well defined (the target function) EXPERIENCE – training data provided by a human PERFORMANCE – error/accuracy on the task Unsupervised Learning     Goal: To find some kind of structure in the data. TASK – vaguely defined No EXPERIENCE No PERFORMANCE (but, there are some evaluations metrics) 2 What is Clustering? Clustering is the most common form of Unsupervised Learning Clustering is the process of grouping a set of physical or abstract objects into classes of similar objects It can be used in IR:   To improve recall in search applications For better navigation of search results 3 Example 1: Improving Recall Cluster hypothesis - Documents with similar text are related Thus, when a query matches a document D, also return other documents in the cluster containing D. 4 Example 2: Better Navigation 5 Clustering Characteristics Flat versus Hierarchical Clustering   Flat means dividing objects in groups (clusters) Hierarchical means organize clusters in a subsuming hierarchy Evaluating Clustering  Internal Criteria  The intra-cluster similarity is high (tightness)  The inter-cluster similarity is low (separateness)  External Criteria  Did we discover the hidden classes? (we need gold standard data for this evaluation) 6 Clustering for Web IR Representation for clustering  Document representation  Vector space? Normalization?  Need a notion of similarity/distance How many clusters?   Fixed a priori? Completely data driven?  Avoid “trivial” clusters - too large or small 7 Recall documents as vectors Each doc j is a vector of tfidf values, one component for each term. Can normalize to unit length.  wi , j dj dj    where wi , j  tf i , j  idfi n dj i1 wi, j So we have a vector space    terms are axes - aka features n docs live in this space even with stemming, may have 20,000+ dimensions 8 What makes documents related? Ideal: semantic similarity. Practical: statistical similarity   We will use cosine similarity. Documents as vectors. We will describe algorithms in terms of cosine similarity. Cosine similarity of normalized d j , dk : n sim( d , d )   w  w j k i1 i, j i, k This is known as the normalized inner product. 9 Intuition for relatedness D2 D3 D1 x y t1 t2 D4 Documents that are “close together” in vector space talk about the same things. 10 Clustering Algorithms Partitioning “flat” algorithms   Usually start with a random (partial) partitioning Refine it iteratively  k-means clustering  Model based clustering (we will not cover it) Hierarchical algorithms   Bottom-up, agglomerative Top-down, divisive (we will not cover it) 11 Partitioning “flat” algorithms Partitioning method: Construct a partition of n documents into a set of k clusters Given: a set of documents and the number k Find: a partition of k clusters that optimizes the chosen partitioning criterion Watch animation of k-means 12 K-means Assumes documents are real-valued vectors. Clusters based on centroids (aka the center of gravity or mean) of points in a cluster, c:   1 μ(c)  x  | c | xc Reassignment of instances to clusters is based on distance to the current cluster centroids. 13 K-Means Algorithm Let d be the distance measure between instances. Select k random instances {s1, s2,… sk} as seeds. Until clustering converges or other stopping criterion: For each instance xi: Assign xi to the cluster cj such that d(xi, sj) is minimal. (Update the seeds to the centroid of each cluster) For each cluster cj sj = (cj) 14 K-means: Different Issues When to stop?   When a fixed number of iterations is reached When centroid positions do not change Seed Choice   Results can vary based on random seed selection. Try out multiple starting points Example showing sensitivity to seeds If you start with B and E as centroids you converge to {A,B,C} and {D,E,F} If you start with D and F you converge to {A,B,D,E} {C,F} A B C D E F 15 Hierarchical clustering Build a tree-based hierarchical taxonomy (dendrogram) from a set of unlabeled examples. animal vertebrate fish reptile amphib. mammal invertebrate worm insect crustacean 16 Hierarchical Agglomerative Clustering We assume there is a similarity function that determines the similarity of two instances. Algorithm: Start with all instances in their own cluster. Until there is only one cluster: Among the current clusters, determine the two clusters, ci and cj, that are most similar. Replace ci and cj with a single cluster ci  cj Watch animation of HAC 17 What is the most similar cluster? Single-link  Similarity of the most cosine-similar (single-link) Complete-link  Similarity of the “furthest” points, the least cosine-similar Group-average agglomerative clustering  Average cosine between pairs of elements Centroid clustering  Similarity of clusters’ centroids 18 Single link clustering 1) Use maximum similarity of pairs: sim(ci ,c j )  max sim( x, y) xci , yc j 2) After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is: sim((ci  c j ), ck )  max(sim(ci , ck ), sim(c j , ck )) 19 Complete link clustering 1) Use minimum similarity of pairs: sim(ci ,c j )  min sim( x, y ) xci , yc j 2) After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is: sim((ci  c j ), ck )  min(sim(ci , ck ), sim(c j , ck )) 20 Major issue - labeling After clustering algorithm finds clusters - how can they be useful to the end user? Need a concise label for each cluster   In search results, say “Animal” or “Car” in the jaguar example. In topic trees (Yahoo), need navigational cues.  Often done by hand, a posteriori. 21 How to Label Clusters Show titles of typical documents    Titles are easy to scan Authors create them for quick scanning! But you can only show a few titles which may not fully represent cluster Show words/phrases prominent in cluster    More likely to fully represent cluster Use distinguishing words/phrases But harder to scan 22 Not covered in this lecture Complexity:  Clustering is computationally expensive. Implementations need careful balancing of needs. How to decide how many clusters are best? Evaluating the “goodness” of clustering  There are many techniques, some focus on implementation issues (complexity/time), some on the quality of 23

lecture_17

Related documents

Products

Support

lecture_17

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib