Text Clustering http://net.pku.edu.cn/~wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 11/19/2013 What’s Clustering? What is clustering? Clustering: the process of grouping a set of objects into classes of similar objects The commonest form of unsupervised learning Unsupervised learning = learning from raw data, as opposed to supervised data where a classification of examples is given A common and important task that finds many applications in IR and other places Clustering Internal Criterion How many clusters? High intra-cluster similarity Low inter-cluster similarity Issues for clustering Representation for clustering 文档表示Document representation Vector space or language model? 相似度/距离similarity/distance COS similarity or KL distance How many clusters? Fixed a priori? Completely data driven? Avoid “trivial” clusters - too large or small Clustering Algorithms Hard clustering algorithms computes a hard assignment – each document is a member of exactly one cluster. Soft clustering algorithms is soft – a document’s assignment is a distribution over all clusters. Clustering Algorithms Flat algorithms Create cluster set without explicit structure Usually start with a random (partial) partitioning Refine it iteratively K means clustering Model based clustering Hierarchical algorithms Bottom-up, agglomerative Top-down, divisive Clustering Algorithms Flat algorithms Create cluster set without explicit structure Usually start with a random (partial) partitioning Refine it iteratively K means clustering Model based clustering Hierarchical algorithms Bottom-up, agglomerative Top-down, divisive Applications in IR Navigating document collections Index Aardvark, 15 Blueberry, 200 Capricorn, 1, 45-55 Dog, 79-99 Egypt, 65 Falafel, 78-90 Giraffes, 45-59 … Table of Contents 1. Science of Cognition 1.a. Motivations 1.a.i. Intellectual Curiosity 1.a.ii. Practical Applications 1.b. History of Cognitive Psychology 2. The Neural Basis of Cognition 2.a. The Nervous System 2.b. Organization of the Brain 2.c. The Visual System 3. Perception and Attention 3.a. Sensory Memory 3.b. Attention and Sensory Information Processing Information Retrieval —— a book index Document clusters —— a table of contents Scatter/Gather: Cutting, Karger, and Pedersen For better navigation of search results Navigating search results (2) 按sense of a word 对documents聚类 对搜索结果 (say Jaguar, or NLP), 聚成相关的文档组 可看作是一种word sense disambiguation For speeding up vector space retrieval VSM中retrieval, 需要找到和query vector最近 的doc vectors 计算文档集里所有doc和query doc的similarity – slow (for some applications) 优化一下:使用inverted index,只计算那些 query doc中的term出现过的doc By clustering docs in corpus a priori 只在子集上计算:query doc所在的cluster K Means Algorithm Partitioning Algorithms Given: a set of documents D and the number K Find: 找到一个K clusters的划分,使partitioning criterion最优 Globally optimal: exhaustively enumerate all partitions Effective heuristic methods: K-means algorithms partitioning criterion: residual sum of squares(残差平方和 ) K-Means 假设documents是实值 vectors. 基于cluster ω的中心centroids (aka the center of gravity or mean) 划分instances到clusters是根据它到cluster centroid中心点的距离,选择最近的centroid K Means Example (K=2) Pick seeds x x x x Reassign clusters Compute centroids Reassign clusters Compute centroids Reassign clusters Converged! K-Means Algorithm Convergence 为什么K-means算法会收敛? A state in which clusters don’t change. Reassignment: RSS单调减,每个向量分到最近的 centroid. Recomputation: 每个RSSk 单调减(mk is number of members in cluster k): a =(ωk )取什么值,使RSSK取得最小值? Σ –2(X – a) = 0 ΣX=Σa mK a = Σ X a = (1/ mk) Σ X Convergence = Global Minimum? There is unfortunately no guarantee that a global minimum in the objective function will be reached outlier Seed Choice In the above, if you start with B and E as centroids you converge to {A,B,C} and {D,E,F} If you start with D and F you converge to {A,B,D,E} {C,F} Seed的选择会影响结果 某些seeds导致收敛很慢, 或者收敛到sub-optimal clusterings. 用heuristic选seeds (e.g., doc least similar to any existing mean) 尝试多种starting points 以其它clustering方法的结果 来初始化.(e.g., by sampling) How Many Clusters? 怎样确定合适的K? 在产生更多cluster(每个cluster内部更像)和产生太多的 cluster (eg.浏览代价大)之间取得平衡 例如: 定义Benefit :a doc到它所在的cluster centroid的cosine similarity。所有docs的benefit之和为Total Benefit. 定义一个cluster的Cost 定义clustering的Value = Total Benefit - Total Cost. 所有可能的K中,选取value最大的那一个 Is K-Means Efficient? Time Complexity Computing distance between two docs is O(M) where M is the dimensionality of the vectors. Reassigning clusters: O(KN) distance computations, or O(KNM). Computing centroids: Each doc gets added once to some centroid: O(NM). Assume these two steps are each done once for I iterations: O(IKNM). M is … Document is sparse vector, but Centroid is not K-medoids algorithms: the element closest to the center as "the medoid" Efficiency: Medoid As Cluster Representative Medoid: 用一个document来作cluster的表示 One reason this is useful 如: 离centroid最近的document 考察一个很大的cluster的representative (>1000 documents) The centroid of this cluster will be a dense vector The medoid of this cluster will be a sparse vector 类似于: mean .vs. median centroid vs. medoid #1. K-Means 习题 16-11 (i) 请给出这样的一个例子,由多个点组成的点集 合和3个初始质心组成(初始质心不一定要是点集合 中的点),采用3-均值聚类方法收敛得到的聚类结 果中包含空簇 (ii)这种包含空簇的聚类结果在RSS的意义上说有没 有可能是全局最优? Hierarchical Clustering Algorithm Hierarchical Agglomerative Clustering (HAC) 假定有了一个similarity function来确定两个 instances的相似度. 贪心算法: Dendrogram 每个instances为一独立的 cluster开始 选择最similar的两个 cluster,合并为一个新 cluster 直到最后剩下一个cluster 为止 上面的合并历史形成一个 binary tree或hierarchy. Dendrogram: Document Example As clusters agglomerate, docs likely to fall into a hierarchy of “topics” or concepts. d3 d5 d1 d2 d3,d4,d 5 d4 d1,d2 d4,d5 d3 HAC Algorithm Hierarchical Clustering algorithms Agglomerative (bottom-up): Start with each document being a single cluster. Eventually all documents belong to the same cluster. Divisive (top-down): Start with all documents belong to the same cluster. Eventually each node forms a cluster on its own. 不需要预定clusters的数目k Key notion: cluster representative 如何计算哪两个clusters最近? 为了有效进行此计算,怎样表达每个cluster(cluster representation)? Representative可以cluster中的某些“typical” 或central 点: “Closest pair” of clusters Single-link 最近点(Similarity of the most cosine-similar) Complete-link 最远点(Similarity of the “furthest” points) “Center of gravity” centroids (centers of gravity) 最cosine-similar的clusters Average-link 每对元素的平均cosinesimilar Single Link Example chaining sim(ci ,c j ) max sim( x, y) xci , yc j sim((ci c j ), ck ) max(sim(ci , ck ), sim(c j , ck )) Complete Link Example Affect by outliers sim(ci ,c j ) min sim( x, y ) xci , yc j sim((ci c j ), ck ) min(sim(ci , ck ), sim(c j , ck )) Outlier in Complete-Link X-coordinates: Computational Complexity 第一次iteration, HAC计算所有pairs之间的 similarity : O(n2). 后续的n1 merging iterations,每次需要从N*N个 similarity值中选出最大的一个, Naive的实现为O(n3),优化可达到O(n2 log n) 后续的n1 merging iterations, 需要计算最新产生 的cluster和其它已有的clusters之间的similarity 其它的similarity不变 计算和其它cluster之间的similarity须要是constant time. 否则O(n3) Group Average Agglomerative Clustering 合并后的cluster中所有pairs的平均similarity 1 sim(ci , c j ) sim( x, y) ci c j ( ci c j 1) x(ci c j ) y(ci c j ): y x 可以在常数时间计算? Vectors都经过单位长度normalized. 保存每个cluster的sum of vectors. s (c j ) sim(ci , c j ) x xc j (s (ci ) s (c j )) (s (ci ) s (c j )) (| ci | | c j |) (| ci | | c j |)(| ci | | c j | 1) Centroid Agglomerative Clustering Centroid相似度等价于来自不同Cluster的点的all pairs average similarity Centroid Agglomerative Clustering d4 Example: n=6, k=3, closest pair of centroids d3 d6 Centroid after second step. d1 d2 d5 Centroid after first step. • 保存每个cluster的Centroid Centroid Clustering is not Monotonic d1 at (1+ε,1), d2 at (5,1),and d3 at (3,1 + 2√3) Comparison of HAC Algorithms Evaluation Think about it… Evaluation by High internal criterion scores? Object function for High intra-cluster similarity and Low inter-cluster similarity Application User judgment Internal judgment Example Cluster I Cluster II Cluster III External criteria for clustering quality 测试集是什么?ground truth=? Assume documents with C gold standard classes, while our clustering algorithms produce K clusters, ω1, ω2, …, ωK with ni members each. 一个简单的measure: purity 定义为cluster中占主 导的class Ci 的文档数与cluster ωK 大小的比率 ω= {ω1,ω2, . . . ,ωK} is the set of clusters and C = {c1, c2, . . . , cJ} the set of classes. Purity example Cluster I Cluster II Cluster I: Purity = 1/6 *(max(5, 1, 0)) = 5/6 Cluster II: Purity = 1/6 * (max(1, 4, 1)) = 4/6 Cluster III: Purity = 1/5 * (max(2, 0, 3)) = 3/5 Total: Purity = 1/17 * (5+4+3) = 12/17 Cluster III NMI: Normalized Mutual Information I(Ω;C) measures the amount of information by which our Knowledge about the classes increases when we are told what the clusters are. Rand Index View it as a series of decisions, one for each of the N(N − 1)/2 pairs of documents in the collection. true positive (TP) decision assigns two similar documents to the same cluster true negative (TN) decision assigns two dissimilar documents to different clusters. false positive (FP) decision assigns two dissimilar documents to the same cluster. false negative (FN) decision assigns two similar documents to different clusters. Rand Index Number of points Same Cluster in clustering Different Clusters in clustering Same class in ground truth TP FN Different classes in ground truth FP TN Rand index Example Cluster I Cluster II Cluster III 本次课小结 Text Clustering Evaluation Partition Algorithm Purity, NMI ,Rand Index K-Means Reassignment Recomputation Hierarchical Algorithm Cluster representation Close measure of cluster pair Single link Complete link Average link centroid Resources Weka 3 - Data Mining with Open Source Machine Learning Software in Java Thank You! Q&A Readings [1]. IIR Ch16.1-4 Ch17.1-4 [2]. B. Florian, E. Martin, and X. Xiaowei, "Frequent term-based text clustering," in Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. Edmonton, Alberta, Canada: ACM, 2002. Cluster Labeling Major issue - labeling After clustering algorithm finds clusters - how can they be useful to the end user? Need pithy label for each cluster In search results, say “Animal” or “Car” in the jaguar example. In topic trees (Yahoo), need navigational cues. Often done by hand, a posteriori. How to Label Clusters Show titles of typical documents Titles are easy to scan Authors create them for quick scanning! But you can only show a few titles which may not fully represent cluster Show words/phrases prominent in cluster More likely to fully represent cluster Use distinguishing words/phrases Differential labeling (think about Feature Selection) But harder to scan Labeling Common heuristics - list 5-10 most frequent terms in the centroid vector. Differential labeling by frequent terms Drop stop-words; stem. Within a collection “Computers”, clusters all have the word computer as frequent term. Discriminant analysis of centroids. Perhaps better: distinctive noun phrase