Design and Evaluation of Clustering Approaches for Large Document Collections, The “BIC-Means” Method Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering 13/4/2015 Nikos Hourdakis, MSc Thesis 1 Motivation Large document collections in many applications. Digital libraries, Web There is additional interest in methods for more effective management of information. Abtraction, Browsing, Classification, Retrieval Clustering is the means for achieving better organization of information. The data space is partitioned into groups of entities with similar content. 13/4/2015 Nikos Hourdakis, MSc Thesis 2 Outline Background State-of-the-art clustering approaches Partitional, hierarchical methods K-Means and its variants Incremental K-Means, Bisecting Incremental K-Means Proposed method: BIC-Means Bisecting Incremental K-Means using BIC as stopping criterion. Evaluation of clustering methods Application in Information Retrieval 13/4/2015 Nikos Hourdakis, MSc Thesis 3 Hierarchical Clustering (1/3) Nested sequence of clusters. Two approaches: A. Agglomerative: Starting from singleton clusters, recursively merges the two most similar clusters until there is only one cluster. Divisive (e.g., Bisecting K-Means): Starting with all documents in the same root cluster, iteratively splits each cluster into K clusters. B. 13/4/2015 Nikos Hourdakis, MSc Thesis 4 Hierarchical Clustering – Example (2/3) .. .. . 2 . . ..... .. . 4 5 13/4/2015 1 .. . . ... .. .. . .. 1 6 3 2 3 7 Nikos Hourdakis, MSc Thesis 4 5 6 7 5 Hierarchical Clustering (3/3) Organization and browsing of large document collections call for hierarchical clustering but: Agglomerative clustering have quadratic time complexity. Prohibitive for large data sets. 13/4/2015 Nikos Hourdakis, MSc Thesis 6 Partitional Clustering We focus on Partitional Clustering K-Means, Incremental K-Means, Bisecting K-Means At least as good as hierarchical. Low complexity, O(KN) Faster than hierarchical for large document collections. 13/4/2015 Nikos Hourdakis, MSc Thesis 7 K-Means 1. Randomly select K centroids 2. Repeat ITER times or until the centroids do not change: a) Assign each instance to the cluster whose centroid it is closest. b) Re-compute the cluster centroids. Generates a flat partition of K Clusters (K must be known in advance). Centroid is the mean of a group of instances. 13/4/2015 Nikos Hourdakis, MSc Thesis 8 K-Means Example .. . .. . . .. . .. C x C . ..... .. . C 13/4/2015 x Nikos Hourdakis, MSc Thesis 9 K-Means demo (1/7): http://www.delft-cluster.nl/textminer/theory/kmeans/kmeans.html 13/4/2015 Nikos Hourdakis, MSc Thesis 10 K-Means demo (2/7) 13/4/2015 Nikos Hourdakis, MSc Thesis 11 K-Means demo (3/7) 13/4/2015 Nikos Hourdakis, MSc Thesis 12 K-Means demo (4/7) 13/4/2015 Nikos Hourdakis, MSc Thesis 13 K-Means demo (5/7) 13/4/2015 Nikos Hourdakis, MSc Thesis 14 K-Means demo (6/7) 13/4/2015 Nikos Hourdakis, MSc Thesis 15 K-Means demo (7/7) 13/4/2015 Nikos Hourdakis, MSc Thesis 16 Comments No proof of convergence Converges to a local minimum of the distortion measure (average of the square distance of the points from their nearest centroids): ΣiΣd(d-μc)2 Too slow for practical databases K-means fully deterministic once initial centroids selected. Bad choice of initial centroids leads to poor clusters. 13/4/2015 Nikos Hourdakis, MSc Thesis 17 Incremental K-Means (IK) In K-Means new centroids are computed after each iteration (after all documents have been examined). In Incremental K-Means each cluster centroid is updated after a document is assigned to a cluster: S 1 C d C' S 13/4/2015 Nikos Hourdakis, MSc Thesis 18 Comments Not as sensitive as K-Means to the selection of initial centroids. Faster convergence, much faster in general 13/4/2015 Nikos Hourdakis, MSc Thesis 19 Bisecting IK-Means (1/4) A hierarchical clustering solution is produced by recursively applying the Incremental K-Means in a document collection. The documents are initially partitioned into two clusters. The algorithm iteratively selects and bisects each one of the leaf clusters until singleton clusters are reached. 13/4/2015 Nikos Hourdakis, MSc Thesis 20 Bisecting IK-means (2/4) Input: (d1,d2…dN) Output: hierarchy of K-clusters 1. All document in cluster C 2. Apply IK-means to split C into K clusters (K=2) C1,C2,…CK leaf clusters 3. Iteratively split each Ci cluster until K clusters or singleton clusters are produces at the leafs 13/4/2015 Nikos Hourdakis, MSc Thesis 21 Bisecting IK-Means (3/4) The algorithm is exhaustive terminating at singleton clusters (unless K is known) Terminating at singleton clusters Is time consuming Singleton clusters are meaningless Intermediate clusters are more likely to correspond to real classes No criterion for stopping bisections before singleton clusters are reached. 13/4/2015 Nikos Hourdakis, MSc Thesis 22 Bayesian Information Criterion (BIC) (1/3) To prevent over-splitting we define a strategy to stop the Bisecting algorithm when meaningful clusters are reached. Bayesian Information Criterion (BIC) or Schwarz Criterion [Schwarz 1978]. X-Means [Pelleg and Moore, 2000] used BIC for estimating the best K in a given range of values. 13/4/2015 Nikos Hourdakis, MSc Thesis 23 Bayesian Information Criterion (BIC) (2/3) In this work, we suggest using BIC as the splitting criterion of a cluster in order to decide whether a cluster should split or not. It measures the improvement of the cluster structure between a cluster and its two children clusters. We compute the BIC score of: A cluster and of its Two children clusters. 13/4/2015 Nikos Hourdakis, MSc Thesis 24 Bayesian Information Criterion (BIC) (3/3) If the BIC score of the produced children clusters is less than the BIC score of their parent cluster we do not accept the split. We keep the parent cluster as it is. Otherwise, we accept the split and the algorithm proceeds similarly to lower levels. 13/4/2015 Nikos Hourdakis, MSc Thesis 25 Example C1 C C1 Parent cluster: BIC(K=1)=1980 C2 Two resulting clusters: BIC(K=2)=2245 The BIC score of the parent cluster is less than BIC score of the generated cluster structure => we accept the bisection. 13/4/2015 Nikos Hourdakis, MSc Thesis 26 Computing BIC The BIC score of a data collection is defined as (Kass and Wasserman, 1995): p j ˆ BIC(M j ) l j D log R 2 where lˆj D is the log-likelihood of the data set D, Pj = M*K+1, is a function of the number of independent parameters and R is the number of points. 13/4/2015 Nikos Hourdakis, MSc Thesis 27 Log-likelihood Given a cluster of points, that produces a Gaussian distribution N(μ, σ2), log-likelihood is the probability that a neighborhood of data points follows this distribution. The log-likelihood of the data can be considered as a measure of the cohesiveness of a cluster. It estimates how closely to the centroid are the points of the cluster. 13/4/2015 Nikos Hourdakis, MSc Thesis 28 Parameters pj Sometimes, due to the complexity of the data (many dimensions or many data points), the data may follow other distributions. We penalize log-likelihood by a function of the number of independent parameters (pj/2*logR). 13/4/2015 Nikos Hourdakis, MSc Thesis 29 Notation μj : coordinates of j-th centroid μ(i) : centroid nearest to i-th data point D: input set of data points Dj : set of data points that have μ(j) as their closest centroid R = |D| and Ri = |Di| M: the number of dimensions Mj: family of alternative models (different models correspond clustering solutions) BIC scores the models and chooses the best among K models 13/4/2015 Nikos Hourdakis, MSc Thesis 30 Computing BIC (1/3) To compute log-likelihood of data we need the parameters of the Gaussian for the data Maximum likelihood estimate (MLE) of the variance (under spherical Gaussian assumption) 1 2 2 xi (i ) RK i 13/4/2015 Nikos Hourdakis, MSc Thesis 31 Computing BIC (2/3) Probability of point xi : Gaussian with the estimated σ and mean the nearest cluster centroid to xi Px i Ri 1 R 2 M 2 1 exp 2 x i 2 Log likelihood of data Ri 2 1 1 l ( D) log i Pxi i log 2 xi i log 2 R 2 13/4/2015 Nikos Hourdakis, MSc Thesis 32 Computing BIC (3/3) Focusing on the set Dn of points which belong to centroid n Rn Rn M l ( Dn ) log(2 ) log( 2 ) 2 2 Rn Rn log Rn Rn log R 2 13/4/2015 Nikos Hourdakis, MSc Thesis 33 Proposed Method: BIC-Means (1/2) BIC: Bisecting InCremental K-Means clustering incorporating BIC as the stopping criterion. BIC performs a splitting test at each leaf cluster to prevent it from over-splitting. BIC-Means doesn’t terminate at singleton clusters. BIC-Means terminates when there are no separable clusters according to BIC. 13/4/2015 Nikos Hourdakis, MSc Thesis 34 Proposed Method: BIC-Means (2/2) Combines the strengths of partitional and hierarchical clustering methods 13/4/2015 Hierarchical clustering Low complexity (O(N*K)) Good clustering quality Produces meaningful clusters at the leafs Nikos Hourdakis, MSc Thesis 35 BIC-Means Algorithm Input: S: (d1, d2,…,dn) data in one cluster Output: A hierarchy of clusters. 1. 2. 3. 4. 13/4/2015 All documents in one cluster C. Apply Incremental K-Means to split C into C1, C2. Compute BIC for C and C1, C2 : I. If BIC(C) < BIC(C1, C2) put C1, C2 in queue II. Otherwise do not split C Repeat steps 2, 3 and 4, until there is no separable leaf clusters in queue according to BIC. Nikos Hourdakis, MSc Thesis 36 Evaluation Evaluation of document clustering algorithms. Two data sets: OHSUMED (233,445 Medline documents), Reuters (21578 documents). Application of clustering to information retrieval Evaluation of several cluster-based retrieval strategies. Comparison with retrieval by exhaustive search on OHSUMED. 13/4/2015 Nikos Hourdakis, MSc Thesis 37 F-Measure Howe good the clusters approximate data classes F-Measure for cluster C and class T is defined as: F Measure 2PR/(P R) , where P N / C , R N / T The F measure of a class T is the maximum value it achieves over all clusters C: FT= maxCFTC The F measure of the clustering solution is the mean FT (over all classes) C F FT T T 13/4/2015 Nikos Hourdakis, MSc Thesis 38 Comparison of Clustering Algorithms Comparison of K-Means, Incremental K-Means and Bisecting Ohsumed1 - Reuters1 data sets Reuters1 OHSUMED1 0.9 Avg F-Measure (10 trials) 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 K-Means 13/4/2015 Incremental K-Means Algorithms Nikos Hourdakis, MSc Thesis Bisecting Increm. K-Means 39 Evaluation of Incremental K-Means Incremental K-Means - Reuters1 Number of Iterations of Center adjustment 0.8 Avg F-Measure(10 trials) 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 iteration 2 iterations 3 iterations 4 iterations Number of Iterations 13/4/2015 Nikos Hourdakis, MSc Thesis 40 MeSH Representation of Documents We use MeSH terms for describing medical documents (OHSUMED). Each document is represented by a vector of MeSH terms (multi-word terms instead of single word terms). Leads to more compact representation (each vector contains less terms, about 20). Sequential approach to extract MeSH terms from OHSUMED documents. 13/4/2015 Nikos Hourdakis, MSc Thesis 41 Bisecting Incremental K-Means – Clustering Quality Bisecting Incremental K-Means- OHSUMED2 MeSH terms Vs Single Word Terms Representation 0.8 Avg F-Measure (10 trials) 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Single-Word Term Representation MeSH Term Representation Document Representation 13/4/2015 Nikos Hourdakis, MSc Thesis 42 Speed of Clustering Bisecting Incremental K-Means - Ohsumed2 MeSH-based Vs Single word Terms Representation 110 Avg Clustering Time (min) 100 90 Single-Word Terms Representation 97.6 min 80 70 60 50 40 30 MeSH Terms Representation, 14 min. 20 10 0 Single-Word Term Representation MeSH Term Representation Document Representation 13/4/2015 Nikos Hourdakis, MSc Thesis 43 Evaluation of BIC-Means Comparison of BIC-Means and Bisecting Incremental K-Means F-Measure BIC-Means Bisecting Incremental K-Means 0.9 Avg F-Measure (10 trials) 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Ohsumed2 Reuters1 Reuters2 Data Set 13/4/2015 Nikos Hourdakis, MSc Thesis 44 Speed of Clustering Comparison of BIC-Means and Bisecting Incremental K-Means Clustering Time Avg Clustering Time (min) BIC-Means Bisecting Incremental K-Means 180 170 160 150 140 130 120 110 100 90 80 70 60 50 40 30 20 10 0 Ohsumed2 Reuters1 Reuters2 Data Set 13/4/2015 Nikos Hourdakis, MSc Thesis 45 Comments BIC-Means is much faster than Bisecting Incremental K-Means Not exhaustive algorithm. Achieves approximately the same F-Measure with the exhaustive Bisecting approach. It is more suited for clustering large document collections. 13/4/2015 Nikos Hourdakis, MSc Thesis 46 Application of Clustering to Information Retrieval We demonstrate that it is possible to reduce the size of the search (and therefore retrieval response time) on large data sets (OHSUMED). BIC-Means is applied on entire OHSUMED. Each document is represented by MeSH terms. Chose 61 queries of the original OHSUMED query set developed by Hersh et. al. Each OHSUMED document has been judged as relevant to a query. 13/4/2015 Nikos Hourdakis, MSc Thesis 47 Query – Document Similarity d1 d d Sim(d1, d2 ) 1 2 | d1 || d2 | i1 wid wid M M 2 w i1 id i1 wid2 M 1 1 2 2 d2 θ Similarity is defined as the cosine of the angle between document and query vectors. 13/4/2015 Nikos Hourdakis, MSc Thesis 48 Information Retrieval Methods Method 1: Search M clusters closer to the query Compute similarity between cluster centroid - query Method 2: Search M clusters closer to the query Each cluster is represented by the 20 most frequent terms of its centroid. Method 3: Search M clusters whose centre contain the terms of the query. 13/4/2015 Nikos Hourdakis, MSc Thesis 49 Method 1: Search M clusters closer to the query (compute similarity between cluster centroid – query). top_1Cluster 0.45 top_3Clusters 0.4 top_10Clusters top_30Clusters 0.35 top_50Clusters precision 0.3 top_100Clusters top_150Clusters 0.25 Exhaustive Search 0.2 0.15 0.1 0.05 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 recall 13/4/2015 Nikos Hourdakis, MSc Thesis 50 Method 2: Search M clusters closer to the query. Each cluster is represented by the 20 most frequent terms of its centroid. 0.45 20Terms-top_10Clusters 0.4 20Terms-top_50Clusters 20Terms-top_100Clusters 0.35 20Terms-top_150Clusters precision 0.3 top_150Clusters Exhaustive Search 0.25 0.2 0.15 0.1 0.05 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 recall 13/4/2015 Nikos Hourdakis, MSc Thesis 51 Method 3: Search M clusters containing the terms of the query. 0.45 AllQinCen_Top_15Clusters 0.4 AllQinCen_Top_30Clusters 0.35 AllQinCen_Top_50Clusters AllQinCen_AllClusters precision 0.3 Exhaustive Search 0.25 0.2 0.15 0.1 0.05 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 recall 13/4/2015 Nikos Hourdakis, MSc Thesis 52 Size of Search "Avg Num ber of Docum ents searched over the 61 queries" Retrieval Strategy: Retrieve the clusters w hich contain all the MeSH Query Term s in their Centroid. 245000 Num OF Docs 210000 175000 140000 105000 70000 35000 0 VSM AllClusters Top_50Clusters Top_30Clusters Top_15Clusters Search Strategy 13/4/2015 Nikos Hourdakis, MSc Thesis 53 Comments Best cluster-based retrieval strategy: Retrieve only the clusters which contain all the MeSH query terms in their centroid vector (Method 3). Search the documents which are contained in the retrieved clusters and order them by similarity with the query. Advantages: Searches only 30% of all OHSUMED documents as opposed to exhaustive searching (233,445 docs). Almost as effective as the retrieval by exhaustive searching (searching without clustering). 13/4/2015 Nikos Hourdakis, MSc Thesis 54 Conclusions (1/2) We implemented and evaluated various partitional clustering techniques Incremental K-Means Bisecting Incremental K-Means (exhaustive approach) BIC-Means Incorporates BIC as stopping criterion for preventing clustering from over-splitting. Produces meaningful clusters at the leafs. 13/4/2015 Nikos Hourdakis, MSc Thesis 55 Conclusions (2/2) BIC-Means Much faster than Bisecting Incremental K-Means. As effective as exhaustive Bisecting approach. More suited for clustering large document collection. Cluster-based retrieval strategies Reduces the size of the search. The best proposed retrieval method is as effective as exhaustive searching (searching without clustering). 13/4/2015 Nikos Hourdakis, MSc Thesis 56 Future Work Evaluation using more or application specific data sets. Examine additional cluster-based retrieval strategies (top-down, bottom-up). Clustering and Browsing on Medline. Clustering Dynamic Document Collections. Semantic Similarity Methods in document clustering. 13/4/2015 Nikos Hourdakis, MSc Thesis 57 References Nikos Hourdakis, Michalis Argyriou, Euripides G.M. Petrakis, Evangelos Milios, " Hierarchical Clustering in Medical Document Collections: the BIC-Means Method", Journal of Digital Information Management (JDIM), Vol. 8, No. 2, pp. 71-77, April. 2010. Dan Pelleg, Andrew Moore, “X-means: Extending K-means with efficient estimation of the number of clusters”, Proc. of the 7th Intern. Conf. on Machine Learning, 2000, pp. 727-734 13/4/2015 Nikos Hourdakis, MSc Thesis 58 Thank you!!! Questions? 13/4/2015 Nikos Hourdakis, MSc Thesis 59