BIC-means

Design and Evaluation of Clustering Approaches for Large Document Collections, The “BIC-Means” Method Nikolaos Hourdakis Technical University of Crete Department of Electronic and Computer Engineering 13/4/2015 Nikos Hourdakis, MSc Thesis 1 Motivation  Large document collections in many applications.  Digital libraries, Web  There is additional interest in methods for more effective management of information.  Abtraction, Browsing, Classification, Retrieval  Clustering is the means for achieving better organization of information.  The data space is partitioned into groups of entities with similar content. 13/4/2015 Nikos Hourdakis, MSc Thesis 2 Outline  Background  State-of-the-art clustering approaches  Partitional, hierarchical methods  K-Means and its variants  Incremental K-Means, Bisecting Incremental K-Means  Proposed method: BIC-Means  Bisecting Incremental K-Means using BIC as stopping criterion.  Evaluation of clustering methods  Application in Information Retrieval 13/4/2015 Nikos Hourdakis, MSc Thesis 3 Hierarchical Clustering (1/3)   Nested sequence of clusters. Two approaches: A. Agglomerative: Starting from singleton clusters, recursively merges the two most similar clusters until there is only one cluster. Divisive (e.g., Bisecting K-Means): Starting with all documents in the same root cluster, iteratively splits each cluster into K clusters. B. 13/4/2015 Nikos Hourdakis, MSc Thesis 4 Hierarchical Clustering – Example (2/3) .. .. . 2 . . ..... .. . 4 5 13/4/2015 1 .. . . ... .. .. . .. 1 6 3 2 3 7 Nikos Hourdakis, MSc Thesis 4 5 6 7 5 Hierarchical Clustering (3/3)  Organization and browsing of large document collections call for hierarchical clustering but:  Agglomerative clustering have quadratic time complexity.  Prohibitive for large data sets. 13/4/2015 Nikos Hourdakis, MSc Thesis 6 Partitional Clustering  We focus on Partitional Clustering  K-Means,  Incremental K-Means,  Bisecting K-Means  At least as good as hierarchical.  Low complexity, O(KN)  Faster than hierarchical for large document collections. 13/4/2015 Nikos Hourdakis, MSc Thesis 7 K-Means 1. Randomly select K centroids 2. Repeat ITER times or until the centroids do not change: a) Assign each instance to the cluster whose centroid it is closest. b) Re-compute the cluster centroids.  Generates a flat partition of K Clusters (K must be known in advance).  Centroid is the mean of a group of instances. 13/4/2015 Nikos Hourdakis, MSc Thesis 8 K-Means Example .. . .. . . .. . .. C x C . ..... .. . C 13/4/2015 x Nikos Hourdakis, MSc Thesis 9 K-Means demo (1/7): http://www.delft-cluster.nl/textminer/theory/kmeans/kmeans.html 13/4/2015 Nikos Hourdakis, MSc Thesis 10 K-Means demo (2/7) 13/4/2015 Nikos Hourdakis, MSc Thesis 11 K-Means demo (3/7) 13/4/2015 Nikos Hourdakis, MSc Thesis 12 K-Means demo (4/7) 13/4/2015 Nikos Hourdakis, MSc Thesis 13 K-Means demo (5/7) 13/4/2015 Nikos Hourdakis, MSc Thesis 14 K-Means demo (6/7) 13/4/2015 Nikos Hourdakis, MSc Thesis 15 K-Means demo (7/7) 13/4/2015 Nikos Hourdakis, MSc Thesis 16 Comments  No proof of convergence  Converges to a local minimum of the distortion measure (average of the square distance of the points from their nearest centroids): ΣiΣd(d-μc)2  Too slow for practical databases  K-means fully deterministic once initial centroids selected.  Bad choice of initial centroids leads to poor clusters. 13/4/2015 Nikos Hourdakis, MSc Thesis 17 Incremental K-Means (IK)  In K-Means new centroids are computed after each iteration (after all documents have been examined).  In Incremental K-Means each cluster centroid is updated after a document is assigned to a cluster:     S  1 C  d C'  S 13/4/2015 Nikos Hourdakis, MSc Thesis 18 Comments  Not as sensitive as K-Means to the selection of initial centroids.  Faster convergence, much faster in general 13/4/2015 Nikos Hourdakis, MSc Thesis 19 Bisecting IK-Means (1/4)  A hierarchical clustering solution is produced by recursively applying the Incremental K-Means in a document collection.  The documents are initially partitioned into two clusters.  The algorithm iteratively selects and bisects each one of the leaf clusters until singleton clusters are reached. 13/4/2015 Nikos Hourdakis, MSc Thesis 20 Bisecting IK-means (2/4)  Input: (d1,d2…dN)  Output: hierarchy of K-clusters 1. All document in cluster C 2. Apply IK-means to split C into K clusters (K=2) C1,C2,…CK leaf clusters 3. Iteratively split each Ci cluster until K clusters or singleton clusters are produces at the leafs 13/4/2015 Nikos Hourdakis, MSc Thesis 21 Bisecting IK-Means (3/4)  The algorithm is exhaustive terminating at singleton clusters (unless K is known)  Terminating at singleton clusters Is time consuming Singleton clusters are meaningless Intermediate clusters are more likely to correspond to real classes  No criterion for stopping bisections before singleton clusters are reached. 13/4/2015 Nikos Hourdakis, MSc Thesis 22 Bayesian Information Criterion (BIC) (1/3)  To prevent over-splitting we define a strategy to stop the Bisecting algorithm when meaningful clusters are reached.  Bayesian Information Criterion (BIC) or Schwarz Criterion [Schwarz 1978].  X-Means [Pelleg and Moore, 2000] used BIC for estimating the best K in a given range of values. 13/4/2015 Nikos Hourdakis, MSc Thesis 23 Bayesian Information Criterion (BIC) (2/3)  In this work, we suggest using BIC as the splitting criterion of a cluster in order to decide whether a cluster should split or not.  It measures the improvement of the cluster structure between a cluster and its two children clusters.  We compute the BIC score of:  A cluster and of its  Two children clusters. 13/4/2015 Nikos Hourdakis, MSc Thesis 24 Bayesian Information Criterion (BIC) (3/3)  If the BIC score of the produced children clusters is less than the BIC score of their parent cluster we do not accept the split.  We keep the parent cluster as it is.  Otherwise, we accept the split and the algorithm proceeds similarly to lower levels. 13/4/2015 Nikos Hourdakis, MSc Thesis 25 Example C1 C C1 Parent cluster: BIC(K=1)=1980 C2 Two resulting clusters: BIC(K=2)=2245  The BIC score of the parent cluster is less than BIC score of the generated cluster structure => we accept the bisection. 13/4/2015 Nikos Hourdakis, MSc Thesis 26 Computing BIC  The BIC score of a data collection is defined as (Kass and Wasserman, 1995): p   j ˆ BIC(M j )  l j  D   log R   2 where lˆj  D  is the log-likelihood of the data set D, Pj = M*K+1, is a function of the number of independent parameters and R is the number of points. 13/4/2015 Nikos Hourdakis, MSc Thesis 27 Log-likelihood  Given a cluster of points, that produces a Gaussian distribution N(μ, σ2), log-likelihood is the probability that a neighborhood of data points follows this distribution.  The log-likelihood of the data can be considered as a measure of the cohesiveness of a cluster.  It estimates how closely to the centroid are the points of the cluster. 13/4/2015 Nikos Hourdakis, MSc Thesis 28 Parameters pj  Sometimes, due to the complexity of the data (many dimensions or many data points), the data may follow other distributions.  We penalize log-likelihood by a function of the number of independent parameters (pj/2*logR). 13/4/2015 Nikos Hourdakis, MSc Thesis 29 Notation         μj : coordinates of j-th centroid μ(i) : centroid nearest to i-th data point D: input set of data points Dj : set of data points that have μ(j) as their closest centroid R = |D| and Ri = |Di| M: the number of dimensions Mj: family of alternative models (different models correspond clustering solutions) BIC scores the models and chooses the best among K models 13/4/2015 Nikos Hourdakis, MSc Thesis 30 Computing BIC (1/3)  To compute log-likelihood of data we need the parameters of the Gaussian for the data  Maximum likelihood estimate (MLE) of the variance (under spherical Gaussian assumption) 1 2 2   xi  (i )  RK i 13/4/2015 Nikos Hourdakis, MSc Thesis 31 Computing BIC (2/3)  Probability of point xi : Gaussian with the estimated σ and mean the nearest cluster centroid to xi Px i   Ri  1 R 2  M 2  1 exp  2 x  i    2   Log likelihood of data Ri   2  1 1  l ( D)  log i Pxi   i  log  2 xi  i   log  2 R  2   13/4/2015 Nikos Hourdakis, MSc Thesis 32 Computing BIC (3/3)  Focusing on the set Dn of points which belong to centroid n Rn Rn M l ( Dn )   log(2 )  log( 2 )  2 2 Rn    Rn log Rn  Rn log R 2 13/4/2015 Nikos Hourdakis, MSc Thesis 33 Proposed Method: BIC-Means (1/2)  BIC: Bisecting InCremental K-Means clustering incorporating BIC as the stopping criterion.  BIC performs a splitting test at each leaf cluster to prevent it from over-splitting.  BIC-Means doesn’t terminate at singleton clusters.  BIC-Means terminates when there are no separable clusters according to BIC. 13/4/2015 Nikos Hourdakis, MSc Thesis 34 Proposed Method: BIC-Means (2/2)  Combines the strengths of partitional and hierarchical clustering methods     13/4/2015 Hierarchical clustering Low complexity (O(N*K)) Good clustering quality Produces meaningful clusters at the leafs Nikos Hourdakis, MSc Thesis 35 BIC-Means Algorithm Input: S: (d1, d2,…,dn) data in one cluster Output: A hierarchy of clusters.   1. 2. 3. 4. 13/4/2015 All documents in one cluster C. Apply Incremental K-Means to split C into C1, C2. Compute BIC for C and C1, C2 : I. If BIC(C) < BIC(C1, C2) put C1, C2 in queue II. Otherwise do not split C Repeat steps 2, 3 and 4, until there is no separable leaf clusters in queue according to BIC. Nikos Hourdakis, MSc Thesis 36 Evaluation  Evaluation of document clustering algorithms.  Two data sets: OHSUMED (233,445 Medline documents), Reuters (21578 documents).  Application of clustering to information retrieval  Evaluation of several cluster-based retrieval strategies.  Comparison with retrieval by exhaustive search on OHSUMED. 13/4/2015 Nikos Hourdakis, MSc Thesis 37 F-Measure  Howe good the clusters approximate data classes  F-Measure for cluster C and class T is defined as: F Measure  2PR/(P R) , where P  N / C , R  N / T  The F measure of a class T is the maximum value it achieves over all clusters C: FT= maxCFTC  The F measure of the clustering solution is the mean FT (over all classes) C F   FT T T 13/4/2015 Nikos Hourdakis, MSc Thesis 38 Comparison of Clustering Algorithms Comparison of K-Means, Incremental K-Means and Bisecting Ohsumed1 - Reuters1 data sets Reuters1 OHSUMED1 0.9 Avg F-Measure (10 trials) 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 K-Means 13/4/2015 Incremental K-Means Algorithms Nikos Hourdakis, MSc Thesis Bisecting Increm. K-Means 39 Evaluation of Incremental K-Means Incremental K-Means - Reuters1 Number of Iterations of Center adjustment 0.8 Avg F-Measure(10 trials) 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 iteration 2 iterations 3 iterations 4 iterations Number of Iterations 13/4/2015 Nikos Hourdakis, MSc Thesis 40 MeSH Representation of Documents  We use MeSH terms for describing medical documents (OHSUMED).  Each document is represented by a vector of MeSH terms (multi-word terms instead of single word terms).  Leads to more compact representation (each vector contains less terms, about 20).  Sequential approach to extract MeSH terms from OHSUMED documents. 13/4/2015 Nikos Hourdakis, MSc Thesis 41 Bisecting Incremental K-Means – Clustering Quality Bisecting Incremental K-Means- OHSUMED2 MeSH terms Vs Single Word Terms Representation 0.8 Avg F-Measure (10 trials) 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Single-Word Term Representation MeSH Term Representation Document Representation 13/4/2015 Nikos Hourdakis, MSc Thesis 42 Speed of Clustering Bisecting Incremental K-Means - Ohsumed2 MeSH-based Vs Single word Terms Representation 110 Avg Clustering Time (min) 100 90 Single-Word Terms Representation 97.6 min 80 70 60 50 40 30 MeSH Terms Representation, 14 min. 20 10 0 Single-Word Term Representation MeSH Term Representation Document Representation 13/4/2015 Nikos Hourdakis, MSc Thesis 43 Evaluation of BIC-Means Comparison of BIC-Means and Bisecting Incremental K-Means F-Measure BIC-Means Bisecting Incremental K-Means 0.9 Avg F-Measure (10 trials) 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Ohsumed2 Reuters1 Reuters2 Data Set 13/4/2015 Nikos Hourdakis, MSc Thesis 44 Speed of Clustering Comparison of BIC-Means and Bisecting Incremental K-Means Clustering Time Avg Clustering Time (min) BIC-Means Bisecting Incremental K-Means 180 170 160 150 140 130 120 110 100 90 80 70 60 50 40 30 20 10 0 Ohsumed2 Reuters1 Reuters2 Data Set 13/4/2015 Nikos Hourdakis, MSc Thesis 45 Comments  BIC-Means is much faster than Bisecting Incremental K-Means  Not exhaustive algorithm.  Achieves approximately the same F-Measure with the exhaustive Bisecting approach.  It is more suited for clustering large document collections. 13/4/2015 Nikos Hourdakis, MSc Thesis 46 Application of Clustering to Information Retrieval  We demonstrate that it is possible to reduce the size of the search (and therefore retrieval response time) on large data sets (OHSUMED).  BIC-Means is applied on entire OHSUMED.  Each document is represented by MeSH terms.  Chose 61 queries of the original OHSUMED query set developed by Hersh et. al.  Each OHSUMED document has been judged as relevant to a query. 13/4/2015 Nikos Hourdakis, MSc Thesis 47 Query – Document Similarity d1 d d Sim(d1, d2 )  1 2  | d1 || d2 | i1 wid wid M M 2 w i1 id i1 wid2 M 1 1 2 2 d2 θ  Similarity is defined as the cosine of the angle between document and query vectors. 13/4/2015 Nikos Hourdakis, MSc Thesis 48 Information Retrieval Methods  Method 1: Search M clusters closer to the query  Compute similarity between cluster centroid - query  Method 2: Search M clusters closer to the query  Each cluster is represented by the 20 most frequent terms of its centroid.  Method 3: Search M clusters whose centre contain the terms of the query. 13/4/2015 Nikos Hourdakis, MSc Thesis 49 Method 1: Search M clusters closer to the query (compute similarity between cluster centroid – query). top_1Cluster 0.45 top_3Clusters 0.4 top_10Clusters top_30Clusters 0.35 top_50Clusters precision 0.3 top_100Clusters top_150Clusters 0.25 Exhaustive Search 0.2 0.15 0.1 0.05 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 recall 13/4/2015 Nikos Hourdakis, MSc Thesis 50 Method 2: Search M clusters closer to the query. Each cluster is represented by the 20 most frequent terms of its centroid. 0.45 20Terms-top_10Clusters 0.4 20Terms-top_50Clusters 20Terms-top_100Clusters 0.35 20Terms-top_150Clusters precision 0.3 top_150Clusters Exhaustive Search 0.25 0.2 0.15 0.1 0.05 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 recall 13/4/2015 Nikos Hourdakis, MSc Thesis 51 Method 3: Search M clusters containing the terms of the query. 0.45 AllQinCen_Top_15Clusters 0.4 AllQinCen_Top_30Clusters 0.35 AllQinCen_Top_50Clusters AllQinCen_AllClusters precision 0.3 Exhaustive Search 0.25 0.2 0.15 0.1 0.05 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 recall 13/4/2015 Nikos Hourdakis, MSc Thesis 52 Size of Search "Avg Num ber of Docum ents searched over the 61 queries" Retrieval Strategy: Retrieve the clusters w hich contain all the MeSH Query Term s in their Centroid. 245000 Num OF Docs 210000 175000 140000 105000 70000 35000 0 VSM AllClusters Top_50Clusters Top_30Clusters Top_15Clusters Search Strategy 13/4/2015 Nikos Hourdakis, MSc Thesis 53 Comments  Best cluster-based retrieval strategy:  Retrieve only the clusters which contain all the MeSH query terms in their centroid vector (Method 3).  Search the documents which are contained in the retrieved clusters and order them by similarity with the query.  Advantages:  Searches only 30% of all OHSUMED documents as opposed to exhaustive searching (233,445 docs).  Almost as effective as the retrieval by exhaustive searching (searching without clustering). 13/4/2015 Nikos Hourdakis, MSc Thesis 54 Conclusions (1/2)  We implemented and evaluated various partitional clustering techniques  Incremental K-Means  Bisecting Incremental K-Means (exhaustive approach)  BIC-Means  Incorporates BIC as stopping criterion for preventing clustering from over-splitting.  Produces meaningful clusters at the leafs. 13/4/2015 Nikos Hourdakis, MSc Thesis 55 Conclusions (2/2)  BIC-Means  Much faster than Bisecting Incremental K-Means.  As effective as exhaustive Bisecting approach.  More suited for clustering large document collection.  Cluster-based retrieval strategies  Reduces the size of the search.  The best proposed retrieval method is as effective as exhaustive searching (searching without clustering). 13/4/2015 Nikos Hourdakis, MSc Thesis 56 Future Work  Evaluation using more or application specific data sets.  Examine additional cluster-based retrieval strategies (top-down, bottom-up).  Clustering and Browsing on Medline.  Clustering Dynamic Document Collections.  Semantic Similarity Methods in document clustering. 13/4/2015 Nikos Hourdakis, MSc Thesis 57 References  Nikos Hourdakis, Michalis Argyriou, Euripides G.M. Petrakis, Evangelos Milios, " Hierarchical Clustering in Medical Document Collections: the BIC-Means Method", Journal of Digital Information Management (JDIM), Vol. 8, No. 2, pp. 71-77, April. 2010.  Dan Pelleg, Andrew Moore, “X-means: Extending K-means with efficient estimation of the number of clusters”, Proc. of the 7th Intern. Conf. on Machine Learning, 2000, pp. 727-734 13/4/2015 Nikos Hourdakis, MSc Thesis 58 Thank you!!! Questions? 13/4/2015 Nikos Hourdakis, MSc Thesis 59

BIC-means

Related documents

Products

Support

BIC-means

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib