BIC-means

advertisement
Design and Evaluation of Clustering
Approaches for Large Document Collections,
The “BIC-Means” Method
Nikolaos Hourdakis
Technical University of Crete
Department of Electronic and Computer Engineering
13/4/2015
Nikos Hourdakis, MSc Thesis
1
Motivation
 Large document collections in many applications.
 Digital libraries, Web
 There is additional interest in methods for more
effective management of information.
 Abtraction, Browsing, Classification, Retrieval
 Clustering is the means for achieving better
organization of information.
 The data space is partitioned into groups of entities
with similar content.
13/4/2015
Nikos Hourdakis, MSc Thesis
2
Outline
 Background
 State-of-the-art clustering approaches
 Partitional, hierarchical methods
 K-Means and its variants
 Incremental K-Means, Bisecting Incremental K-Means
 Proposed method: BIC-Means
 Bisecting Incremental K-Means using BIC as stopping
criterion.
 Evaluation of clustering methods
 Application in Information Retrieval
13/4/2015
Nikos Hourdakis, MSc Thesis
3
Hierarchical Clustering (1/3)


Nested sequence of clusters.
Two approaches:
A.
Agglomerative: Starting from singleton
clusters, recursively merges the two most
similar clusters until there is only one
cluster.
Divisive (e.g., Bisecting K-Means): Starting
with all documents in the same root cluster,
iteratively splits each cluster into K clusters.
B.
13/4/2015
Nikos Hourdakis, MSc Thesis
4
Hierarchical Clustering – Example (2/3)
..
.. . 2
.
. .....
.. .
4
5
13/4/2015
1
..
.
. ...
..
.. .
..
1
6
3
2
3
7
Nikos Hourdakis, MSc Thesis
4
5 6
7
5
Hierarchical Clustering (3/3)
 Organization and browsing of large document
collections call for hierarchical clustering but:
 Agglomerative clustering have quadratic time
complexity.
 Prohibitive for large data sets.
13/4/2015
Nikos Hourdakis, MSc Thesis
6
Partitional Clustering
 We focus on Partitional Clustering
 K-Means,
 Incremental K-Means,
 Bisecting K-Means
 At least as good as hierarchical.
 Low complexity, O(KN)
 Faster than hierarchical for large document
collections.
13/4/2015
Nikos Hourdakis, MSc Thesis
7
K-Means
1. Randomly select K centroids
2. Repeat ITER times or until the centroids do
not change:
a) Assign each instance to the cluster whose centroid
it is closest.
b) Re-compute the cluster centroids.
 Generates a flat partition of K Clusters (K
must be known in advance).
 Centroid is the mean of a group of instances.
13/4/2015
Nikos Hourdakis, MSc Thesis
8
K-Means Example
.. .
.. .
. .. .
..
C
x
C
. .....
.. .
C
13/4/2015
x
Nikos Hourdakis, MSc Thesis
9
K-Means demo (1/7):
http://www.delft-cluster.nl/textminer/theory/kmeans/kmeans.html
13/4/2015
Nikos Hourdakis, MSc Thesis
10
K-Means demo (2/7)
13/4/2015
Nikos Hourdakis, MSc Thesis
11
K-Means demo (3/7)
13/4/2015
Nikos Hourdakis, MSc Thesis
12
K-Means demo (4/7)
13/4/2015
Nikos Hourdakis, MSc Thesis
13
K-Means demo (5/7)
13/4/2015
Nikos Hourdakis, MSc Thesis
14
K-Means demo (6/7)
13/4/2015
Nikos Hourdakis, MSc Thesis
15
K-Means demo (7/7)
13/4/2015
Nikos Hourdakis, MSc Thesis
16
Comments
 No proof of convergence
 Converges to a local minimum of the distortion
measure (average of the square distance of the
points from their nearest centroids):
ΣiΣd(d-μc)2
 Too slow for practical databases
 K-means fully deterministic once initial centroids
selected.
 Bad choice of initial centroids leads to poor
clusters.
13/4/2015
Nikos Hourdakis, MSc Thesis
17
Incremental K-Means (IK)
 In K-Means new centroids are computed
after each iteration (after all documents
have been examined).
 In Incremental K-Means each cluster
centroid is updated after a document is
assigned to a cluster:
 
  S  1 C  d
C' 
S
13/4/2015
Nikos Hourdakis, MSc Thesis
18
Comments
 Not as sensitive as K-Means to the
selection of initial centroids.
 Faster convergence, much faster in general
13/4/2015
Nikos Hourdakis, MSc Thesis
19
Bisecting IK-Means (1/4)
 A hierarchical clustering solution is produced by
recursively applying the Incremental K-Means in
a document collection.
 The documents are initially partitioned into two
clusters.
 The algorithm iteratively selects and bisects each
one of the leaf clusters until singleton clusters are
reached.
13/4/2015
Nikos Hourdakis, MSc Thesis
20
Bisecting IK-means (2/4)
 Input: (d1,d2…dN)
 Output: hierarchy of K-clusters
1. All document in cluster C
2. Apply IK-means to split C into K clusters
(K=2) C1,C2,…CK leaf clusters
3. Iteratively split each Ci cluster until K
clusters or singleton clusters are
produces at the leafs
13/4/2015
Nikos Hourdakis, MSc Thesis
21
Bisecting IK-Means (3/4)
 The algorithm is exhaustive terminating at
singleton clusters (unless K is known)
 Terminating at singleton clusters
Is time consuming
Singleton clusters are meaningless
Intermediate clusters are more likely to correspond
to real classes
 No criterion for stopping bisections before
singleton clusters are reached.
13/4/2015
Nikos Hourdakis, MSc Thesis
22
Bayesian Information Criterion (BIC)
(1/3)
 To prevent over-splitting we define a
strategy to stop the Bisecting algorithm
when meaningful clusters are reached.
 Bayesian Information Criterion (BIC) or
Schwarz Criterion [Schwarz 1978].
 X-Means [Pelleg and Moore, 2000] used
BIC for estimating the best K in a given
range of values.
13/4/2015
Nikos Hourdakis, MSc Thesis
23
Bayesian Information Criterion (BIC)
(2/3)
 In this work, we suggest using BIC as the
splitting criterion of a cluster in order to decide
whether a cluster should split or not.
 It measures the improvement of the cluster
structure between a cluster and its two children
clusters.
 We compute the BIC score of:
 A cluster and of its
 Two children clusters.
13/4/2015
Nikos Hourdakis, MSc Thesis
24
Bayesian Information Criterion (BIC)
(3/3)
 If the BIC score of the produced children
clusters is less than the BIC score of their
parent cluster we do not accept the split.
 We keep the parent cluster as it is.
 Otherwise, we accept the split and the
algorithm proceeds similarly to lower levels.
13/4/2015
Nikos Hourdakis, MSc Thesis
25
Example
C1
C
C1
Parent cluster:
BIC(K=1)=1980
C2
Two resulting
clusters:
BIC(K=2)=2245
 The BIC score of the parent cluster is less than
BIC score of the generated cluster structure =>
we accept the bisection.
13/4/2015
Nikos Hourdakis, MSc Thesis
26
Computing BIC
 The BIC score of a data collection is defined as
(Kass and Wasserman, 1995):
p


j
ˆ
BIC(M j )  l j  D   log R


2
where lˆj  D  is the log-likelihood of the data set
D, Pj = M*K+1, is a function of the number of
independent parameters and R is the number of
points.
13/4/2015
Nikos Hourdakis, MSc Thesis
27
Log-likelihood
 Given a cluster of points, that produces a
Gaussian distribution N(μ, σ2), log-likelihood
is the probability that a neighborhood of data
points follows this distribution.
 The log-likelihood of the data can be
considered as a measure of the cohesiveness
of a cluster.
 It estimates how closely to the centroid are the
points of the cluster.
13/4/2015
Nikos Hourdakis, MSc Thesis
28
Parameters pj
 Sometimes, due to the complexity of the
data (many dimensions or many data
points), the data may follow other
distributions.
 We penalize log-likelihood by a function of
the number of independent parameters
(pj/2*logR).
13/4/2015
Nikos Hourdakis, MSc Thesis
29
Notation








μj : coordinates of j-th centroid
μ(i) : centroid nearest to i-th data point
D: input set of data points
Dj : set of data points that have μ(j) as
their closest centroid
R = |D| and Ri = |Di|
M: the number of dimensions
Mj: family of alternative models (different
models correspond clustering solutions)
BIC scores the models and chooses the
best among K models
13/4/2015
Nikos Hourdakis, MSc Thesis
30
Computing BIC (1/3)
 To compute log-likelihood of data we
need the parameters of the Gaussian
for the data
 Maximum likelihood estimate (MLE)
of the variance (under spherical
Gaussian assumption)
1
2
2
 
xi  (i )

RK i
13/4/2015
Nikos Hourdakis, MSc Thesis
31
Computing BIC (2/3)
 Probability of point xi : Gaussian with
the estimated σ and mean the
nearest cluster centroid to xi
Px i  
Ri 
1
R
2  M
2
 1
exp  2 x  i  
 2

 Log likelihood of data
Ri  
2

1
1

l ( D)  log i Pxi   i  log
 2 xi  i   log

2
R 
2 

13/4/2015
Nikos Hourdakis, MSc Thesis
32
Computing BIC (3/3)
 Focusing on the set Dn of points
which belong to centroid n
Rn
Rn M
l ( Dn )   log(2 ) 
log( 2 ) 
2
2
Rn  
 Rn log Rn  Rn log R
2
13/4/2015
Nikos Hourdakis, MSc Thesis
33
Proposed Method: BIC-Means (1/2)
 BIC: Bisecting InCremental K-Means
clustering incorporating BIC as the
stopping criterion.
 BIC performs a splitting test at each leaf
cluster to prevent it from over-splitting.
 BIC-Means doesn’t terminate at singleton
clusters.
 BIC-Means terminates when there are no
separable clusters according to BIC.
13/4/2015
Nikos Hourdakis, MSc Thesis
34
Proposed Method: BIC-Means (2/2)
 Combines the strengths of partitional
and hierarchical clustering methods




13/4/2015
Hierarchical clustering
Low complexity (O(N*K))
Good clustering quality
Produces meaningful clusters at the
leafs
Nikos Hourdakis, MSc Thesis
35
BIC-Means Algorithm
Input: S: (d1, d2,…,dn) data in one cluster
Output: A hierarchy of clusters.


1.
2.
3.
4.
13/4/2015
All documents in one cluster C.
Apply Incremental K-Means to split C into C1, C2.
Compute BIC for C and C1, C2 :
I. If BIC(C) < BIC(C1, C2) put C1, C2 in queue
II. Otherwise do not split C
Repeat steps 2, 3 and 4, until there is no separable
leaf clusters in queue according to BIC.
Nikos Hourdakis, MSc Thesis
36
Evaluation
 Evaluation of document clustering algorithms.
 Two data sets: OHSUMED (233,445 Medline
documents), Reuters (21578 documents).
 Application of clustering to information retrieval
 Evaluation of several cluster-based retrieval
strategies.
 Comparison with retrieval by exhaustive search
on OHSUMED.
13/4/2015
Nikos Hourdakis, MSc Thesis
37
F-Measure
 Howe good the clusters approximate data classes
 F-Measure for cluster C and class T is defined as:
F Measure  2PR/(P R) , where P  N / C , R  N / T
 The F measure of a class T is the maximum value it
achieves over all clusters C:
FT= maxCFTC
 The F measure of the clustering solution is the mean FT
(over all classes)
C
F   FT
T T
13/4/2015
Nikos Hourdakis, MSc Thesis
38
Comparison of Clustering Algorithms
Comparison of K-Means, Incremental K-Means and Bisecting
Ohsumed1 - Reuters1 data sets
Reuters1
OHSUMED1
0.9
Avg F-Measure (10 trials)
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
K-Means
13/4/2015
Incremental K-Means
Algorithms
Nikos Hourdakis, MSc Thesis
Bisecting Increm. K-Means
39
Evaluation of Incremental K-Means
Incremental K-Means - Reuters1
Number of Iterations of Center adjustment
0.8
Avg F-Measure(10 trials)
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1 iteration
2 iterations
3 iterations
4 iterations
Number of Iterations
13/4/2015
Nikos Hourdakis, MSc Thesis
40
MeSH Representation of Documents
 We use MeSH terms for describing medical
documents (OHSUMED).
 Each document is represented by a vector of
MeSH terms (multi-word terms instead of single
word terms).
 Leads to more compact representation (each
vector contains less terms, about 20).
 Sequential approach to extract MeSH terms
from OHSUMED documents.
13/4/2015
Nikos Hourdakis, MSc Thesis
41
Bisecting Incremental K-Means –
Clustering Quality
Bisecting Incremental K-Means- OHSUMED2
MeSH terms Vs Single Word Terms Representation
0.8
Avg F-Measure (10 trials)
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Single-Word Term Representation
MeSH Term Representation
Document Representation
13/4/2015
Nikos Hourdakis, MSc Thesis
42
Speed of Clustering
Bisecting Incremental K-Means - Ohsumed2
MeSH-based Vs Single word Terms Representation
110
Avg Clustering Time (min)
100
90
Single-Word
Terms
Representation 97.6 min
80
70
60
50
40
30
MeSH Terms
Representation,
14 min.
20
10
0
Single-Word Term Representation
MeSH Term Representation
Document Representation
13/4/2015
Nikos Hourdakis, MSc Thesis
43
Evaluation of BIC-Means
Comparison of BIC-Means and Bisecting Incremental K-Means
F-Measure
BIC-Means
Bisecting Incremental K-Means
0.9
Avg F-Measure (10 trials)
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Ohsumed2
Reuters1
Reuters2
Data Set
13/4/2015
Nikos Hourdakis, MSc Thesis
44
Speed of Clustering
Comparison of BIC-Means and Bisecting Incremental K-Means
Clustering Time
Avg Clustering Time (min)
BIC-Means
Bisecting Incremental K-Means
180
170
160
150
140
130
120
110
100
90
80
70
60
50
40
30
20
10
0
Ohsumed2
Reuters1
Reuters2
Data Set
13/4/2015
Nikos Hourdakis, MSc Thesis
45
Comments
 BIC-Means is much faster than Bisecting
Incremental K-Means
 Not exhaustive algorithm.
 Achieves approximately the same F-Measure
with the exhaustive Bisecting approach.
 It is more suited for clustering large
document collections.
13/4/2015
Nikos Hourdakis, MSc Thesis
46
Application of Clustering to
Information Retrieval
 We demonstrate that it is possible to reduce the
size of the search (and therefore retrieval
response time) on large data sets (OHSUMED).
 BIC-Means is applied on entire OHSUMED.
 Each document is represented by MeSH terms.
 Chose 61 queries of the original OHSUMED
query set developed by Hersh et. al.
 Each OHSUMED document has been judged as
relevant to a query.
13/4/2015
Nikos Hourdakis, MSc Thesis
47
Query – Document Similarity
d1
d d
Sim(d1, d2 )  1 2 
| d1 || d2 |
i1 wid wid
M
M
2
w
i1 id i1 wid2
M
1
1
2
2
d2
θ
 Similarity is defined as the cosine of the angle
between document and query vectors.
13/4/2015
Nikos Hourdakis, MSc Thesis
48
Information Retrieval Methods
 Method 1: Search M clusters closer to the query
 Compute similarity between cluster centroid - query
 Method 2: Search M clusters closer to the query
 Each cluster is represented by the 20 most frequent
terms of its centroid.
 Method 3: Search M clusters whose centre
contain the terms of the query.
13/4/2015
Nikos Hourdakis, MSc Thesis
49
Method 1: Search M clusters closer to the
query (compute similarity between
cluster centroid – query).
top_1Cluster
0.45
top_3Clusters
0.4
top_10Clusters
top_30Clusters
0.35
top_50Clusters
precision
0.3
top_100Clusters
top_150Clusters
0.25
Exhaustive Search
0.2
0.15
0.1
0.05
0
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
recall
13/4/2015
Nikos Hourdakis, MSc Thesis
50
Method 2: Search M clusters closer to the query.
Each cluster is represented by the 20
most frequent terms of its centroid.
0.45
20Terms-top_10Clusters
0.4
20Terms-top_50Clusters
20Terms-top_100Clusters
0.35
20Terms-top_150Clusters
precision
0.3
top_150Clusters
Exhaustive Search
0.25
0.2
0.15
0.1
0.05
0
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
recall
13/4/2015
Nikos Hourdakis, MSc Thesis
51
Method 3: Search M clusters containing the
terms of the query.
0.45
AllQinCen_Top_15Clusters
0.4
AllQinCen_Top_30Clusters
0.35
AllQinCen_Top_50Clusters
AllQinCen_AllClusters
precision
0.3
Exhaustive Search
0.25
0.2
0.15
0.1
0.05
0
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
recall
13/4/2015
Nikos Hourdakis, MSc Thesis
52
Size of Search
"Avg Num ber of Docum ents searched over the 61 queries"
Retrieval Strategy: Retrieve the clusters w hich contain all the MeSH Query
Term s in their Centroid.
245000
Num OF Docs
210000
175000
140000
105000
70000
35000
0
VSM
AllClusters
Top_50Clusters Top_30Clusters Top_15Clusters
Search Strategy
13/4/2015
Nikos Hourdakis, MSc Thesis
53
Comments
 Best cluster-based retrieval strategy:
 Retrieve only the clusters which contain all the MeSH
query terms in their centroid vector (Method 3).
 Search the documents which are contained in the
retrieved clusters and order them by similarity with the
query.
 Advantages:
 Searches only 30% of all OHSUMED documents as
opposed to exhaustive searching (233,445 docs).
 Almost as effective as the retrieval by exhaustive
searching (searching without clustering).
13/4/2015
Nikos Hourdakis, MSc Thesis
54
Conclusions (1/2)
 We implemented and evaluated various
partitional clustering techniques
 Incremental K-Means
 Bisecting Incremental K-Means (exhaustive
approach)
 BIC-Means
 Incorporates BIC as stopping criterion for preventing
clustering from over-splitting.
 Produces meaningful clusters at the leafs.
13/4/2015
Nikos Hourdakis, MSc Thesis
55
Conclusions (2/2)
 BIC-Means
 Much faster than Bisecting Incremental K-Means.
 As effective as exhaustive Bisecting approach.
 More suited for clustering large document collection.
 Cluster-based retrieval strategies
 Reduces the size of the search.
 The best proposed retrieval method is as effective
as exhaustive searching (searching without
clustering).
13/4/2015
Nikos Hourdakis, MSc Thesis
56
Future Work
 Evaluation using more or application specific
data sets.
 Examine additional cluster-based retrieval
strategies (top-down, bottom-up).
 Clustering and Browsing on Medline.
 Clustering Dynamic Document Collections.
 Semantic Similarity Methods in document
clustering.
13/4/2015
Nikos Hourdakis, MSc Thesis
57
References
 Nikos Hourdakis, Michalis Argyriou, Euripides
G.M. Petrakis, Evangelos Milios, " Hierarchical
Clustering in Medical Document Collections:
the BIC-Means Method", Journal of Digital
Information Management (JDIM), Vol. 8, No.
2, pp. 71-77, April. 2010.
 Dan Pelleg, Andrew Moore, “X-means:
Extending K-means with efficient estimation of
the number of clusters”, Proc. of the 7th
Intern. Conf. on Machine Learning, 2000, pp.
727-734
13/4/2015
Nikos Hourdakis, MSc Thesis
58
Thank you!!!
Questions?
13/4/2015
Nikos Hourdakis, MSc Thesis
59
Download