Enhancing Cluster Labeling Using Wikipedia

advertisement
ENHANCING CLUSTER
LABELING USING WIKIPEDIA
SIGIR’09
David Carmel, Haggai Roitman, Naama Zwerdling
IBM Research Lab
Document Clustering

A method of aggregating a set of documents such
that :
 Documents
within cluster are as similar as possible.
 Documents from different clusters should be dissimilar.
Cluster 2
Cluster 1
Cluster 3
Cluster Labeling

To assign each cluster a human readable label that
can best represent the cluster.
Bowling
Cluster 2
Electronics
Cluster 1
Cluster 3

Ice Hockey
Traditional method is to pick the label from the
important terms within the cluster.
 The
statistically significant terms may not be a good
label.
 A good label may not occur directly in the text.
Approach


Utilizing the external resource to help the cluster
labeling.
Besides the important terms extracted from the
cluster, the metadata of Wikipedia such as title and
category is used to serve as candidate label.
A General Framework
i
i
i
i
i
i
Step1: Indexing



Documents are parsed and tokenized.
Term weight are determined by tf-idf.
Use Lucene to generate a search index such that the
tf and idf value of term t can be quickly accessed.
Step2: Clustering



Given the document collection D, return a set of
document clusters C={C1,C2,…,Cn}.
A cluster is represented by its centroid of the
cluster's documents.
The term weight of the cluster's centroid is slightly
modified:




Step3: Important Terms Extraction


Given a cluster
, find a list of important terms
ordered by their estimated importance.
This can be achieved by
 Selecting
the top weighted terms from the cluster
centroid.
 Use the Jensen-Shannon Divergence(JSD) to measure
the distance between the cluster and the collection.
Step4: Label Extraction


One way is to use the top k important terms directly.
The other way is to use the top k important terms to
query Wikipedia. The title and the set of categories
of the returned Wiki documents serve as candidate
labels.
Step5: Output the Recommended Labels from
Candidate Labels

MI(Mutual Information) Judge


Score each candidate label by its pointwise mutual
information with the cluster's important terms.
SP(Score Propagation) Judge

Propagate the document score to the candidate label.


Document score can be the original score of the IR system or the
rank(d)-1
Socore Aggregation
Use linear combination to combine the above two judges.
 The recommend labels are the top ranked labels.

Data Collection

20 News Groups
 20

(clusters) X 1000 (documents/ clusters)
Open Directory Project(ODP)
 100

(clusters) X 100 (documents/ clusters)
The Ground Truth
 The
correct label itself.
 The correct label's inflection.
 The correct label's Wordnet synonym .
Evaluation Metrics

Match@K
 Ex:

label1
label2
label3
label4
Correct
c1
c2
label1
label2
label3
label4
Match@4
=1/2
=0.5
Mean Reciprocal Rank(MRR@K)
 Ex:
label1
label2
label3
label4
Correct
Correct
c1
c2
label1
label2
label3
label4
MRR@4
=((1/2)+(1/3))/2
=0.416…
Parameters
1.
2.
3.
4.
The important term selection method(JSD, ctf-cdfidf, MI, chi-square).
The number of important terms for querying
Wikipedia.
The number of Wikipedia results to be used for
label extraction.
The judges used for candidate evaluation.
Evaluation 1

The effectiveness of using Wikipedia to enhance
cluster labeling.
Evaluation 2

Candidate label extraction
Evaluation 3

Judge effectiveness
Evaluation 4.1

The Effect of Clusters' Coherency on Label Quality


Testing on a "noisy cluster":
 For
a noise level p(in [0,1]) of clusters, each document
in one cluster have probability p to swap with document
in other cluster.
Evaluation 4.2

The Effect of Clusters' Coherency on Label Quality
Conclusion



Proposed a general framework for solving cluster
labeling problem.
The metadata of Wikipedia can boost the
performance of cluster labeling.
The proposed method has good resiliency to noisy
clusters.
Download