Semi-Supervised Clustering and its Application to Text Clustering and Record Linkage

advertisement
Machine Learning Group
Semi-Supervised Clustering and its
Application to Text Clustering and
Record Linkage
Raymond J. Mooney
Sugato Basu
Mikhail Bilenko
Arindam Banerjee
University of Texas at Austin
University of Texas at Austin
Machine Learning Group
Supervised Classification Example
.
.
.
.
University of Texas at Austin
2
Machine Learning Group
Supervised Classification Example
.
.
.
. ..
.. .
.. .
.
. .
. . .
. .
University of Texas at Austin
3
Machine Learning Group
Supervised Classification Example
.
.
.
. ..
.. .
.. .
.
. .
. . .
. .
University of Texas at Austin
4
Machine Learning Group
Unsupervised Clustering Example
.
.
.
. ..
.. .
.. .
.
. .
. . .
. .
University of Texas at Austin
5
Machine Learning Group
Unsupervised Clustering Example
.
.
.
. ..
.. .
.. .
.
. .
. . .
. .
University of Texas at Austin
6
Machine Learning Group
Semi-Supervised Learning
• Combines labeled and unlabeled data during training to
improve performance:
– Semi-supervised classification: Training on labeled data
exploits additional unlabeled data, frequently resulting in
a more accurate classifier.
– Semi-supervised clustering: Uses small amount of
labeled data to aid and bias the clustering of unlabeled
data.
University of Texas at Austin
7
Machine Learning Group
Semi-Supervised Classification Example
.
.
.
. ..
.. .
.. .
.
. .
. . .
. .
University of Texas at Austin
8
Machine Learning Group
Semi-Supervised Classification Example
.
.
.
. ..
.. .
.. .
.
. .
. . .
. .
University of Texas at Austin
9
Machine Learning Group
Semi-Supervised Classification
• Algorithms:
– Semisupervised EM [Ghahramani:NIPS94,Nigam:ML00].
– Co-training [Blum:COLT98].
– Transductive SVM’s [Vapnik:98,Joachims:ICML99].
• Assumptions:
– Known, fixed set of categories given in the labeled data.
– Goal is to improve classification of examples into these
known categories.
University of Texas at Austin
10
Machine Learning Group
Semi-Supervised Clustering Example
.
.
.
. ..
.. .
.. .
.
. .
. . .
. .
University of Texas at Austin
11
Machine Learning Group
Semi-Supervised Clustering Example
.
.
.
. ..
.. .
.. .
.
. .
. . .
. .
University of Texas at Austin
12
Machine Learning Group
Second Semi-Supervised Clustering Example
.
.
.
. ..
.. .
.. .
.
. .
. . .
. .
University of Texas at Austin
13
Machine Learning Group
Second Semi-Supervised Clustering Example
.
.
.
. ..
.. .
.. .
.
. .
. . .
. .
University of Texas at Austin
14
Machine Learning Group
Semi-Supervised Clustering
• Can group data using the categories in the initial
labeled data.
• Can also extend and modify the existing set of
categories as needed to reflect other regularities in
the data.
• Can cluster a disjoint set of unlabeled data using
the labeled data as a “guide” to the type of clusters
desired.
University of Texas at Austin
15
Machine Learning Group
Search-Based Semi-Supervised Clustering
• Alter the clustering algorithm that searches for a good
partitioning by:
– Modifying the objective function to give a reward for
obeying labels on the supervised data
[Demeriz:ANNIE99].
– Enforcing constraints (must-link, cannot-link) on the
labeled data during clustering [Wagstaff:ICML00,
Wagstaff:ICML01].
– Use the labeled data to initialize clusters in an iterative
refinement algorithm (kMeans, EM) [Basu:ICML02].
University of Texas at Austin
16
Machine Learning Group
Unsupervised KMeans Clustering
• KMeans is a partitional clustering algorithm based on
iterative relocation that partitions a dataset into K clusters.
Algorithm:
Initialize K cluster centers {l }l 1 randomly. Repeat until convergence:
– Cluster Assignment Step: Assign each data point x to the cluster Xl,
such that L2 distance of x from  l (center of Xl) is minimum
– Center Re-estimation Step: Re-estimate each cluster center  l as the
mean of the points in that cluster
K
University of Texas at Austin
17
Machine Learning Group
KMeans Objective Function
• Locally minimizes sum of squared distance between the
data points and their corresponding cluster centers:
 
K
l 1
xi X l
|| xi  l ||
2
• Initialization of K cluster centers:
– Totally random
– Random perturbation from global mean
– Heuristic to ensure well-separated centers
etc.
University of Texas at Austin
18
Machine Learning Group
K Means Example
University of Texas at Austin
19
Machine Learning Group
K Means Example
Randomly Initialize Means
x
x
University of Texas at Austin
20
Machine Learning Group
K Means Example
Assign Points to Clusters
x
x
University of Texas at Austin
21
Machine Learning Group
K Means Example
Re-estimate Means
x
x
University of Texas at Austin
22
Machine Learning Group
K Means Example
Re-assign Points to Clusters
x
x
University of Texas at Austin
23
Machine Learning Group
K Means Example
Re-estimate Means
x
x
University of Texas at Austin
24
Machine Learning Group
K Means Example
Re-assign Points to Clusters
x
x
University of Texas at Austin
25
Machine Learning Group
K Means Example
Re-estimate Means and Converge
x
x
University of Texas at Austin
26
Machine Learning Group
Semi-Supervised KMeans
• Seeded KMeans:
– Labeled data provided by user are used for initialization: initial
center for cluster i is the mean of the seed points having label i.
– Seed points are only used for initialization, and not in subsequent
steps.
• Constrained KMeans:
– Labeled data provided by user are used to initialize KMeans
algorithm.
– Cluster labels of seed data are kept unchanged in the cluster
assignment steps, and only the labels of the non-seed data are reestimated.
University of Texas at Austin
27
Machine Learning Group
Semi-Supervised K Means Example
University of Texas at Austin
28
Machine Learning Group
Semi-Supervised K Means Example
Initialize Means Using Labeled Data
x
x
University of Texas at Austin
29
Machine Learning Group
Semi-Supervised K Means Example
Assign Points to Clusters
x
x
University of Texas at Austin
30
Machine Learning Group
Semi-Supervised K Means Example
Re-estimate Means and Converge
x
University of Texas at Austin
x
31
Machine Learning Group
Similarity-Based Semi-Supervised Clustering
• Train an adaptive similarity function to fit the labeled data.
• Use a standard clustering algorithm with the trained
similarity function to cluster the unlabeled data.
• Adaptive similarity functions:
– Altered Euclidian distance [Klein:ICML02]
– Trained Mahalanobis distance [Xing:NIPS02]
– EM-trained edit distance [Bilenko:KDD03]
• Clustering algorithms:
– Single-link agglomerative [Bilenko:KDD03]
– Complete-link agglomerative [Klein:ICML02]
– K-means [Xing:NIPS02]
University of Texas at Austin
32
Machine Learning Group
Semi-Supervised Clustering Example
Similarity Based
University of Texas at Austin
33
Machine Learning Group
Semi-Supervised Clustering Example
Distances Transformed by Learned Metric
University of Texas at Austin
34
Machine Learning Group
Semi-Supervised Clustering Example
Clustering Result with Trained Metric
University of Texas at Austin
35
Machine Learning Group
Experiments
• Evaluation measures:
– Objective function value for KMeans.
– Mutual Information (MI) between distributions of
computed cluster labels and human-provided class
labels.
• Experiments:
– Change of objective function and MI with increasing
fraction of seeding (for complete labeling and no
noise).
– Change of objective function and MI with increasing
noise in seeds (for complete labeling and fixed
seeding).
University of Texas at Austin
36
Machine Learning Group
Experimental Methodology
• Clustering algorithm is always run on the entire
dataset.
• Learning curves with 10-fold cross-validation:
– 10% data set aside as test set whose label is always
hidden.
– Learning curve generated by training on different “seed
fractions” of the remaining 90% of the data whose
label is provided.
• Objective function is calculated over the entire
dataset.
• MI measure calculated only on the independent
test set.
University of Texas at Austin
37
Machine Learning Group
Experimental Methodology (contd.)
• For each fold in the Seeding experiments:
– Seeds selected from training dataset by varying seed
fraction from 0.0 to 1.0, in steps of 0.1
• For each fold in the Noise experiments:
– Noise simulated by changing the labels of a fraction of
the seed values to a random incorrect value.
University of Texas at Austin
38
Machine Learning Group
COP-KMeans
• COPKMeans [Wagstaff et al.: ICML01] is KMeans with
must-link (must be in same cluster) and cannot-link
(cannot be in same cluster) constraints on data points.
• Initialization: Cluster centers are chosen randomly, but as
each one is chosen any must-link constraints that it
participates in are enforced (so that they cannot later be
chosen as the center of another cluster).
• Algorithm: During cluster assignment step in COPKMeans, a point is assigned to its nearest cluster without
violating any of its constraints. If no such assignment
exists, abort.
University of Texas at Austin
39
Machine Learning Group
Datasets
• Data sets:
– UCI Iris (3 classes; 150 instances)
– CMU 20 Newsgroups (20 classes; 20,000 instances)
– Yahoo! News (20 classes; 2,340 instances)
• Data subsets created for experiments:
– Small-20 newsgroup: random sample of 100 documents from
each newsgroup, created to study effect of datasize on algorithms.
– Different-3 newsgroup: 3 very different newsgroups (alt.atheism,
rec.sport.baseball, sci.space), created to study effect of data
separability on algorithms.
– Same-3 newsgroup: 3 very similar newsgroups (comp.graphics,
comp.os.ms-windows, comp.windows.x).
University of Texas at Austin
40
Machine Learning Group
Text Data
• Vector space model with TF-IDF weighting for
text data.
• Non-content bearing words removed:
– Stopwords
– High and low frequency words
– Words of length < 3
• Text-handling software:
– Spherical KMeans was used as the underlying
clustering algorithm: it uses cosine-similarity instead of
Euclidean distance between word vectors.
– Code base built on top of MC and SPKMeans packages
developed at UT Austin.
University of Texas at Austin
41
Machine Learning Group
Results: MI and Seeding
Zero noise in seeds
[Small-20 NewsGroup]
– Semi-Supervised KMeans substantially better than unsupervised KMeans
University of Texas at Austin
42
Machine Learning Group
Results: Objective function and Seeding
User-labeling consistent with KMeans assumptions
[Small-20 NewsGroup]
– Obj. function of data partition increases exponentially with seed fraction
University of Texas at Austin
43
Machine Learning Group
Results: MI and Seeding
Zero noise in seeds
– Semi-Supervised KMeans still better than unsupervised
University of Texas at Austin
[Yahoo! News]
44
Machine Learning Group
Results: Objective Function and Seeding
User-labeling inconsistent with KMeans assumptions
[Yahoo! News]
– Objective function of constrained algorithms decreases with seeding
University of Texas at Austin
45
Machine Learning Group
Results: Dataset Separability
Difficult datasets: lots of overlap between the clusters
– Semi-supervision gives substantial improvement
University of Texas at Austin
[Same-3 NewsGroup]
46
Machine Learning Group
Results: Dataset Separability
Easy datasets: not much overlap between the clusters [Different-3 NewsGroup]
– Semi-supervision does not give substantial improvement
University of Texas at Austin
47
Machine Learning Group
Results: Noise Resistance
Seed fraction: 0.1
[20 NewsGroup]
– Seeded-KMeans most robust against noisy seeding
University of Texas at Austin
48
Machine Learning Group
Record Linkage
• Identify and merge duplicate field values and
duplicate records in a database.
• Applications
– Duplicates in mailing lists
– Information integration of multiple databases of stores,
restaurants, etc.
– Matching bibliographic references in research papers
(Cora/ResearchIndex)
– Different published editions in a database of books.
University of Texas at Austin
49
Machine Learning Group
Experimental Datasets
• 1,200 artificially corrupted mailing list addresses.
• 1,295 Cora research paper citations.
• 864 restaurant listings from Fodor’s and Zagat’s
guidebooks.
• 1,675 Citeseer research paper citations.
University of Texas at Austin
50
Machine Learning Group
Record Linkage Examples
Author
Title
Venue
Address Year
Yoav Freund, H.
Sebastian Seung, Eli
Shamir, and Naftali
Tishby
Information, prediction,
and query by
committee
Advances in Neural
Information
Processing System
San Mateo,
CA
Freund, Y., Seung,
H.S., Shamir, E. &
Tishby, N.
Information, prediction,
and query by
committee
Advances in Neural
Information
Processing Systems
San Mateo,
CA.
Name
Address
City
Cusine
Second Avenue
Deli
156 2nd Ave. at
10th
New York
Delicatessen
Second Avenue
Deli
156 Second Ave.
New York City
Delis
University of Texas at Austin
1993
51
Machine Learning Group
Traditional Record Linkage
• Apply a static text-similarity metric to each field.
– Cosine similarity
– Jaccard similarity
– Edit distance
• Combine similarity of each field to determine
overall similarity.
– Manually weighted sum
• Threshold overall similarity to detect duplicates.
University of Texas at Austin
52
Machine Learning Group
Edit (Levenstein) Distance
• Minimum number of character deletions,
additions, or substitutions needed to make two
strings equivalent.
– “misspell” to “mispell” is distance 1
– “misspell” to “mistell” is distance 2
– “misspell” to “misspelling” is distance 3
• Can be computed efficiently using dynamic
programming in O(mn) time where m and n are
the lengths of the two strings being compared.
University of Texas at Austin
53
Machine Learning Group
Edit Distance with Affine Gaps
• Contiguous deletions/additions are less expensive
than non-contiguous ones.
– “misspell” to “misspelling” is distance < 3
• Relative cost of contiguous and non-contiguous
deletions/additions determined by a manually-set
parameter.
• Affine gap edit-distance better for identifying
duplicates than Levenstein.
University of Texas at Austin
54
Machine Learning Group
Trainable Record Linkage
• MARLIN (Multiply Adaptive Record Linkage
using INduction)
• Learn parameterized similarity metrics for
comparing each field.
– Trainable edit-distance
• Use EM to set edit-operation costs
• Learn to combine multiple similarity metrics for
each field to determine equivalence.
– Use SVM to decide on duplicates
University of Texas at Austin
55
Machine Learning Group
Trainable Edit Distance
• Learnable edit distance based on generative
probabilistic model for producing matched pairs of
strings.
• Parameters trained using EM to maximize the
probability of producing training pairs of
equivalent strings.
• Originally developed for Levenstein distance by
Ristad & Yianilos (1998).
• We modified for affine gap edit distance.
University of Texas at Austin
56
Machine Learning Group
Sample Learned Edit Operations
• Inexpensive operations:
– Deleting/adding space
– Substituting ‘/’ for ‘-’ in phone numbers
– Deleting/adding ‘e’ and ‘t’ in addresses (Street St.)
• Expensive operations:
– Deleting/adding a digit in a phone number.
– Deleting/adding a ‘q’ in a name
University of Texas at Austin
57
Machine Learning Group
Combining Field Similarities
• Record similarity is determined by combining the
similarities of individual fields.
• Some fields are more indicative of record similarity than
others:
– For addresses, city similarity is less relevant than restaurant/person
name or street address.
– For bibliographic citation, venue (i.e. conference or journal name)
is less relevant than author or title.
• Field similarities should be weighted when combined to
determine record similarity.
• Weights should be learned using learning algorithm.
University of Texas at Austin
58
Machine Learning Group
MARLIN
Record Linkage Framework
Trainable
duplicate
detector
Trainable
similarity
metrics
m1
…
mk
A.Field1 B.Field1
University of Texas at Austin
m1
…
mk
A.Field2 B.Field2
…
…
m1
…
mk
A.Fieldn B.Fieldn
59
Machine Learning Group
Learned Record Similarity
• Field similarities used as feature vectors describing a pair
of records.
• SVM trained on these feature vectors to discriminate
duplicate from non-duplicate pairs.
• Record similarity based on distance of feature vector from
the separator.
University of Texas at Austin
60
Machine Learning Group
Record Pair Classification Example
University of Texas at Austin
61
Machine Learning Group
Clustering Records into Equivalence Classes
• Use similarity-based semi-supervised clustering to
identify groups of equivalent records.
• Use single-link agglomerative clustering to cluster
records based on learned similarity metric.
University of Texas at Austin
62
Machine Learning Group
Experimental Methodology
• 2-fold cross-validation with equivalence classes of records
randomly assigned to folds.
• Results averaged over 20 runs of cross-validation.
• Accuracy of duplicate detection on test data measured using:
– Precision:
– Recall:
# CorrectlyIdentifiedDuplicatePairs
# IdentifiedDuplicateP airs
# CorrectlyIdentifiedDuplicatePairs
# TrueDuplic atePairs
– F-Measure:
University of Texas at Austin
2  Precision  Recall
Precision  Recall
63
Machine Learning Group
Mailing List Name Field Results
University of Texas at Austin
64
Machine Learning Group
Cora Title Field Results
University of Texas at Austin
65
Machine Learning Group
Maximum F-measure for
Detecting Duplicate Field Values
Metric
Restaurant Restaurant
Name
Address
Citeseer
Reason
Citeseer
Face
Citeseer
RL
Citeseer
Constraint
Static
Affine
Edit Dist.
0.29
0.68
0.93
0.95
0.89
0.92
Learned
Affine
Edit Dist.
0.35
0.71
0.94
0.97
0.91
0.94
T-test results indicate differences are significant at .05 level
University of Texas at Austin
66
Machine Learning Group
Mailing List Record Results
University of Texas at Austin
67
Machine Learning Group
Restaurant Record Results
University of Texas at Austin
68
Machine Learning Group
Combining Similarity and Search-Based
Semi-Supervised Clustering
• Can apply seeded/constrained clustering with a trained
similarity metric.
• We developed a unified framework for Euclidian distance
with soft pairwise-constraints (must-link, cannot-link).
• Experiments on UCI data comparing approaches.
• With small amounts of training, seeded/constrained tends
to do better than similarity-based.
• With larger amounts of labeled data, similarity-based tends
to do better.
• Combining both approaches outperforms both individual
approaches.
University of Texas at Austin
69
Machine Learning Group
Active Semi-Supervision
• Use active learning to select the most informative labeled
examples.
• We have developed an active approach for selecting good
pairwise queries to get must-link and cannot-link
constraints.
– Should these two examples be in same or different clusters?
• Experimental results on UCI and text data.
• Active learning achieves much higher accuracy with fewer
labeled training pairs.
University of Texas at Austin
70
Machine Learning Group
Future Work
• Adaptive metric learning for vector-space cosine-similarity.
– Supervised learning of better token weights than TF-IDF.
• Unified method for text data (cosine similarity) that
combines seeded/constrained with learned similarity
measure.
• Active learning results for duplicate detection.
– Static-active learning
• Exploiting external data/knowledge (e.g. from the web) for
improving similarity measures for duplicate detection.
University of Texas at Austin
71
Machine Learning Group
Conclusion
• Semi-supervised clustering is an alternative way of
combining labeled and unlabeled data in learning.
• Search-based and similarity-based are two alternative
approaches.
• They have useful applications in text-clustering and
database record linkage.
• Experimental results for these applications illustrate their
utility.
• They can be combined to produce even better results.
University of Texas at Austin
72
Download