Constrained K-means Clustering with Background Knowledge

Constrained K-means Clustering
with Background Knowledge
Wagstaff, Cardie, Rogers, Schroedl
Proc. 18th ICML, 2001
Background Knowledge
• How to integrate background information (about the
domain or the data set) into clustering algorithms
• Supervision in clustering can take two forms
– Specify class labels for a subset of points (instances)
– Specify pairs of points that belong to same or different clusters
• Supervision in the form of constraints is more realistic
than providing class labels
• Authors propose a variant of K-means that can utilize
pair-wise “instance-level” constraints (COP-KMEANS,
constraied pairwise K-means)
Constrained K-means Clustering
• Must-link constraints: two instances (objects,
patterns) have to be in the same cluster
• Cannot-link constraints: two instances must not be
placed in the same cluster
• How do we get the constraints? Either from partially
labeled data or from background knowledge about
the domain
• Given a set of constraints, we take a transitive
closure over them (if di must link to dj which cannot
link to dk, then we know that di cannot link to dk)
Constrained K-means Algorithm
• Major modification:
when updating the
cluster assignments,
we ensure that none
of the specified
constraints are
violated; if a legal
cluster cannot be
found for an
instance di, empty
partition is returned
Evaluation Model
• Use Rand index to find the agreement between the correct
labels and clustering results
• Given two partitions P1 and P2 of the same data set D with
n instances,
Rand (P1, P2) = (a + b)/ (n*(n-1)/2)
a = no. of decisions where di is in same cluster as dj in both P1 & P2
b = no. of decisions where di & dj are in different clusters in P1 & P2
• 10-fold cross-validation; generate constraints on nine folds and
evaluate performance on the tenth
Experimental Results: Artificial Constraints
• True value of K (no. of clusters) is known
• Constraint generation: if two randomly picked instances
have the same label, generate a must-link constraint,
otherwise a cannot-link constraint
• 100 trials on each data set; each trial is one 10-fold crossvalidation
• Soybean data: 47 instances, 35 attributes, 4 classes
– With 100 constraints, performance improved from 87% to 99%
– Rand index between the constraints and true labels = 48%;
combining clustering and constraints achieves better
performance than either in isolation
• Mushroom data: 50 instances, 21 attributes, 2 classes
– With 100 constraints, performance improved from 69% to 96%
COP-KMEANS vs. K-Means
Soybean data
COP-KMEANS vs. K-Means
Mushroom data
Integrating Constraints and Metric
Learning in Semi-Supervised Clustering
Billenko, Basu and Mooney
ICML 2004
Semi-Supervised Clustering
• Two ways to incorporate domain knowledge
– Constrained-based approach: Modify the clustering objective function
to satisfy the pairwise constraints
– Metric learning-based approach: train the metric/distance function
used by the clustering algorithm to satisfy the constraints
• MPCK-MEANS: incorporates both metric learning and the use
of pairwise constraints
– Learns individual metric for each cluster, allowing clusters of different
– Allows violation of constraints if it leads to more cohesive clustering
• Euclidean distance is parameterized by using a symmetric
positive-definite matrix A; matrix A is learned for each cluster
Integrating Constraints & Metric
• Goal of pairwise constrained K-Means is to minimize the
following objective function, where point 𝑥𝑖 is assigned to the
partition 𝜒𝑙𝑖 with centroid 𝜇𝑙𝑖
Distance between points 𝑥𝑖 and 𝑥𝑗
𝑥𝑖 − 𝑥𝑗
𝑥𝑖 − 𝜇𝑙𝑖
𝐴 𝑥𝑗 − 𝜇𝑙𝑗
Penalty for violating constraint
between 𝑥𝑖 and 𝑥𝑗
Ensure penalty for violating ML constraint between distant points
(according to the current distance metric) is greater than penalty for violating ML
constraint between nearby points
Ensure penalty for violating CL constraint between nearby points is greater than penalty
for violating CL constraint between distant points
Use EM algorithm to find cluster labels and the distance metric
Neighborhood consists of points connected by must-links; in weighted farthest-first traversal, the goal
is find K points that are maximally separated from each other in terms of a weighted distance
Evaluation Model
• Use F-measure to measure agreement between the true
labels and the estimated cluster labels
Experimental Results
• Constraint generation: Randomly select pairs
of points and generate constraint based on
their true labels. Set penalty to 1 for all pairs
• 50 five-fold F-measure reported on each
• Iris dataset: 150 points in 4-dim, 3 clusters
• Wine dataset: 178 points in 13-dim, 3 clusters
Experimental Results
MPCK-MEANS: involves both seeding and metric learning in the unified framework
MK-MEANS: K-Means clustering with the metric learning component
PCK-MEANS: Utilizes constraints for seeding the initial cluster centers and cluster assignments
K-MEANS: Unsupervised clustering
SUPERVISED-MEANS: Assign points to nearest cluster centroids inferred from constraints;
performs a baseline for performance of pure supervised learning based on constraints
• MPCK-MEANS unifies constraint based and metric based
methods for semi-supervised clustering
• The integrated approach outperforms the two techniques
• MPCK-MEANS allows clusters to lie in different subspaces
and have different shapes
• Future work: extending to high dimensional data sets
(where Euclidean distance does not work well), finding the
most informative constraints, noisy constraints,…
Pairwise-Constraints Via Crowdsourcing
“Crowdsourcing is the practice of obtaining
needed services, ideas, or content by soliciting
contributions from a large group of people, and
especially from an online community, rather than
from traditional employees or suppliers.
The general concept is to combine the efforts of
crowds of volunteers or part-time workers, where
each one could contribute a small portion, which
adds into a relatively large or significant result.”
Image Collection
Sample annotations
by the worker
Sample images in a
Pair-wise labels
generated from