Constrained K-means Clustering with Background Knowledge Wagstaff, Cardie, Rogers, Schroedl Proc. 18th ICML, 2001 Background Knowledge • How to integrate background information (about the domain or the data set) into clustering algorithms • Supervision in clustering can take two forms – Specify class labels for a subset of points (instances) – Specify pairs of points that belong to same or different clusters • Supervision in the form of constraints is more realistic than providing class labels • Authors propose a variant of K-means that can utilize pair-wise “instance-level” constraints (COP-KMEANS, constraied pairwise K-means) Constrained K-means Clustering • Must-link constraints: two instances (objects, patterns) have to be in the same cluster • Cannot-link constraints: two instances must not be placed in the same cluster • How do we get the constraints? Either from partially labeled data or from background knowledge about the domain • Given a set of constraints, we take a transitive closure over them (if di must link to dj which cannot link to dk, then we know that di cannot link to dk) Constrained K-means Algorithm (COP-KMEANS) • Major modification: when updating the cluster assignments, we ensure that none of the specified constraints are violated; if a legal cluster cannot be found for an instance di, empty partition is returned Evaluation Model • Use Rand index to find the agreement between the correct labels and clustering results • Given two partitions P1 and P2 of the same data set D with n instances, Rand (P1, P2) = (a + b)/ (n*(n-1)/2) where a = no. of decisions where di is in same cluster as dj in both P1 & P2 b = no. of decisions where di & dj are in different clusters in P1 & P2 • 10-fold cross-validation; generate constraints on nine folds and evaluate performance on the tenth Experimental Results: Artificial Constraints • True value of K (no. of clusters) is known • Constraint generation: if two randomly picked instances have the same label, generate a must-link constraint, otherwise a cannot-link constraint • 100 trials on each data set; each trial is one 10-fold crossvalidation • Soybean data: 47 instances, 35 attributes, 4 classes – With 100 constraints, performance improved from 87% to 99% – Rand index between the constraints and true labels = 48%; combining clustering and constraints achieves better performance than either in isolation • Mushroom data: 50 instances, 21 attributes, 2 classes – With 100 constraints, performance improved from 69% to 96% COP-KMEANS vs. K-Means Soybean data COP-KMEANS vs. K-Means Mushroom data Integrating Constraints and Metric Learning in Semi-Supervised Clustering Billenko, Basu and Mooney ICML 2004 Semi-Supervised Clustering • Two ways to incorporate domain knowledge – Constrained-based approach: Modify the clustering objective function to satisfy the pairwise constraints – Metric learning-based approach: train the metric/distance function used by the clustering algorithm to satisfy the constraints • MPCK-MEANS: incorporates both metric learning and the use of pairwise constraints – Learns individual metric for each cluster, allowing clusters of different shapes – Allows violation of constraints if it leads to more cohesive clustering • Euclidean distance is parameterized by using a symmetric positive-definite matrix A; matrix A is learned for each cluster Integrating Constraints & Metric Learning • Goal of pairwise constrained K-Means is to minimize the following objective function, where point 𝑥𝑖 is assigned to the partition 𝜒𝑙𝑖 with centroid 𝜇𝑙𝑖 Distance between points 𝑥𝑖 and 𝑥𝑗 𝑥𝑖 − 𝑥𝑗 𝐴 = 𝑥𝑖 − 𝜇𝑙𝑖 𝑇 𝐴 𝑥𝑗 − 𝜇𝑙𝑗 Penalty for violating constraint between 𝑥𝑖 and 𝑥𝑗 Ensure penalty for violating ML constraint between distant points (according to the current distance metric) is greater than penalty for violating ML constraint between nearby points AND Ensure penalty for violating CL constraint between nearby points is greater than penalty for violating CL constraint between distant points MPCK-MEANS Use EM algorithm to find cluster labels and the distance metric Neighborhood consists of points connected by must-links; in weighted farthest-first traversal, the goal is find K points that are maximally separated from each other in terms of a weighted distance MPCK-MEANS E-step M-step Evaluation Model • Use F-measure to measure agreement between the true labels and the estimated cluster labels Experimental Results • Constraint generation: Randomly select pairs of points and generate constraint based on their true labels. Set penalty to 1 for all pairs • 50 five-fold F-measure reported on each dataset • Iris dataset: 150 points in 4-dim, 3 clusters • Wine dataset: 178 points in 13-dim, 3 clusters Experimental Results MPCK-MEANS: involves both seeding and metric learning in the unified framework MK-MEANS: K-Means clustering with the metric learning component PCK-MEANS: Utilizes constraints for seeding the initial cluster centers and cluster assignments K-MEANS: Unsupervised clustering SUPERVISED-MEANS: Assign points to nearest cluster centroids inferred from constraints; performs a baseline for performance of pure supervised learning based on constraints Summary • MPCK-MEANS unifies constraint based and metric based methods for semi-supervised clustering • The integrated approach outperforms the two techniques individually • MPCK-MEANS allows clusters to lie in different subspaces and have different shapes • Future work: extending to high dimensional data sets (where Euclidean distance does not work well), finding the most informative constraints, noisy constraints,… Pairwise-Constraints Via Crowdsourcing “Crowdsourcing is the practice of obtaining needed services, ideas, or content by soliciting contributions from a large group of people, and especially from an online community, rather than from traditional employees or suppliers. The general concept is to combine the efforts of crowds of volunteers or part-time workers, where each one could contribute a small portion, which adds into a relatively large or significant result.” http://en.wikipedia.org/wiki/Crowdsourcing Image Collection HIT 1 HIT 2 HIT 3 …… Sample annotations by the worker Sample images in a HIT annotated keywords butterfly flower tree leaf Pair-wise labels generated from human annotations butterfly mountain lake sky