Robust hierarchical k-center clustering Ilya Razenshteyn (MIT) Silvio Lattanzi (Google), Stefano Leonardi (Sapienza University of Rome) and Vahab Mirrokni (Google) k-Center clustering • Given: n-point metric space (symmetric distance, triangle inequality) • Goal: cover all points with k balls of the smallest radius • Simple 2-approximation, NP-hard to approximate better (Gonzalez 1985), (Hochbaum, Shmoys 1986) k-Center clustering with z outliers • Given: n-point metric space (symmetric distance, triangle inequality) • Goal: cover all but z points with k balls of the smallest radius • Simple 3-approximation, NP-hard to approximate better (Charikar, Khuller, Mount, Narasimhan 2001) Universal outliers • The set of z outliers depends on k • Is there a set of outliers that “works” for every k? • Notation: OPTk,z is the cost of the optimal k-center clustering with z outliers • Formalization: • Universal set S of size f(z) • For every k one can cover everything but S with k balls of radius O(1) • OPTk,z • The main result: one can always achieve f(z) = z2, and this is tight (up to a constant) Greedy construction • Set S to the empty set • For k ranging from 1 to n • If the cost of covering everything but S with k balls is much larger than the cost of k-clustering with z outliers, then • Add z optimal outliers to S • Obviously correct • Not much control over |S|: potentially can update S at every iteration Greedy & Sparsification • Let S’ be equal to S together with z optimal outliers for k-clustering • Obtain the new S from S’ via sparsification: remove a point x from S’ if • Either x is at distance ≤ 2 • OPTk,z from the complement of S’ • There are more than z points in the ball B(x, 2 • OPTk,z) 2 • OPTk,z 2 • OPTk,z X \ S’ S’ >z x remove from S’ Quality • Fix k and suppose the resulting S does not contain some outlier x from the optimal k-clustering with z outliers • Suppose x was added during iteration k, so it must have been removed later (during iteration k’ ≥ k) • The ball B(x, 2 • OPTk’,z) has cardinality > z • x was (2 • OPTk’,z)-close y from the complement of S’ 2 • OPTk’,z >z x • Case 1: y is not an outlier in the best k-clustering, then attach x to y • Case 2: y is an outlier, proceed by induction, crucial: the distances telescope 2 • OPTk’,z X \ S’ y x S’ Size 2 • OPTk,z 2 • OPTk,z X \ S’ S’ >z x • At every iteration |S| ≤ z2 remove from S’ • Update during step k • There are < z clusters that consist exclusively of points from the old S • True, since we demanded an update from the old S • Points outside of the “exclusive” clusters are removed from S • Large “exclusive” clusters (of size > z) are removed from S • At most z points added: ≤ (z-1) • z + z = z2 points in total Lower bound • Will sketch Ω(z log z) lower bound: for Ω(z2) see the paper • Say z = 4 Δ2 Δ3 Δ Δ Δ Δ2 Δ Δ2 Δ3 Additional results and applications • One can’t obtain a set of f(z) outliers that would be 1-competitive for every k • After finding a universal set of outliers of size O(z2), one can run the algorithm from (Dasgupta, Long 2005) and obtain a hierarchical clustering with O(z2) outliers that is O(1)-competitive with OPTk,z for every k • Maybe z outliers is possible for the hierarchical case (different sets of outliers for different k)? No, see the paper for details! Conclusions and open problems • Introduced the notion of universal outliers for k-center clustering • Tight bounds • Applications to hierarchical clustering • Open problems: • Generalize to k-medians, k-means, other optimization problems… • Improve approximation factors: right now 28-approximation, if know OPTk,z exactly, and 2163-, if we insist on running in polynomial time • Interesting class of metrics, where the bound z2 can be improved • Questions?