slides - Ilya Razenshteyn

advertisement
Robust hierarchical k-center
clustering
Ilya Razenshteyn (MIT)
Silvio Lattanzi (Google), Stefano Leonardi (Sapienza University of Rome)
and Vahab Mirrokni (Google)
k-Center clustering
• Given: n-point metric space (symmetric distance, triangle inequality)
• Goal: cover all points with k balls of the smallest radius
• Simple 2-approximation, NP-hard to approximate better (Gonzalez
1985), (Hochbaum, Shmoys 1986)
k-Center clustering with z outliers
• Given: n-point metric space (symmetric distance, triangle inequality)
• Goal: cover all but z points with k balls of the smallest radius
• Simple 3-approximation, NP-hard to approximate better (Charikar,
Khuller, Mount, Narasimhan 2001)
Universal outliers
• The set of z outliers depends on k
• Is there a set of outliers that “works” for every k?
• Notation: OPTk,z is the cost of the optimal k-center clustering with z
outliers
• Formalization:
• Universal set S of size f(z)
• For every k one can cover everything but S with k balls of radius O(1) • OPTk,z
• The main result: one can always achieve f(z) = z2, and this is tight (up
to a constant)
Greedy construction
• Set S to the empty set
• For k ranging from 1 to n
• If the cost of covering everything but S with k balls is much larger than the
cost of k-clustering with z outliers, then
• Add z optimal outliers to S
• Obviously correct
• Not much control over |S|: potentially can update S at every iteration
Greedy & Sparsification
• Let S’ be equal to S together with z optimal outliers for k-clustering
• Obtain the new S from S’ via sparsification: remove a point x from S’
if
• Either x is at distance ≤ 2 • OPTk,z from the complement of S’
• There are more than z points in the ball B(x, 2 • OPTk,z)
2 • OPTk,z
2 • OPTk,z
X \ S’
S’
>z
x
remove from S’
Quality
• Fix k and suppose the resulting S does not contain some outlier x
from the optimal k-clustering with z outliers
• Suppose x was added during iteration k, so it must have been
removed later (during iteration k’ ≥ k)
• The ball B(x, 2 • OPTk’,z) has cardinality > z
• x was (2 • OPTk’,z)-close y from the complement of S’
2 • OPTk’,z
>z
x
• Case 1: y is not an outlier in the
best k-clustering, then attach x to y
• Case 2: y is an outlier, proceed by
induction, crucial: the distances
telescope
2 • OPTk’,z
X \ S’
y
x
S’
Size
2 • OPTk,z
2 • OPTk,z
X \ S’
S’
>z
x
• At every iteration |S| ≤ z2
remove from S’
• Update during step k
• There are < z clusters that consist exclusively of points from the old S
• True, since we demanded an update from the old S
• Points outside of the “exclusive” clusters are removed from S
• Large “exclusive” clusters (of size > z) are removed from S
• At most z points added: ≤ (z-1) • z + z = z2 points in total
Lower bound
• Will sketch Ω(z log z) lower bound: for Ω(z2) see the paper
• Say z = 4
Δ2
Δ3
Δ
Δ
Δ
Δ2
Δ
Δ2
Δ3
Additional results and applications
• One can’t obtain a set of f(z) outliers that would be 1-competitive for
every k
• After finding a universal set of outliers of size O(z2), one can run the
algorithm from (Dasgupta, Long 2005) and obtain a hierarchical
clustering with O(z2) outliers that is O(1)-competitive with OPTk,z for
every k
• Maybe z outliers is possible for the hierarchical case (different sets of
outliers for different k)? No, see the paper for details!
Conclusions and open problems
• Introduced the notion of universal outliers for k-center clustering
• Tight bounds
• Applications to hierarchical clustering
• Open problems:
• Generalize to k-medians, k-means, other optimization problems…
• Improve approximation factors: right now 28-approximation, if know OPTk,z
exactly, and 2163-, if we insist on running in polynomial time
• Interesting class of metrics, where the bound z2 can be improved
• Questions?
Download