Slides - People

advertisement
Data reduction for weighted and
outlier-resistant clustering
Leonard J. Schulman
Caltech
joint with
Dan Feldman
MIT
Talk outline
• Clustering-type problems:
–
–
–
–
–
k-median
weighted k-median
k-median with m outliers (small m)
k-median with penalty (clustering with many outliers)
k-line median
• Unifying framework: tame loss functions
• Core-sets, a.k.a. -approximations
• Common existence proof and algorithm
Voronoi regions have spherical boundaries
k-Median with penalty
k-Median with penalty: good for outliers
2-median clustering of a data set:
Same data set plus an outlier:
Now cluster with h-robust loss function:
Related work and our results
Problem
Approx.
4
Time
Reference
Charikar et al.
SODA’01
Our Result
O(1)
K. Chen
SODA’08
Our Result
Har-Peled
FSTTCS’06
Our Result
F, Fiat, Sharir
FOCS’06
Our Result
Why are all these problems in the same paper?
In each case the objective function is a suitably
tame “loss function”.
The loss in representing a point p by a center c is:
k-median: D(p) = dist(p,c)
Weighted k-median: D(p) = w · dist(p,c)
Robust k-median: D(p) = min{h, dist(p,c)}
What qualifies as a “tame” loss function?
Log-Log Lipschitz (LgLgLp)
condition on the loss function
Many examples of LgLgLp loss functions:
Robust M-estimators in Statistics
figure: Z. Zhang
Classic Data Reduction
Same notion for LgLgLp loss functions
k-clustering core-set for loss D
Weighted-k-clustering core-set for loss D
Handling arbitrary-weight centers is the “hard part”
Our main technical result
1. For every LgLgLp loss fcn D on a metric
space, for every set P of n points, there is a
weighted-(D,k)-core-set S of size
|S| = O(log2 n)
(In more detail: |S|=(dkO(k)/2) log2 n in Rd. For finite
metrics, d=log n.)
2. S can be computed in time O(n)
Sensitivity [Langberg and S, SODA’11]
The sensitivity of a point p  P determines how important it
is to include P in a core-set:
s(p) = maxC
DW(p,C)
qP DW(q,C)
Why this works:
If s(p) is small, then p has many “surrogates” in the data,
we can take any one of them for the core-set.
If s(p) is large, then there is some C for which p alone
contributes a significant fraction of the loss, so we need
to include p in any core-set.
Total sensitivity
The total sensitivity T(P) is the sum of the
sensitivities of all the points:
T(P)=sP s(p)
The total sensitivity of the problem is the maximum
of T(P) over all input sets P.
Total sensitivity ~ n: cannot have small core-sets.
Total sensitivity constant or polylog: there may
exist small core-sets.
Small total sensitivity  Small coreset
Small total sensitivity  Small core-set
The main thing we need to do in order
to produce a small core-set for
weighted-k-median:
For each p  P compute a good upper bound on
s(p) in amortized O(1) time per point.
(Upper bound should be good enough that s(p) is
small)
Algorithm for computing sensitivities
Recursive-Robust-Median(P,k)
• Input:
– A set P of n points in a metric space
– An integer k 1
• Output:
– A subset Q  P of (n/kk) points
We prove that any two points in Q can serve as each
others’ surrogates w.r.t. any query. Hence each point p
 Q has sensitivity s(p)  O(1/|Q|).
Outer loop: Call Recursive-Robust-Median(P,k), then set
P:=P-Q. Repeat until P is empty.
Total sensitivity bd: T  # calls to Recursive-RobustMedian  kk log n.
The algorithm to find the (n)–size set Q:
Recursive-Robust-Median: illustration
c*
c*
Recursive-Robust-Median: illustration
c*
A detail
Actually it’s more complicated than described
because we can’t afford to look for a (1+)approximation, or even a 2-approximation, to the
best k-median of any b·n points (b constant).
Instead look for a bicriteria approximation: a 2approximation of the best k-median of any b·n/2
points. Linear time algorithm from [F,Langberg
STOC’11].
High-level intuition for the correctness of
Recursive-Robust-Median
Consider any p in the “output” set Q.
If for all queries C, D(p,C) is small, then p has low
sensitivity.
If there is a query C for which D(p,C) is large then
in that query, all points of Q are assigned to the
same center c  C, and are closer to each other
than to c; so they are surrogates.
Thank
you
appendices
Many examples of LgLgLp loss functions:
Robust M-estimators in Statistics
…
М-оценка
Huber
"fair"
Cauchy
GemanMcClure
Welsch
Tukey
Andrews
Download