presentation

advertisement
Stability Yields a PTAS for
k-Median and k-Means Clustering
Pranjal Awasthi, Avrim Blum, Or Sheffet
Carnegie Mellon University
November 3rd, 2010
1
Stability Yields a PTAS for
k-Median and k-Means Clustering
1. Introduce k-Median / k-Means problems.
2. Define stability
I. Previous notion [ORSS06]
II. Weak Deletion Stability
III. ¯-distributed instances
3. The algorithm for k-Median
4. Conclusion + open problems.
2
Clustering In Real Life
Clustering: come up with desired partition
3
Clustering in a Metric Space
Clustering: come up with desired partition
Input
• n points
• A distance function d:n£n! R¸0 satisfying:
– Reflexive:
– Symmetry:
– Triangle Inequality:
•
q
p
8 p,
d(p,p) = 0
8 p,q, d(p,q) = d(q,p)
8 p,q,r, d(p,q) · d(p,r)+d(r,q)
r
k-partition
4
Clustering in a Metric Space
Clustering: come up with desired partition
Input:
• n points
• A distance function d:n£n! R¸0 satisfying:
– Reflexive:
– Symmetry:
– Triangle Inequality:
•
8 p,
d(p,p) = 0
8 p,q, d(p,q) = d(q,p)
8 p,q,r, d(p,q) · d(p,r)+d(r,q)
k-partition
k is large, e.g. k=polylog(n)
5
k-Median
• Input:
1. n points in a finite metric space
2. k
• Goal:
–
–
–
–
Partition into k disjoint subsets: C*1, C*2 , … , C*k
Choose a center per subset
Cost: cost(C*i )= x d(x,c*i)
Cost of partition: i cost(C*i)
• Given centers ) Easy to get best partition
• Given partition ) Easy to get best centers
6
k-Means
• Input:
1. n points in Euclidean space
2. k
• Goal:
–
–
–
–
Partition into k disjoint subsets : C*1, C*2 , … , C*k
Choose a center per subset
Cost: cost(C*i )= x d2(x, c*i)
Cost of partition: i cost(C*i)
• Given centers ) Easy to get best partition
• Given partition ) Easy to get best centers
7
We Would Like To…
• Solve k–median/ k-means problems.
• NP-hard to get OPT (= cost of optimal partition)
• Find a c-approximation algorithm
c OPT
Alg
– A poly-time algorithm guaranteed to output a
clustering whose cost · c OPT
Alg
• Ideally, find a PTAS
Alg
Alg
2OPT
1.5 OPT
1.1 OPT
OPT
Polynomial Time Approximation Scheme
– Get a c-approximation algorithm where
c = (1+²), for any ²>0.
– Runtime can be exp(1/²)
0
8
Related Work
k-Median
k-Means
Small k
Easy (try all centers) in time nk
PTAS, exponential in (k/²)
[KSS04]
General k
•(3+²)-apx [GK98, CGTS99,
AGKMMP01, JMS02, dlVKKR03]
9-apx [OR00, BHPI02,
dlVKKR03, ES04, HPM04,
KMNPSW02]
No PTAS!
Special case
•(1.367...)-apx hardness [GK98,
JMS02]
Euclidean k-Median [ARR98], PTAS
if dimension is small (loglog(n)c)
[ORSS06]
• We focus on large k (e.g. k=polylog(n))
• Runtime goal: poly(n,k)
9
World
All possible instances
10
ORSS Result (k-Means)
11
ORSS Result (k-Means)
Why use 5 sites?
12
ORSS Result (k-Means)
13
ORSS Result (k-Means)
14
ORSS Result (k-Means)
15
ORSS Result (k-Means)
• Instance is stable if
OPT(k-1) > (1/®)2 OPT(k) (require 1/® > 10)
• Give a (1+O(®))-approximation.
Our Result (k-Means)
• Instance is stable if
OPT(k-1) > (1+®) OPT(k) (require ® > 0)
• Give a PTAS ((1+²)-approximation).
• Runtime: poly(n,k) exp(1/®,1/²)
16
Philosophical Note
• Stable instances:
9®>0 s.t. OPT(k-1) > (1+®) OPT(k)
• Not stable instances:
8®>0 s.t. OPT(k-1) · (1+®) OPT(k)
• A (1+®)-approximation can return a (k-1)-clustering.
• Any PTAS can return a (k-1)-clustering.
• It is not a k-clustering problem,
• It is a (k-1)-clustering problem!
• If we believe our instance inherently has k clusters
“Necessary condition“ to guarantee:
PTAS will return a “meaningful” clustering.
• Our result:
It’s a sufficient condition to get a PTAS.
17
World
All possible instances
Any (k-1) clustering is
significantly costlier than OPT(k)
ORSS Stable
18
A Weaker Guarantee
Why use 5 sites?
19
A Weaker Guarantee
20
A Weaker Guarantee
21
(1+®)-Weak Deletion Stability
• Consider OPT(k).
• Take any cluster C*i, associate its points with c*j.
• This increases the cost to at least (1+®)OPT(k).
c*i
c*j
)
c*j
1. An obvious relaxation of ORSS-stability.
2. Our result: suffices to get a PTAS.
22
World
All possible instances
ORSS Stable
Weak-Deletion Stable
Merging any two clusters in OPT(k)
increases the cost significantly
23
¯-Distributed Instances
For every cluster C*i, and every p not in C*i, we have:
p
c*i
We show that:
• k-median:
(1+®)-weak deletion stability ) (®/2)-distributed.
• k-means:
(1+®)-weak deletion stability ) (®/4)-distributed.
24
Claim: (1+®)-Weak Deletion Stability )
(®/2)-Distributed
p
c*i
c*j
®OPT · x d(x, c*j) - x d(x, c*i)
· x [d(x, c*i) + d(c*i, c*j)] - x d(x, c*i)
= x d(c*i, c*j) = |C*i| d(c*i, c*j)
) ®(OPT/|C*i|) · d(c*i, c*j) · d(c*i, p) + d(p, c*j) · 2d(c*i, p)
25
World
All possible instances
ORSS Stable
Weak-Deletion Stable
In optimal solution: large
distance between a center to
any “outside” point
¯-Distributed
26
Main Result
• We give a PTAS for ¯-distributed k-median and
k-means instances.
• Running time:
• There are NP-hard ¯-distributed instances.
(Superpolynomial dependence on 1/² is unavoidable!)
27
Stability Yields a PTAS for
k-Median and k-Means Clustering
1. Introduce k-Median / k-Means problems.
2. Define stability
3. PTAS for k-Median
I. High level description
II. Intuition (“had only we known more…”)
III. Description
4. Conclusion + open problems.
28
k-Median Algorithm’s Overview
Input: Metric, k, ¯, OPT
0. Handle “extreme” clusters
(Brute-force guessing of some clusters’ centers)
1. Populate L with components
2. Pick best center in each component
3. Try all possible k-centers
L := List of “suspected” clusters’ “cores”
30
k-Median Algorithm’s Overview
Input: Metric, k, ¯, OPT
0. Handle “extreme” clusters
(Brute-force guessing of some clusters’ centers)
1. Populate L with components
2. Pick best center in each component
3. Try all possible k-centers
• Right definition of “core”.
• Get the core of each cluster.
• L can’t get too big.
31
Intuition: “Mind the Gap”
• We know:
• In contrast, an “average” cluster contributes:
• So for an “average” point p, in an “average”
cluster C*i,
32
Intuition: “Mind the Gap”
• We know:
• Denote the core of a cluster C*i
c*i
33
Intuition: “Mind the Gap”
• We know:
• Denote the core of a cluster C*i
• Formally, call cluster C*i cheap if
• Assume all clusters are cheap.
• In general: we brute-force guess O(1/¯²) centers of expensive
clusters in Stage 0.
34
Intuition: “Mind the Gap”
• We know:
• Denote the core of a cluster C*i
• Formally, call cluster C*i cheap if
• Markov: At most (²/4) fraction of the points of
a cheap cluster, lie outside the core.
35
Intuition: “Mind the Gap”
• We know:
• Denote the core of a cluster C*i
• Formally, call cluster C*i cheap if
• Markov: At least half of the points of a cheap
cluster lie inside the core.
37
Magic (r/4) Ball
If p belongs to the core )
B(p, r/4) contains ¸ |C*i|/2 pts.
Denote r = ¯(OPT/|C*i|).
“Heavy”:
Mass ¸ |C*i|/2
r/4
· r/8
>r
c*i
p
38
Magic (r/4) Ball
1. Draw a ball of radius r/4 around all points.
2. Unite “heavy” balls whose centers overlap.
Denote r = ¯(OPT/|C*i|).
· r/8
>r
c*i
All points in the core are
merged into one set!
39
Magic (r/4) Ball
1. Draw a ball of radius r/4 around all points.
2. Unite “heavy” balls whose centers overlap.
Denote r = ¯(OPT/|C*i|).
· r/8
>r
c*i
Could we merge
core pts with
pts from other clusters?
40
Magic (r/4) Ball
1. Draw a ball of radius r/4 around all points.
2. Unite “heavy” balls whose centers overlap.
Denote r = ¯(OPT/|C*i|).
x
p
x
· r/8
>r
c*i
r/2 · d(p,c*i) · 3r/4
r/4 = r/2 - r/4 · d(x,c*i) · 3r/4 + r/4 = r
41
Magic (r/4) Ball
1. Draw a ball of radius r/4 around all points.
2. Unite “heavy” balls whose centers overlap.
Denote r = ¯(OPT/|C*i|).
p
· r/8
x
>r
c*i
r/4 · d(x,c*i) · r
x falls outside
the core
x belongs to C*i
42
Magic (r/4) Ball
1. Draw a ball of radius r/4 around all points.
2. Unite “heavy” balls whose centers overlap.
Denote r = ¯(OPT/|C*i|).
p
· r/8
x
>r
c*i
r/4 · d(x,c*i) · r
More than |C*i|/2 pts fall outside the core!
)(
43
Finding the Right Radius
1. Draw a ball of radius r/4 around all points.
2. Unite “heavy” balls whose centers overlap.
Denote r = ¯(OPT/|C*i|).
• Problem: we don’t know |C*i|
• Solution: Try all sizes, in order!
• Set s = n, n-1, n-2, …, 1
• Set rs = ¯(OPT/s)
• Complication:
• When s gets small (s=4,3,2,1) we collect many “leftovers” of one cluster.
• Solution: once we add a subset to L, we remove close-by points.
44
Population Stage
•
•
•
•
•
Set s = n, n-1, n-2, …, 1
Set rs = ¯(OPT/s)
Draw a ball of radius r/4 around each point
Unite balls containing ¸ s/2 pts whose centers overlap
Once a set ¸ s/2 is found
•
•
Put this set in L
Remove all points in a (r/2)-”buffer zone” from L.
45
Population Stage
•
•
•
•
•
Set s = n, n-1, n-2, …, 1
Set rs = ¯(OPT/s)
Draw a ball of radius r/4 around each point
Unite balls containing ¸ s/2 pts whose centers overlap
Once a set ¸ s/2 is found
•
•
Put this set in L
Remove all points in a (r/2)-”buffer zone” from L.
46
Population Stage
•
•
•
•
•
Set s = n, n-1, n-2, …, 1
Set rs = ¯(OPT/s)
Draw a ball of radius r/4 around each point
Unite balls containing ¸ s/2 pts whose centers overlap
Once a set ¸ s/2 is found
•
•
Put this set in L
Remove all points in a (r/2)-”buffer zone” from L.
47
Population Stage
•
•
•
•
•
Set s = n, n-1, n-2, …, 1
Set rs = ¯(OPT/s)
Draw a ball of radius r/4 around each point
Unite balls containing ¸ s/2 pts whose centers overlap
Once a set ¸ s/2 is found
•
•
Put this set in L
Remove all points in a (r/2)-”buffer zone” from L.
48
Population Stage
•
•
•
•
•
Set s = n, n-1, n-2, …, 1
Set rs = ¯(OPT/s)
Draw a ball of radius r/4 around each point
Unite balls containing ¸ s/2 pts whose centers overlap
Once a set ¸ s/2 is found
•
•
Put this set in L
Remove all points in a (r/2)-”buffer zone” from L.
49
Population Stage
•
•
•
•
•
Set s = n, n-1, n-2, …, 1
Set rs = ¯(OPT/s)
Draw a ball of radius r/4 around each point
Unite balls containing ¸ s/2 pts whose centers overlap
Once a set ¸ s/2 is found
•
•
Put this set in L
Remove all points in a (r/2)-”buffer zone” from L.
50
Population Stage
•
•
•
•
•
Set s = n, n-1, n-2, …, 1
Set rs = ¯(OPT/s)
Draw a ball of radius r/4 around each point
Unite balls containing ¸ s/2 pts whose centers overlap
Once a set ¸ s/2 is found
•
•
Put this set in L
Remove all points in a (r/2)-”buffer zone” from L.
51
Population Stage
•
•
•
•
•
Set s = n, n-1, n-2, …, 1
Set rs = ¯(OPT/s)
Draw a ball of radius r/4 around each point
Unite balls containing ¸ s/2 pts whose centers overlap
Once a set ¸ s/2 is found
•
•
Put this set in L
Remove all points in a (r/2)-”buffer zone” from L.
52
Population Stage
•
•
•
•
•
Set s = n, n-1, n-2, …, 1
Set rs = ¯(OPT/s)
Draw a ball of radius r/4 around each point
Unite balls containing ¸ s/2 pts whose centers overlap
Once a set ¸ s/2 is found
•
•
Put this set in L
Remove all points in a (r/2)-”buffer zone” from L.
53
Population Stage
•
•
•
•
•
Set s = n, n-1, n-2, …, 1
Set rs = ¯(OPT/s)
Draw a ball of radius r/4 around each point
Unite balls containing ¸ s/2 pts whose centers overlap
Once a set ¸ s/2 is found
•
•
Put this set in L
Remove all points in a (r/2)-”buffer zone” from L.
54
Population Stage
•
•
•
•
•
Set s = n, n-1, n-2, …, 1
Set rs = ¯(OPT/s)
Draw a ball of radius r/4 around each point
Unite balls containing ¸ s/2 pts whose centers overlap
Once a set ¸ s/2 is found
•
•
Put this set in L
Remove all points in a (r/2)-”buffer zone” from L.
Remainder of the proof:
1. Even with “buffer zones” - still collect cores.
2. #{Components without core pts} in L is O(1/¯)
3. cost(k centers from cores) · (1+²)OPT
55
A Note About k-Means
• Roughly the same algorithm, consts » squared.
• Problem:
– Can’t guess centers for expensive clusters!
• Solution:
– A random sample of O(1/²) pts from each cluster
approximates the center of mass.
– Brute force guess O(1/²) pts from O(1/¯²) expensive
clusters.
• Better solution:
– Randomly sample O(1/²) pts from expensive clusters
whose size ¸ poly(1/k) fraction of the instance.
– Slight complication: introduce intervals.
• Expected runtime:
56
Conclusion
• World:
ORSS Stable
Weak-Deletion Stable
¯-Distributed
• 8 ²>0, a (1+²)-approximation algorithm for
¯-distributed instances of k-median / k-means.
• Improve constants?
• Other clustering objectives (k-centers)?
57
Take Home Message
Life (
,
) gives you
a k-median
instance.
Stability
=
A belief that a PTAS is meaningful
- “Can you solve it?”
This allows us to introduce a PTAS!
- “NO!!!”
• Stability gives us an “Archimedean Point”
that allows us to bypass NP-hardness.
• To what other•NP-hard
problems
similar logic applies?
But that’s
not new!
58
Thank you!
59
World
All possible instances
BBG Stable+
ORSS Stable
Weak-Deletion Stable
¯-Distributed
60
BBG Result
• We have target clustering.
• k-median is a proxy:
• Target is close to OPT(k).
• Problem:
k-median is NP-hard.
• Solution:
Use approximation alg.
• We would like:
Our (1+®)-approx algorithm
outputs a meaningful
k-clustering
61
BBG Result
• We have target clustering.
• k-median is a proxy:
• Target is close to OPT(k).
• Problem:
k-median is NP-hard.
• Solution:
Use approximation alg.
• We would like:
Our (1+®)-approx algorithm
outputs a meaningful
k-clustering
62
BBG Result
• We have target clustering.
• k-median is a proxy:
• Target is close to OPT(k).
• Problem:
k-median is NP-hard.
• Solution:
Use approximation alg.
• Implicit assumption:
Any k-clustering with
cost at most (1+®)OPT
is ±-close (pointwise) to target
63
BBG Result
• Instance is (BBG) stable:
Any two k-partitions with cost · (1+®)OPT(k)
differ over no more than (2±)-fraction of the input
• Give algorithm to get O(±/®)-close to the target.
• Additionally (k-median):
if all clusters’ sizes are (±n/®)
then get ±-close to the target.
• Our result:
if BBG-stability & clusters are >2±n
then PTAS for k-median (implies: get ±-close to the target).
64
Claim: BBG-Stability & Large Clusters )
(1+®)-Weak Deletion Stability
• We know:
(i) Any two k-partitions with cost ·
(1+®)OPT(k)
differ over · 2± fraction of the
input
(ii) All clusters contain >2±n points
• Take optimal k-clustering.
• Take C*i, move all points but c*i to C*j.
c*i
c*j
)
c*i
c*j
• New partition and OPT differ on >2±n pts )
cost(OPTi!j) ¸ cost( ) ¸ (1+®)OPT(k)
65
World
All possible instances
BBG Stable+
ORSS Stable
Weak-Deletion Stable
¯-Distributed
66
Download