Improved approximation for k-median Shi Li Department of Computer Science Princeton University Princeton, NJ, 08540 04/20/2013 minimize maintenance cost + transportation cost $20 $10 $50 $100 $130 $30 $30 Facility Location Problem BALINSKI, M. L.1966. On finding integer solutions to linear programs. In Proceedings of the IBM Scientific Computing Symposium on Combinatorial Problems. IBM, New York, pp. 225–248. KUEHN, A. A., AND HAMBURGER, M. J. 1963. A heuristic program for locating warehouses. STOLLSTEIMER, J. F.1961. The effect of technical change and output expansion on the optimum number, size and location of pear marketing facilities in a California pear producing region. Ph.D. thesis, Univ. California at Berkeley, Berkeley, Calif. STOLLSTEIMER, J. F.1963. A working model for plant numbers and locations. J. Farm Econom. 45, 631– 645. Uncapacitated Facility Location (UFL) facilities clients F : potential facility locations C : set of clients fi , i F : cost for opening i d : metric over F C $100 $100 $30 $20 find S F, minimize + facility cost $100 $100 connection cost Wal-mart Stores in New Jersey Question : Suppose you have budget for 50 stores, how will you select 50 locations? k-median facilities clients F : potential facility locations C : set of clients of facilities to open fki :, inumber F : cost for opening i d : metric over F C find S F, |S |= k minimize + k-median clustering Known Results: UFL O(log n)-approximation [Hoc82] constant approximations 3.16 [STA98] 2.41 [GK99] 3 [JV99] 1.853 [CG99] 1.728 [CG99] 5+ε [Kor00] 1.861 [MMSV01] 1.736 [CS03] 1.61 [JMS02] 1.582 [Svi02] 1.52 [MYZ02] 1.50 [Byr07] 1.488 [Li11] 1.463-hardness of approx. [GK98] 4 Deterministic rounding of linear programs 4.5 The uncapacitated facility location problem 5 Random sampling and randomized rounding of linear programs 5.8 The uncapacitated facility location problem 7 The primal-dual method 7.6 The uncapacitated facility location problem 9 Further uses of greedy and local search algorithms 9.1 A local search algorithm for the uncapacitated facility location problem 9.4 A greedy algorithm for the uncapacitated facility location problem 12 Further uses of random sampling and randomized rounding of linear programmings 12.1 The uncapacitated facility location problem Know results : k-median pseudo-approximation 1-approx with O(k log n) facilities [Hoc82] 2(1+ε)-approx. with (1+1/ε)k facilities[LV92] super-constant approximation O(log n loglog n) [Bar96,Bar98] O(log k loglog k) [CCGS98] Known Results: k-median constant approximation LP rounding 6.667 [CGTS99] 3.25 [CL12] Primal-Dual Local Search 6 [JV99] 3+ε [AGK+01] 4 [JMS03] 4 [CG99] 1+√3+ε [LS13] (1+2/e)-hardness of approximation [JMS03] Lloyd Algorithm[Lloyd82] k-means clustering : min total squared distances k-means vs k-median • clustering: k-means is more often used • Walmart example: k-median is more appropriate • approximation: k-median is “easier” Local Search Can we improve the solution by p swaps? No : stop Yes : swap and repeat Approximation : k-median : 3+2/p [AGK+01] k-means : (3+2/p)2 [KMN+02] LP for k-median yi : whether to open i xi,j : whether connect j to i integrality gap is at least 2 integrality gap is at most 3 open (proofatnon-constructive) most k facilities client j can mustonly be connected connected to an open facility (1+√3+ε)-approximation on k-median k-median and UFL f = cost of a facility f #open facilities Given a black-box α-approximation A for UFL Naïve try : find an f such that A opens k facilities α-approxition for k-median? Proof : α ≈1.488 for UFL, α > 1.736 for k-median k-median and UFL Naïve try : find an f such that A opens k facilities 2 issues with naïve try : 1. need LMP α-approximation for UFL α-approximation: LMP α-approximation F + C F+ C a a a £ OPT £ OPT LMP = Lagragean Multiplier Preserving k-median and UFL Naïve try : find an f such that A opens k facilities 2 issues with naïve try : 1. need LMP α-approximation for UFL 2. can not find f s.t. A opens exactly k facilities S1 : set of k1 < k facilities S2 : set of k2 > k facilities bi-point solution k-median and UFL 2 issues with naïve try : 1. need LMP α-approximation for UFL 2. can not find f s.t. A opens exactly k facilities LMP approx. factor bi-point integral final ratio for k-median [JV] [JMS] our result 3 x2 6 2 x2 4 2 dothis not factor know of how improve 2 istotight !! bi-point solution S1 S2 k1= |S1| < k ≤ |S2| = k2 a, b : ak1 + bk2 = k, a + b = 1 bi-point solution : aS1+bS2 cost(aS1+bS2) = a cost(S1) + b cost(S2) gap-2 instance 0 cost of integral solution = 2 1 S1 k1 = 1, S2 k+1 cost(S1) = k+1, 01 k -1 k1 + k2 = k k k k2 = k+1 cost(S2) = k-median and UFL LMP approx. factor bi-point integral bi-point pseudo-integral final ratio for k-median [JV] [JMS] our result 3 x2 6 2 x2 4 2 1+ 3 + e 2 1+ 3 + e this factor of 2 istotight Main Lemma 1 : suffice give!!an α-approximate solution with k+O(1) facilities Main Lemma 2 : bi-point solution of cost C 1+ 3 + e solution of cost C with k+O(1/ε) facilities 2 Main Lemma 1 A : black-box α-approximation with k+c open facilities A' : (α+ε)-approximation with k open facilities A' calls A nO(c/ε) times. bad instance: with k+1 open facilities, cost =0 k open facilities , cost huge Dense Facility Bi : set of clients in a small ball around i i is A-dense, if connection cost of Bi in OPT is ≥ A this instance : i is A-dense for A≈opt i Bi Dense Facility Reduction component works directly if there are no opt/t-dense facilities, t = O(c/ε) can reduce to such an instance in nO(t) time i Bi Lemma 1 from [ABS] Main Lemma 1 : suffice to give an α-approximate solution with k+O(1) facilities k-median clustering is easy in practice reason : there is a “meaningful” clustering [Awasthi-Blum-Sheffet] : ε, δ >0 constants, OPTk-1 ≥ (1+δ)OPTk can find (1+ε)-approximation Lemma 1 from [ABS] [ABS] OPTk-1 ≥ (1+δ)OPTk (1+ε)-approximation A : α-approximation algorithm for k-median with k+c medians Algorithm Apply A to (k-c, F, C, d) solution with k facilities of cost ≤ αOPTk-c Apply [ABS] to each (k-i, F, C, d) for i = 0, 1, 2, …, c-1 Output the best of the c+1 solutions Proof If OPTk-c ≤ (1+ε)OPTk, then done. otherwise, consider the smallest i s.t. OPTk-i-1 ≥ (1+ε)1/cOPTk-i [ABS] on (k-i, F, C, d) solution of cost (1+ε)OPTk-i ≤ (1+ε)2OPTk Main Lemma 2 : bi-point solution of cost C solution of cost 1+ 3 + e C with k+O(1/ε) facilities 2 [JV] bi-point solution of cost C solution of cost 2C based on improving [JV] algorithm JV algorithm S1 i S2 τi = nearest facility of i given : bi-point solution aS1+bS2 select S’2 S2 , |S’2| = |S1| = k1 with prob. a, open S1 with prob. b, open S’2 randomly open k-k1 facilities in S2 \ S’2 guarantee : either i is open, or τi is open Analysis of JV algorithm d1 i1 j d2 ≤ d1+d2 i2 i1 S1 , i3 S’2 i3 either i1 or i3 is open If i2 is open, connect j to i2 Otherwise, if i1 is open, connect j to i1 Otherwise connect j to i3 E[cost of j] ≤ 2 × [cost of j in aS1+bS2] Our Algorithm i3 i1 ≤ d1+d2 d1 j d2 ≤ d1+d2 i2 i3 on average, d1 >> d2 d(j, i3) ≤ d2d 1+2d 1+d22 If i2 is open, connect j to i2 Otherwise, if i1 is open, connect j to i1 Otherwise connect j to i3 E[cost of j] ≤ 1+ 3 22 × [cost of j in aS1+bS2] Our Algorithm need to guarantee : either i is open, or τi is open for a star, either the center is open, or all leaves are open τi i first idea try : open eachalways star independently? big stars: open the center, with a, open center, open prob. each leaf withthe prob. ≈b openofthe with groupprob. smallb,stars theleaves same problem : can not bound the size, dependent rounding openopen facilities number for each of group, 3 more facilities than expected small stars small star : star of size ≤ 2/(abε ) Mh : set of stars of size h, m = |Mh| Roughly, for am stars, open the center for bm stars, open the leaves More accurately, permute the stars and the facilities open top éêamùú +1 centers open bottom éêbhmùú leaves big stars size h > 2/(abε ) always open the center randomly open êëa + bhúû -1 leaves êëa + bhúû -1≈ bh for big star Lemma : we open at most k + 6/(abε) facilities. for a big star of size h, FRAC : a+bh ALG : êëa + bhúû £ a + bh for a group of m small stars of size h FRAC : m(a+bh) ALG : éêamùú +1+ éêbhmùú £ m(a + bh)+ 3 there are at most 2/(abε) groups Summary LMP approx. factor bi-point pseudo-integral final ratio for k-median [JV] [JMS] our result 3 x2 6 2 x2 4 2 1+ 3 + e 2 1+ 3 + e Main Lemma 1 : suffice to give an α-approximate solution with k+O(1) facilities Main Lemma 2 : bi-point solution of cost C 1+ 3 + e solution of cost C with k+O(1/ε) facilities 2 Open Problems gap between integral solution with k+1 open facilities and LP value(with k open facilities)? tight analysis? algorithm works for k-means? Questions?