Randomized Approximation Algorithms for Offline and Online Set Multicover Problems Bhaskar DasGupta† Department of Computer Science Univ of IL at Chicago dasgupta@cs.uic.edu Joint works with Piotr Berman (Penn State) and Eduardo Sontag (Rutgers) collection of results that appeared in APPROX-2004, WADS-2005 and to appear in Discrete Applied Math (special issue on computational biology) † Supported by NSF grants CCR-0206795, CCR-0208749 and a CAREER award IIS-0346973 4/11/2020 UIC 1 More interesting title for the theoretical computer science community: Randomized Approximation Algorithms for Set Multicover Problems with Applications to Reverse Engineering of Protein and Gene Networks 4/11/2020 UIC 2 More interesting title for the biological community: Randomized Approximation Algorithms for Set Multicover Problems with Applications to Reverse Engineering of Protein and Gene Networks 4/11/2020 UIC 3 Set k-multicover (SCk) Input: Universe U={1,2,,n}, sets S1,S2,,Sm U, integer (coverage factor) k1 Valid Solution: cover every element of universe k times: subset of indices I {1,2,,m} such that xU |jI : xSj| k Objective: minimize number of picked sets |I| k=1 simply called (unweighted) set-cover a well-studied problem Special case of interest in our applications: k is large, e.g., k=n-1 4/11/2020 UIC 4 (maximum size of any set) Known positive results: Set-cover (k=1): • can approximate with approx. ratio of 1+ln a (determinstic or randomized) Johnson 1974, Chvátal 1979, Lovász 1975 Set-multicover (k>1): • same holds for k1 e.g., primal-dual fitting: Rajagopalan and Vazirani 1999 4/11/2020 UIC 5 Known negative results for setcover (i.e., k=1): - (modulo NP DTIME(nloglog n)) approx ratio better than (1-)ln n is not possible for any constant 01 (Feige 1998) - (modulo NPP) better than (1-)ln n not possible for some constant 01) (Raz and Safra 1997) - lower bound can be generalized in terms of set size a better than ln a-O(ln ln a) is not possible (Trevisan, 2001) 4/11/2020 UIC 6 r(a,k)= approx. ratio of an algorithm as function of a,k • We know that for greedy algorithm r(a,k) 1+ln a – at every step select set that contains maximum number of elements not covered k times yet • Can we design algorithm such that r(a,k) decreases with increasing k ? – possible approaches: • improved analysis of greedy? • randomized approach (LP + rounding) ? • 4/11/2020 UIC 7 Our results (very “roughly”) n = number of elements of universe U k = number of times each element must be covered a = maximum size of any set • Greedy would not do any better – r(a,k)=(log n) even if k is large, e.g, k=n • But can design randomized algorithm based on LP+rounding approach such that the expected approx. ratio is better: E[r(a,k)] max{2+o(1), ln(a/k)} (as appears in conference proceedings) (further improvement (via comments from Feige)) max{1+o(1), ln(a/k)} 4/11/2020 UIC 8 More precise bounds on E[r(a,k)] 1+ln a (1+e-(k-1)/5) ln(a/(k-1)) if k=1 if a/(k-1) e2 7.4 and k>1 min{2+2e-(k-1)/5,2+0.46 a/k} 1+2(a/k)½ if ¼ a/(k-1) e2 and k>1 if a/(k-1) ¼ and k>1 E[r(a,k)] ln(a/k) approximate not drawn to scale 4 2 1 4/11/2020 0 ¼ UIC e2 a a/k 9 Can E[r(a,k)] coverge to 1 at a much faster rate? Probably not...for example, problem can be shown to be APXhard for a/k 1 Can we prove matching lower bounds of the form max { 1+o(1) , 1+ln(a/k) } ? Do not know... 4/11/2020 UIC 10 How about the “weighted” case? • each set has arbitrary positive weight • minimize sum of weights of selected sets It seems that the multi-cover version may not be much easier than the single-cover version: – take single-cover instance – add few new elements and new “must-select” sets with “almost-zero” weights that covers original elements k-1 times and all new elements k times 4/11/2020 UIC 11 Our randomized algorithm Standard LP-relaxation for set multicover (SCk): • selection variable xi for each set Si (1 i m) m • minimize xi i 1 subject to: x Si : uSi i k for every element u U 0 xi 1 for all i 4/11/2020 UIC 12 • • • • • Our randomized algorithm Solve the LP-relaxation Select a scaling factor carefully: ln a if k=1 ln (a/(k-1)) if a/(k-1)e2 and k1 2 if ¼a/(k-1)e2 and k1 1+(a/k)½ otherwise Deterministic rounding: select Si if xi1 C0 = { Si | xi1 } Randomized rounding: select Si{S1,,Sm}\C0 with prob. xi C1 = collection of such selected sets Greedy choice: if an element uU is covered less than k times, pick sets from {S1,,Sm}\(C0 C1) arbitrarily 4/11/2020 UIC 13 Most non-trivial part of the analysis involved proving the following bound for E[r(a,k)]: E[r(a,k)] (1+e-(k-1)/5) ln(a/(k-1)) if a/(k-1) e2 and k>1 • Needed to do an amortized analysis of the interaction between the deterministic and randomized rounding steps with the greedy step. • For tight analysis, the standard Chernoff bounds were not always sufficient and hence needed to devise more appropriate bounds for certain parameter ranges. 4/11/2020 UIC 14 Proof of the simplest of the bounds E[r(a,k)] 1+2(a/k)½ if a/k ¼ Notational simplification: – – – – – – α = (a/k)½ ≥ 2 thus, ß = 1+(1/α) need to show that E[r(a,k)] 1+(2/α) (x1, x2, ...,xn) is the solution vector for the LP thus, OPT ≥ Also, obviously, OPT ≥ (n k)/a = n α2 4/11/2020 UIC 15 Focus on a single element jU Remember the algorithm: • Deterministic rounding: C0 = { Si | xi1 } select Si if xi1 Let C0,j = those sets in C0 that contained j • Randomized rounding: select Si{S1,,Sm}\C0 with prob. xi C1 = collection of such selected sets Let C1,j = those sets in C1 that contained j p = sum of prob. of those sets that contained j = • Greedy choice: if an element jU is covered less than k times, pick sets from {S1,,Sm}\(C0 C1) that contains j arbitrarily; let C2 be all such sets selected Let C2,j be those sets in C2 that contained j 4/11/2020 UIC 16 Obvious. What is E[ |C0| + |C1| ] ? E[ |C0|+|C1| ] = ß( ) (1+α-1).OPT ( no set is both in C0 and C1 ) 4/11/2020 UIC 17 What is E[ |C2,j| ] ? Suppose that |C0,j|=k-f for some f S1, S2, ...,Sk-f,Sk-f+1,....Sk-f+ C0,j =f+, say and xj 1 for any j imply 4/11/2020 UIC 18 (Focus on a single element jU) Goal is to first determine E[ |C0| + |C1| ] then determine • E[ |C2,j| ] • sum it up over all j to get E[ |C2| ] finallly determine E[ |C0| + |C1| + |C2| ] 4/11/2020 UIC 19 What is E[ |C2,j| ] ? (contd.) |C1,j| = f-|C2,j| and thus after some algebra 4/11/2020 UIC 20 What is E[ |C2,j| ] ? (contd.) 4/11/2020 UIC 21 4/11/2020 UIC 22 One application We used the randomized algorithm for robust string barcoding Check the publications in the software webpage http://dna.engr.uconn.edu/~software/barcode/ (joint project with Kishori Konwar, Ion Mandoiu and Alex Shvartsman at Univ. of Connecticut) 4/11/2020 UIC 23 Another (the original) motivation for looking at set-multicover Reverse engineering of biological networks 4/11/2020 UIC 24 Biological problem via Differential Equations Linear Algebraic formulation Set-multicover formulation Randomized Algorithm Selection of appropriate biological experiments 4/11/2020 Biological Motivation UIC 25 Biological problem via Differential Equations Linear Algebraic formulation Set multicover formulation Randomized Algorithm Selection of appropriate biological experiments 4/11/2020 Biological Motivation UIC 26 m 1 0 2 4 1 0 0 0 1 2 0 5 0 3 0 1 C 1 n = -1 2 0 1 -1 0 n B0 B 1 B 2 B 3 B 4 3 1 4 -1 n 4 3 37 1 10 4 5 52 2 16 0 0 -5 0 -1 A =0 0 =0 0 =0 0 0 0 =0 =0 =0 =0 0 =0 0 C0 zero structure of C known 4/11/2020 1 ? ? ? m 1 x 1 n B (columns are B2in general position) ? ? ? ? ? ? unknown UIC what is B2 ? 37 52 -5 initially unknown but can query columns 28 – Rough objective: obtain as much information about A performing as few queries as possible – Obviously, the best we can hope is to identify A upto scaling 4/11/2020 UIC 29 n 1 =0 0 =0 0 =0 1 0 0 0 =0 =0 =0 =0 0 =0 0 n ? ? ? ? ? ? A C0 =0 0 0 = ? ? ? =0 =0 0 |J1| 2 B0 B 1 B 2 B 3 B 4 1 n x 4 3 37 1 10 4 5 52 2 16 0 0 -5 0 -1 n B 37 52 -5 =n-1 1 10 16 -1 can be recovered (upto scaling) A 4/11/2020 UIC 30 – Suppose we query columns Bj for jJ = { j1,, jl } – Let Ji={j | jJ and cij=0} – Suppose |Ji| n-1.Then,each Ai is uniquely determined upto a scalar multiple (theoretically the best possible) – Thus, the combinatorial question is: find J of minimum cardinality such that |Ji| n-1 for all i 4/11/2020 UIC 31 Combinatorial Question Input: sets Ji {1,2,…,n} for 1 i m Valid Solution: a subset {1,2,...,m} such that 1 i n : |J : and iJ| n-1 Goal: minimize || This is the set-multicover problem with coverage factor n-1 More generally, one can ask for lower coverage factor, n-k for some k1, to allow fewer queries but resulting in ambiguous determination of A 4/11/2020 UIC 32 Biological problem via Differential Equations Linear Algebraic formulation Combinatorial Algorithms (randomized) Combinatorial formulation Selection of appropriate biological experiments 4/11/2020 UIC 33 • Time evolution of state variables (x1(t),x2(t),,xn(t)) given by a set of differential equations: x/t = f(x,p) x1/t = f1(x1,x2,,xn,p1,p2,,pm) xn/t = fn(x1,x2,,xn,p1,p2,,pm) • p=(p1,p2,,pm) represents concentration of certain enzymes • f(x,p)=0 p is “wild type” (i.e. normal) condition of p x is corresponding steday-state condition 4/11/2020 UIC 34 Goal We are interested in obtaining information about the sign of fi/xj(x,p) e.g., if fi/xj 0, then xj has a positive (catalytic) effect on the formation of xi 4/11/2020 UIC 35 Assumption We do not know f, but do know that certain parameters pj do not effect certain variables xi This gives zero structure of matrix C: matrix C0=(c0ij) with c0ij=0 fi/xj=0 4/11/2020 UIC 36 m experiments • change one parameter, say pk (1 k m) • for perturbed p p, measure steady state vector x = (p) • estimate n “sensitivities”: where ej is the jth canonical basis vector • consider matrix B = (bij) 4/11/2020 UIC 37 In practice, perturbation experiment involves: • letting the system relax to steady state • measure expression profiles of variables xi (e.g., using microarrys) 4/11/2020 UIC 38 Biology to linear algebra (continued) • Let A be the Jacobian matrix f/x • Let C be the negative of the Jacobian matrix f/p • From f((p),p)=0, taking derivative with respect to p and using chain rules, we get C=AB. This gives the linear algebraic formulation of the problem. 4/11/2020 UIC 39 Online Set-multicover 4/11/2020 UIC 40 Performance measure Via competitive ratio: ratio of the total cost of the online algorithm to that of an optimal offline algorithm that knows the entire input in advance For randomized algorithm, we measure the expected competitive ratio 4/11/2020 UIC 41 Parameters of interest (for performance measure) • “frequency” m (maximum number of sets in which any presented element belongs) unknown • maximum “set size” d (maximum number of presented elements a set contains) unknown • total number of elements in the universe n ( ≥ d) unknown • coverage factor k given 4/11/2020 UIC 42 Previous result Alon, Awerbuch, Azar, Buchbinder, and Naor (STOC 2003 and SODA 2004) • considered k=1 • both deterministic and randomized algorithms • competitive ratio O(log m log n), worst-case/expected • almost matching lower bound of for deterministic algorithms and “almost all” parameter values 4/11/2020 UIC 43 Our improved algorithm Expected competitive ratio of O(log m log n) O(log m log d) dn small precise constants log2m ln d + lower order term ratio improves with larger k c = largest weight / smallest weight 4/11/2020 UIC 44 Even more precise smaller constants for unweighted k=1 case via improved analysis 4/11/2020 UIC 45 Our lower bounds on competitive ratio (for deterministic algorithms) unweighted case weighted case for many values of parameters 4/11/2020 UIC 46 Work concurrent to our conference publication Alon, Azar and Gutner (SPAA 2005) • different version of the online problem (weighted case) – same element can be presented multiple times – if the same element is presented k times, our goal is to cover it by at least k different sets • expected competitive ratio O(log m log n) • easy to see that it applies to our version with same bounds Conversely, our algorithm and analysis can be easily adapted to provide expected competitive ratio of log2m ln (d/....) for the above version 4/11/2020 UIC 47 Yet another version of online set-cover Awerbuch, Azar, Fiat, Leighton (STOC 96) • elements presented one at a time • allowed to pick k sets at a given time for a specified k • goal: maximize number of presented elements for which: – at least one set containing the element was selected before the element was presented • provides efficient radomized approximation algorithms and matching lower bounds 4/11/2020 UIC 48 Our algorithmic approach Randomized version of the so-called “winnowing” approach: (deterministic) winnowing approach was first used long ago: N. Littlestone, “Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm”, Machine Learning, 2, pp. 285318, 1988. this approach was also used by Alon, Awerbuch, Azar, Buchbinder and Naor in their STOC-2003 paper 4/11/2020 UIC 49 Very very rough description of our approach • every set starts with zero probability of selection • start with an empty solution • when the next element i is presented: – if already k sets contain i, terminate; – “appropriately” increase probabilities of all sets containing i (“promotion” step of winnowing) – select sets containing i with the above probabilities – if still k sets not selected, then just select more sets “greedily: • select the least-cost set not selected already, then the next least-cost sets etc. 4/11/2020 UIC 50 Many desirable (and, sometimes conflicting goals) • increase in probability of each set should not be “too much” – else, e.g., randomized step may select “too many” sets • increase in probability of each set should not be “too little” – else, e.g., optimal sets may be missed too many times, greedy step may dominate too much • “light” sets should be preferable over “heavy” sets unless heavy sets are in an optimal solution • increase in probability should be somehow inversely linked to the frequency of i to eliminate selection of too many sets in the randomized step 4/11/2020 UIC 51 4/11/2020 UIC 52 Slightly improved algorithm for unweighted case (expected competitive ratio has better constants/asymptotic) Modify the “promotion” step slightly change to 4/11/2020 UIC 53 New expected competitive ratio: 4/11/2020 UIC 54 Motivation for the online version: Similar to before except that: • we use fluorescent proteins instead of microarrays – Fluorescent proteins can be used to know the rate at which a certain gene transcribes in a cell under a set of conditions. • a priori matrix C is not known completely but to be learnt by doing experiments 4/11/2020 UIC 55 Thank you for your attention! 4/11/2020 UIC 56