Randomized Approximation Algorithms for Offline and Online Set Multicover Problems Bhaskar DasGupta

advertisement
Randomized Approximation Algorithms for
Offline and Online Set Multicover Problems
Bhaskar DasGupta†
Department of Computer Science
Univ of IL at Chicago
dasgupta@cs.uic.edu
Joint works with Piotr Berman (Penn State) and Eduardo
Sontag (Rutgers)
collection of results that appeared in APPROX-2004,
WADS-2005 and to appear in Discrete Applied Math
(special issue on computational biology)
† Supported by NSF grants CCR-0206795, CCR-0208749
and a CAREER award IIS-0346973
4/11/2020
UIC
1
More interesting title for the theoretical computer
science community:
Randomized Approximation Algorithms for
Set Multicover Problems
with Applications to
Reverse Engineering of Protein and Gene Networks
4/11/2020
UIC
2
More interesting title for the biological community:
Randomized Approximation Algorithms for
Set Multicover Problems
with Applications to
Reverse Engineering of Protein and Gene Networks
4/11/2020
UIC
3
Set k-multicover (SCk)
Input: Universe U={1,2,,n}, sets S1,S2,,Sm  U,
integer (coverage factor) k1
Valid Solution: cover every element of universe k times:
subset of indices I  {1,2,,m} such that
xU |jI : xSj|  k
Objective: minimize number of picked sets |I|
k=1  simply called (unweighted) set-cover
a well-studied problem
Special case of interest in our applications:
k is large, e.g., k=n-1
4/11/2020
UIC
4
(maximum size of any set)
Known positive results:
Set-cover (k=1):
• can approximate with approx. ratio of 1+ln a
(determinstic or randomized)
Johnson 1974, Chvátal 1979, Lovász 1975
Set-multicover (k>1):
• same holds for k1
e.g., primal-dual fitting: Rajagopalan and Vazirani 1999
4/11/2020
UIC
5
Known negative results for setcover (i.e., k=1):
- (modulo NP  DTIME(nloglog n))
approx ratio better than (1-)ln n is not
possible for any constant 01 (Feige 1998)
- (modulo NPP)
better than (1-)ln n not possible for
some constant 01) (Raz and Safra 1997)
- lower bound can be generalized in terms of
set size a
better than ln a-O(ln ln a) is not possible
(Trevisan, 2001)
4/11/2020
UIC
6
r(a,k)= approx. ratio of an algorithm as function of a,k
• We know that for greedy algorithm r(a,k)  1+ln a
– at every step select set that contains maximum number
of elements not covered k times yet
• Can we design algorithm such that r(a,k) decreases with
increasing k ?
– possible approaches:
• improved analysis of greedy?
• randomized approach (LP + rounding) ?
• 
4/11/2020
UIC
7
Our results (very “roughly”)
n = number of elements of universe U
k = number of times each element must be covered
a = maximum size of any set
• Greedy would not do any better
– r(a,k)=(log n) even if k is large, e.g, k=n
• But can design randomized algorithm based on LP+rounding
approach such that the expected approx. ratio is better:
E[r(a,k)]  max{2+o(1), ln(a/k)} (as appears in conference proceedings)
 (further improvement (via comments from Feige))
 max{1+o(1), ln(a/k)}
4/11/2020
UIC
8
More precise bounds on E[r(a,k)]
1+ln a
(1+e-(k-1)/5) ln(a/(k-1))
if k=1
if a/(k-1)  e2 7.4 and k>1
min{2+2e-(k-1)/5,2+0.46 a/k}
1+2(a/k)½
if ¼  a/(k-1)  e2 and k>1
if a/(k-1)  ¼ and k>1
E[r(a,k)]
ln(a/k)
approximate
not drawn to scale
4
2
1
4/11/2020
0
¼
UIC
e2
a
a/k
9
Can E[r(a,k)] coverge to 1 at a much faster rate?
Probably not...for example, problem can be shown to be APXhard for a/k  1
Can we prove matching lower bounds of the form
max { 1+o(1) , 1+ln(a/k) } ?
Do not know...
4/11/2020
UIC
10
How about the “weighted” case?
• each set has arbitrary positive weight
• minimize sum of weights of selected sets
It seems that the multi-cover version may not be
much easier than the single-cover version:
– take single-cover instance
– add few new elements and new “must-select” sets with
“almost-zero” weights that covers original elements
k-1 times and all new elements k times
4/11/2020
UIC
11
Our randomized algorithm
Standard LP-relaxation for set multicover (SCk):
• selection variable xi for each set Si (1  i  m)
m
• minimize  xi
i 1
subject to:
x
Si : uSi
i
 k for every element u U
0  xi  1 for all i
4/11/2020
UIC
12
•
•
•
•
•
Our randomized algorithm
Solve the LP-relaxation
Select a scaling factor  carefully:
ln a
if k=1
ln (a/(k-1))
if a/(k-1)e2 and k1
2
if ¼a/(k-1)e2 and k1
1+(a/k)½
otherwise
Deterministic rounding: select Si if xi1
C0 = { Si | xi1 }
Randomized rounding: select Si{S1,,Sm}\C0 with prob. xi
C1 = collection of such selected sets
Greedy choice: if an element uU is covered less than k
times, pick sets from {S1,,Sm}\(C0 C1) arbitrarily
4/11/2020
UIC
13
Most non-trivial part of the analysis involved proving the
following bound for E[r(a,k)]:
E[r(a,k)]  (1+e-(k-1)/5) ln(a/(k-1)) if a/(k-1)  e2 and k>1
• Needed to do an amortized analysis of the interaction
between the deterministic and randomized rounding steps
with the greedy step.
• For tight analysis, the standard Chernoff bounds were not
always sufficient and hence needed to devise more
appropriate bounds for certain parameter ranges.
4/11/2020
UIC
14
Proof of the simplest of the bounds
E[r(a,k)]  1+2(a/k)½ if a/k  ¼
Notational simplification:
–
–
–
–
–
–
α = (a/k)½ ≥ 2
thus, ß = 1+(1/α)
need to show that E[r(a,k)]  1+(2/α)
(x1, x2, ...,xn) is the solution vector for the LP
thus, OPT ≥
Also, obviously, OPT ≥ (n k)/a = n α2
4/11/2020
UIC
15
Focus on a single element jU
Remember the algorithm:
• Deterministic rounding:
C0 = { Si | xi1 }
select Si if xi1
Let C0,j = those sets in C0 that contained j
• Randomized rounding: select Si{S1,,Sm}\C0 with prob. xi
C1 = collection of such selected sets
Let C1,j = those sets in C1 that contained j
p = sum of prob. of those sets that contained j
=
• Greedy choice: if an element jU is covered less than k times, pick sets
from {S1,,Sm}\(C0 C1) that contains j arbitrarily; let C2 be all such
sets selected
Let C2,j be those sets in C2 that contained j
4/11/2020
UIC
16
Obvious.
What is E[ |C0| + |C1| ] ?
E[ |C0|+|C1| ] = ß(
)  (1+α-1).OPT
( no set is both in C0 and C1 )
4/11/2020
UIC
17
What is E[ |C2,j| ] ?
Suppose that |C0,j|=k-f for some f
S1, S2, ...,Sk-f,Sk-f+1,....Sk-f+
C0,j
=f+, say
and xj  1 for any j imply
4/11/2020
UIC
18
(Focus on a single element jU)
Goal is to
first determine E[ |C0| + |C1| ]
then determine
• E[ |C2,j| ]
• sum it up over all j to get E[ |C2| ]
finallly determine E[ |C0| + |C1| + |C2| ]
4/11/2020
UIC
19
What is E[ |C2,j| ] ? (contd.)
|C1,j| = f-|C2,j| and thus after some algebra
4/11/2020
UIC
20
What is E[ |C2,j| ] ? (contd.)
4/11/2020
UIC
21
4/11/2020
UIC
22
One application
We used the randomized algorithm for robust string barcoding
Check the publications in the software webpage
http://dna.engr.uconn.edu/~software/barcode/
(joint project with Kishori Konwar, Ion Mandoiu and Alex
Shvartsman at Univ. of Connecticut)
4/11/2020
UIC
23
Another (the original) motivation for looking at
set-multicover
Reverse engineering of biological networks
4/11/2020
UIC
24
Biological problem
via
Differential Equations
Linear Algebraic
formulation
Set-multicover
formulation
Randomized
Algorithm
Selection of
appropriate
biological experiments
4/11/2020
Biological Motivation
UIC
25
Biological problem
via
Differential Equations
Linear Algebraic
formulation
Set multicover
formulation
Randomized
Algorithm
Selection of
appropriate
biological experiments
4/11/2020
Biological Motivation
UIC
26
m
1
0 2
4 1
0 0
0 1
2 0
5 0
3
0
1
C
1
n
=
-1
2
0
1
-1
0
n
B0 B 1 B 2 B 3 B 4
3 1
4
-1 n
4 3 37 1 10
4 5 52 2 16
0 0 -5 0 -1
A
=0 0 =0 0 =0
0 0 0 =0 =0
=0 =0 0 =0 0
C0
zero structure of C
known
4/11/2020
1
?
?
?
m
1
x
1
n
B
(columns are
B2in
general position)
?
?
?
?
?
?
unknown
UIC
what is B2 ?
37
52
-5
initially unknown
but can query columns
28
– Rough objective: obtain as much
information about A performing as few
queries as possible
– Obviously, the best we can hope is to
identify A upto scaling
4/11/2020
UIC
29
n
1
=0 0 =0 0 =0 1
0 0 0 =0 =0
=0 =0 0 =0 0 n
?
?
?
?
?
?
A
C0
=0
0
0
=
?
?
?
=0
=0
0
|J1| 2
B0 B 1 B 2 B 3 B 4
1
n
x
4 3 37 1 10
4 5 52 2 16
0 0 -5 0 -1
n
B
37
52
-5
=n-1
1
10
16
-1
can be recovered (upto scaling)
A
4/11/2020
UIC
30
– Suppose we query columns Bj for jJ = { j1,, jl }
– Let Ji={j | jJ and cij=0}
– Suppose |Ji|  n-1.Then,each Ai is uniquely
determined upto a scalar multiple (theoretically
the best possible)
– Thus, the combinatorial question is:
find J of minimum cardinality such that
|Ji|  n-1 for all i
4/11/2020
UIC
31
Combinatorial Question
Input: sets Ji  {1,2,…,n} for 1  i  m
Valid Solution: a subset   {1,2,...,m} such that
 1  i  n : |J :  and iJ|  n-1
Goal: minimize ||
This is the set-multicover problem with coverage
factor n-1
More generally, one can ask for lower coverage
factor, n-k for some k1, to allow fewer queries but
resulting in ambiguous determination of A
4/11/2020
UIC
32
Biological problem
via
Differential Equations
Linear Algebraic
formulation
Combinatorial
Algorithms
(randomized)
Combinatorial
formulation
Selection of
appropriate
biological experiments
4/11/2020
UIC
33
• Time evolution of state variables
(x1(t),x2(t),,xn(t)) given by a set of differential
equations:
x/t = f(x,p) 
x1/t = f1(x1,x2,,xn,p1,p2,,pm)

xn/t = fn(x1,x2,,xn,p1,p2,,pm)
• p=(p1,p2,,pm) represents concentration of certain
enzymes
• f(x,p)=0
p is “wild type” (i.e. normal) condition of p
x is corresponding steday-state condition
4/11/2020
UIC
34
Goal
We are interested in obtaining information
about the sign of fi/xj(x,p)
e.g., if fi/xj  0, then xj has a positive
(catalytic) effect on the formation of xi
4/11/2020
UIC
35
Assumption
We do not know f, but do know that certain
parameters pj do not effect certain variables
xi
This gives zero structure of matrix C:
matrix C0=(c0ij) with c0ij=0  fi/xj=0
4/11/2020
UIC
36
m experiments
• change one parameter, say pk (1  k  m)
• for perturbed p  p, measure steady state
vector x = (p)
• estimate n “sensitivities”:
where ej is the jth canonical basis vector
• consider matrix B = (bij)
4/11/2020
UIC
37
In practice, perturbation experiment involves:
• letting the system relax to steady state
• measure expression profiles of variables xi
(e.g., using microarrys)
4/11/2020
UIC
38
Biology to linear algebra (continued)
• Let A be the Jacobian matrix f/x
• Let C be the negative of the Jacobian matrix
f/p
• From f((p),p)=0, taking derivative with
respect to p and using chain rules, we get
C=AB.
This gives the linear algebraic formulation of
the problem.
4/11/2020
UIC
39
Online Set-multicover
4/11/2020
UIC
40
Performance measure
Via competitive ratio:
ratio of the total cost of the online algorithm to
that of an optimal offline algorithm that knows
the entire input in advance
For randomized algorithm, we measure the expected
competitive ratio
4/11/2020
UIC
41
Parameters of interest
(for performance measure)
• “frequency” m
(maximum number of sets in which any presented element
belongs)
unknown
• maximum “set size” d
(maximum number of presented elements a set contains)
unknown
• total number of elements in the universe n
( ≥ d) unknown
• coverage factor k
given
4/11/2020
UIC
42
Previous result
Alon, Awerbuch, Azar, Buchbinder, and Naor
(STOC 2003 and SODA 2004)
• considered k=1
• both deterministic and randomized algorithms
• competitive ratio O(log m log n), worst-case/expected
• almost matching lower bound of
for deterministic algorithms and “almost all” parameter
values
4/11/2020
UIC
43
Our improved algorithm
Expected competitive ratio of
O(log m log n)
O(log m log d)
dn
small precise constants log2m ln d + lower order term
ratio improves
with larger k
c = largest weight / smallest weight
4/11/2020
UIC
44
Even more precise smaller constants for
unweighted k=1 case
via improved analysis
4/11/2020
UIC
45
Our lower bounds on competitive ratio
(for deterministic algorithms)
unweighted case
weighted case
for many values of parameters
4/11/2020
UIC
46
Work concurrent to our conference publication
Alon, Azar and Gutner (SPAA 2005)
• different version of the online problem (weighted case)
– same element can be presented multiple times
– if the same element is presented k times, our goal is to
cover it by at least k different sets
• expected competitive ratio O(log m log n)
• easy to see that it applies to our version with same bounds
Conversely,
our algorithm and analysis can be easily adapted to provide
expected competitive ratio of
log2m ln (d/....)
for the above version
4/11/2020
UIC
47
Yet another version of online set-cover
Awerbuch, Azar, Fiat, Leighton (STOC 96)
• elements presented one at a time
• allowed to pick k sets at a given time for a specified k
• goal: maximize number of presented elements for which:
– at least one set containing the element was selected before
the element was presented
• provides efficient radomized approximation algorithms and
matching lower bounds
4/11/2020
UIC
48
Our algorithmic approach
Randomized version of the so-called “winnowing” approach:
(deterministic) winnowing approach was first used long ago:
N. Littlestone, “Learning Quickly When Irrelevant Attributes Abound:
A New Linear-Threshold Algorithm”, Machine Learning, 2, pp. 285318, 1988.
this approach was also used by Alon, Awerbuch, Azar,
Buchbinder and Naor in their STOC-2003 paper
4/11/2020
UIC
49
Very very rough description of our approach
• every set starts with zero probability of selection
• start with an empty solution
• when the next element i is presented:
– if already k sets contain i, terminate;
– “appropriately” increase probabilities of all sets
containing i (“promotion” step of winnowing)
– select sets containing i with the above probabilities
– if still k sets not selected, then just select more sets
“greedily:
• select the least-cost set not selected already, then
the next least-cost sets etc.
4/11/2020
UIC
50
Many desirable (and, sometimes conflicting goals)
• increase in probability of each set should not be “too much”
– else, e.g., randomized step may select “too many” sets
• increase in probability of each set should not be “too little”
– else, e.g., optimal sets may be missed too many times,
greedy step may dominate too much
• “light” sets should be preferable over “heavy” sets unless
heavy sets are in an optimal solution
• increase in probability should be somehow inversely linked
to the frequency of i to eliminate selection of too many sets
in the randomized step
4/11/2020
UIC
51
4/11/2020
UIC
52
Slightly improved algorithm for unweighted case
(expected competitive ratio has better constants/asymptotic)
Modify the “promotion” step slightly
change
to
4/11/2020
UIC
53
New expected competitive ratio:
4/11/2020
UIC
54
Motivation for the online version:
Similar to before except that:
• we use fluorescent proteins instead of microarrays
– Fluorescent proteins can be used to know the rate at
which a certain gene transcribes in a cell under a set of
conditions.
• a priori matrix C is not known completely but to be
learnt by doing experiments
4/11/2020
UIC
55
Thank you for your attention!
4/11/2020
UIC
56
Download