Several Geometric Tiling and Packing Problems With Applications

advertisement
Several Geometric Tiling and Packing Problems With
Applications


Bhaskar DasGupta
Department of Computer Science
University of Illinois at Chicago
Chicago, IL 60607-7053
Email: dasgupta@cs.uic.edu
Joint research works with various subsets of the following researchers:
Piotr Berman, Paul Bertone, Andrew Binkowski, Mark
Gerstein, Ming-Yang Kao, Jie Liang, Dhruv Mubayi,
S. Muthukrishnan, Robert Sloan, Michael Snyder, György
Turán & Yi Zhang
Supported by NSF Grants CCR-0296041, CCR-0206795 and CCR0208749 and a startup fund from UIC.
7/26/2016
University of Illinois at Chicago

Outline
Min-Max and Max-Min Tiling




General Packing





RTILE and DRTILE
Rectilinear Decomposition
Genome Tiling
d-RPACK
Protein substructure comparison
Non-overlapping local alignments of DNA
sequences
Inverse Protein Folding on 3D lattices
Possible Future Research Directions
7/26/2016
University of Illinois at Chicago
Two Basic Tiling Problems
Common General Setting
 Two dimensional n×n
numbers

array A of non-negative
Tile: a subarray A[p .. q][r .. s] of A
p
tile
r
q
s
Tiling: partition A into tiles such
that


tiles are disjoint

every A[i,j] is in some tile
7/26/2016
University of Illinois at Chicago
Common General Setting (continued)
 weight function w : tile → positive real such w is nondecreasing
w1
w2
w ≥ max {w1, w2}
e.g.: w ( tile ) = sum of array entries in that tile
w ( tile ) = maximum of array entries in that tile

weight bound W ≥ 0
7/26/2016
University of Illinois at Chicago
Min-Max Tiling
Produce a tiling of A with the minimum number of
tiles such that it satisfies the constraint:
weight of any tile ≤ W
Max-Min Tiling
Produce a tiling of A with the maximum number of
tiles such that it satisfies the constraint:
weight of any tile ≥ W
Note: Both problems can be generalized to d
dimensions:


given array is d-dimensional
each tile is a d-dimensional hyper-rectangle
7/26/2016
University of Illinois at Chicago
Why are these two problems fundamentally
different?

Local greedy splits may help in Min-Max tiling but
may not help in Max-Min tiling
>W

≤W
≤W
Combining two tiles may be helpful in Max-Min tiling,
but this is a more difficult operation to coordinate
since the tiles need to be aligned
7/26/2016
University of Illinois at Chicago



Applications will involve specific settings or
variations of the Max-Min or Min-Max tiling
Both Min-Max and Max-Min tiling are NP-complete;
Min-Max tiling cannot be approximated below a
factor of 5 4 in polynomial time
Dual Min-Max tiling:
 number of tiles is ≤ p
 minimize the maximum of weights of tiles
7/26/2016
University of Illinois at Chicago
RTILE problem


Dual Min-Max Tiling
weight of a tile is the sum of array elements in it
Example (p=3)
1
0
5
0
0
0
0
0
0
3
0
6
0
0
8
Input
n £ n array
A
An Optimal
Solution

In applications, A may be sparse, i.e.,
number of nonzero elements m ¿ n2
7/26/2016
University of Illinois at Chicago
Previous hardness results on RTILE

NP-hard
NP-hard to approximate within a factor of
even if A has entries only from
{0,1,2,3,4,5}

Application of RTILE
Equisum histogram of numerical attributes
etc.

(important special case: A is binary, i.e. A[i][j] is 0
or 1)
7/26/2016
University of Illinois at Chicago
Our Results on RTILE
Berman, DasGupta, Muthukrishnan & Ramaswami
(2001)
Problem
Time
RTILE
O(m+n)
RTILE, A binary
O(m+n)
*approximation
7/26/2016
Approximation Ratio*
2
ratio 
University of Illinois at Chicago
Approach for proving approximation bounds
Lower bound:
M = sum of all entries of A
p = (given) maximum number of tiles
y = largest entry in A
Then, any tiling must have at least one tile of weight
Approximation algorithm:
Show that all our tiles are of weight at most
2z
if A is binary
otherwise
7/26/2016
University of Illinois at Chicago
Technique used: (Greedy) Slice-and-dice
Slice: greedily partition A into a number of slices
(rectangles), not all of which necessarily satisfy
the problem constraints
Dice: (locally) adjust the slices to get good
approximation bounds
7/26/2016
University of Illinois at Chicago
How far can we go using such a simple lower
bound for RTILE?
Not too far, unfortunately!
0
1
0
p=3
1
2
1
0
1
0
M= 6
y = 2
sum of elements of A
maximum element of A
W=1+2+1=4 =2z
so, cannot approximate better than 2 using this bound
Similarly, if A is binary, cannot approximate better than
using this bound
7/26/2016
University of Illinois at Chicago
DRTILE Problem


Min-Max Tiling
Weight of a tile is the sum of elements in it
Example (W=5)
3
0
0
5
Input
n×n solution
array A
An
optimal

Again, in applications, A may be sparse
7/26/2016
University of Illinois at Chicago

Previous hardness result on DRTILE
NP-hard even if A has entries only from
{0,1,2,3,4,5}
Our Results on DRTILE
Berman, DasGupta, Muthukrishnan & Ramaswami
(2001)
Problem
Time
Approximation Ratio
DRTILE, A binary
O(m+n)
2
DRTILE
O(n5)
2
DRTILE, d-dimensional
O( d(m+n) )
2d-1
7/26/2016
University of Illinois at Chicago
Approach for proving approximation bounds for DRTILE
Need good lower bounds to compare with our solution


we need at least
rectangles where M = sum of entries
of A
use the notion of anti-rectangle (a.r.) pairs/sets
2
2
2
3
3
1
W = 12
a.r. pair since weight of
rectangle is > W
we need at least  rectangles
7/26/2016
University of Illinois at Chicago
Another approach to provide approximation
algorithms for DRTILE and other tiling problems
Binary Space Partitioning (BSP)
•Rectangles may be cut
into pieces by lines
•Finally, each partitioned
region contains at most
one piece of any
rectangle.
•n = number of rectangles
(=4)
•size of BSP = total
number of pieces of
objects at the end (=5)
7/26/2016
University of Illinois at Chicago
Bound on BSP size
Berman, DasGupta & Muthukrishnan (2001):
Let  denote the set of rectangles in a tiling of an array A.
There exists a BSP of  of size ≤ 2||-1.
Hierarchical binary partition (HBPC)
HBPC of array A is
 the entire array A or union of HBPs of a partition of A by
a horizontal or vertical line
 each rectangle in the HBP satisfies constraint C
 size of HBPC  the number of rectangles
size is 6
7/26/2016
University of Illinois at Chicago
HBPC of A (continued)
Constraint C for DRTILE:
weight (of the tile) ≤ W
Observation 1: for the type of constraints C applicable to
problems in this talk, HBPC of minimum size can be computed
in O(n5) time
Observation 2:
This gives the factor 2 approximation for DRTILE
7/26/2016
University of Illinois at Chicago
Rectilinear Decomposition Problem
Given a rectilinear polygon P (possibly with holes) of n
vertices, partition the interior of P into a minimum
number of rectangles
minimum number of rectangles is
5
Previous results: NP-hard if P is allowed to have point holes,
solvable in polynomial time otherwise
Berman, DasGupta and Muthukrishnan (2002): can approximate
within a factor of 2 in O(n5) time
Idea: Apply the BSP-based technique
7/26/2016
University of Illinois at Chicago
Max-Min Tiling
Produce a tiling of A with the maximum number of
tiles such that it satisfies the constraint:
weight of any tile ≥ W
Not difficult to see that the problem is NP-hard
Let p be the maximum number of tiles in an optimum
solution
Definition: (r,s)-approximation is a solution that
produces a tiling of A containing at least rp tiles each
of which has weight at least sW
7/26/2016
University of Illinois at Chicago
Berman, DasGupta and Muthukrishnan (2002):
Problem
Time
Max-Min
O(m+n)
O(n7)
O(n7)
Max-Min, A binary
O(m+n)
(r,s)-Approximation
(
(
, 1)
,
)
(½ , ¼)
(
,1)
Techniques used:
• greedy slicing followed by a complicated dicing
• BSP-based technique
7/26/2016
University of Illinois at Chicago
Genome Tiling Problems
Can be thought as a variation of Max-Min tiling on a
one-dimensional array with somewhat different
constraints
Input: a one-dimensional array c[0,n) of real numbers
-3 0
c0
5
c1 c2
………
………
and two size parameters `
-19
cn-1
and u
Define: block is a subarray B=c[i,j)
ci
7/26/2016
ci+1 ci+2 …… cj-1
University of Illinois at Chicago
Genome Tiling problems
Define: weight of a set of block is the sum of their
weights
…
B1
…
B2
…
Bk
…
Define: tile is a block of length between l and u
Goal: find a set of pairwise disjoint tiles of maximum
weight
7/26/2016
University of Illinois at Chicago

Variations/Restrictions of Genome Tiling
Compressed input: array entries are x or –x for
some fixed x. Then, we only need to store the
beginning and endings of maximal blocks of x’s.
c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10
x x x -x -x x -x -x -x x x uncompressed
(n=11)
0
3
5
6
9
11 compressed
S0
S1
S2 S3
S4
S5
notation
(m=5)
usually m « n
7/26/2016
University of Illinois at Chicago


Variations/Restrictions of Genome Tiling
Limited Number of tiles: at most t tiles
Limited overlaps between tiles:
 two tiles share  p elements (p usually small)
 penalize for each shared element by subtracting
its weight
(l = 3, u = 4, p = 1)
 -125 -500 50 30 1 120 500 -200 -700 
score = (50+30+1) + (120+500+1) -1 = 701
7/26/2016
University of Illinois at Chicago
Variations/Restrictions of Genome Tiling
 d-dimensional for d > 1
Input:
d-dimensional array c[1,n1) [1,n2)  [1,nd)
 2d size parameters l1,u1,  , ld,ud ( i, li  ui)
Define:
 tile is a subarray B = c[i1,j1) [i2,j2)  [id,jd) with lk
≤ jk  ik ≤ uk
weight w(B) = sum of elements in B
Goal: find a set of pairwise disjoint tiles of maximum
weight

7/26/2016
University of Illinois at Chicago
Variations/Restrictions of Genome Tiling
(d-dimensional case)
Special cases have been looked at before, such as:
d=2
ARRAY-RPACK problem of
l1 = l2 = 0  Khanna, Muthukrishnan and
u1 = u2 = 
Skiena (ICALP 1997)
Unless otherwise mentioned, the default genome tiling
problem considered is:
• 1-dimensional
• has uncompressed input
• uses unlimited number of tiles (t=)
• no overlaps (p=0)
7/26/2016
University of Illinois at Chicago
Applications of genome tiling problems
Eukaryotic genome
non-coding (repetitive) sequence
(low complexity region)
Repeat sequences can be problematic in computation or
experiments:
• computational context: in homology search, often results
in spurious matches
•experimental context: in investigating binding of
complimentary DNAs, can generate false positive signals
and mask true positive signals
7/26/2016
University of Illinois at Chicago
Homology search application
Sequences can be screened by using special programs
such as RepeatMasker:
high-complexity component
• vast majority of high-complexity fragments may not be large
enough
(e.g., in Homo sapiens,  1Kbp)
• need larger (but not too large) contiguous subsequences for
homology search, gene prediction and many high-throughput
experimental applications
7/26/2016
University of Illinois at Chicago
Homology Search Applications
Tiles give reasonably sized high-complexity fragments in the
above problem:
 level of complexity of an element is expressed by its real
value
case of compressed input when the level of complexity is
binary
 typically, l  300, u  1500  2000
representing the average size of mammalian messenger RNA
transcripts
 searching sequence databases for homology matches can be
enhanced with non-zero overlap (p  100) for the case when
potential matches can be made at the boundaries
 number of tiles t is unbounded
7/26/2016
University of Illinois at Chicago
DNA Microarray Application
PCR-based or amplicon microarrays
 composed of large (typically 500  1500bp) subsequences of
genomic DNA
subsequences acquired via PCR (polymerase chain reaction)
Design amplicon microarray:
 select best set of tiles from target genomic sequence for
PCR amplification

maximize coverage of
microarray


maximize high-complexity
subsequence fragments
l  200
fragments below 200bp become difficult to recover when
amplified in high-throughput setting
7/26/2016
University of Illinois at Chicago
DNA Microarray Application (continued)



u  1500
balances two factors
 obtain maximal sequence coverage with limited number of
tiles
 produce small enough tiles to achieve sufficient array
resolution
no overlap (p = 0)
finite capacity of microarray  limited number of tiles
(for mamalian DNA where repeat content (and subsequent
sequence fragmentation is high, we expect the highcomplexity sequence nucleotides to cover about n/2
sequence elements)
7/26/2016
University of Illinois at Chicago
Berman, Bertone, DasGupta, Gerstein, Kao and Snyder (2002)
Version
Time
Space
Basic
O(n)
O(n)
overlap is from a s-subset‡ of
[0,],   l/2
O(sn)
O(n)
compressed input
O(m
)
O(m
Number of tiles ≤ t
‡s-subset
O(n)
if a subset of s elements
For our biological applications, p ≤ 100  l/2«n, m«n, and
l/(ul)6
7/26/2016
University of Illinois at Chicago
)
Berman, Bertone, DasGupta, Gerstein, Kao and Snyder (2002)
d-dimensional
Time
Space
Approximation Ratio
O(M)
(1(1/))d
d-dimensional, number of tiles ≤ t
Time
Space
O(tM+dM logM+dN (log N/log log N)
O(M)
Approximation Ratio
same as time
7/26/2016
University of Illinois at Chicago
Online Interval Maximum (OLIM) Problem
(a crucial component of our algorithms for genome tiling)
Input:
 a sequence a0,a1, ,an-1 of real values in increasing order
each ai is an argument or a test (possibly both)
 2 real numbers 0  l1  u1  l2  u2    l  u
 a function g : arguments  R (neglect the time to compute)
Output:
 for every test ak compute bk:
Online limitations:

examine a0,a1,  one at a time from left to right

if ak is a test, then compute bk before evaluating g(ak)
7/26/2016
University of Illinois at Chicago
Illustration of OLIM for  = 2
bk = maximum of the maximums
maximum

 g(ai) 
aku1 ≤ ai  akl1

maximum
 g(aj) 

ak
aku2 ≤ aj  akl2
Theorem: We can solve OLIM in O(n) time and O(n+) space.
(time/space is independent of li and ui’s)
Datar, Gionis, Indyk and Motwani (SODA-2002) considered a restricted
version of OLIM in the context of maintaining stream statistics in the
sliding window model and briefly mentions a solution for that
7/26/2016
University of Illinois at Chicago
General Packing Problems
Input:
 n d-dimensional hyper-rectangles R1,R2,  ,Rn with a
non-negative weight wi for each Ri
 an integer p satisfying 1 ≤ p ≤ n
 a relation R defined on pairs of rectangles
Goal:
 a subset S of p hyper-rectangles such that
is maximized

(note:
7/26/2016
is the negation of relation R)
University of Illinois at Chicago
Dual of General Packing Problem
Includes as a special case the following
problem:
find a minimum cardinality subset of
disjoint (non-intersecting) rectangles of
total weight at least W
It is NP-hard to find a feasible solution to the
above problem
Hence, we do not consider the dual version
further
7/26/2016
University of Illinois at Chicago

Ri
d-RPACK Problem
Rj if and only if they do not intersect
Illustration of 2-RPACK (p=2)
5
1
2
total weight = 12
7
1


Applications
database decision support
resource allocation problems etc.
7/26/2016
University of Illinois at Chicago

Berman, DasGupta, Muthukrishnan and Ramaswami (2001)
Assume that the given n d-dimensional hyper-rectangles
have their endpoints from the set {1,2,  ,N}
Time
Approximation Ratio
( 1+log2 N)d-1
  0 and c  1 are arbitrary constants
Key strategies behind these approximation algorithms:


divide and conquer
for the second approximation, solve several levels of the
recursion tree simultaneously
7/26/2016
University of Illinois at Chicago
Protein Substructure Similarity
a
b’
b
c’
c
a’
a matches to a’ with similarity 10
b matches to b’ with similarity 15
c matches to c’ with similarity 11
total similarity 36
Goal: match disjoint substructures to maximize total
similarity
7/26/2016
University of Illinois at Chicago
Few Complications

Many short vs. fewer long substructures
• Measure of similarity between substructures
Examples:
 rmsd (root-mean-square distance) between 3D substructures
7/26/2016
University of Illinois at Chicago
Common substructure between protein structures
(work in progress.......with Jie Liang and Andrew Binkowski)
Comparison of 2 4-helix bundles that differ by topological rearrangement,
ROP and cytochrome b56
(a) Topological cartoons of 1ROP and 256B. Helices are drawn as cylinders
and loops as lines. Residue numbers of structurally equivalent segments
are indicated on the cylinders.
(b) The alignment is non-sequential.
7/26/2016
University of Illinois at Chicago
Motivation:
discovering similar substructures from different proteins is essential for
recognizing remote evolutionary relationship at the level of protein fragments
Few interesting points:


it is not easy to characterize topological structures such as void, pocket, or
tunnel where ligand and other molecules bind.
Current computational tools do not perform very well on discovering similar
substructures.
For example:
(a) protein structures are typically represented by distance matrices or
contact maps, which record pairwise inter-distances between selected atoms
(typically Cα atoms) on the primary sequences
(b) finding common substructures becomes matching submatrices of the two
contact maps
(c) Heuristic algorithms have been developed and have proven to be useful.
But, they are time consuming (typically O(n6)), and cannot be used for more
demanding tasks such as identifying spatial functional motifs
7/26/2016
University of Illinois at Chicago




Our approach in work in progress
reduce the problem to a 2-RPACK problem
use combinatorial methods (such as the local-ratio
technique) to design approximation algorithms for these
problems
Our final goal
identification of the most discriminating geometric and
chemical features and their combinations for various
proteins
development of a robust method to compute the
similarity/dissimilarity of two shape distributions of these
features
7/26/2016
University of Illinois at Chicago
How substructure comparison can be viewed as a
2-RPACK problem?
similarity of matching two fragments
5
7/26/2016
University of Illinois at Chicago
Non-overlapping local alignment of DNA sequences
total similarity 10+15=25
7/26/2016
University of Illinois at Chicago
The problem
Input: pairs of fragments, one from each sequence
(or, equivalently a set of rectangles).
the weight of each pair (rectangle) is their
similarity
Output: a set of pairs (rectangles) such that
 no two rectangles overlap on the x-axis
(i.e., matched fragments of the first sequence are
disjoint)
 no two rectangles overlap on the y-axis
(i.e., matched fragments of the 2nd sequence are
disjoint)
 total similarity of selected fragment pairs is
maximized
7/26/2016
University of Illinois at Chicago
Further assumption
We can preprocess input data (rectangles or fragment pairs)
to ensure that
 for any two rectangles, the projection of one on the yaxis does not enclose that of another
not allowed
in the input
data

for any two rectangles, the projection of one on the
x-axis does not enclose that of another
7/26/2016
University of Illinois at Chicago
An illustration
Input:
A
G
15
2
G
C
1
C
10
T
A
A
G
C
A
C
C
An optimal solution of total similarity 25
7/26/2016
University of Illinois at Chicago
Previous results
(n = number of rectangles (fragment pairs))
Bafna, Narayanan and Ravi (WADS’95)


NP-complete
O(n2) time approximation algorithm with approximation ratio 3.25
 converts to a problem of finding maximum-weight independent set in a 5clawfree graph
 gives approximation algorithm for (d+1)-clawfree graphs with
approximation ratio of
Halldórsson (SODA’95)
– approximation algorithm with approximation ratio of about 2.5 when
all weights are one
• again uses clawfree graphs
Berman (SWAT’00)
–
O(n4) time algorithm with approximation ratio of about 2.5
• via clawfree graphs again
7/26/2016
University of Illinois at Chicago
Berman, DasGupta and Muthukrishnan (2002)






O(n log n) time approximation algorithm with approximation ratio 3
very simple to implement
 uses a 2-phase approach (or, equivalently, the local-ratio technique)
Extensions to d dimensions (d > 2)
Inputs are similarity measures of d fragments, one from each of given d
sequences
Motivation: multiple sequence comparison problems
Generalization of our above approach:
 O(n d log n) time approximation algorithm with approximation ratio
of 2d-1
current best (Bar-Yehuda, Halldórsson, Naor, Shachnai and Shapira,
SODA’02):

polynomial time algorithm with approximation ratio 2d
 uses repeated linear programming and continuous version of local-ratio
techniques
7/26/2016
University of Illinois at Chicago
7/26/2016
University of Illinois at Chicago
7/26/2016
University of Illinois at Chicago
7/26/2016
University of Illinois at Chicago
7/26/2016
University of Illinois at Chicago
7/26/2016
University of Illinois at Chicago
7/26/2016
University of Illinois at Chicago
Inverse Protein Folding on 3D Lattices
(Canonical Model)
3D Sequence
Contact graph
Our problem: given the contact graph
find a generating sequence
7/26/2016
University of Illinois at Chicago
Contact graph of a 3D sequence



subgraph of a 3D lattice
every vertex except two has degree at most 4
(these two vertices, if present, has degree 5)
can be realized as the contact graph of some 3D sequence
Canonical Model



Sequence vertices are of two types:
H (Hydrophobic) and P (polar)
H vertices tend to cluster together, so H-H contacts are
encouraged, other contacts are not encouraged
Of the n vertices of the sequence, at most n can be H
(otherwise, we can as well label all vertices as H)
7/26/2016
University of Illinois at Chicago
Inverse Protein Folding under Canonical Model on 3D
Lattices (IPFC3)
Input: A contact graph G=(V,E) and 0   < 1
Valid Solution: An induced subgraph G’=(V’,E’) with
|V’|  |V|
Goal: maximize |E’|
|V| = 12
=½
|V’| =  |V| = 6
|E’| = 7
7/26/2016
University of Illinois at Chicago
Theorem
(a) IPFC3 is NP-complete
(b) We can design a polynomial-time approximation
scheme (PTAS) for IPFC3, i.e.,
for every constant 0 <  < 1, we can design a
O(|V|2) time algorithm that returns a solution
with at least (1  ) times the optimum number
of edges
For (b), we use the shifted slice-and-dice technique
7/26/2016
University of Illinois at Chicago
Possible Future Research Directions

Other useful instances of the Max-Min and the Min-Max
tiling problems with applications to Bioinformatics?
Additional constraints may be necessary, such as:




certain combinations of tiles are forbidden or discouraged?
partial overlaps of tiles are allowed (possibly with a penalty?)
weight of a tile or set of tiles is more complicated (e.g., least-square
error of the best linear approximation of these points?)
More efficient solutions of special cases of d-RPACK, such
as:


tighter approximation of identification of common substructures
from multiple structures
local alignments with more twist, such as:


computing all best or near-best alignments and efficiently representing
them
identifying gene families?
7/26/2016
University of Illinois at Chicago
Finally, the end
7/26/2016
University of Illinois at Chicago
Related documents
Download