Several Geometric Tiling and Packing Problems With Applications Bhaskar DasGupta Department of Computer Science University of Illinois at Chicago Chicago, IL 60607-7053 Email: dasgupta@cs.uic.edu Joint research works with various subsets of the following researchers: Piotr Berman, Paul Bertone, Andrew Binkowski, Mark Gerstein, Ming-Yang Kao, Jie Liang, Dhruv Mubayi, S. Muthukrishnan, Robert Sloan, Michael Snyder, György Turán & Yi Zhang Supported by NSF Grants CCR-0296041, CCR-0206795 and CCR0208749 and a startup fund from UIC. 7/26/2016 University of Illinois at Chicago Outline Min-Max and Max-Min Tiling General Packing RTILE and DRTILE Rectilinear Decomposition Genome Tiling d-RPACK Protein substructure comparison Non-overlapping local alignments of DNA sequences Inverse Protein Folding on 3D lattices Possible Future Research Directions 7/26/2016 University of Illinois at Chicago Two Basic Tiling Problems Common General Setting Two dimensional n×n numbers array A of non-negative Tile: a subarray A[p .. q][r .. s] of A p tile r q s Tiling: partition A into tiles such that tiles are disjoint every A[i,j] is in some tile 7/26/2016 University of Illinois at Chicago Common General Setting (continued) weight function w : tile → positive real such w is nondecreasing w1 w2 w ≥ max {w1, w2} e.g.: w ( tile ) = sum of array entries in that tile w ( tile ) = maximum of array entries in that tile weight bound W ≥ 0 7/26/2016 University of Illinois at Chicago Min-Max Tiling Produce a tiling of A with the minimum number of tiles such that it satisfies the constraint: weight of any tile ≤ W Max-Min Tiling Produce a tiling of A with the maximum number of tiles such that it satisfies the constraint: weight of any tile ≥ W Note: Both problems can be generalized to d dimensions: given array is d-dimensional each tile is a d-dimensional hyper-rectangle 7/26/2016 University of Illinois at Chicago Why are these two problems fundamentally different? Local greedy splits may help in Min-Max tiling but may not help in Max-Min tiling >W ≤W ≤W Combining two tiles may be helpful in Max-Min tiling, but this is a more difficult operation to coordinate since the tiles need to be aligned 7/26/2016 University of Illinois at Chicago Applications will involve specific settings or variations of the Max-Min or Min-Max tiling Both Min-Max and Max-Min tiling are NP-complete; Min-Max tiling cannot be approximated below a factor of 5 4 in polynomial time Dual Min-Max tiling: number of tiles is ≤ p minimize the maximum of weights of tiles 7/26/2016 University of Illinois at Chicago RTILE problem Dual Min-Max Tiling weight of a tile is the sum of array elements in it Example (p=3) 1 0 5 0 0 0 0 0 0 3 0 6 0 0 8 Input n £ n array A An Optimal Solution In applications, A may be sparse, i.e., number of nonzero elements m ¿ n2 7/26/2016 University of Illinois at Chicago Previous hardness results on RTILE NP-hard NP-hard to approximate within a factor of even if A has entries only from {0,1,2,3,4,5} Application of RTILE Equisum histogram of numerical attributes etc. (important special case: A is binary, i.e. A[i][j] is 0 or 1) 7/26/2016 University of Illinois at Chicago Our Results on RTILE Berman, DasGupta, Muthukrishnan & Ramaswami (2001) Problem Time RTILE O(m+n) RTILE, A binary O(m+n) *approximation 7/26/2016 Approximation Ratio* 2 ratio University of Illinois at Chicago Approach for proving approximation bounds Lower bound: M = sum of all entries of A p = (given) maximum number of tiles y = largest entry in A Then, any tiling must have at least one tile of weight Approximation algorithm: Show that all our tiles are of weight at most 2z if A is binary otherwise 7/26/2016 University of Illinois at Chicago Technique used: (Greedy) Slice-and-dice Slice: greedily partition A into a number of slices (rectangles), not all of which necessarily satisfy the problem constraints Dice: (locally) adjust the slices to get good approximation bounds 7/26/2016 University of Illinois at Chicago How far can we go using such a simple lower bound for RTILE? Not too far, unfortunately! 0 1 0 p=3 1 2 1 0 1 0 M= 6 y = 2 sum of elements of A maximum element of A W=1+2+1=4 =2z so, cannot approximate better than 2 using this bound Similarly, if A is binary, cannot approximate better than using this bound 7/26/2016 University of Illinois at Chicago DRTILE Problem Min-Max Tiling Weight of a tile is the sum of elements in it Example (W=5) 3 0 0 5 Input n×n solution array A An optimal Again, in applications, A may be sparse 7/26/2016 University of Illinois at Chicago Previous hardness result on DRTILE NP-hard even if A has entries only from {0,1,2,3,4,5} Our Results on DRTILE Berman, DasGupta, Muthukrishnan & Ramaswami (2001) Problem Time Approximation Ratio DRTILE, A binary O(m+n) 2 DRTILE O(n5) 2 DRTILE, d-dimensional O( d(m+n) ) 2d-1 7/26/2016 University of Illinois at Chicago Approach for proving approximation bounds for DRTILE Need good lower bounds to compare with our solution we need at least rectangles where M = sum of entries of A use the notion of anti-rectangle (a.r.) pairs/sets 2 2 2 3 3 1 W = 12 a.r. pair since weight of rectangle is > W we need at least rectangles 7/26/2016 University of Illinois at Chicago Another approach to provide approximation algorithms for DRTILE and other tiling problems Binary Space Partitioning (BSP) •Rectangles may be cut into pieces by lines •Finally, each partitioned region contains at most one piece of any rectangle. •n = number of rectangles (=4) •size of BSP = total number of pieces of objects at the end (=5) 7/26/2016 University of Illinois at Chicago Bound on BSP size Berman, DasGupta & Muthukrishnan (2001): Let denote the set of rectangles in a tiling of an array A. There exists a BSP of of size ≤ 2||-1. Hierarchical binary partition (HBPC) HBPC of array A is the entire array A or union of HBPs of a partition of A by a horizontal or vertical line each rectangle in the HBP satisfies constraint C size of HBPC the number of rectangles size is 6 7/26/2016 University of Illinois at Chicago HBPC of A (continued) Constraint C for DRTILE: weight (of the tile) ≤ W Observation 1: for the type of constraints C applicable to problems in this talk, HBPC of minimum size can be computed in O(n5) time Observation 2: This gives the factor 2 approximation for DRTILE 7/26/2016 University of Illinois at Chicago Rectilinear Decomposition Problem Given a rectilinear polygon P (possibly with holes) of n vertices, partition the interior of P into a minimum number of rectangles minimum number of rectangles is 5 Previous results: NP-hard if P is allowed to have point holes, solvable in polynomial time otherwise Berman, DasGupta and Muthukrishnan (2002): can approximate within a factor of 2 in O(n5) time Idea: Apply the BSP-based technique 7/26/2016 University of Illinois at Chicago Max-Min Tiling Produce a tiling of A with the maximum number of tiles such that it satisfies the constraint: weight of any tile ≥ W Not difficult to see that the problem is NP-hard Let p be the maximum number of tiles in an optimum solution Definition: (r,s)-approximation is a solution that produces a tiling of A containing at least rp tiles each of which has weight at least sW 7/26/2016 University of Illinois at Chicago Berman, DasGupta and Muthukrishnan (2002): Problem Time Max-Min O(m+n) O(n7) O(n7) Max-Min, A binary O(m+n) (r,s)-Approximation ( ( , 1) , ) (½ , ¼) ( ,1) Techniques used: • greedy slicing followed by a complicated dicing • BSP-based technique 7/26/2016 University of Illinois at Chicago Genome Tiling Problems Can be thought as a variation of Max-Min tiling on a one-dimensional array with somewhat different constraints Input: a one-dimensional array c[0,n) of real numbers -3 0 c0 5 c1 c2 ……… ……… and two size parameters ` -19 cn-1 and u Define: block is a subarray B=c[i,j) ci 7/26/2016 ci+1 ci+2 …… cj-1 University of Illinois at Chicago Genome Tiling problems Define: weight of a set of block is the sum of their weights … B1 … B2 … Bk … Define: tile is a block of length between l and u Goal: find a set of pairwise disjoint tiles of maximum weight 7/26/2016 University of Illinois at Chicago Variations/Restrictions of Genome Tiling Compressed input: array entries are x or –x for some fixed x. Then, we only need to store the beginning and endings of maximal blocks of x’s. c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 x x x -x -x x -x -x -x x x uncompressed (n=11) 0 3 5 6 9 11 compressed S0 S1 S2 S3 S4 S5 notation (m=5) usually m « n 7/26/2016 University of Illinois at Chicago Variations/Restrictions of Genome Tiling Limited Number of tiles: at most t tiles Limited overlaps between tiles: two tiles share p elements (p usually small) penalize for each shared element by subtracting its weight (l = 3, u = 4, p = 1) -125 -500 50 30 1 120 500 -200 -700 score = (50+30+1) + (120+500+1) -1 = 701 7/26/2016 University of Illinois at Chicago Variations/Restrictions of Genome Tiling d-dimensional for d > 1 Input: d-dimensional array c[1,n1) [1,n2) [1,nd) 2d size parameters l1,u1, , ld,ud ( i, li ui) Define: tile is a subarray B = c[i1,j1) [i2,j2) [id,jd) with lk ≤ jk ik ≤ uk weight w(B) = sum of elements in B Goal: find a set of pairwise disjoint tiles of maximum weight 7/26/2016 University of Illinois at Chicago Variations/Restrictions of Genome Tiling (d-dimensional case) Special cases have been looked at before, such as: d=2 ARRAY-RPACK problem of l1 = l2 = 0 Khanna, Muthukrishnan and u1 = u2 = Skiena (ICALP 1997) Unless otherwise mentioned, the default genome tiling problem considered is: • 1-dimensional • has uncompressed input • uses unlimited number of tiles (t=) • no overlaps (p=0) 7/26/2016 University of Illinois at Chicago Applications of genome tiling problems Eukaryotic genome non-coding (repetitive) sequence (low complexity region) Repeat sequences can be problematic in computation or experiments: • computational context: in homology search, often results in spurious matches •experimental context: in investigating binding of complimentary DNAs, can generate false positive signals and mask true positive signals 7/26/2016 University of Illinois at Chicago Homology search application Sequences can be screened by using special programs such as RepeatMasker: high-complexity component • vast majority of high-complexity fragments may not be large enough (e.g., in Homo sapiens, 1Kbp) • need larger (but not too large) contiguous subsequences for homology search, gene prediction and many high-throughput experimental applications 7/26/2016 University of Illinois at Chicago Homology Search Applications Tiles give reasonably sized high-complexity fragments in the above problem: level of complexity of an element is expressed by its real value case of compressed input when the level of complexity is binary typically, l 300, u 1500 2000 representing the average size of mammalian messenger RNA transcripts searching sequence databases for homology matches can be enhanced with non-zero overlap (p 100) for the case when potential matches can be made at the boundaries number of tiles t is unbounded 7/26/2016 University of Illinois at Chicago DNA Microarray Application PCR-based or amplicon microarrays composed of large (typically 500 1500bp) subsequences of genomic DNA subsequences acquired via PCR (polymerase chain reaction) Design amplicon microarray: select best set of tiles from target genomic sequence for PCR amplification maximize coverage of microarray maximize high-complexity subsequence fragments l 200 fragments below 200bp become difficult to recover when amplified in high-throughput setting 7/26/2016 University of Illinois at Chicago DNA Microarray Application (continued) u 1500 balances two factors obtain maximal sequence coverage with limited number of tiles produce small enough tiles to achieve sufficient array resolution no overlap (p = 0) finite capacity of microarray limited number of tiles (for mamalian DNA where repeat content (and subsequent sequence fragmentation is high, we expect the highcomplexity sequence nucleotides to cover about n/2 sequence elements) 7/26/2016 University of Illinois at Chicago Berman, Bertone, DasGupta, Gerstein, Kao and Snyder (2002) Version Time Space Basic O(n) O(n) overlap is from a s-subset‡ of [0,], l/2 O(sn) O(n) compressed input O(m ) O(m Number of tiles ≤ t ‡s-subset O(n) if a subset of s elements For our biological applications, p ≤ 100 l/2«n, m«n, and l/(ul)6 7/26/2016 University of Illinois at Chicago ) Berman, Bertone, DasGupta, Gerstein, Kao and Snyder (2002) d-dimensional Time Space Approximation Ratio O(M) (1(1/))d d-dimensional, number of tiles ≤ t Time Space O(tM+dM logM+dN (log N/log log N) O(M) Approximation Ratio same as time 7/26/2016 University of Illinois at Chicago Online Interval Maximum (OLIM) Problem (a crucial component of our algorithms for genome tiling) Input: a sequence a0,a1, ,an-1 of real values in increasing order each ai is an argument or a test (possibly both) 2 real numbers 0 l1 u1 l2 u2 l u a function g : arguments R (neglect the time to compute) Output: for every test ak compute bk: Online limitations: examine a0,a1, one at a time from left to right if ak is a test, then compute bk before evaluating g(ak) 7/26/2016 University of Illinois at Chicago Illustration of OLIM for = 2 bk = maximum of the maximums maximum g(ai) aku1 ≤ ai akl1 maximum g(aj) ak aku2 ≤ aj akl2 Theorem: We can solve OLIM in O(n) time and O(n+) space. (time/space is independent of li and ui’s) Datar, Gionis, Indyk and Motwani (SODA-2002) considered a restricted version of OLIM in the context of maintaining stream statistics in the sliding window model and briefly mentions a solution for that 7/26/2016 University of Illinois at Chicago General Packing Problems Input: n d-dimensional hyper-rectangles R1,R2, ,Rn with a non-negative weight wi for each Ri an integer p satisfying 1 ≤ p ≤ n a relation R defined on pairs of rectangles Goal: a subset S of p hyper-rectangles such that is maximized (note: 7/26/2016 is the negation of relation R) University of Illinois at Chicago Dual of General Packing Problem Includes as a special case the following problem: find a minimum cardinality subset of disjoint (non-intersecting) rectangles of total weight at least W It is NP-hard to find a feasible solution to the above problem Hence, we do not consider the dual version further 7/26/2016 University of Illinois at Chicago Ri d-RPACK Problem Rj if and only if they do not intersect Illustration of 2-RPACK (p=2) 5 1 2 total weight = 12 7 1 Applications database decision support resource allocation problems etc. 7/26/2016 University of Illinois at Chicago Berman, DasGupta, Muthukrishnan and Ramaswami (2001) Assume that the given n d-dimensional hyper-rectangles have their endpoints from the set {1,2, ,N} Time Approximation Ratio ( 1+log2 N)d-1 0 and c 1 are arbitrary constants Key strategies behind these approximation algorithms: divide and conquer for the second approximation, solve several levels of the recursion tree simultaneously 7/26/2016 University of Illinois at Chicago Protein Substructure Similarity a b’ b c’ c a’ a matches to a’ with similarity 10 b matches to b’ with similarity 15 c matches to c’ with similarity 11 total similarity 36 Goal: match disjoint substructures to maximize total similarity 7/26/2016 University of Illinois at Chicago Few Complications Many short vs. fewer long substructures • Measure of similarity between substructures Examples: rmsd (root-mean-square distance) between 3D substructures 7/26/2016 University of Illinois at Chicago Common substructure between protein structures (work in progress.......with Jie Liang and Andrew Binkowski) Comparison of 2 4-helix bundles that differ by topological rearrangement, ROP and cytochrome b56 (a) Topological cartoons of 1ROP and 256B. Helices are drawn as cylinders and loops as lines. Residue numbers of structurally equivalent segments are indicated on the cylinders. (b) The alignment is non-sequential. 7/26/2016 University of Illinois at Chicago Motivation: discovering similar substructures from different proteins is essential for recognizing remote evolutionary relationship at the level of protein fragments Few interesting points: it is not easy to characterize topological structures such as void, pocket, or tunnel where ligand and other molecules bind. Current computational tools do not perform very well on discovering similar substructures. For example: (a) protein structures are typically represented by distance matrices or contact maps, which record pairwise inter-distances between selected atoms (typically Cα atoms) on the primary sequences (b) finding common substructures becomes matching submatrices of the two contact maps (c) Heuristic algorithms have been developed and have proven to be useful. But, they are time consuming (typically O(n6)), and cannot be used for more demanding tasks such as identifying spatial functional motifs 7/26/2016 University of Illinois at Chicago Our approach in work in progress reduce the problem to a 2-RPACK problem use combinatorial methods (such as the local-ratio technique) to design approximation algorithms for these problems Our final goal identification of the most discriminating geometric and chemical features and their combinations for various proteins development of a robust method to compute the similarity/dissimilarity of two shape distributions of these features 7/26/2016 University of Illinois at Chicago How substructure comparison can be viewed as a 2-RPACK problem? similarity of matching two fragments 5 7/26/2016 University of Illinois at Chicago Non-overlapping local alignment of DNA sequences total similarity 10+15=25 7/26/2016 University of Illinois at Chicago The problem Input: pairs of fragments, one from each sequence (or, equivalently a set of rectangles). the weight of each pair (rectangle) is their similarity Output: a set of pairs (rectangles) such that no two rectangles overlap on the x-axis (i.e., matched fragments of the first sequence are disjoint) no two rectangles overlap on the y-axis (i.e., matched fragments of the 2nd sequence are disjoint) total similarity of selected fragment pairs is maximized 7/26/2016 University of Illinois at Chicago Further assumption We can preprocess input data (rectangles or fragment pairs) to ensure that for any two rectangles, the projection of one on the yaxis does not enclose that of another not allowed in the input data for any two rectangles, the projection of one on the x-axis does not enclose that of another 7/26/2016 University of Illinois at Chicago An illustration Input: A G 15 2 G C 1 C 10 T A A G C A C C An optimal solution of total similarity 25 7/26/2016 University of Illinois at Chicago Previous results (n = number of rectangles (fragment pairs)) Bafna, Narayanan and Ravi (WADS’95) NP-complete O(n2) time approximation algorithm with approximation ratio 3.25 converts to a problem of finding maximum-weight independent set in a 5clawfree graph gives approximation algorithm for (d+1)-clawfree graphs with approximation ratio of Halldórsson (SODA’95) – approximation algorithm with approximation ratio of about 2.5 when all weights are one • again uses clawfree graphs Berman (SWAT’00) – O(n4) time algorithm with approximation ratio of about 2.5 • via clawfree graphs again 7/26/2016 University of Illinois at Chicago Berman, DasGupta and Muthukrishnan (2002) O(n log n) time approximation algorithm with approximation ratio 3 very simple to implement uses a 2-phase approach (or, equivalently, the local-ratio technique) Extensions to d dimensions (d > 2) Inputs are similarity measures of d fragments, one from each of given d sequences Motivation: multiple sequence comparison problems Generalization of our above approach: O(n d log n) time approximation algorithm with approximation ratio of 2d-1 current best (Bar-Yehuda, Halldórsson, Naor, Shachnai and Shapira, SODA’02): polynomial time algorithm with approximation ratio 2d uses repeated linear programming and continuous version of local-ratio techniques 7/26/2016 University of Illinois at Chicago 7/26/2016 University of Illinois at Chicago 7/26/2016 University of Illinois at Chicago 7/26/2016 University of Illinois at Chicago 7/26/2016 University of Illinois at Chicago 7/26/2016 University of Illinois at Chicago 7/26/2016 University of Illinois at Chicago Inverse Protein Folding on 3D Lattices (Canonical Model) 3D Sequence Contact graph Our problem: given the contact graph find a generating sequence 7/26/2016 University of Illinois at Chicago Contact graph of a 3D sequence subgraph of a 3D lattice every vertex except two has degree at most 4 (these two vertices, if present, has degree 5) can be realized as the contact graph of some 3D sequence Canonical Model Sequence vertices are of two types: H (Hydrophobic) and P (polar) H vertices tend to cluster together, so H-H contacts are encouraged, other contacts are not encouraged Of the n vertices of the sequence, at most n can be H (otherwise, we can as well label all vertices as H) 7/26/2016 University of Illinois at Chicago Inverse Protein Folding under Canonical Model on 3D Lattices (IPFC3) Input: A contact graph G=(V,E) and 0 < 1 Valid Solution: An induced subgraph G’=(V’,E’) with |V’| |V| Goal: maximize |E’| |V| = 12 =½ |V’| = |V| = 6 |E’| = 7 7/26/2016 University of Illinois at Chicago Theorem (a) IPFC3 is NP-complete (b) We can design a polynomial-time approximation scheme (PTAS) for IPFC3, i.e., for every constant 0 < < 1, we can design a O(|V|2) time algorithm that returns a solution with at least (1 ) times the optimum number of edges For (b), we use the shifted slice-and-dice technique 7/26/2016 University of Illinois at Chicago Possible Future Research Directions Other useful instances of the Max-Min and the Min-Max tiling problems with applications to Bioinformatics? Additional constraints may be necessary, such as: certain combinations of tiles are forbidden or discouraged? partial overlaps of tiles are allowed (possibly with a penalty?) weight of a tile or set of tiles is more complicated (e.g., least-square error of the best linear approximation of these points?) More efficient solutions of special cases of d-RPACK, such as: tighter approximation of identification of common substructures from multiple structures local alignments with more twist, such as: computing all best or near-best alignments and efficiently representing them identifying gene families? 7/26/2016 University of Illinois at Chicago Finally, the end 7/26/2016 University of Illinois at Chicago