Physical Mapping of DNA Shanna Terry March 2, 2004 Overview • • • • Background Types of Mapping Mathematical Models Enhanced Double Digest Problem Background Given a sequence of DNA, how do we figure out where on some larger chromosome the sequence lies? ? Background II Look for markers that match in both the chromosome and the shorter sequence. – Markers: Usually short, precisely defined sequences match! Background III How do we create the original map? • Generate fingerprints (markers) with: – Restriction site mapping – Hybridization Background IV But why? Can’t we just expand the sequence assembly techniques we’ve already learned? NO! (with one exception) Why not? A chromosome isn’t just 150k bps long. … Human chromosomes range in length from 51 million to 245 million base pairs. Overview • • • • Background Types of Mapping Mathematical Models Enhanced Double Digest Problem Restriction Site Mapping In this situation, the fingerprint is the length between restriction sites of given enzymes (recall from previous lectures). • • • Make three copies of target DNA: strings A, B, C. Apply one enzyme (α) to string A, another (β) to string B, and both (α and β) to string C. Line up the fragments in A and B so they match C: this is the double digest problem. 6 14 8 6 7 3 2 3 5 15 9 4 7 1 4 Restriction Site Mapping II A variant is the partial digest approach: • Use only one enzyme, but allow it to act for different time periods. Different restriction sites will be recognized. 6 14 7 5 Fragment sites: 6, 20, 27, 32; 14, 21, 26; 7, 12; and 5 Restriction Site Mapping III 6 20 6 6 14 14 14 21 14 7 6 14 7 … 5 etc … Fragment sites: 6, 20, 27, 32; 14, 21, 26; 7, 12; and 5 Hybridization Mapping • Check whether specific small sequences (called probes) bind (hybridize) to fragments (clones) • The fingerprint is the subset of probes that successfully hybridize to the clone. • If some portion of one clone’s fingerprint matches another, they are likely to be from overlapping regions of the target. Hybridization Mapping II Probes x, y, z, bound to clone A; x, w and z bound to clone B… overlap in x and z. y x z w Except… we don’t know that much. We only know which probes bind to which clones. Not ordering or even relative lengths! • • • • Background Types of Mapping Mathematical Models Enhanced Double Digest Problem Restriction Site Models • Back to the double digest problem: we’ve split the strings A, B, C into fragments with two enzymes. • We have the multisets made up of the fragment lengths: – From previous example: A = {5, 6, 7, 14} B = {3, 4, 8, 15} C = {1, 2, 3, 4, 6, 7, 9} • Find permutations of A and B such that there is a one-to-one correspondence between all the subintervals and C. Not too bad, right? Restriction Site Models II BAD NEWS: The double digest problem is NP-complete. It is a generalization of the set-partition problem, already known to be NP-complete. To give you an idea…the number of solutions is (k-1)! for k = number of restriction sites. BUT… we will see a heuristic later… Interval Graph Models • Model hybridization mapping in terms of interval graphs – Interval graph: A graph G which is mapped from a series of intervals. For each interval there is a vertex in G. For each intersection of intervals there is an edge in G. Interval Graph Models II Ex: a b c d e b a c d e Interval Graph Models III • To apply this to the hybridization mapping problem… • We create graphs with vertices representing clones (fragments), and edges representing overlaps between clones. • Two graphs: one for known overlaps and one with known and unknown overlap information (neither are necessarily interval graphs). Interval Graph Models IV • Now find the true interval graph (a subgraph of the known and unknown graph) given the two graphs. Known overlaps + Known/Unknown Overlaps Hmm… Not too easy. Actual Interval Graph Interval Graph Models V MORE BAD NEWS! This is NP-hard. Maybe another model? – There are two other possible models for hybridization mapping (described in the book). But…. Those are NP-hard too! Consecutive Ones Property We’re sick of NP hard problems. Give us something a little easier. The Consecutive Ones Property Model (C1P) can be solved in linear time! Assumptions: 1. The probes are unique. 2. There are no errors. (!!) 3. All of the correspondences of clones and probes have been found. (!!!) Consecutive Ones Property II • Build a matrix (n x m), n = # of clones, m = # of probes. Entry i,j is a binary code for whether probe j hybridized to clone i. 1 1 0 1 0 1 0 1 1 0 1 1 above: probe 1 hybridized to clone 1, probe 2 hybridized to clone 1, probe 1 hybridized to clone 3, probe 4 hybridized to clone 3. Consecutive Ones Property III • Find a permutation of the columns (probes) such that all the 1s in each row (clone) are consecutive. 1 0 0 1 0 1 1 0 0 1 0 1 0 0 1 1 1 0 1 0 1 1 0 0 Consecutive Ones Property IV This algorithm can be run in linear time! Unfortunately, the assumption that there are no errors isn’t useful because biology isn’t a mathematical model. Probes may not bind, DNA may be replicated incorrectly. And generalizations make the problem NP-hard again! We need a good heuristic… • • • • Background Types of Mapping Mathematical Models Enhanced Double Digest Problem Enhanced Double Digest Problem The Enhanced Double Digest (EDD) problem is NP-hard in the general case, but if the lengths of fragments in C (the string acted upon by both types of enzymes) are distinct, it can be solved in linear time! Why do the fragments have to be distinct? … What if all the fragments are the same length? Problem Formulation • We have the multisets A and B. A = {6, 14, 7, 5} B = {8, 3, 15, 4} • Take the actual fragments corresponding to each member of the either set (since the sets are only lengths). Apply the other enzyme to the fragment (i.e. apply enzyme β to fragments from A, and vice versa) to create subfragments. • ABi is the multiset of subfragments created by applying enzyme β to fragments from A; BAj from applying enzyme α to fragments of B. Problem Formulation II Example: 6 14 8 6 7 3 2 3 5 15 9 8 {2,6} A={5, 6, 7, 14} B={3, 4, 8, 15} AB1={1,4}, AB2={6}, AB3={7}, AB4={2,3,9} BA1={3}, BA2={4}, BA3={2,6}, BA4={1,7,9} 4 7 1 5 4 {1,4} Algorithm • Given A, B, ABi and BAj for all i, j, construct an undirected graph that connects each element of A and B to its corresponding AB/BA. Note that all elements in C will be covered A: 5 6 7 14 C: B: 1 2 3 3 4 4 6 8 7 15 9 Algorithm II • Create a spanning tree: • Start at random node, follow all paths from the node, don’t repeat edges. 6A 6C 8B 2C 14A 9C 15B 3C 7C 3B 7A 1C 5A 4C 4B Properties • The graph (G) will always be connected, and every node in A and B will only be adjacent to nodes from C. Each node from C connects to only one node each from A and B. • If the problem can be solved: G will be a spanning tree, and any subtree that “hangs” on the longest path will be a 2-node length path (dangler). Properties II Danglers: 6A 6C 8B 2C 14A 9C 15B 3C 7C 3B 7A 1C 5A 4C 4B Algorithm III • If the graph G is not a spanning tree, and not all subtrees hanged off the longest path are danglers, then there is no valid permutation. • Perform Dangler-first search on the graph G… Dangler-First Search • Traverse G starting at one end of a path S with the largest number of edges, reading only the nodes from C. Whenever reaching a node with degree greater than 2 (must have a dangler), read the nodes in C from the hanging danglers first, then continue to traverse S. This sequence is πc. Dangler-First Search II πc= 6, 2, 3, 9, 7, 1, 4. 6A 6C 8B 2C 14A 9C 15B 3C 7C 3B 7A 1C 5A 4C 4B Algorithm IV • The elements in each ABi form a consecutive subsequence in πc. Likewise, the elements in each BAj also form a consecutive subsequence in πc. – This permutation is a valid permutation… meaning: you have the answer! Solution A= 6, 14, 7, 5 πc= 6, 2, 3, 9, 7, 1, 4 B= 8, 3, 15, 4 6A 6C 8B 2C 6 14 8 6 14A 3 2 9C 7 15 3 15B 3C 7C 3B 7A 5 9 1C 4 7 5A 4C 1 4B 4 Enhanced Double Digest Problem II This can be solved in O(n) time! A generalization (assuming a constant number of duplicates) can also be solved in O(n) time. The general enhanced double digest problem is still NP-Hard. This can be shown by reduction from the Hamiltonian Path problem.