Physical Mapping of DNA

advertisement
Physical Mapping of DNA
Shanna Terry
March 2, 2004
Overview
•
•
•
•
Background
Types of Mapping
Mathematical Models
Enhanced Double Digest Problem
Background
Given a sequence of DNA, how do we figure
out where on some larger chromosome the
sequence lies?
?
Background II
Look for markers that match in both the
chromosome and the shorter sequence.
– Markers: Usually short, precisely defined sequences
match!
Background III
How do we create the original map?
• Generate fingerprints (markers) with:
– Restriction site mapping
– Hybridization
Background IV
But why? Can’t we just expand the sequence
assembly techniques we’ve already learned?
NO! (with one exception)
Why not?
A chromosome isn’t just 150k bps long.
… Human chromosomes range in length
from 51 million to 245 million base pairs.
Overview
•
•
•
•
Background
Types of Mapping
Mathematical Models
Enhanced Double Digest Problem
Restriction Site Mapping
In this situation, the fingerprint is the length
between restriction sites of given enzymes
(recall from previous lectures).
•
•
•
Make three copies of target DNA: strings A, B, C.
Apply one enzyme (α) to string A, another (β) to string B, and both
(α and β) to string C.
Line up the fragments in A and B so they match C: this is the
double digest problem.
6
14
8
6
7
3
2
3
5
15
9
4
7
1
4
Restriction Site Mapping II
A variant is the partial digest approach:
• Use only one enzyme, but allow it to act for
different time periods. Different restriction
sites will be recognized.
6
14
7
5
Fragment sites: 6, 20, 27, 32; 14, 21, 26; 7, 12; and 5
Restriction Site Mapping III
6
20
6
6
14
14
14
21
14
7
6
14
7
…
5
etc …
Fragment sites: 6, 20, 27, 32; 14, 21, 26; 7, 12; and 5
Hybridization Mapping
• Check whether specific small sequences
(called probes) bind (hybridize) to
fragments (clones)
• The fingerprint is the subset of probes that
successfully hybridize to the clone.
• If some portion of one clone’s fingerprint
matches another, they are likely to be from
overlapping regions of the target.
Hybridization Mapping II
Probes x, y, z, bound to clone A; x, w and z
bound to clone B… overlap in x and z.
y
x
z
w
Except… we don’t know that much. We only know which
probes bind to which clones. Not ordering or even relative
lengths!
•
•
•
•
Background
Types of Mapping
Mathematical Models
Enhanced Double Digest Problem
Restriction Site Models
• Back to the double digest problem: we’ve split the
strings A, B, C into fragments with two enzymes.
• We have the multisets made up of the fragment
lengths:
– From previous example:
A = {5, 6, 7, 14}
B = {3, 4, 8, 15}
C = {1, 2, 3, 4, 6, 7, 9}
• Find permutations of A and B such that there is a
one-to-one correspondence between all the
subintervals and C. Not too bad, right?
Restriction Site Models II
BAD NEWS:
The double digest problem is NP-complete. It
is a generalization of the set-partition
problem, already known to be NP-complete.
To give you an idea…the number of solutions
is (k-1)! for k = number of restriction sites.
BUT… we will see a heuristic later…
Interval Graph Models
• Model hybridization mapping in terms of
interval graphs
– Interval graph: A graph G which is mapped
from a series of intervals. For each interval
there is a vertex in G. For each intersection of
intervals there is an edge in G.
Interval Graph Models II
Ex:
a
b
c
d
e
b
a
c
d
e
Interval Graph Models III
• To apply this to the hybridization mapping
problem…
• We create graphs with vertices representing clones
(fragments), and edges representing overlaps
between clones.
• Two graphs: one for known overlaps and one with
known and unknown overlap information (neither
are necessarily interval graphs).
Interval Graph Models IV
• Now find the true interval graph (a subgraph of
the known and unknown graph) given the two
graphs.
Known overlaps
+
Known/Unknown
Overlaps
Hmm… Not too easy.
Actual Interval
Graph
Interval Graph Models V
MORE BAD NEWS!
This is NP-hard.
Maybe another model?
– There are two other possible models for
hybridization mapping (described in the book).
But….
Those are NP-hard too!
Consecutive Ones Property
We’re sick of NP hard problems. Give us something
a little easier.
The Consecutive Ones Property Model (C1P) can be
solved in linear time!
Assumptions:
1. The probes are unique.
2. There are no errors. (!!)
3. All of the correspondences of clones and probes have
been found. (!!!)
Consecutive Ones Property II
•
Build a matrix (n x m), n = # of clones, m = # of
probes. Entry i,j is a binary code for whether
probe j hybridized to clone i.
1
1
0
1
0
1
0
1
1
0
1
1
above: probe 1 hybridized to clone 1, probe 2 hybridized to clone 1,
probe 1 hybridized to clone 3, probe 4 hybridized to clone 3.
Consecutive Ones Property III
• Find a permutation of the columns (probes)
such that all the 1s in each row (clone) are
consecutive.
1
0
0
1
0
1
1
0
0
1
0
1
0
0
1
1
1
0
1
0
1
1
0
0
Consecutive Ones Property IV
This algorithm can be run in linear time!
Unfortunately, the assumption that there are no
errors isn’t useful because biology isn’t a
mathematical model. Probes may not bind, DNA
may be replicated incorrectly.
And generalizations make the problem NP-hard
again!
We need a good heuristic…
•
•
•
•
Background
Types of Mapping
Mathematical Models
Enhanced Double Digest Problem
Enhanced Double Digest
Problem
The Enhanced Double Digest (EDD)
problem is NP-hard in the general case, but
if the lengths of fragments in C (the string
acted upon by both types of enzymes) are
distinct, it can be solved in linear time!
Why do the fragments have to be distinct?
… What if all the fragments are the same length?
Problem Formulation
• We have the multisets A and B.
A = {6, 14, 7, 5}
B = {8, 3, 15, 4}
• Take the actual fragments corresponding to each
member of the either set (since the sets are only
lengths). Apply the other enzyme to the fragment
(i.e. apply enzyme β to fragments from A, and
vice versa) to create subfragments.
• ABi is the multiset of subfragments created by
applying enzyme β to fragments from A; BAj from
applying enzyme α to fragments of B.
Problem Formulation II
Example:
6
14
8
6
7
3
2
3
5
15
9
8
{2,6}
A={5, 6, 7, 14}
B={3, 4, 8, 15}
AB1={1,4}, AB2={6}, AB3={7}, AB4={2,3,9}
BA1={3}, BA2={4}, BA3={2,6}, BA4={1,7,9}
4
7
1
5
4
{1,4}
Algorithm
• Given A, B, ABi and BAj for all i, j, construct an
undirected graph that connects each element of A
and B to its corresponding AB/BA. Note that all
elements in C will be covered
A:
5
6
7
14
C:
B:
1
2
3
3
4
4
6
8
7
15
9
Algorithm II
• Create a spanning tree:
• Start at random node, follow all paths from the
node, don’t repeat edges.
6A
6C
8B
2C 14A
9C 15B
3C
7C
3B
7A
1C 5A
4C 4B
Properties
• The graph (G) will always be connected, and
every node in A and B will only be adjacent to
nodes from C. Each node from C connects to only
one node each from A and B.
• If the problem can be solved: G will be a
spanning tree, and any subtree that “hangs” on the
longest path will be a 2-node length path
(dangler).
Properties II
Danglers:
6A
6C
8B
2C
14A
9C
15B
3C
7C
3B
7A
1C
5A
4C
4B
Algorithm III
• If the graph G is not a spanning tree, and
not all subtrees hanged off the longest path
are danglers, then there is no valid
permutation.
• Perform Dangler-first search on the graph
G…
Dangler-First Search
• Traverse G starting at one end of a path S
with the largest number of edges, reading
only the nodes from C. Whenever reaching
a node with degree greater than 2 (must
have a dangler), read the nodes in C from
the hanging danglers first, then continue to
traverse S. This sequence is πc.
Dangler-First Search II
πc= 6, 2, 3, 9, 7, 1, 4.
6A
6C
8B
2C
14A
9C
15B
3C
7C
3B
7A
1C
5A
4C
4B
Algorithm IV
• The elements in each ABi form a
consecutive subsequence in πc. Likewise,
the elements in each BAj also form a
consecutive subsequence in πc.
– This permutation is a valid permutation…
meaning: you have the answer!
Solution
A= 6, 14, 7, 5
πc= 6, 2, 3, 9, 7, 1, 4
B= 8, 3, 15, 4
6A
6C
8B
2C
6
14
8
6
14A
3
2
9C
7
15
3
15B
3C
7C
3B
7A
5
9
1C
4
7
5A
4C
1
4B
4
Enhanced Double Digest
Problem II
This can be solved in O(n) time!
A generalization (assuming a constant number of
duplicates) can also be solved in O(n) time.
The general enhanced double digest problem is still
NP-Hard. This can be shown by reduction from
the Hamiltonian Path problem.
Download