Fragment assembly of DNA

advertisement
Fragment assembly of DNA
A typical approach to sequencing
long DNA molecules is to sample and
then sequence fragments from them.
Fragment assembly of DNA
•
•
•
•
Biological background
Models
Algorithms
Heuristics
® Pei-Jie Wu
2
Biological background
• Problem as puzzle
• We do not know which letter from the set
{A, C, G, T} is written on each card, but we
do know that cards in the same position of
opposite stands from a complementary pair.
• Our goal is obtain the letters using certain
hint, which are (approximate) substrings of
the rows.
® Pei-Jie Wu
3
Biological background
• Target: The long sequence to reconstruct.
• Fragment vs. Subsequence
• Shotgun method: Based on fragment
overlap
• Fragment assembly: A collection of
fragments to put together
® Pei-Jie Wu
4
Biological background
--The ideal case
• Case: p.106
• Aligned the input set, ignoring spaces at the
extremities
• Overlaps: the end part of a fragment is
similar to the beginning of another
• Consensus sequence base on majority vote
® Pei-Jie Wu
5
Biological background
--Complications
• The main factors that add to the complexity
of the problem are:
–
–
–
–
Error
Unknown orientation
Repeated regions
Lack of coverage.
® Pei-Jie Wu
6
Biological background
--Complications
Errors
• It usually means algorithms that require more time and
space when computer program deal with error.
• The simplest errors are called base call errors and
comprise base substitutions, insertions and deletions in the
fragments.
• Base call errors occurs in practice at rates varying from 1
to 5 errors every 100 characters.
• Figures 4.2, 4.3, 4.4
® Pei-Jie Wu
7
Biological background
--Complications
Errors
• Two other types of errors: chimera and Contamination
• Chimeras, arise when two regular fragments from distinct
parts of the target molecule join end-to-end to form a
fragment that is not a contiguous part of the target
– Figure 4.5
– Solution: Must be recognized as such and removed from the
fragment set in a preprocessing stage.
• Contamination is from host or vector DNA
– Solution: Most vectors are well know, so we can screen the data
before starting assembly.
® Pei-Jie Wu
8
Biological background
--Complications
Unknown orientation
• We generally do not know to which strand a particular
fragment belongs to.
• The input fragments as being all approximate substrings of
the consensus sought either as given or in reverse
complement.
• Figure 4.6
• Complexity: 2n
® Pei-Jie Wu
9
Biological background
--Complications
Repeated regions
• Repeats are sequences that appear two or more times in the
targrt molecule.
– Short repeats
– Longer repeats
• If the level of similarity between two copies of a repeat is
high enough, the differences can be mistaken for base call
errors
• Figure 4.7
® Pei-Jie Wu
10
Biological background
--Complications
Repeated regions
• Problems:
– If a fragment is totally contained in a repeat, we may have several
places to put it in the final alignment. When the copies are not
exactly equal, we may weaken the consensus by placing a
fragment in the wrong way copy.
– Repeats can be positioned in such a way as to render assembly
inherently ambiguous. (Figure 4.8 and 4.9)
• Direct repeats: repeated copies in the same strand.
• Inverted repeats: repeated regions in opposite strands
(Figure 4.10)
® Pei-Jie Wu
11
Biological background
--Complications
Lack of coverage
• Coverage: position i of the target as the number of
fragments that cover this position.
• Contigs: The contiguously covered regions
• Figure 4.11
• Solutions:
– Sampling more fragments
– Directed sequencing or walking
® Pei-Jie Wu
12
Biological background
--Alternative methods for DNA sequencing
• Directed sequencing: a method that can be used to
cover small remaining gaps in a shotgun project.
• Problem:
– It is expensive to build special primers
– Sequential rather than parallel
• Sequencing by hybridization (SBH), it consists of
assembling the target molecule based on many
hybridization experiments with very short, fixed
length sequences called probes.
® Pei-Jie Wu
13
Models
• Shortest common superstring (SCS)
• RECONSTRUCTION
• MULTICONTIG
– All three assume that the fragment collection is free of
contamination and chimeras.
® Pei-Jie Wu
14
Models
--Shortest common superstring
• Seeking the shortest superstring of a collection of
given strings
• PROBLEM: Shortest common superstring (SCS)
• INPUT: a collection F of strings.
• OUTPUT: a shortest possible string S such that
for every f  F , S is a superstring of f.
® Pei-Jie Wu
15
Models
--Shortest common superstring
• Example 4.1
• Example 4.2
– Figure 4.12
– Figure 4.13
• A superstring may contain only one copy, which
will absorb all fragments totally contained in any
of the copies
® Pei-Jie Wu
16
Models
--Reconstruction
• Takes into account both errors and unknown
orientation
• Dynamic programming sequence comparison
algorithm
• Use distance rather than similarity
• Expression: p.116
® Pei-Jie Wu
17
Models
--Reconstruction
• PROBLEM: RECONSTRUCTION
• INPUT: a collection F of strings and an error
tolerance  between 1 and 0.
• OUTPUT: (p.117)
• Find a string S as short as possoble such that either
f or its reverse complement must be an
approximate substring of S at error level 
• Does not model repeats, lack of coverage, and size
of target
® Pei-Jie Wu
18
Models
--Multicontig
• Involve internal linkage of the fragments in the
layout
• Nonlink: there is a fragment that properly contains
the overlap on both sides
• Weakest link: the smallest size of any link
• t-contig: the weakest link of a layout is at least as
large as t
• Example 4.4
• Definition: p.119
® Pei-Jie Wu
19
Algorithms
• Greedy algorithm
• Acyclic subgraphs
(no errors and know orientation)
® Pei-Jie Wu
20
Algorithms
--Representing overlaps
• Over multigraph OM(F) of a collection F is the
directed, weighted multigraph
• Set V of nodes of this structure is just F itself.
• A directed edge from a to a different fragment b
with weight t  0 exists if the suffix of a with t
characters is a prefix of b
• May be many edges from a to b
• No self-loops
® Pei-Jie Wu
21
Algorithms
--Paths originating superstrings
• Edge e = (f, g) in the path has a certain weight t,
which means that the last t bases of the tail f of e
• Figure 4.15
– Example in p.121
• Equation 4.3
• Hamiltonian paths: A path that goes through every
vertex
• Equation 4.4
– Minimizing |S(P)|  maximizing w(P)
® Pei-Jie Wu
22
Algorithms
--Shortest superstrings as paths
• A collection F is said to be substring-free if there
are no two distinct strings a and b in such that a is
a substring of b.
• THEOREM 4.1
• COROLLARY 4.1
• LEMMA 4.1
• THEOREM 4.2
® Pei-Jie Wu
23
Algorithms
--The greedy algorithm
• Looking for shortest common superstrings is the
same as looking for Hamiltonian paths of
maximum weight in a directed multigraph.
• OM(F)  OG(F)
• “greedy” attempt at computing the heaveiest path.
The basic idea employed in it is to continuously
add the heaviest available edge
® Pei-Jie Wu
24
Algorithms
--The greedy algorithm
• Three conditions we have to test before accepting
an edge in our Hamiltonian path:
– Edges are processed in nonincreasing order by weight
– The procedure ends when we have exactly n-1 edges, or
– when the accepted edges induce a connected subgraph.
• Figure 4.16
• Example 4.5
– Figure 4.17
® Pei-Jie Wu
25
Algorithms
--Acyclic subgraphs
• Assembling fragments without error and known
orientation assuming that the fragments have been
obtained from a “good sampling” of the target
DNA.
• “good sampling”: fragments cover the entire target
molecule, and the collection as a whole to exhibit
enough linkage to guarantee a safe assembly.
• Figure 4.18
® Pei-Jie Wu
26
Algorithms
--Acyclic subgraphs
• The presence of repeated regions, or repeated element, in
the target string S is related to the existence of cycles in the
overlap graph.
• Cycles in an overlap graph are necessarily due to repeats in
S. The converse is not necessarily true; that is, we may
have repeats but still an acyclic overlap graph.
• THEOREM 4.5
• Algorithm: Topological sorting
• Example 4.6
– Figure 4.19, 4.20 and 4.21
® Pei-Jie Wu
27
Heuristics
• None of the formalisms proposed for fragment
assembly are entirely adequate
• Fragment assembly can be viewed as a multiple
alignment problem with some additional feature:
– Each fragment can participate with either the direct or
the reverse-complemented sequence.
– The sequences themselves are usually much shorter
than the alignment itself.
® Pei-Jie Wu
28
Heuristics
• Three criteria according to the second feature:
– Scoring
 Entropy is a quantity that is defied on a group of relative
frequencies, and it is low when one of these frequencies stands
out from the others, and high when they are all more or less
equal
 Lower the entropy, the better
 Coverage:
 A fragment covers a column i if it participates in this column
either with a character or with an internal space.
 Linkage
 The way individual fragment are linked in the layout is another
determinant of layout quality.
 Figure 4.22
® Pei-Jie Wu
29
Heuristics
--Assembly in practice
• Practical implementations often divide the whole
problem in three phase:
– Finding overlaps
– Building a layout
– Computing the consensus
® Pei-Jie Wu
30
Heuristics
--Assembly in practice
Finding overlaps
• The first step in any assembly problem is fragment
overlap delection.
• Determine reverse complement
• Consider fragments entirely contained in other
fragment
• Recall Section 3.2.3
– Figure 4.23
® Pei-Jie Wu
31
Heuristics
--Assembly in practice
Ordering fragments
• Finding a good ordering of fragments in a contig
• No algorithm that is simple and general enough
• There are four issues to keep in mind when
building paths:
–
–
–
–
Every path has a corresponding complement path
It is not necessary to include contain fragments
Cycles usually indicate the presence of repeats
Unbalanced coverage may be related to repeats as well
(see Figure 4.13)
® Pei-Jie Wu
32
Heuristics
--Assembly in practice
Alignment and consensus
• Building a layout from a path in an overlap graph
• Two techniques related to alignment construction:
– The first one helps in building a good layout from a
path in the presence of errors.
 Example 4.7
 Implement: Figure 4.24
– The second one focuses on locally improving an
already constructed layout
 Example 4.8 in Figure 4.25
 Implement: sum-of-pairs scoring scheme
® Pei-Jie Wu
33
Download