Fragment assembly of DNA A typical approach to sequencing long DNA molecules is to sample and then sequence fragments from them. Fragment assembly of DNA • • • • Biological background Models Algorithms Heuristics ® Pei-Jie Wu 2 Biological background • Problem as puzzle • We do not know which letter from the set {A, C, G, T} is written on each card, but we do know that cards in the same position of opposite stands from a complementary pair. • Our goal is obtain the letters using certain hint, which are (approximate) substrings of the rows. ® Pei-Jie Wu 3 Biological background • Target: The long sequence to reconstruct. • Fragment vs. Subsequence • Shotgun method: Based on fragment overlap • Fragment assembly: A collection of fragments to put together ® Pei-Jie Wu 4 Biological background --The ideal case • Case: p.106 • Aligned the input set, ignoring spaces at the extremities • Overlaps: the end part of a fragment is similar to the beginning of another • Consensus sequence base on majority vote ® Pei-Jie Wu 5 Biological background --Complications • The main factors that add to the complexity of the problem are: – – – – Error Unknown orientation Repeated regions Lack of coverage. ® Pei-Jie Wu 6 Biological background --Complications Errors • It usually means algorithms that require more time and space when computer program deal with error. • The simplest errors are called base call errors and comprise base substitutions, insertions and deletions in the fragments. • Base call errors occurs in practice at rates varying from 1 to 5 errors every 100 characters. • Figures 4.2, 4.3, 4.4 ® Pei-Jie Wu 7 Biological background --Complications Errors • Two other types of errors: chimera and Contamination • Chimeras, arise when two regular fragments from distinct parts of the target molecule join end-to-end to form a fragment that is not a contiguous part of the target – Figure 4.5 – Solution: Must be recognized as such and removed from the fragment set in a preprocessing stage. • Contamination is from host or vector DNA – Solution: Most vectors are well know, so we can screen the data before starting assembly. ® Pei-Jie Wu 8 Biological background --Complications Unknown orientation • We generally do not know to which strand a particular fragment belongs to. • The input fragments as being all approximate substrings of the consensus sought either as given or in reverse complement. • Figure 4.6 • Complexity: 2n ® Pei-Jie Wu 9 Biological background --Complications Repeated regions • Repeats are sequences that appear two or more times in the targrt molecule. – Short repeats – Longer repeats • If the level of similarity between two copies of a repeat is high enough, the differences can be mistaken for base call errors • Figure 4.7 ® Pei-Jie Wu 10 Biological background --Complications Repeated regions • Problems: – If a fragment is totally contained in a repeat, we may have several places to put it in the final alignment. When the copies are not exactly equal, we may weaken the consensus by placing a fragment in the wrong way copy. – Repeats can be positioned in such a way as to render assembly inherently ambiguous. (Figure 4.8 and 4.9) • Direct repeats: repeated copies in the same strand. • Inverted repeats: repeated regions in opposite strands (Figure 4.10) ® Pei-Jie Wu 11 Biological background --Complications Lack of coverage • Coverage: position i of the target as the number of fragments that cover this position. • Contigs: The contiguously covered regions • Figure 4.11 • Solutions: – Sampling more fragments – Directed sequencing or walking ® Pei-Jie Wu 12 Biological background --Alternative methods for DNA sequencing • Directed sequencing: a method that can be used to cover small remaining gaps in a shotgun project. • Problem: – It is expensive to build special primers – Sequential rather than parallel • Sequencing by hybridization (SBH), it consists of assembling the target molecule based on many hybridization experiments with very short, fixed length sequences called probes. ® Pei-Jie Wu 13 Models • Shortest common superstring (SCS) • RECONSTRUCTION • MULTICONTIG – All three assume that the fragment collection is free of contamination and chimeras. ® Pei-Jie Wu 14 Models --Shortest common superstring • Seeking the shortest superstring of a collection of given strings • PROBLEM: Shortest common superstring (SCS) • INPUT: a collection F of strings. • OUTPUT: a shortest possible string S such that for every f F , S is a superstring of f. ® Pei-Jie Wu 15 Models --Shortest common superstring • Example 4.1 • Example 4.2 – Figure 4.12 – Figure 4.13 • A superstring may contain only one copy, which will absorb all fragments totally contained in any of the copies ® Pei-Jie Wu 16 Models --Reconstruction • Takes into account both errors and unknown orientation • Dynamic programming sequence comparison algorithm • Use distance rather than similarity • Expression: p.116 ® Pei-Jie Wu 17 Models --Reconstruction • PROBLEM: RECONSTRUCTION • INPUT: a collection F of strings and an error tolerance between 1 and 0. • OUTPUT: (p.117) • Find a string S as short as possoble such that either f or its reverse complement must be an approximate substring of S at error level • Does not model repeats, lack of coverage, and size of target ® Pei-Jie Wu 18 Models --Multicontig • Involve internal linkage of the fragments in the layout • Nonlink: there is a fragment that properly contains the overlap on both sides • Weakest link: the smallest size of any link • t-contig: the weakest link of a layout is at least as large as t • Example 4.4 • Definition: p.119 ® Pei-Jie Wu 19 Algorithms • Greedy algorithm • Acyclic subgraphs (no errors and know orientation) ® Pei-Jie Wu 20 Algorithms --Representing overlaps • Over multigraph OM(F) of a collection F is the directed, weighted multigraph • Set V of nodes of this structure is just F itself. • A directed edge from a to a different fragment b with weight t 0 exists if the suffix of a with t characters is a prefix of b • May be many edges from a to b • No self-loops ® Pei-Jie Wu 21 Algorithms --Paths originating superstrings • Edge e = (f, g) in the path has a certain weight t, which means that the last t bases of the tail f of e • Figure 4.15 – Example in p.121 • Equation 4.3 • Hamiltonian paths: A path that goes through every vertex • Equation 4.4 – Minimizing |S(P)| maximizing w(P) ® Pei-Jie Wu 22 Algorithms --Shortest superstrings as paths • A collection F is said to be substring-free if there are no two distinct strings a and b in such that a is a substring of b. • THEOREM 4.1 • COROLLARY 4.1 • LEMMA 4.1 • THEOREM 4.2 ® Pei-Jie Wu 23 Algorithms --The greedy algorithm • Looking for shortest common superstrings is the same as looking for Hamiltonian paths of maximum weight in a directed multigraph. • OM(F) OG(F) • “greedy” attempt at computing the heaveiest path. The basic idea employed in it is to continuously add the heaviest available edge ® Pei-Jie Wu 24 Algorithms --The greedy algorithm • Three conditions we have to test before accepting an edge in our Hamiltonian path: – Edges are processed in nonincreasing order by weight – The procedure ends when we have exactly n-1 edges, or – when the accepted edges induce a connected subgraph. • Figure 4.16 • Example 4.5 – Figure 4.17 ® Pei-Jie Wu 25 Algorithms --Acyclic subgraphs • Assembling fragments without error and known orientation assuming that the fragments have been obtained from a “good sampling” of the target DNA. • “good sampling”: fragments cover the entire target molecule, and the collection as a whole to exhibit enough linkage to guarantee a safe assembly. • Figure 4.18 ® Pei-Jie Wu 26 Algorithms --Acyclic subgraphs • The presence of repeated regions, or repeated element, in the target string S is related to the existence of cycles in the overlap graph. • Cycles in an overlap graph are necessarily due to repeats in S. The converse is not necessarily true; that is, we may have repeats but still an acyclic overlap graph. • THEOREM 4.5 • Algorithm: Topological sorting • Example 4.6 – Figure 4.19, 4.20 and 4.21 ® Pei-Jie Wu 27 Heuristics • None of the formalisms proposed for fragment assembly are entirely adequate • Fragment assembly can be viewed as a multiple alignment problem with some additional feature: – Each fragment can participate with either the direct or the reverse-complemented sequence. – The sequences themselves are usually much shorter than the alignment itself. ® Pei-Jie Wu 28 Heuristics • Three criteria according to the second feature: – Scoring Entropy is a quantity that is defied on a group of relative frequencies, and it is low when one of these frequencies stands out from the others, and high when they are all more or less equal Lower the entropy, the better Coverage: A fragment covers a column i if it participates in this column either with a character or with an internal space. Linkage The way individual fragment are linked in the layout is another determinant of layout quality. Figure 4.22 ® Pei-Jie Wu 29 Heuristics --Assembly in practice • Practical implementations often divide the whole problem in three phase: – Finding overlaps – Building a layout – Computing the consensus ® Pei-Jie Wu 30 Heuristics --Assembly in practice Finding overlaps • The first step in any assembly problem is fragment overlap delection. • Determine reverse complement • Consider fragments entirely contained in other fragment • Recall Section 3.2.3 – Figure 4.23 ® Pei-Jie Wu 31 Heuristics --Assembly in practice Ordering fragments • Finding a good ordering of fragments in a contig • No algorithm that is simple and general enough • There are four issues to keep in mind when building paths: – – – – Every path has a corresponding complement path It is not necessary to include contain fragments Cycles usually indicate the presence of repeats Unbalanced coverage may be related to repeats as well (see Figure 4.13) ® Pei-Jie Wu 32 Heuristics --Assembly in practice Alignment and consensus • Building a layout from a path in an overlap graph • Two techniques related to alignment construction: – The first one helps in building a good layout from a path in the presence of errors. Example 4.7 Implement: Figure 4.24 – The second one focuses on locally improving an already constructed layout Example 4.8 in Figure 4.25 Implement: sum-of-pairs scoring scheme ® Pei-Jie Wu 33