Methods: Adjustments to the Standard Min

Paul Medvedev’s brain + Michael Brudno’s brain = Brainchild! Ab Initio Whole Genome Shotgun Assembly With Mated Short Reads Presented by Lucas Lochovsky 1 Outline • • • • • • • • • • 2 Whole Genome Assembly Review of Related Work The Medvedev-Brudno Method Methods: Bidirected Overlap Graph Methods: Adjustments to the Standard Min-cost Biflow Problem Methods: Maximizing the Global Read-Count Likelihood Methods: Efficiently Solving a Min-cost Biflow (Linear) Methods: Show Me the Contigs Results Discussion Outline • • • • • • • • • • 3 Whole Genome Assembly Review of Related Work The Medvedev-Brudno Method Methods: Bidirected Overlap Graph Methods: Adjustments to the Standard Min-cost Biflow Problem Methods: Maximizing the Global Read-Count Likelihood Methods: Efficiently Solving a Min-cost Biflow (Linear) Methods: Show Me the Contigs Results Discussion Who Wants to Assemble an Entire Genome? Two approaches: I) Clone-by-clone assembly (a.k.a. “hierarchical shotgun” approach) • Break genome into 150 kb fragments • Insert into BAC and map onto chromosomes • Shotgun sequence each BAC and assemble into a contig 4 Who Wants to Assemble an Entire Genome? (cont’d) II) Whole Genome Shotgun (WGS) Assembly • Break genome into shotgunsized fragments and sequence • Match the overlapping regions of contiguous sequences • Demonstrated by Celera Genomics to be feasible for whole genome assembly • Sequenced human genome at 1/10’th the cost of the public Human Genome Project 5 Who Wants to Assemble an Entire Genome? (cont’d) Next Generation Sequencing (NGS): potential third player? • Improved speed and cost-effectiveness relative to the other methods… • … but much shorter read length (25-200 bp) • Only proven on resequencing projects, i.e. a reference genome is already available • Not yet proven on ab initio genome sequencing • No advance knowledge available for ab initio (Def’n: “from first principles”) genome sequencing 6 Outline • • • • • • • • • • 7 Whole Genome Assembly Review of Related Work The Medvedev-Brudno Method Methods: Bidirected Overlap Graph Methods: Adjustments to the Standard Min-cost Biflow Problem Methods: Maximizing the Global Read-Count Likelihood Methods: Efficiently Solving a Min-cost Biflow (Linear) Methods: Show Me the Contigs Results Discussion Review of Related Work • Genome assembly intuition: identify shortest common superstring of all reads • Equivalent to finding the shortest genome that is possible with the given reads • Problem: each repeat only used once – “over-collapsing” the repeats • Accommodate multiple uses of repeat regions: makes sequence assembly amenable to graph-theoretic methods 8 Review of Related Work (cont’d) EULER assembler (Pevzner, Tang and Waterman) • Represent reads as edges and overlaps as vertices in a de Bruijn graph • Assembly can be efficiently solved as an Eulerian Path Problem: each edge must be visited exactly once • Repeats dealt with by using multiple edges for a single repeat read 9 Review of Related Work (cont’d) String graph (Myers) • Represent reads as vertices, and read overlaps as edges • Remove redundant edges • Establish edge constraints  Unique? (flow is exactly one)  Required? (min. flow is 1)  Optional? (min. flow is 0) • Find shortest walk 10 Review of Related Work (cont’d) • Flow entering and exiting each vertex in these • • • • 11 graphs must be balanced Gave rise to the idea of network flow methods for assembly EULER assembler: minimum cost circulation problem String graph: place constraints on copy-counts before solving the flow Network flow fails for the assembly problem when there are repeats longer than the span of a single read Review of Related Work (cont’d) Sequence assembly using NGS • Several methods available now (e.g. SSAKE, VCAKE, SHARCGS, etc.) • All of these assume that the length of the assembled genome must be minimized • Results in over-collapsing of repeats • Given ubiquity of repeats in eukaryotic genomes, authors considered this a poor assumption 12 Outline • • • • • • • • • • 13 Whole Genome Assembly Review of Related Work The Medvedev-Brudno Method Methods: Bidirected Overlap Graph Methods: Adjustments to the Standard Min-cost Biflow Problem Methods: Maximizing the Global Read-Count Likelihood Methods: Efficiently Solving a Min-cost Biflow (Linear) Methods: Show Me the Contigs Results Discussion The Medvedev-Brudno Method • Change goal of sequence assembly  Maximize the likelihood that the resultant genome was the source of the given reads • Take advantage of the high coverage of NGS to statistically estimate the copy-count of each read: identify and quantify repeats • Maximizing the likelihood of observed read frequencies can be cast as mininum cost bidirected flow (biflow) problem • Allows solution to be obtained with an off-theshelf network flow solver •14 Authors claim 99.99% accuracy The Medvedev-Brudno Method (cont’d) • Second important aspect is the use of matepair • • • 15 information for joining contigs Other systems look for all paths between mated reads The Medvedev-Brudno Method looks only for short paths between some pairs of reads Question: How do you decide the upper bound for these “short paths”? And how do you decide which pairs of reads to examine? Outline • • • • • • • • • • 16 Whole Genome Assembly Review of Related Work The Medvedev-Brudno Method Methods: Bidirected Overlap Graph Methods: Adjustments to the Standard Min-cost Biflow Problem Methods: Maximizing the Global Read-Count Likelihood Methods: Efficiently Solving a Min-cost Biflow (Linear) Methods: Show Me the Contigs Results Discussion Methods: Bidirected Overlap Graph • Bidirected graphs are kind of like directed graphs, except each edge has an orientation on each of its ends • Gives rise to three types of edges:  Edges where one arrow points out of a vertex, and one arrow points into a vertex  Edges with both arrows pointing out, and  Edges with both arrows pointing in (easiest one to do in PowerPoint!) • For a walk in a bidirected graph, for each vertex on that walk, the orientation of the edge entering the vertex must be opposite that of the edge leaving the vertex 17 Methods: Bidirected Overlap Graph (cont’d) • In a bidirected overlap graph, each vertex is a doublestranded read • Edges represent read overlaps • Three possible ways that two double-stranded reads can overlap (corresponds to the three types of edges)  Suppose we have two ds reads r1 and r2  Each read can be oriented to the left or to the right  The three possible overlaps are: • i) Both strands point in the same direction (both reads can point left, or both can point right, it’s the same overlap either way) • ii) r1 points left and r2 points right • iii) r1 points right and r2 points left 18 Methods: Bidirected Overlap Graph (cont’d) • A walk along this graph that visits every vertex at • • • 19 least once produces the original double-stranded genome (under the assumptions that the whole genome was covered by the reads, and that the reads are error-free) In this paper, the overlap graph is constructed by placing an edge between two reads if they overlap by a minimum number of characters omin Question: How is omin determined? Then perform transitive edge reduction: remove overlaps covered by two shorter overlaps Outline • • • • • • • • • • 20 Whole Genome Assembly Review of Related Work The Medvedev-Brudno Method Methods: Bidirected Overlap Graph Methods: Adjustments to the Standard Min-cost Biflow Problem Methods: Maximizing the Global Read-Count Likelihood Methods: Efficiently Solving a Min-cost Biflow (Linear) Methods: Show Me the Contigs Results Discussion Methods: Adjustments to the Standard Min-cost Biflow Problem Standard Min-cost Biflow Problem • Set upper and lower flow bounds on each edge • Flow function f : E → N must obey the constraint le  f e  ue for each edge e • For each vertex, the incoming flow is balanced with the outgoing flow  • Objective: Find the flow that minimizes ce f e 21 Methods: Adjustments to the Standard Min-cost Biflow Problem (cont’d) Medvedev-Brudno Min-cost Biflow Problem • Upper and lower flow bounds on vertices as well • Accomplished by splitting every vertex v into two: v+ and v• v- serves as the “incoming” vertex, and inherits v’s incoming edges • v+ serves as the “outgoing” vertex, and inherits v’s outgoing edges • Finally add one edge between vand v+ and assign it the upper and lower flow bounds for v 22 Methods: Adjustments to the Standard Min-cost Biflow Problem (cont’d) • Second variation: represent the cost ce as a convex function • A function is convex if every point on or above it forms a convex set • A convex set refers to an area where, for every pair of points within that area, every point on the straight line segment connecting those points also lies within that area 23 Methods: Adjustments to the Standard Min-cost Biflow Problem (cont’d) • An area that is not convex would have some sort of concave portion that would contradict the above property of convex sets • In the overlap graph, convex functions are modelled with piecewiselinear approximations, allowing the flow to be solved as a linear mincost flow problem 24 Methods: Adjustments to the Standard Min-cost Biflow Problem (cont’d) • Supersource and supersink added to convert flow problem into circulation problem • Each vertex has a lower bound of 1, since each read must appear in the finished genome at least once • Edge bounds are set to 0 (lower bound) and infinity (upper bound) • Put prohibitively large cost on the edge leading from the supersource and the edge leading to the supersink to ensure that the assembly uses the smallest number of contigs possible • Flow through each vertex represents number of 25 times it appears in the assembled genome Outline • • • • • • • • • • 26 Whole Genome Assembly Review of Related Work The Medvedev-Brudno Method Methods: Bidirected Overlap Graph Methods: Adjustments to the Standard Min-cost Biflow Problem Methods: Maximizing the Global Read-Count Likelihood Methods: Efficiently Solving a Min-cost Biflow (Linear) Methods: Show Me the Contigs Results Discussion Methods: Maximizing the Global Read-Count Likelihood • Start with the probability of a k-mer i being sampled a • • • • 27 certain number of times from a genome G Let N(G) be the length of the genome assembly of G, and let gi be the frequency of i in G Under the assumption of uniform sampling, the probability of sampling i is gi/N(G) Let Xi be the random variable that represents the number of trials whose outcome is i Each random variable for every possible k-mer has a binomial distribution. Their joint distribution is the following multinomial distribution: x PX1  x1, X 2  x 2 , , X 4 k i   n! gi  x 4 k      xi i N G Methods: Maximizing the Global Read-Count Likelihood (cont’d) • From this, derive the global read-count likelihood, the likelihood of k-mer distributions (gi) given the sampling outcomes (xi): i   n! gi Lg1, ,g4 k | x1, , x 4 k      xi N G x • Goal is to maximize L, or, equivalently, minimize the negative log of L • To translate this problem into a convex min-cost biflow problem, we need convex functions for each k-mer ci s.t. log L  ci gi  • Problem: the Xi random variables are not 28 independent… Methods: Maximizing the Global Read-Count Likelihood (cont’d) • … unless the number of trials approaches infinity • The number of trials is usually large enough to • • 29 warrant the approximation of the multinomial distribution as the product of the binomial distributions for each Xi In this binomial approximation, genome length N(G) is constant, and independent of the sampling frequencies Therefore, use N instead, which is the actual length of the genome G Methods: Maximizing the Global Read-Count Likelihood (cont’d) • New approximation of L: n gi xi  gi nxi Lg1, ,g4 k | x1, , x 4 k    P X i  x i      1  x i N   N  • Now log L  Kc g  • And c g   x log g   n  x log N  g  •ci is used as the convex functions for the i i i i i i i i vertices of the min-cost biflow graph  described earlier 30 Outline • • • • • • • • • • 31 Whole Genome Assembly Review of Related Work The Medvedev-Brudno Method Methods: Bidirected Overlap Graph Methods: Adjustments to the Standard Min-cost Biflow Problem Methods: Maximizing the Global Read-Count Likelihood Methods: Efficiently Solving a Min-cost Biflow (Linear) Methods: Show Me the Contigs Results Discussion Methods: Efficiently Solving a Min-Cost Biflow (Linear) • Problem: No existing efficient • • • • 32 implementation of a min-cost biflow algorithm Solution: Roll your own! Introducing the Medvedev-Brudno Min-cost Biflow Solver! It’s fast! It’s 2-approximate! Get one today! Now available at half-price for one day only!* *offer contingent on willingness of authors to go into business with this Methods: Efficiently Solving a Min-Cost Biflow (Linear) (cont’d) • Can solve directed network flow by reducing the • problem to a linear program (LP) Use an edge incidence matrix I V E derived from the overlap graph  If cell Im,n has a value of 1, then edge n is an in-edge for vertex m   If the value is -1, n is an out-edge  0 means n and m are not on speaking terms  • Use incidence matrix as constraint matrix for LP: optimal LP solution corresponds to a minimum flow 33 Methods: Efficiently Solving a Min-Cost Biflow (Linear) (cont’d) • The incidence matrix is Totally Unimodular (TU) • Translation from wowspeak: Every linear combination of a TU matrix M and the identity matrix I has integer coefficients. Therefore, every solution to the l.c. consists of integers • Makes it possible to produce an integral solution with LP, rather than resort to Integer Programming -> NP-hard 34 Methods: Efficiently Solving a Min-Cost Biflow (Linear) (cont’d) • Possible for +2 or -2 to appear in the incidence matrix, since two in-edges/outedges can enter a single vertex • Incidence matrix is actually a binet matrix • Optimal LP solution for binet matrices is guaranteed to be half-integral (i.e. the coefficients are multiples of 0.5) 35 Methods: Efficiently Solving a Min-Cost Biflow (Linear) (cont’d) • The binet matrix can be reduced to a TU matrix through monotonization, i.e. doubling the number of columns and rows • TU matrix corresponds to a directed graph • Min-cost directed flow can be solved must faster than LP 36 Methods: Efficiently Solving a Min-Cost Biflow (Linear) (cont’d) Monotonization Procedure • For every vertex v in the bidirected graph, replace with two vertices v1 and v2 in the new graph • Each of v’s in-edges are replaced with two edges, one of which points into v1, while the other points out of v2 • Likewise, each of v’s out-edges are replaced with two edges, one of which points out of v1, while the other points into v2 • Bounds and costs from original graph are transferred to the new graph, and the solution of the new graph will be transferred to the original graph • Problem can now be solved with off-the-shelf software (no more work for us!) 37 Outline • • • • • • • • • • 38 Whole Genome Assembly Review of Related Work The Medvedev-Brudno Method Methods: Bidirected Overlap Graph Methods: Adjustments to the Standard Min-cost Biflow Problem Methods: Maximizing the Global Read-Count Likelihood Methods: Efficiently Solving a Min-cost Biflow (Linear) Methods: Show Me the Contigs Results Discussion Methods: Show Me the Contigs • Assuming the flow’s been solved, now it’s time to decompose it into a collection of walks, which translates into assembled contigs • Graph is first simplified by removing all edges with a flow of zero • Additional simplifications possible by removing vertices v where:  There is exactly one edge going into v and one edge leading out of v, and the flow on both edges is the same  Vertices where there is also a loop with the same flow as the other two edges, and 39  Split and join vertices, where the flow on the inedges is the same as those of the out-edges Methods: Show Me the Contigs (cont’d) • After at most 2|V| of these simplifications, the number of edges will be reduced by 105 fold, and only “conflict” vertices remain (those that didn’t match the previous criteria) Conflict Node Resolution • Look for edges at these vertices with opposite orientations supported by matepairs • Use BFS to find all reads within a certain distance from the vertex • Match those reads that are matepairs • For those matepairs where one read is on the incoming side and the other is on the outgoing side, find the shortest path between them using Dijkstra’s algorithm 40 Methods: Show Me the Contigs (cont’d) • Make note of the number of mates that fall within • • • 41 the expected insert distance Pairs of in/out edges that have a significant number of matepairs that fall within the insert distance are joined into a common edge The previous step is repeated until no more edges can be joined in this manner Graph simplification continues in iterative phases until somebody decides it’s time to stop Outline • • • • • • • • • • 42 Whole Genome Assembly Review of Related Work The Medvedev-Brudno Method Methods: Bidirected Overlap Graph Methods: Adjustments to the Standard Min-cost Biflow Problem Methods: Maximizing the Global Read-Count Likelihood Methods: Efficiently Solving a Min-cost Biflow (Linear) Methods: Show Me the Contigs Results Discussion Results • Unable to obtain real data for experimental validation • Generated synthetic reads from E. coli genome, • • • • • • 43 which must be 4.6 megabases long, or else it takes up 4.6 MB of hard drive space Simulated matepairs’ distances were uniformly distributed within 10% of the expected insert size Reads were 25 bp long, and error-free Coverage rates involved 50x, 75x, 100x, and 200x Minimum overlap length varied between 17 and 21 Overall running time on one machine: ~1 hour Question: What kind of machine? Read Count Results • Compared vertex flow with read frequency in the original genome • High degree of accuracy • Error rate between 10-4 and 10-6 • Generally more tendency to overestimate read frequency • Authors claim only slight improvements beyond 75x 44 coverage, but 200x coverage is fantastically good Assembly Results • Take the edges of the graph produced after the • • • 45 conflict node resolution and generate the sequence it spells out Compute N50: The length of the shortest contig s.t. 50% of the genome lies in longer contigs Also compute N90: Similar to N50, but the cutoff is 90% Finally, compute errors by aligning each contig to the reference genome and seeing how many local alignments it takes to completely tile the contig (minus one because it always takes at least one alignment to do it) Assembly Results (cont’d) • Length of contigs that contain 50% of the genome varied between 23-28 kb • Length of contigs that contain 90% of the genome varied between 78 kb • N50 error rate: ~1/100-180 kb • N90 error rate: ~1/100-160 kb • Greedy algorithm can be fooled by several strong edge matches • Contig size is good relative to other whole genome assemblies 46 involving small read sizes Outline • • • • • • • • • • 47 Whole Genome Assembly Review of Related Work The Medvedev-Brudno Method Methods: Bidirected Overlap Graph Methods: Adjustments to the Standard Min-cost Biflow Problem Methods: Maximizing the Global Read-Count Likelihood Methods: Efficiently Solving a Min-cost Biflow (Linear) Methods: Show Me the Contigs Results Discussion Discussion • Successfully demonstrated that ab initio genome • • • 48 assembly is feasible with 25 bp reads and matepair information Still needs improvements for sequencing unknown (bacterial) genomes In the future, matepair information could be further used to join contigs into supercontigs, and better help resolve conflict nodes Convex network flow can be more broadly applied within computational biology Discussion (cont’d) • First major assumption: Reads are error-free • Can be overcome with higher coverage • Second major assumption: Uniform sampling • • • 49 of all genomic regions Reality: certain portions of the genome are easier to sample than others More difficult to overcome Could be overcome by establishing the biases of the sequencing apparatus used Is NGS The One? The One that will herald the beginning of costeffective whole genome assembly? Maybe you should ask the Oracle… 50 That’s all folks! Discussion questions • What were the strengths/weaknesses of the Medvedev-Brudno Method? How would you improve it? • How rigorous do you think the experimental validation was? How would you improve it? • Would you agree with the conclusions the authors drew from their results? Why or why not? • How many of you fell asleep during this presentation? What can I do to keep your attention in the future? 51

Methods: Adjustments to the Standard Min

Related documents

Products

Support

Methods: Adjustments to the Standard Min

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib