Methods: Adjustments to the Standard Min

advertisement
Paul Medvedev’s brain
+
Michael Brudno’s brain
= Brainchild!
Ab Initio Whole Genome Shotgun
Assembly With Mated Short Reads
Presented by Lucas Lochovsky
1
Outline
•
•
•
•
•
•
•
•
•
•
2
Whole Genome Assembly
Review of Related Work
The Medvedev-Brudno Method
Methods: Bidirected Overlap Graph
Methods: Adjustments to the Standard Min-cost Biflow
Problem
Methods: Maximizing the Global Read-Count Likelihood
Methods: Efficiently Solving a Min-cost Biflow (Linear)
Methods: Show Me the Contigs
Results
Discussion
Outline
•
•
•
•
•
•
•
•
•
•
3
Whole Genome Assembly
Review of Related Work
The Medvedev-Brudno Method
Methods: Bidirected Overlap Graph
Methods: Adjustments to the Standard Min-cost Biflow
Problem
Methods: Maximizing the Global Read-Count Likelihood
Methods: Efficiently Solving a Min-cost Biflow (Linear)
Methods: Show Me the Contigs
Results
Discussion
Who Wants to Assemble an
Entire Genome?
Two approaches:
I) Clone-by-clone
assembly (a.k.a.
“hierarchical shotgun”
approach)
• Break genome into 150
kb fragments
• Insert into BAC and map
onto chromosomes
• Shotgun sequence each
BAC and assemble into a
contig
4
Who Wants to Assemble an Entire
Genome? (cont’d)
II) Whole Genome Shotgun
(WGS) Assembly
• Break genome into shotgunsized fragments and
sequence
• Match the overlapping
regions of contiguous
sequences
• Demonstrated by Celera
Genomics to be feasible for
whole genome assembly
• Sequenced human genome
at 1/10’th the cost of the
public Human Genome
Project
5
Who Wants to Assemble an Entire
Genome? (cont’d)
Next Generation Sequencing (NGS): potential
third player?
• Improved speed and cost-effectiveness relative
to the other methods…
• … but much shorter read length (25-200 bp)
• Only proven on resequencing projects, i.e. a
reference genome is already available
• Not yet proven on ab initio genome sequencing
• No advance knowledge available for ab initio
(Def’n: “from first principles”) genome
sequencing
6
Outline
•
•
•
•
•
•
•
•
•
•
7
Whole Genome Assembly
Review of Related Work
The Medvedev-Brudno Method
Methods: Bidirected Overlap Graph
Methods: Adjustments to the Standard Min-cost Biflow
Problem
Methods: Maximizing the Global Read-Count Likelihood
Methods: Efficiently Solving a Min-cost Biflow (Linear)
Methods: Show Me the Contigs
Results
Discussion
Review of Related Work
• Genome assembly intuition: identify
shortest common superstring of all reads
• Equivalent to finding the shortest genome
that is possible with the given reads
• Problem: each repeat only used once –
“over-collapsing” the repeats
• Accommodate multiple uses of repeat
regions: makes sequence assembly
amenable to graph-theoretic methods
8
Review of Related Work (cont’d)
EULER assembler (Pevzner, Tang and
Waterman)
• Represent reads as edges and overlaps as
vertices in a de Bruijn graph
• Assembly can be efficiently solved as an
Eulerian Path Problem: each edge must be
visited exactly once
• Repeats dealt with by using multiple edges
for a single repeat read
9
Review of Related Work (cont’d)
String graph (Myers)
• Represent reads as vertices, and read
overlaps as edges
• Remove redundant edges
• Establish edge constraints
 Unique? (flow is exactly one)
 Required? (min. flow is 1)
 Optional? (min. flow is 0)
• Find shortest walk
10
Review of Related Work (cont’d)
• Flow entering and exiting each vertex in these
•
•
•
•
11
graphs must be balanced
Gave rise to the idea of network flow methods for
assembly
EULER assembler: minimum cost circulation
problem
String graph: place constraints on copy-counts
before solving the flow
Network flow fails for the assembly problem
when there are repeats longer than the span of a
single read
Review of Related Work (cont’d)
Sequence assembly using NGS
• Several methods available now (e.g.
SSAKE, VCAKE, SHARCGS, etc.)
• All of these assume that the length of the
assembled genome must be minimized
• Results in over-collapsing of repeats
• Given ubiquity of repeats in eukaryotic
genomes, authors considered this a poor
assumption
12
Outline
•
•
•
•
•
•
•
•
•
•
13
Whole Genome Assembly
Review of Related Work
The Medvedev-Brudno Method
Methods: Bidirected Overlap Graph
Methods: Adjustments to the Standard Min-cost Biflow
Problem
Methods: Maximizing the Global Read-Count Likelihood
Methods: Efficiently Solving a Min-cost Biflow (Linear)
Methods: Show Me the Contigs
Results
Discussion
The Medvedev-Brudno Method
• Change goal of sequence assembly
 Maximize the likelihood that the resultant genome was
the source of the given reads
• Take advantage of the high coverage of NGS to
statistically estimate the copy-count of each read:
identify and quantify repeats
• Maximizing the likelihood of observed read
frequencies can be cast as mininum cost
bidirected flow (biflow) problem
• Allows solution to be obtained with an off-theshelf network flow solver
•14 Authors claim 99.99% accuracy
The Medvedev-Brudno
Method (cont’d)
• Second important aspect is the use of matepair
•
•
•
15
information for joining contigs
Other systems look for all paths between mated
reads
The Medvedev-Brudno Method looks only for
short paths between some pairs of reads
Question: How do you decide the upper bound
for these “short paths”? And how do you decide
which pairs of reads to examine?
Outline
•
•
•
•
•
•
•
•
•
•
16
Whole Genome Assembly
Review of Related Work
The Medvedev-Brudno Method
Methods: Bidirected Overlap Graph
Methods: Adjustments to the Standard Min-cost Biflow
Problem
Methods: Maximizing the Global Read-Count Likelihood
Methods: Efficiently Solving a Min-cost Biflow (Linear)
Methods: Show Me the Contigs
Results
Discussion
Methods: Bidirected Overlap Graph
• Bidirected graphs are kind of like directed graphs, except
each edge has an orientation on each of its ends
• Gives rise to three types of edges:
 Edges where one arrow points out of a vertex, and one arrow
points into a vertex
 Edges with both arrows pointing out, and
 Edges with both arrows pointing in (easiest one to do in
PowerPoint!)
• For a walk in a bidirected graph, for each vertex on that
walk, the orientation of the edge entering the vertex must
be opposite that of the edge leaving the vertex
17
Methods: Bidirected
Overlap Graph (cont’d)
• In a bidirected overlap graph, each vertex is a doublestranded read
• Edges represent read overlaps
• Three possible ways that two double-stranded reads can
overlap (corresponds to the three types of edges)
 Suppose we have two ds reads r1 and r2
 Each read can be oriented to the left or to the right
 The three possible overlaps are:
• i) Both strands point in the same direction (both
reads can point left, or both can point right, it’s
the same overlap either way)
• ii) r1 points left and r2 points right
• iii) r1 points right and r2 points left
18
Methods: Bidirected
Overlap Graph (cont’d)
• A walk along this graph that visits every vertex at
•
•
•
19
least once produces the original double-stranded
genome (under the assumptions that the whole
genome was covered by the reads, and that the
reads are error-free)
In this paper, the overlap graph is constructed by
placing an edge between two reads if they
overlap by a minimum number of characters omin
Question: How is omin determined?
Then perform transitive edge reduction: remove
overlaps covered by two shorter overlaps
Outline
•
•
•
•
•
•
•
•
•
•
20
Whole Genome Assembly
Review of Related Work
The Medvedev-Brudno Method
Methods: Bidirected Overlap Graph
Methods: Adjustments to the Standard Min-cost
Biflow Problem
Methods: Maximizing the Global Read-Count Likelihood
Methods: Efficiently Solving a Min-cost Biflow (Linear)
Methods: Show Me the Contigs
Results
Discussion
Methods: Adjustments to the
Standard Min-cost Biflow Problem
Standard Min-cost Biflow Problem
• Set upper and lower flow bounds on each
edge
• Flow function f : E → N must obey the
constraint le  f e  ue for each edge e
• For each vertex, the incoming flow is
balanced with the outgoing flow

• Objective: Find the flow that minimizes
ce f e
21
Methods: Adjustments to the Standard
Min-cost Biflow Problem (cont’d)
Medvedev-Brudno Min-cost Biflow Problem
• Upper and lower flow bounds on vertices as well
• Accomplished by splitting every vertex v into two: v+ and
v• v- serves as the “incoming” vertex, and inherits v’s
incoming edges
• v+ serves as the “outgoing” vertex, and inherits v’s
outgoing edges
• Finally add one edge between vand v+ and assign it the upper and
lower flow bounds for v
22
Methods: Adjustments to the Standard
Min-cost Biflow Problem (cont’d)
• Second variation:
represent the cost ce as a
convex function
• A function is convex if
every point on or above it
forms a convex set
• A convex set refers to an
area where, for every pair
of points within that area,
every point on the straight
line segment connecting
those points also lies
within that area
23
Methods: Adjustments to the Standard
Min-cost Biflow Problem (cont’d)
• An area that is not convex
would have some sort of
concave portion that
would contradict the
above property of convex
sets
• In the overlap graph,
convex functions are
modelled with piecewiselinear approximations,
allowing the flow to be
solved as a linear mincost flow problem
24
Methods: Adjustments to the Standard
Min-cost Biflow Problem (cont’d)
• Supersource and supersink added to convert flow
problem into circulation problem
• Each vertex has a lower bound of 1, since each
read must appear in the finished genome at least
once
• Edge bounds are set to 0 (lower bound) and infinity
(upper bound)
• Put prohibitively large cost on the edge leading
from the supersource and the edge leading to the
supersink to ensure that the assembly uses the
smallest number of contigs possible
• Flow through each vertex represents number of
25
times it appears in the assembled genome
Outline
•
•
•
•
•
•
•
•
•
•
26
Whole Genome Assembly
Review of Related Work
The Medvedev-Brudno Method
Methods: Bidirected Overlap Graph
Methods: Adjustments to the Standard Min-cost Biflow
Problem
Methods: Maximizing the Global Read-Count
Likelihood
Methods: Efficiently Solving a Min-cost Biflow (Linear)
Methods: Show Me the Contigs
Results
Discussion
Methods: Maximizing the Global
Read-Count Likelihood
• Start with the probability of a k-mer i being sampled a
•
•
•
•
27
certain number of times from a genome G
Let N(G) be the length of the genome assembly of G, and
let gi be the frequency of i in G
Under the assumption of uniform sampling, the
probability of sampling i is gi/N(G)
Let Xi be the random variable that represents the number
of trials whose outcome is i
Each random variable for every possible k-mer has a
binomial distribution. Their joint distribution is the
following multinomial distribution:
x
PX1  x1, X 2  x 2 , , X 4 k
i


n!
gi
 x 4 k 



 xi i N G
Methods: Maximizing the Global
Read-Count Likelihood (cont’d)
• From this, derive the global read-count
likelihood, the likelihood of k-mer distributions
(gi) given the sampling outcomes (xi):
i


n!
gi
Lg1, ,g4 k | x1, , x 4 k 



 xi N G
x
• Goal is to maximize L, or, equivalently, minimize
the negative log of L
• To translate this problem into a convex min-cost
biflow problem, we need convex functions for
each k-mer ci s.t. log L  ci gi 
• Problem: the Xi random variables are not
28 independent…
Methods: Maximizing the Global
Read-Count Likelihood (cont’d)
• … unless the number of trials approaches infinity
• The number of trials is usually large enough to
•
•
29
warrant the approximation of the multinomial
distribution as the product of the binomial
distributions for each Xi
In this binomial approximation, genome length
N(G) is constant, and independent of the
sampling frequencies
Therefore, use N instead, which is the actual
length of the genome G
Methods: Maximizing the Global
Read-Count Likelihood (cont’d)
• New approximation of L:
n gi xi  gi nxi
Lg1, ,g4 k | x1, , x 4 k    P X i  x i      1 
x i N   N 
• Now log L  Kc g 
• And c g   x log g   n  x log N  g 
•ci is used as the convex functions for the
i
i
i
i
i
i
i
i
vertices
of
the
min-cost
biflow
graph

described earlier
30
Outline
•
•
•
•
•
•
•
•
•
•
31
Whole Genome Assembly
Review of Related Work
The Medvedev-Brudno Method
Methods: Bidirected Overlap Graph
Methods: Adjustments to the Standard Min-cost Biflow
Problem
Methods: Maximizing the Global Read-Count Likelihood
Methods: Efficiently Solving a Min-cost Biflow
(Linear)
Methods: Show Me the Contigs
Results
Discussion
Methods: Efficiently Solving a
Min-Cost Biflow (Linear)
• Problem: No existing efficient
•
•
•
•
32
implementation of a min-cost biflow
algorithm
Solution: Roll your own! Introducing the
Medvedev-Brudno Min-cost Biflow Solver!
It’s fast!
It’s 2-approximate!
Get one today! Now available at half-price
for one day only!*
*offer contingent on willingness of authors to go into
business with this
Methods: Efficiently Solving a
Min-Cost Biflow (Linear) (cont’d)
• Can solve directed network flow by reducing the
•
problem to a linear program (LP)
Use an edge incidence matrix I V E derived from
the overlap graph
 If cell Im,n has a value of 1, then edge n is an in-edge
for vertex m

 If the value is -1, n is an out-edge
 0 means n and m are not on speaking terms

• Use incidence matrix as constraint matrix for LP:
optimal LP solution corresponds to a minimum
flow
33
Methods: Efficiently Solving a
Min-Cost Biflow (Linear) (cont’d)
• The incidence matrix is Totally Unimodular
(TU)
• Translation from wowspeak: Every linear
combination of a TU matrix M and the
identity matrix I has integer coefficients.
Therefore, every solution to the l.c.
consists of integers
• Makes it possible to produce an integral
solution with LP, rather than resort to
Integer Programming -> NP-hard
34
Methods: Efficiently Solving a
Min-Cost Biflow (Linear) (cont’d)
• Possible for +2 or -2 to appear in the
incidence matrix, since two in-edges/outedges can enter a single vertex
• Incidence matrix is actually a
binet matrix
• Optimal LP solution for binet
matrices is guaranteed to be half-integral
(i.e. the coefficients are multiples of 0.5)
35
Methods: Efficiently Solving a
Min-Cost Biflow (Linear) (cont’d)
• The binet matrix can be reduced to a TU
matrix through monotonization, i.e.
doubling the number of columns and rows
• TU matrix corresponds to a directed graph
• Min-cost directed flow can be solved must
faster than LP
36
Methods: Efficiently Solving a
Min-Cost Biflow (Linear) (cont’d)
Monotonization Procedure
• For every vertex v in the bidirected graph, replace with
two vertices v1 and v2 in the new graph
• Each of v’s in-edges are replaced with two edges, one
of which points into v1, while the other points out of v2
• Likewise, each of v’s out-edges are replaced with two
edges, one of which points out of v1, while the other
points into v2
• Bounds and costs from original graph are transferred
to the new graph, and the solution of the new graph will
be transferred to the original graph
• Problem can now be solved with off-the-shelf software
(no more work for us!)
37
Outline
•
•
•
•
•
•
•
•
•
•
38
Whole Genome Assembly
Review of Related Work
The Medvedev-Brudno Method
Methods: Bidirected Overlap Graph
Methods: Adjustments to the Standard Min-cost Biflow
Problem
Methods: Maximizing the Global Read-Count Likelihood
Methods: Efficiently Solving a Min-cost Biflow (Linear)
Methods: Show Me the Contigs
Results
Discussion
Methods: Show Me the Contigs
• Assuming the flow’s been solved, now it’s time to
decompose it into a collection of walks, which translates
into assembled contigs
• Graph is first simplified by removing all edges with a flow
of zero
• Additional simplifications possible by removing vertices v
where:
 There is exactly one edge going into v and one edge leading out
of v, and the flow on both edges is the same
 Vertices where there is also a loop with the same flow as the
other two edges, and
39
 Split and join vertices, where the flow on the inedges is the same as those of the out-edges
Methods: Show Me
the Contigs (cont’d)
• After at most 2|V| of these simplifications, the number of
edges will be reduced by 105 fold, and only “conflict”
vertices remain (those that didn’t match the previous
criteria)
Conflict Node Resolution
• Look for edges at these vertices with opposite
orientations supported by matepairs
• Use BFS to find all reads within a certain distance from
the vertex
• Match those reads that are matepairs
• For those matepairs where one read is on the incoming
side and the other is on the outgoing side, find the
shortest path between them using Dijkstra’s algorithm
40
Methods: Show Me
the Contigs (cont’d)
• Make note of the number of mates that fall within
•
•
•
41
the expected insert distance
Pairs of in/out edges that have a significant
number of matepairs that fall within the insert
distance are joined into a common edge
The previous step is repeated until no more
edges can be joined in this manner
Graph simplification continues in iterative phases
until somebody decides it’s time to stop
Outline
•
•
•
•
•
•
•
•
•
•
42
Whole Genome Assembly
Review of Related Work
The Medvedev-Brudno Method
Methods: Bidirected Overlap Graph
Methods: Adjustments to the Standard Min-cost Biflow
Problem
Methods: Maximizing the Global Read-Count Likelihood
Methods: Efficiently Solving a Min-cost Biflow (Linear)
Methods: Show Me the Contigs
Results
Discussion
Results
• Unable to obtain real data for experimental validation
• Generated synthetic reads from E. coli genome,
•
•
•
•
•
•
43
which must be 4.6 megabases long, or else it takes
up 4.6 MB of hard drive space
Simulated matepairs’ distances were uniformly
distributed within 10% of the expected insert size
Reads were 25 bp long, and error-free
Coverage rates involved 50x, 75x, 100x, and 200x
Minimum overlap length varied between 17 and 21
Overall running time on one machine: ~1 hour
Question: What kind of machine?
Read Count Results
• Compared vertex flow with read frequency in the original
genome
• High degree of accuracy
• Error rate between 10-4 and 10-6
• Generally more tendency to overestimate read frequency
• Authors claim only slight improvements beyond 75x
44 coverage, but 200x coverage is fantastically good
Assembly Results
• Take the edges of the graph produced after the
•
•
•
45
conflict node resolution and generate the
sequence it spells out
Compute N50: The length of the shortest contig
s.t. 50% of the genome lies in longer contigs
Also compute N90: Similar to N50, but the cutoff
is 90%
Finally, compute errors by aligning each contig to
the reference genome and seeing how many
local alignments it takes to completely tile the
contig (minus one because it always takes at
least one alignment to do it)
Assembly Results (cont’d)
• Length of contigs that contain 50% of the genome varied between
23-28 kb
• Length of contigs that contain 90% of the genome varied between 78 kb
• N50 error rate: ~1/100-180 kb
• N90 error rate: ~1/100-160 kb
• Greedy algorithm can be fooled by several strong edge matches
• Contig size is good relative to other whole genome assemblies
46
involving small read sizes
Outline
•
•
•
•
•
•
•
•
•
•
47
Whole Genome Assembly
Review of Related Work
The Medvedev-Brudno Method
Methods: Bidirected Overlap Graph
Methods: Adjustments to the Standard Min-cost Biflow
Problem
Methods: Maximizing the Global Read-Count Likelihood
Methods: Efficiently Solving a Min-cost Biflow (Linear)
Methods: Show Me the Contigs
Results
Discussion
Discussion
• Successfully demonstrated that ab initio genome
•
•
•
48
assembly is feasible with 25 bp reads and
matepair information
Still needs improvements for sequencing
unknown (bacterial) genomes
In the future, matepair information could be
further used to join contigs into supercontigs, and
better help resolve conflict nodes
Convex network flow can be more broadly
applied within computational biology
Discussion (cont’d)
• First major assumption: Reads are error-free
• Can be overcome with higher coverage
• Second major assumption: Uniform sampling
•
•
•
49
of all genomic regions
Reality: certain portions of the genome are
easier to sample than others
More difficult to overcome
Could be overcome by establishing the biases of
the sequencing apparatus used
Is NGS The One?
The One that will herald the beginning of costeffective whole genome assembly?
Maybe you should ask the Oracle…
50
That’s all folks!
Discussion questions
• What were the strengths/weaknesses of the
Medvedev-Brudno Method? How would you
improve it?
• How rigorous do you think the experimental
validation was? How would you improve it?
• Would you agree with the conclusions the
authors drew from their results? Why or why not?
• How many of you fell asleep during this
presentation? What can I do to keep your
attention in the future?
51
Download