two - Kent State University

advertisement
Quasispecies Assembly
Using Network Flows
Alex Zelikovsky
Georgia State University
Joint work with
Kelly Westbrooks
Irina Astrovskaya
David Campo
Yury Khudyakov
Piotr Berman
Ion Mandoiu
Georgia State University
Georgia State University
Centers for Disease Control
Centers for Disease Control
Pennsylvania State University
University of Connecticut
Outline

454 Sequencing of Virus Genome

Quasispecies Assembly

The Read Graph

Network Flow Formulations

Phasing Flow Problem

Maximum Likelihood

Simulation Results
HCV Quasispecies





HCV is a small, enveloped, positive sense single strand RNA virus that
is responsible for Hepatitis C infection.
Over the course of infection, the mutations made in replication are
passed down to descendants, eventually producing a family of related
variants of the ancestral genome referred as quasispecies.
Due to HCV's high mutation rate, in time the quasispecies in an infected
person can become very diverse.
A better understanding of HCV quasispecies diversity could potentially
lead to new treatments.
The ultimate objective of our work is to develop a method of
computationally inferring the quasispecies sequences in a HCV-infected
individual.
454 Sequencing




454 Sequencing is one promising technology that may prove useful for
quasispecies sequencing. It is a massively-parallel pyrosequencing
system developed the by biotechnology firm 454 Life Sciences for
genome sequencing.
The system fragments the source genetic material to be sequenced into
pieces called reads. Then, each read is sequenced and the original
genome is reconstructed via software.
Since this system was originally designed to sequence genetic material
from a single organism, the software assembles all of the reads to a
single genome.
In order to use 454 for quasispecies sequencing, new methods and
software are needed to correctly infer the sequences of the quasispecies
present in an infected person and their population frequencies directly
from 454 read data.
Problem Formulations

Given a collection of 454 reads taken from a
quasispecies population of unknown size and
distribution, reconstruct the quasispecies
population, i.e. the sequences and tehir
frequencies.
Original
Quasispecies
454 Reads
Reconstructed
Quasispecies
Parsimonious/Min Cost
Quasispecies Assembly



Given a set of reads,
Find the minimum number set of quasispecies covering all reads
Given a set of reads with costs on read overlaps,
Find the minimum number set of quasispecies covering all reads
The cost of the assembly should be inversely connected to the likelihood
that the assembly is the correct one.
Reconstruction 1
Cost: C1
Original
Quasispecies
454 Reads
If C1 < C2, then favor Reconstruction 1 over Reconstruction 2
Reconstruction 2
Cost: C2
Read Alignment



Before beginning assembly, first find the genome offset of the
read. We assume that the consensus sequence for the particular
strain of HCV that the quasispecies came from is available to us.
We simply align each read to the consensus sequence, choosing
the offset that yields the best alignment (i.e. lowest Hamming
distance).
Because HCV quasispecies don't contain repeats as long as a
454 read, the alignment is both fast and extremely accurate.
GUCUCAUCGGAACAGCAAAACACUUGCCCCGAACGCUAGCGGUUGGGGUACUAUUCAAUGGCUGUAG
AACAGCAAAACACUUGCUCCGAACGCUAGCGGUUGGGGAACUAU
The Read Graph




The data structure: a directed acyclic graph that contains
every possible quasispecies reconstruction.
An aligned read can be contained within another aligned
read.
Find the subset of reads that are not contained within any
other read

We call these reads “superreads”

“subreads” = everything else
The superreads are vertices in the read graph.
Edges of the Read Graph

Put an edge between read X and read Y if

X overlaps Y in the alignment

some suffix of X = some prefix of Y
UGGACUAGAUGUGGUGGGUGCUCUCCGGAAUACCUUGGUGGCGGGU
GAUGUGGUGGGUGCUCUCCGGAAUACCUUGGUGGCGGGUUAGAGA
GGGUGCUCUCCGGAAUACCUUGGUGGCGGGUUAGAGAGAAGAGAGCA
CUCCGGAAUACCUUGGUGGCGGGUUAGAGAGAAGAGAGCAAGUGUCA
AUACCUUGGUGGCGGGUUAGAGAGAAGAGAGCAAGUGUCAACGCCUA
Quasispecies in the Read Graph

Then, we add two extra vertices: a new vertex with outgoing
edges to all vertices with indegree 0 (the “source”) and a new
vertex with incoming edges from all vertices with outdegree 0 (the
“sink”)
The Read
Graph
Source
Sink
Any path from the source to the sink represents a potential sequence
in the quasispecies population!
Transitive Reduction

Edge u  w logically follows from edges u v and v w


The transitive reduction of a graph


Drop edge u w from consideration – no information, any quasispecies
sequence containing u and w will also have v
= smallest subgraph that maintains all reachability relationships
The graph is partially closed – the transitive reduction found in
O(δ|E|), where δ is the read degree
Estimating Read Frequencies




In general, superreads may be contained in several quasispecies
sequences.
Thus, each superread has associated with its frequency = the
sum over the quasispecies of the population frequencies of
quasispecies that contains the superread.
Although the true read frequencies are unknown to us, we may
estimate them by counting the number of subreads contained
within each superread.
By definition, the read frequency of the source and sink vertices
are 0.
Probability of a True Overlap


Given N reads over Q sequences, each read with L possible
starting positions, the probability that a position is bu for
some read u is N/(LQ).
Let (u, v) be an edge in transitive reduction.
The probability of bv-bu > Δ is proportional to exp(Δ N/(LQ)).
bu
GUGGGGGCAGCGGACGUAUGC
GACGUAUGCAGAACUCUAGGCA
Δ
bv

Probability of an edge from the source or to sink is 0.
Network Flow Through Vertices

Replacing the vertex for read r with


two vertices r_b and r_e
and the edge (r_b, r_e)
Networks Flows

Observe that the true quasispecies sequences in the read graph
can be represented as a flow:
1
1
2
1
1
1
1
3
2
2
2
2
1
1
2
1
1
1
Each vertex has a frequency proportional to the number and frequencies of the
quasispecies that contains it's associated superread. When we solve the flow, we
demand that each vertex has a inflow passing into it >= its frequency.
Min Cost Flow



We define the cost of a flow in the following manner:

The flow cost of an edge is that edge's cost multiplied by
the amount of flow that traverses the edge

The cost of a flow through the graph is the sum of all of
the edge flow costs.
Out of all possible flows that go through all of the vertices in
the graph, we seek to find the flow with the minimum cost.
After solving min cost flow for the graph, all of the edges that
have flow > 0 are assumed to participate in true
quasispecies. The remaining edges can be dropped from
the graph.
LP for Min Cost Flow


Although there are fast combinatorial algorithms for solving min
cost flow, we opted to solve the flow using a linear program.
For each edge e, create a real-valued, nonnegative variable fe to
represent the flow across that edge.
Source
The Read
Graph
Sink
Linear Program for Min Cost
Quasispecies Assembly

Objective:
Minimize the sum of cost(e) * fe over all edges e in the read
graph.
Subject to:
 For all vertices v:
 The sum of f over incoming edges to v equals the sum of
e
fe over outgoing edges from v.
 The sum of f over incoming edges to v is greater or equal
e
to the frequency of v.
Splitting Flow in Quasispecies
Five quasispecies share a common long segment [a,b] and differ on the
left and the right in value of a SNP. The resulted graph with network
flow have multiple feasible solutions.
l
a
b-a > the read length
A
A
C
C
T
Multi set L
A
C
T
b
f=2
r
T
T
A
C
T
Multi set R
T
f=3
f=2
f=1
f=1
f=1
A
C
Quasispecies Matching Problem
Given
two multisets of haplotypes
L on the segment [l, b] and
R on the segment [a, r]
such that all haplotypes are indistinguishable on a common
segment [a, b] (l < a < b < r), |L| = |R|,
Find
the matching between multisets L and R such that
concatenation of the matched haplotypes
correspond to the original quasispecies.
Decomposing the flow into
paths
General strategy:
➡Find a source-to-sink path with positive flow f
➡Subtract f from all of the edge flows in the path
How to find paths?
➡Pick the shortest path ⇒ “most likely quasispecies”
➡Pick the maximum bandwidth path ⇒ “most frequent
quasispecies”
1→3→5→6 is the shortest path
1→2→4→6 is the maximum bandwidth path
Finding Max Bandwidth Paths
A single iteration of the Bellman-Ford algorithm gives an efficient method for
finding the maximum bandwidth path from the source to the sink:





Initialize:

For each vertex I

W(i) ← 0

p(i) ← 0

For the source s

W(s) ← +infinity
Relax:

For each edge (i,j) in order
*** (i,j)< (k,l) if i<k or i=k & j<l ***

if W(j) < min { W(i), cap(i,j) }
*** cap(i,j) capacity of (i,j)

W(j) ← min { W(i), cap(i,j) }

p(j) ← I
Return path p(sink), p(p(sink)),..., source=0
Finding minimum cost paths is simple: just grow a shortest-path tree from the
source using costs for weights.
Maximum Likelihood
Choice
•
After path decomposition, we have a set of candidate quasispecies
sequences, but we don’t know what their frequencies are.
•
Given a set of candidate quasispecies and observed reads
•
Expectation Maximization: alternates between 2 steps until
convergence: Expectation (E) and Maximization (M)
E Step: Calculates the expected likelihood by including the current
estimate of the latent variables
M step: Computes maximum likelihood estimates of parameters by
maximizing the expected likelihood found in the previous E step
EM Implementation
Create a bipartite graph:
Left side: quasispecies
Right side: superreads
Put an edge if quasispecies contains read
Keep 3 sets of numbers:
For each qsp q, keep its estimated frequency fq
For each superread r, keep its frequency nr
For each (q, r) edge, keep pqr
E step: Compute pqr = nr · fq / Σ fq for every
edge
M step: Compute fq = Σ pqr / Σ nr for every qsp
Validation
Real quasispecies population, simulated
reads
Real data: 44 sequences from E1E2 region of
HCV
3 populations consisting of 10 sequences
each:
Uniformly distributed frequencies (the “uniform”
population)
Geometrically distributed frequencies (the
“geometric” population)
Highly skewed distribution (the “skewed”
population)
Instance Generation
Inputs: A quasispecies population Q, n = number of
reads desired n, read length mean μ and
variance σ2
S ←∅
While |S| < number of reads desired
Randomly select a quasispecies q of length lq according
to the population frequency distribution
Generate a read length lr using normal distribution (μ,
σ2)
Generate an offset o using uniform distribution on [0, lq l r]
Extract a substring of length lr starting at position o from
quasispecies q and add it to S
Quality Measures
Percentage of correctly predicted sequences:
Takes into account frequencies
Symmetric difference between multisets
“Switching error”
Generalization of “switching error” from the haplotype
phasing community
Average number of times each path corresponding to a
predicted quasispecies switches between paths in the
read graph corresponding to real quasispecies.
Initial Results
Number of Reads
% Correctly
Predicted
400
40000
8%
400
50000
4%
500
40000
29%
500
50000
38%
400
40000
21%
400
50000
11%
500
40000
50%
500
50000
66%
400
40000
46%
400
50000
10%
500
40000
82%
500
50000
84%
Read Length
Uniform
Geometric
Skewed
Results – Geometric Instances
% Correctly Predicted
Average Error
1.2
1.8
1.6
1
1.2
350
1
450
550
0.8
650
0.6
750
0.4
350
0.8
450
% Correctly Predicted
Average Error
1.4
0.6
550
650
0.4
750
0.2
0.2
0
0
10000
8000
14000
12000
18000
16000
22000
20000
26000
24000
30000
28000
34000
32000
38000
36000
Number of Reads
42000
40000
46000
44000
50000
48000
54000
52000
58000
56000
60000
10000
8000
14000
12000
18000
16000
22000
20000
26000
24000
30000
28000
34000
32000
38000
36000
Number of Reads
42000
40000
46000
44000
50000
48000
54000
52000
58000
56000
60000
Results – Uniform Instances
Average Error
% Correctly Predicted
1.2
2
1.8
1
1.6
350
1.2
450
1
550
0.8
650
750
0.6
0.4
350
0.8
450
% Correctly Predicted
Average Error
1.4
0.6
550
650
0.4
750
0.2
0.2
0
0
10000
8000
14000
12000
18000
16000
22000
20000
26000
24000
30000
28000
34000
32000
38000
36000
Number of Reads
42000
40000
46000
44000
50000
48000
54000
52000
58000
56000
60000
10000
8000
14000
12000
18000
16000
22000
20000
26000
24000
30000
28000
34000
32000
38000
36000
Number of Reads
42000
40000
46000
44000
50000
48000
54000
52000
58000
56000
60000
Download