ppt - University of Connecticut

advertisement
Scaffolding Large Genomes Using
Integer Linear Programming
JAMES LINDSAY*, HAMED SALOOTI, ALEX
ZELIKOVSKI, ION MANDOIU*
University of Connecticut*
Georgia State University
De-novo Assembly Paradigm
The Reads
Sequencing
The Genome
Assembly
The Scaffolds
Scaffolding
The Contigs
Why Scaffolding?
 Annotation
 Comparative biology
No scaffold
gene XYZ
 Re-sequencing and gap
filling
Scaffold
 Structural variation!
5’ UTR
gene XYZ
3’ UTR
Why Scaffolding?
 Annotation
 Comparative biology
 Re-sequencing and gap
Biologist: There are holes in my genes!
5’ UTR
gene XYZ
3’ UTR
filling
Sanger Sequencing
 Structural variation!
5’ UTR
gene XYZ
3’ UTR
Why Scaffolding?
 Annotation
 Comparative biology
 Re-sequencing and gap
Filling
 Structural variation!
Read Pairs
Paired Read Construction
Informative Reads
 Align each read against
2kb
the contigs
 Only accept uniquely
mapped reads

Use the non-unique reads
later
 Both reads in a pair must
same strand and orientation
R2
R1
2kb
map to different contigs
Linkage Information
Possible States
 Two contigs are adjacent if:
 A read pair spans the contigs
 State (A, B, C, D)
 Depends on orientation of
the read
 Order of contigs is arbitrary
5’
3’
R1
R2
contig i
contig j
A
B
C
 Each read pair can be
“consistent” with one of the
four states
D
The Scaffolding Problem
Given
Possible Objectives
• Contigs
• Un-weighted
• Paired reads
• Max number of consistent
Find
read pairs
• Orientation
• Weighted
• Ordering
• Each states is weighted:
𝑊𝑖𝑗𝐴 , 𝑊𝑖𝑗𝐵 , 𝑊𝑖𝑗𝐶 , 𝑊𝑖𝑗𝐷
• Relative Distance
Goal
• Overlap with repeat
• Recreate true scaffolds
• Deviation of expected distance
• …
Graph Representation
 Using input we can define a
scaffolding graph:
𝑉, 𝑠𝑒𝑡 𝑜𝑓 𝑎𝑙𝑙 𝑐𝑜𝑛𝑡𝑖𝑔𝑠
E, set of adjacent 𝑐𝑜𝑛𝑡𝑖𝑔𝑠 𝑖, 𝑗
𝐺 = (𝑉, 𝐸)
 This is an undirected
multi-graph
 Assume it is connected
Integer Linear Program Formulation
Variables
Contig Orientation:
Pairwise Contig
Consistency:
Contig Pair State:
Objective
𝑆𝑖 ∈ 0,1
𝑆𝑖𝑗 ∈ 0,1
𝐴𝑖𝑗 , 𝐵𝑖𝑗 , 𝐶𝑖𝑗 , 𝐷𝑖𝑗 ∈ 0,1
Maximize weight of consistent pairs
(𝑊𝑖𝑗𝐴 𝐴𝑖𝑗 ) + (𝑊𝑖𝑗𝐵 𝐵𝑖𝑗 ) + (𝑊𝑖𝑗𝐶 𝐶𝑖𝑗 ) + (𝑊𝑖𝑗𝐷 𝐷𝑖𝑗 )
max
𝑖,𝑗 ∈𝐸
Constraints
Pairwise Orientation
𝑆𝑖𝑗 ≤ 𝑆𝑗 + 𝑆𝑖
𝑆𝑖𝑗 ≥ 𝑆𝑗 − 𝑆𝑖
𝑆𝑖𝑗 ≤ 2 − 𝑆𝑖 − 𝑆𝑗
𝑆𝑖𝑗 ≥ 𝑆𝑖 − 𝑆𝑗
Mutually Exclusivity
𝐴𝑖𝑗 + 𝐷𝑖𝑗 ≤ 1 − 𝑆𝑖𝑗
𝐵𝑖𝑗 + 𝐶𝑖𝑗 ≤ 𝑆𝑖𝑗
Forbid 2 and 3 Cycles Explicitly
Graph Decomposition: Articulation Points
s
o
l
v
e
Articulation point
Graph Decomposition: 2-cuts
+
+
2-cut
+
-
-
-
+
-
Non-Serial Dynamic Programming
• SPQR-tree to
schedule
decomposition
• Traverse tree using
DFS
• NSDP utilizes
solutions of previous
stage in current stage
Largest Connected Component
Largest Biconnected Component
Largest Triconnected Component
Post Processing ILP Solution
ILP Solution
B
D
F
A
C
E
 May have cycles
 Not a total ordering for
each connected
components
outgoing
incoming
A
A
B
B
C
C
D
D
E
E
F
F
 Bipartite matching
 Objectives:



Max weight
Max cardinality
Max cardinality / Max weight
Testing Framework
 Venter Genome
Read Type
Total Reads
Total
Bases
Avg
length Coverage
Sanger
31,861,976
2.79E+10
875
9.930637
SOLiD pairs
4.85E+08
2.42E+10
50
8.623028
 4x Assembly
# Reads
# Bases in
reads
112,00,000
1.1E+10
# Contigs
# Bases in
contigs
N50
422,837
2.26E+09
7704
Testing Metrics
 Computer Scientists
 Finding Scaffold = Binary Classification Test


n contigs, try to predict n-1 adjacencies
TP,FP,TN,FN, Sensitivity, PPV
 Biologists (main focus)
 N50 (basically average scaffold size, ignore gaps)
 TP50

Break scaffold at incorrect edges, then find N50
Results
test case method bundle size sensitivity
ppv
N50
TP50
10%
opera
2
81.13%
99.26%
27,567
27,327
10%
mip
2
59.01%
98.94%
19,988
19,755
10%
ilp
1
79.86%
98.58%
26,814
26,459
25%
opera
2
80.44%
98.27%
27,296
26,849
25%
mip
2
58.95%
97.56%
19,842
19,518
25%
100%
100%
ilp
opera
mip
1
3
3
79.30%
pending
failed
96.93%
…
n/a
26,684
…
n/a
26,079
…
n/a
100%
ilp
1
68.25%
89.90%
20,538
19,006
Conclusions
 Success
 ILP solves scaffolding problem!
 NSDP works.
 Improvements
 Finalize large test cases (then publish?!)
 Practical considerations (read style, multi-libraries, merge ctgs)
 Future Work
 Where else can I apply NSDP?
 Scaffold before assembly??
 Structural Variation??
Questions?
Download