Scaffolding Large Genomes Using Integer Linear Programming

advertisement
Scaffolding Large Genomes Using
Integer Linear Programming
JAMES LINDSAY*, HAMED SALOOTI, ALEX
ZELIKOVSKI, ION MANDOIU*
ACM-BCB 2012
University of Connecticut*
Georgia State University
De-novo Assembly Paradigm
short reads
shotgun sequencing
the genome
denovo
assembly
the scaffolds
scaffolding
short contigs
Why Scaffolding?
 Annotation
 Comparative biology
No scaffold
gene XYZ
 Re-sequencing and gap
filling
Scaffold
 Structural variation!
5’ UTR
gene XYZ
3’ UTR
Why Scaffolding?
 Annotation
 Comparative biology
 Re-sequencing and gap
Biologist: There are holes in my genes!
5’ UTR
gene XYZ
3’ UTR
filling
Sanger Sequencing
 Structural variation!
5’ UTR
gene XYZ
3’ UTR
Why Scaffolding?
 Annotation
 Comparative biology
 Re-sequencing and gap
Filling
 Structural variation!
Fragmented Genomes
Massive Sequencing
Projects
 I5k
 5000 insect and arthropod
species
Effects of Read Length
 Dog Genome


7.5x Sanger
N50: 180Kb
 Chicken Genome
 G10k
 10,000 vertebrate species


6x Illumina
N50: 12Kb
 Human Genome


100x Illumina
N50: 24Kb
The Scaffolding Problem
GIVEN
• CONTIGS, PAIRED READS
FIND
• ORIENTATION, ORDERING,
RELATIVE DISTANCE
GOAL
• RECREATE TRUE SCAFFOLDS
Paired Reads
Paired Read Construction
Paired Read Styles
 Mate Pair
2kb
same strand and orientation
R2
R1
2kb
 Paired End
different strand and orientation
R1
100b
100b
R2
10kb
Linkage Information
 Two contigs are adjacent if:
 A read pair spans the contigs
 State (A, B, C, D)
 Depends on orientation of
the read
 Order of contigs is arbitrary
Possible States (mate pair)
5’
3’
R1
R2
contig i
contig j
A
B
C
 Each read pair can be
“consistent” with one of the
four states
D
Scaffolding Graph
Nodes
 Nodes are contigs
Edges
 Adjacent contigs have 4
edges (one for each state)
 Weighted by overlap with
repetitive region
State A
contig j
contig i
𝑊𝑖𝑗𝐴 =
1−
𝑟𝑒𝑎𝑑 𝑝𝑎𝑖𝑟𝑠
#𝑏𝑝 𝑖𝑛 𝑟𝑒𝑝𝑒𝑎𝑡 𝑟𝑒𝑔𝑖𝑜𝑛
#𝑏𝑝 𝑖𝑛 𝑟𝑒𝑎𝑑
Integer Linear Program Formulation
Variables
Contig orientation:
𝑆𝑖 ∈ 0,1
Adjacent contig
consistency:
𝑆𝑖𝑗 ∈ 0,1
Contig pair state:
Objective
𝐴𝑖𝑗 , 𝐵𝑖𝑗 , 𝐶𝑖𝑗 , 𝐷𝑖𝑗 ∈ 0,1
Maximize weight of consistent pairs
(𝑊𝑖𝑗𝐴 𝐴𝑖𝑗 ) + (𝑊𝑖𝑗𝐵 𝐵𝑖𝑗 ) + (𝑊𝑖𝑗𝐶 𝐶𝑖𝑗 ) + (𝑊𝑖𝑗𝐷 𝐷𝑖𝑗 )
𝑧 = max
𝑖,𝑗 ∈𝐸
Constraints
Variables
Contig orientation:
𝑆𝑖 ∈ 0,1
Adjacent contig
consistency:
𝑆𝑖𝑗 ∈ 0,1
Contig pair state:
𝐴𝑖𝑗 , 𝐵𝑖𝑗 , 𝐶𝑖𝑗 , 𝐷𝑖𝑗 ∈ 0,1
Pairwise Orientation
𝑆𝑖𝑗 ≤ 𝑆𝑗 + 𝑆𝑖
𝑆𝑖𝑗 ≥ 𝑆𝑗 − 𝑆𝑖
𝑆𝑖𝑗 ≤ 2 − 𝑆𝑖 − 𝑆𝑗
𝑆𝑖𝑗 ≥ 𝑆𝑖 − 𝑆𝑗
Constraints
Variables
Contig orientation:
𝑆𝑖 ∈ 0,1
Adjacent contig
consistency:
𝑆𝑖𝑗 ∈ 0,1
Contig pair state:
𝐴𝑖𝑗 , 𝐵𝑖𝑗 , 𝐶𝑖𝑗 , 𝐷𝑖𝑗 ∈ 0,1
State Variables
2𝐴𝑖𝑗 ≤ (1 − 𝑆𝑖 ) + (1 − 𝑆𝑗 )
2𝐵𝑖𝑗 ≤ (1 − 𝑆𝑖 ) + 𝑆𝑗 )
2𝐶𝑖𝑗 ≤ 𝑆𝑖 + (1 − 𝑆𝑗 )
2𝐷𝑖𝑗 ≤ 𝑆𝑖 + 𝑆𝑗
Constraints
Variables
Contig orientation:
𝑆𝑖 ∈ 0,1
Adjacent contig
consistency:
𝑆𝑖𝑗 ∈ 0,1
Contig pair state:
𝐴𝑖𝑗 , 𝐵𝑖𝑗 , 𝐶𝑖𝑗 , 𝐷𝑖𝑗 ∈ 0,1
Mutual Exclusivity
𝐴𝑖𝑗 + 𝐷𝑖𝑗 ≤ 1 − 𝑆𝑖𝑗
𝐵𝑖𝑗 + 𝐶𝑖𝑗 ≤ 𝑆𝑖𝑗
Constraints
Forbid 2 Cycles
𝐵𝑖𝑗 + 𝐶𝑖𝑗 ≤ 𝑆𝑖𝑗
𝐴𝑖𝑗 + 𝐷𝑖𝑗 ≤ 1 − 𝑆𝑖𝑗
Forbid 3 Cycles
𝐴𝑖𝑗 + 𝐴𝑗𝑘 + 𝐴𝑘𝑖 ≤2
𝐵𝑖𝑗 + 𝐶𝑗𝑘 + 𝐴𝑘𝑖 ≤2
𝐶𝑖𝑗 + 𝐴𝑗𝑘 + 𝐵𝑘𝑖 ≤2
𝐷𝑖𝑗 + 𝐶𝑗𝑘 + 𝐵𝑘𝑖 ≤2
𝐴𝑖𝑗 + 𝐵𝑗𝑘 + 𝐶𝑘𝑖 ≤2
𝐵𝑖𝑗 + 𝐷𝑗𝑘 + 𝐶𝑘𝑖 ≤2
𝐶𝑖𝑗 + 𝐵𝑗𝑘 + 𝐷𝑘𝑖 ≤2
𝐷𝑖𝑗 + 𝐷𝑗𝑘 + 𝐷𝑘𝑖 ≤2
*larger cycles are broken at the end
Largest Connected Component
Graph Decomposition: Articulation Points
Articulation point
s
o
l
v
e
MIP, Salmela 2011
Largest Biconnected Component
Non-Serial Dynamic Programming
A technique which exploits the sparsity of the scaffolding graph by computing the
solution in stages, incorporating the results from previous stages
~inspired by (Neumaier, 06)
Non-Serial Dynamic Programming
𝑧𝐴
+
𝑧𝐵
-
+
𝑧𝐶
-
+
+
𝑧𝐷
-
-
2-cut
Non-Serial Dynamic Programming
𝑧𝐴
+
𝑧𝐵
+
𝑧𝐶
-
+
𝑧𝐴
𝑧𝐵
𝑧𝐶
𝑧𝐷
-
+
𝑧𝐷
-
-
Objective Modification:
𝑧𝐴
+
𝑧 𝐷 +𝑧 𝐶 −𝑧 𝐵 −𝑧 𝐴
𝑆𝑖
2
+
𝑧 𝐷 +𝑧 𝐶 +𝑧 𝐵 −𝑧 𝐴
−𝑧 𝐷 +𝑧 𝐶 +𝑧 𝐵 −𝑧 𝐴
𝑆𝑗 +
𝑆𝑖𝑗
2
2
SPQR-tree Based Implementation
• SPQR-tree efficiently
finds 2 cuts (Tarjan,
73)
• DFS of SPQR-tree is
used to schedule
elimination order for
NSDP
Post Processing ILP Solution
ILP Solution
B
D
F
A
C
E
 May have cycles
 Not a total ordering for
each connected
components
outgoing
incoming
A
A
B
B
C
C
D
D
E
E
F
F
 Bipartite matching
 Objectives:



Max weight
Max cardinality
Max cardinality / Max weight
GAGE Framework
Genome
Size (Mb)
# reads
Staphlococcus Aureus
2.9
3,494,070
Rhodobacter sphaeorides
4.6
2,050,868
Human Chr14
107
22,669,408
 Assembled using:
 ABySS, Allpaths-LG, Bambus2, CABOG, MSR-CA, SGA,
SOAPdenovo, Velvet
 Scaffolded using:
 SILP (our method), Opera, MIP, Bambus2
Testing Metrics
 TPN50
 Break scaffold at incorrect edges, then find N50
 Size of contig where 50% of the contigs are ≥ this size
 Binary Classification
 Given n contigs in a scaffold
 How many of n-1 adjacencies can you predict
PPV
 Sensitivity
 MCC

Results
Scaffolding TPN50
450 000,
400 000,
350 000,
TPN50 (bp)
300 000,
250 000,
silp
opera
200 000,
mip
bambus2
150 000,
100 000,
50 000,
0,
staph
rhodo
Genome
chr14
Results
PPV
120,00%
100,00%
PPV
80,00%
silp
60,00%
opera
mip
bambus2
40,00%
20,00%
0,00%
staph
rhodo
Genome
chr14
Results
Sensitivity
80,00%
70,00%
60,00%
Sensitivity
50,00%
silp
40,00%
opera
mip
30,00%
bambus2
20,00%
10,00%
0,00%
staph
rhodo
Genome
chr14
Results
Matthews Correlation Coefficient
90,00%
80,00%
70,00%
MCC
60,00%
50,00%
silp
opera
40,00%
mip
bambus2
30,00%
20,00%
10,00%
0,00%
staph
rhodo
Genome
chr14
Conclusions
 Success
 ILP solves scaffolding problem!
 NSDP works
 Improvements
 Include SOAPdenovo, Allpaths-LG scaffolds in comparison
 Look at parameter effects
 Practical considerations (read style, multi-libraries, merge ctgs)
 Future Work
 Where else can I apply NSDP?
 Scaffold before assembly … promising
 Structural Variation??
Questions?
Download