Scaffolding Large Genomes Using Integer Linear Programming JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 University of Connecticut* Georgia State University De-novo Assembly Paradigm short reads shotgun sequencing the genome denovo assembly the scaffolds scaffolding short contigs Why Scaffolding? Annotation Comparative biology No scaffold gene XYZ Re-sequencing and gap filling Scaffold Structural variation! 5’ UTR gene XYZ 3’ UTR Why Scaffolding? Annotation Comparative biology Re-sequencing and gap Biologist: There are holes in my genes! 5’ UTR gene XYZ 3’ UTR filling Sanger Sequencing Structural variation! 5’ UTR gene XYZ 3’ UTR Why Scaffolding? Annotation Comparative biology Re-sequencing and gap Filling Structural variation! Fragmented Genomes Massive Sequencing Projects I5k 5000 insect and arthropod species Effects of Read Length Dog Genome 7.5x Sanger N50: 180Kb Chicken Genome G10k 10,000 vertebrate species 6x Illumina N50: 12Kb Human Genome 100x Illumina N50: 24Kb The Scaffolding Problem GIVEN • CONTIGS, PAIRED READS FIND • ORIENTATION, ORDERING, RELATIVE DISTANCE GOAL • RECREATE TRUE SCAFFOLDS Paired Reads Paired Read Construction Paired Read Styles Mate Pair 2kb same strand and orientation R2 R1 2kb Paired End different strand and orientation R1 100b 100b R2 10kb Linkage Information Two contigs are adjacent if: A read pair spans the contigs State (A, B, C, D) Depends on orientation of the read Order of contigs is arbitrary Possible States (mate pair) 5’ 3’ R1 R2 contig i contig j A B C Each read pair can be “consistent” with one of the four states D Scaffolding Graph Nodes Nodes are contigs Edges Adjacent contigs have 4 edges (one for each state) Weighted by overlap with repetitive region State A contig j contig i 𝑊𝑖𝑗𝐴 = 1− 𝑟𝑒𝑎𝑑 𝑝𝑎𝑖𝑟𝑠 #𝑏𝑝 𝑖𝑛 𝑟𝑒𝑝𝑒𝑎𝑡 𝑟𝑒𝑔𝑖𝑜𝑛 #𝑏𝑝 𝑖𝑛 𝑟𝑒𝑎𝑑 Integer Linear Program Formulation Variables Contig orientation: 𝑆𝑖 ∈ 0,1 Adjacent contig consistency: 𝑆𝑖𝑗 ∈ 0,1 Contig pair state: Objective 𝐴𝑖𝑗 , 𝐵𝑖𝑗 , 𝐶𝑖𝑗 , 𝐷𝑖𝑗 ∈ 0,1 Maximize weight of consistent pairs (𝑊𝑖𝑗𝐴 𝐴𝑖𝑗 ) + (𝑊𝑖𝑗𝐵 𝐵𝑖𝑗 ) + (𝑊𝑖𝑗𝐶 𝐶𝑖𝑗 ) + (𝑊𝑖𝑗𝐷 𝐷𝑖𝑗 ) 𝑧 = max 𝑖,𝑗 ∈𝐸 Constraints Variables Contig orientation: 𝑆𝑖 ∈ 0,1 Adjacent contig consistency: 𝑆𝑖𝑗 ∈ 0,1 Contig pair state: 𝐴𝑖𝑗 , 𝐵𝑖𝑗 , 𝐶𝑖𝑗 , 𝐷𝑖𝑗 ∈ 0,1 Pairwise Orientation 𝑆𝑖𝑗 ≤ 𝑆𝑗 + 𝑆𝑖 𝑆𝑖𝑗 ≥ 𝑆𝑗 − 𝑆𝑖 𝑆𝑖𝑗 ≤ 2 − 𝑆𝑖 − 𝑆𝑗 𝑆𝑖𝑗 ≥ 𝑆𝑖 − 𝑆𝑗 Constraints Variables Contig orientation: 𝑆𝑖 ∈ 0,1 Adjacent contig consistency: 𝑆𝑖𝑗 ∈ 0,1 Contig pair state: 𝐴𝑖𝑗 , 𝐵𝑖𝑗 , 𝐶𝑖𝑗 , 𝐷𝑖𝑗 ∈ 0,1 State Variables 2𝐴𝑖𝑗 ≤ (1 − 𝑆𝑖 ) + (1 − 𝑆𝑗 ) 2𝐵𝑖𝑗 ≤ (1 − 𝑆𝑖 ) + 𝑆𝑗 ) 2𝐶𝑖𝑗 ≤ 𝑆𝑖 + (1 − 𝑆𝑗 ) 2𝐷𝑖𝑗 ≤ 𝑆𝑖 + 𝑆𝑗 Constraints Variables Contig orientation: 𝑆𝑖 ∈ 0,1 Adjacent contig consistency: 𝑆𝑖𝑗 ∈ 0,1 Contig pair state: 𝐴𝑖𝑗 , 𝐵𝑖𝑗 , 𝐶𝑖𝑗 , 𝐷𝑖𝑗 ∈ 0,1 Mutual Exclusivity 𝐴𝑖𝑗 + 𝐷𝑖𝑗 ≤ 1 − 𝑆𝑖𝑗 𝐵𝑖𝑗 + 𝐶𝑖𝑗 ≤ 𝑆𝑖𝑗 Constraints Forbid 2 Cycles 𝐵𝑖𝑗 + 𝐶𝑖𝑗 ≤ 𝑆𝑖𝑗 𝐴𝑖𝑗 + 𝐷𝑖𝑗 ≤ 1 − 𝑆𝑖𝑗 Forbid 3 Cycles 𝐴𝑖𝑗 + 𝐴𝑗𝑘 + 𝐴𝑘𝑖 ≤2 𝐵𝑖𝑗 + 𝐶𝑗𝑘 + 𝐴𝑘𝑖 ≤2 𝐶𝑖𝑗 + 𝐴𝑗𝑘 + 𝐵𝑘𝑖 ≤2 𝐷𝑖𝑗 + 𝐶𝑗𝑘 + 𝐵𝑘𝑖 ≤2 𝐴𝑖𝑗 + 𝐵𝑗𝑘 + 𝐶𝑘𝑖 ≤2 𝐵𝑖𝑗 + 𝐷𝑗𝑘 + 𝐶𝑘𝑖 ≤2 𝐶𝑖𝑗 + 𝐵𝑗𝑘 + 𝐷𝑘𝑖 ≤2 𝐷𝑖𝑗 + 𝐷𝑗𝑘 + 𝐷𝑘𝑖 ≤2 *larger cycles are broken at the end Largest Connected Component Graph Decomposition: Articulation Points Articulation point s o l v e MIP, Salmela 2011 Largest Biconnected Component Non-Serial Dynamic Programming A technique which exploits the sparsity of the scaffolding graph by computing the solution in stages, incorporating the results from previous stages ~inspired by (Neumaier, 06) Non-Serial Dynamic Programming 𝑧𝐴 + 𝑧𝐵 - + 𝑧𝐶 - + + 𝑧𝐷 - - 2-cut Non-Serial Dynamic Programming 𝑧𝐴 + 𝑧𝐵 + 𝑧𝐶 - + 𝑧𝐴 𝑧𝐵 𝑧𝐶 𝑧𝐷 - + 𝑧𝐷 - - Objective Modification: 𝑧𝐴 + 𝑧 𝐷 +𝑧 𝐶 −𝑧 𝐵 −𝑧 𝐴 𝑆𝑖 2 + 𝑧 𝐷 +𝑧 𝐶 +𝑧 𝐵 −𝑧 𝐴 −𝑧 𝐷 +𝑧 𝐶 +𝑧 𝐵 −𝑧 𝐴 𝑆𝑗 + 𝑆𝑖𝑗 2 2 SPQR-tree Based Implementation • SPQR-tree efficiently finds 2 cuts (Tarjan, 73) • DFS of SPQR-tree is used to schedule elimination order for NSDP Post Processing ILP Solution ILP Solution B D F A C E May have cycles Not a total ordering for each connected components outgoing incoming A A B B C C D D E E F F Bipartite matching Objectives: Max weight Max cardinality Max cardinality / Max weight GAGE Framework Genome Size (Mb) # reads Staphlococcus Aureus 2.9 3,494,070 Rhodobacter sphaeorides 4.6 2,050,868 Human Chr14 107 22,669,408 Assembled using: ABySS, Allpaths-LG, Bambus2, CABOG, MSR-CA, SGA, SOAPdenovo, Velvet Scaffolded using: SILP (our method), Opera, MIP, Bambus2 Testing Metrics TPN50 Break scaffold at incorrect edges, then find N50 Size of contig where 50% of the contigs are ≥ this size Binary Classification Given n contigs in a scaffold How many of n-1 adjacencies can you predict PPV Sensitivity MCC Results Scaffolding TPN50 450 000, 400 000, 350 000, TPN50 (bp) 300 000, 250 000, silp opera 200 000, mip bambus2 150 000, 100 000, 50 000, 0, staph rhodo Genome chr14 Results PPV 120,00% 100,00% PPV 80,00% silp 60,00% opera mip bambus2 40,00% 20,00% 0,00% staph rhodo Genome chr14 Results Sensitivity 80,00% 70,00% 60,00% Sensitivity 50,00% silp 40,00% opera mip 30,00% bambus2 20,00% 10,00% 0,00% staph rhodo Genome chr14 Results Matthews Correlation Coefficient 90,00% 80,00% 70,00% MCC 60,00% 50,00% silp opera 40,00% mip bambus2 30,00% 20,00% 10,00% 0,00% staph rhodo Genome chr14 Conclusions Success ILP solves scaffolding problem! NSDP works Improvements Include SOAPdenovo, Allpaths-LG scaffolds in comparison Look at parameter effects Practical considerations (read style, multi-libraries, merge ctgs) Future Work Where else can I apply NSDP? Scaffold before assembly … promising Structural Variation?? Questions?