Scaffolding Large Genomes Using Integer Linear Programming JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* University of Connecticut* Georgia State University De-novo Assembly Paradigm The Reads Sequencing The Genome Assembly The Scaffolds Scaffolding The Contigs Why Scaffolding? Annotation Comparative biology No scaffold gene XYZ Re-sequencing and gap filling Scaffold Structural variation! 5’ UTR gene XYZ 3’ UTR Why Scaffolding? Annotation Comparative biology Re-sequencing and gap Biologist: There are holes in my genes! 5’ UTR gene XYZ 3’ UTR filling Sanger Sequencing Structural variation! 5’ UTR gene XYZ 3’ UTR Why Scaffolding? Annotation Comparative biology Re-sequencing and gap Filling Structural variation! Read Pairs Paired Read Construction Informative Reads Align each read against 2kb the contigs Only accept uniquely mapped reads Use the non-unique reads later Both reads in a pair must same strand and orientation R2 R1 2kb map to different contigs Linkage Information Possible States Two contigs are adjacent if: A read pair spans the contigs State (A, B, C, D) Depends on orientation of the read Order of contigs is arbitrary 5’ 3’ R1 R2 contig i contig j A B C Each read pair can be “consistent” with one of the four states D The Scaffolding Problem Given Possible Objectives • Contigs • Un-weighted • Paired reads • Max number of consistent Find read pairs • Orientation • Weighted • Ordering • Each states is weighted: 𝑊𝑖𝑗𝐴 , 𝑊𝑖𝑗𝐵 , 𝑊𝑖𝑗𝐶 , 𝑊𝑖𝑗𝐷 • Relative Distance Goal • Overlap with repeat • Recreate true scaffolds • Deviation of expected distance • … Graph Representation Using input we can define a scaffolding graph: 𝑉, 𝑠𝑒𝑡 𝑜𝑓 𝑎𝑙𝑙 𝑐𝑜𝑛𝑡𝑖𝑔𝑠 E, set of adjacent 𝑐𝑜𝑛𝑡𝑖𝑔𝑠 𝑖, 𝑗 𝐺 = (𝑉, 𝐸) This is an undirected multi-graph Assume it is connected Integer Linear Program Formulation Variables Contig Orientation: Pairwise Contig Consistency: Contig Pair State: Objective 𝑆𝑖 ∈ 0,1 𝑆𝑖𝑗 ∈ 0,1 𝐴𝑖𝑗 , 𝐵𝑖𝑗 , 𝐶𝑖𝑗 , 𝐷𝑖𝑗 ∈ 0,1 Maximize weight of consistent pairs (𝑊𝑖𝑗𝐴 𝐴𝑖𝑗 ) + (𝑊𝑖𝑗𝐵 𝐵𝑖𝑗 ) + (𝑊𝑖𝑗𝐶 𝐶𝑖𝑗 ) + (𝑊𝑖𝑗𝐷 𝐷𝑖𝑗 ) max 𝑖,𝑗 ∈𝐸 Constraints Pairwise Orientation 𝑆𝑖𝑗 ≤ 𝑆𝑗 + 𝑆𝑖 𝑆𝑖𝑗 ≥ 𝑆𝑗 − 𝑆𝑖 𝑆𝑖𝑗 ≤ 2 − 𝑆𝑖 − 𝑆𝑗 𝑆𝑖𝑗 ≥ 𝑆𝑖 − 𝑆𝑗 Mutually Exclusivity 𝐴𝑖𝑗 + 𝐷𝑖𝑗 ≤ 1 − 𝑆𝑖𝑗 𝐵𝑖𝑗 + 𝐶𝑖𝑗 ≤ 𝑆𝑖𝑗 Forbid 2 and 3 Cycles Explicitly Graph Decomposition: Articulation Points s o l v e Articulation point Graph Decomposition: 2-cuts + + 2-cut + - - - + - Non-Serial Dynamic Programming • SPQR-tree to schedule decomposition • Traverse tree using DFS • NSDP utilizes solutions of previous stage in current stage Largest Connected Component Largest Biconnected Component Largest Triconnected Component Post Processing ILP Solution ILP Solution B D F A C E May have cycles Not a total ordering for each connected components outgoing incoming A A B B C C D D E E F F Bipartite matching Objectives: Max weight Max cardinality Max cardinality / Max weight Testing Framework Venter Genome Read Type Total Reads Total Bases Avg length Coverage Sanger 31,861,976 2.79E+10 875 9.930637 SOLiD pairs 4.85E+08 2.42E+10 50 8.623028 4x Assembly # Reads # Bases in reads 112,00,000 1.1E+10 # Contigs # Bases in contigs N50 422,837 2.26E+09 7704 Testing Metrics Computer Scientists Finding Scaffold = Binary Classification Test n contigs, try to predict n-1 adjacencies TP,FP,TN,FN, Sensitivity, PPV Biologists (main focus) N50 (basically average scaffold size, ignore gaps) TP50 Break scaffold at incorrect edges, then find N50 Results test case method bundle size sensitivity ppv N50 TP50 10% opera 2 81.13% 99.26% 27,567 27,327 10% mip 2 59.01% 98.94% 19,988 19,755 10% ilp 1 79.86% 98.58% 26,814 26,459 25% opera 2 80.44% 98.27% 27,296 26,849 25% mip 2 58.95% 97.56% 19,842 19,518 25% 100% 100% ilp opera mip 1 3 3 79.30% pending failed 96.93% … n/a 26,684 … n/a 26,079 … n/a 100% ilp 1 68.25% 89.90% 20,538 19,006 Conclusions Success ILP solves scaffolding problem! NSDP works. Improvements Finalize large test cases (then publish?!) Practical considerations (read style, multi-libraries, merge ctgs) Future Work Where else can I apply NSDP? Scaffold before assembly?? Structural Variation?? Questions?