Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut teamed together (under the KanGO consortium) as part of an international effort to generate a de novo genome sequence for marsupial model species, the tammar wallaby (M. eugenii). • An important model organism is part of the Metatherian mammals and harbors unique life history traits and genome features. • Genome sequencing was done at several institutions , using several technologies, over several years. • Developed a pipeline to integrate long and short read sequences and existing assemblies using well known mapping and assembly tools such as Bowtie and Phrap. Overview of the Data Technology Institution Read Length # Reads # Bases Sanger BCM Avg. 915 bp 9,924,136 9,088,748,105 454 AGRF, UCONN Avg. 160 bp 1,530,592 275,951,386 Illumina AGRF 100 bp 271,875,064 27,187,506,400 Solid BCM 25 bp 710,427,490 18,471,114,740 “Paired” Read Overview: Read Insert Read • All reads used in reassembly were “paired”. • Read orientation not considered. • The Solid reads had insert size of 1.395 kb. • The Illumina reads had insert size of 3 and 8 kb in roughly equal proportions. • The 454 reads had insert size of 8, 12, 20, and 30 kb with the majority of them split between 20 and 30 kb. Local Assembly Pipeline Initial Assembly Map Short Reads Gap Reestimation Final Mapping Scaffold Assembly Finished Assembly Quality Calling Initial Assembly Sanger reads assembled using Atlas (BCM). Initial scaffolding done using Solid mate pairs. Map Short Reads Reads which map to multiple locations (non unique) are not considered. To scaffolds using Bowtie, there are two cases: • Orphaned, when one mate is unmapped. • Complete, when both are mapped. Key: 454, illumina, sanger, unmapped read, dotted line – estimated distance Pipeline continued Scaffold Assembly Feed contigs, complete and orphaned pairs to Phrap. Re-assemble. Final Mapping Map all data to the final contigs. ;;;;;;;;;;;9;7;;.7;393333 ;;;;;;;;;;;7;;;;;-;;;3;83 ;;;;;;;;;;;9;7;;.7;393333 ;;;;;;;;;;;7;;;;;-;;;3;83 ;;3;;;;;;;7;;;;;;;88383;;;34 ;;3;;;;;;7;;;;;;;88383;;;77 Quality Calling Map all reads and calculate a quality of each base. Gap Restimation Of all gap distances in scaffolds using complete pairs mapped on different contigs. Output: Contigs, Quality Scores, Scaffold (agp file). Local Assembly Close-up A B C • Screen shot taken from Codon Code aligner which uses Phrap to map reads. • Red and blue denote orientation. • A: Two contigs may be fused using short reads. • B: Contig will be extended with short reads at one end. • C: Single nucleotide and small errors are corrected using the short reads higher coverage. • We are confident of changes since even at < 10x each read is itself uniquely mapped, or its approximate position supported by a uniquely mapped read. Assembly Comparison and Validation Category Meug 1.0 Meug 1.1 Meug 1.2 Contigs (10^6) 1.21 1.17 1.101 N50 (10^3) 2.5 2.6 2.91 Bases (10^6) 2546 2536 2574 scaffolds 616418 277711 277711 Gaps (10^6) NA 539 614 RIKEN BACs Total Reads Total Bases Recovered Reads Recovered Bases BAC 147312 113232882 34734 25662069 FOSMID 31250 18777544 4335 2263696 BAC 147312 113232882 33328 24555624 FOSMID 31250 18777544 4294 2241059 1.1 (original) 1.2 (updated) Gap Estimation Methods • Gap between two contigs estimated using an expectation maximization algorithm. Steps are repeated until estimated parameters do not change. • Step1: Maximization: compute gap estimate (x), let the mean insertion length of N pairs equal μ (initial value is library average). • Step 2: Sampling, given x, and the length of contigs, sample μ from completely mapped reads spanning gaps. Gap Estimation Results • When using libraries with different insert size and std deviation it is necessary to bundle the estimates. • The following is an example of how two libraries are bundled: • nx = # reads in libx, ex = libx estimate, sx = lib std dev. Simulation study of EM algorithm accuracy. ctg len \ gap 1000 2000 3000 4000 5000 10 -19 -7 2 -1 -15 100 66 87 92 88 74 200 161 188 192 188 173 500 446 492 492 487 469 800 733 795 794 784 764 1000 939 997 995 982 956 1200 1153 1196 1196 1177 1152 1500 1501 1493 1496 1469 1436 Conclusion Future Work • This method is a viable way of improving existing draft genomes with short read technologies at limited (<10x depth) coverage. • This method is robust and easily parallelized so it is practical for large mammalian genomes. • Better results may be obtained through multiple iterations. • Re-scaffolding of the contigs should be done between iterations. • A contig aware assembly algorithm could improve local assembly performance.