Assembling Genomes from Next-Generation Sequencers Steven Salzberg Center for Bioinformatics and Computational Biology University of Maryland Institute for Advanced Computer Studies http://cbcb.umd.edu Solexa sequencing What can we do with next-gen sequencers? 1. Assembling genomes from very short reads (part 1) 2. Mapping millions of reads to the human genome (part 2) Assemble a bacterial genome entirely from Solexa reads Target: a novel strain of Pseudomonas aeruginosa isolated from a frostbite patient Every read exactly 33 bp long 8,627,900 reads generated approximately 41X coverage just 1/4 of a single Solexa run Assembly strategy Throw every trick in the book at it Use related genomes 3 finished strains available Use de novo assemblers New gene-boosting assembly method Pseudomonas aeruginosa A leading cause of hospital-acquired infections, especially of the lungs Leading cause of infections in cystic fibrosis patients Large (~6.5 Mbp) bacterial genome high GC - 66% Comparative Assembly AMOScmp assembles a genome using a related species Fast, accurate assembly http://amos.sourceforge.net Comparative assembly using multiple genomes Comparative assembly A Reference genome A Divergent regions X Y Z Target genome Reference genome B Comparative assembly B Comparative assembly using multiple genomes AMOScmp assembly Contigs Contigs >200bp Max contig PA14 reference 2053 428 170,485 PA01 reference 2797 865 75,626 Comparative assembly using multiple genomes Assembly A Assembly B Merge Merged assembly Comparative assembly using multiple genomes AMOS-Cmp assembly Contigs Contigs >200bp Max contig PA14 reference 2053 428 170,485 PA01 reference 2797 865 75,626 Merged 1850 306 236,472 De novo assembly Several new methods available Short reads require long overlaps e.g., 33 bp reads must overlap by 20 bp end-trimming helps De novo assembly strategies SSAKE Warren et al., 2007 Uses DNA prefix tree to find k-mer matches Edena Hernandez et al., 2008 overlap-layout algorithm adapted for short reads Velvet Zerbino and Birney, 2008 Uses DeBruijn graph algorithm plus error correction De novo Assembler performance ● All three programs run with default parameters on the same data set ● input: 8.6 million reads ● platform: 64-bit Opteron, 4 CPUs, 32 GB memory Program Version CPU time Wall clock SSAKE 3.0 2:24:59 5:08:59 Edena 2.11 0:28:31 28:58 Velvet 0.5 0:08:48 10:36 De novo assemblies Program # Contigs N50 (bp) Sum (bp) Max contig SSAKE 185,030 87 14,287,07 9 Edena 11,180 837 6,175,460 11,300 Velvet 10,684 # Contigs >200 bp 1,184 6,841,458 16,239 Program 5,490 N50 (bp) Sum (bp) Singletons SSAKE 12,532 549 6,090,567 3,164,495 Edena 8,316 902 5,759,209 3,955,865 Velvet 7,382 1,252 6,474,426 1,273,164 Gene-boosted assembly Contig 1 Contig 2 Gap-spanning gene Gap-spanning gene sequence Translated amino acid sequence Translated, mapped reads Comparative assembly using multiple genomes Assembly strategy Contigs Contigs >200bp Merged, AMOS-Cmp 1850 306 236,472 Gene-boosted 120 120 512,638 Max contig Note: input to Gene-boosted assembly included 306 contigs from Merged assembly Final assembly 76 contigs in one large scaffold, 6.3 Mb Largest contig: 512,638 bp additional 436 small contigs spanning 417 kb 9% of the reads unused 5602 protein-coding genes 5568 in PAO1 5892 in PA14 Challenges of next-gen sequencing 1. Assembling genomes from very short reads (part 1) 2. Mapping millions of reads to the human genome (part 2) Short read alignment Sequencer Human source Reads from new sequencing machines are short: 25-50 bp Short read alignment Sequencing machine And you get MANY of them Short read alignment Need to map them back to human reference Bowtie • Ultrafast short read alignment software – designed for 25-63bp reads • Same sensitivity as Maq, but 35 times faster • Shares formats with Maq – compatible with Maq’s SNP caller • Open source: – http://cbcb.umd.edu/software – http://bowtie-bio.sourceforge.net Bowtie overview • For each read, finds a ‘good’ hit to the reference, allowing for mismatches – Prefers mismatches at lower-quality bases – Can behave like Maq or SOAP – Calls SNPs using Maq interface • Uses Burrows-Wheeler index of the reference genome – Pre-built genomes available – Can download or build your own Why Burrows-Wheeler? • BWT very compact: – Approximately ½ byte per base – As large as the original text, plus a few “extras” – Can fit onto a standard computer with 2GB of memory • Linear-time search algorithm – proportional to length of query for exact matches Burrows-Wheeler Transform (BWT) BWT acaacg$ $acaacg aacg$ac acaacg$ acg$aca caacg$a cg$acaa g$acaac gc$aaac Burrows-Wheeler Matrix (BWM) Burrows-Wheeler Matrix $acaacg aacg$ac acaacg$ acg$aca caacg$a cg$acaa g$acaac Burrows-Wheeler Matrix $acaacg aacg$ac acaacg$ acg$aca caacg$a cg$acaa g$acaac See the suffix array? Handling mismatches Matching acctagattcagaggtcaccataggcacatgcag Don’t backtrack to positions in this region of the read Handling mismatches Matching acctagattcagaggtcaccataggcacatgcag Allow mismatches in this part of the read Don’t backtrack to positions in this region of the read Handling mismatches acctagattcagaggtcaccataggcacatgcag Flip the read and index around Allow mismatches in this part of the read Don’t backtrack to positions in this region of the read gacgtacacggataccactggagacttagatcca Handling mismatches acctagattcagaggtcaccataggcacatgcag Allow mismatches in this part of the read Don’t backtrack to positions in this region of the read gacgtacacggataccactggagacttagatcca Matching Handling mismatches • Bowtie uses a more complex scheme to allow for more than 1 mismatch –Divides the read into a 28bp “seed” region, which is assumed to be of high-quality –Divides that into two parts, similar to the 1-mismatch scheme –Allows backtracking in each part in separate phases to avoid excessive backtracking Bowtie speed Alignment of 8.84 million Solexa reads from the 1000 Genomes pilot 268 x 54 x Millions of reads per CPU hour Bowtie memory requirements (less is better) Alignment of 8.84 million Solexa reads from the 1000 Genomes pilot Peak memory usage (megabytes) Bowtie Sensitivity Percent reads aligned Maq 74.7 SOAP 71.6 Bowtie 75.1 >90% of reads are aligned by all 3 programs Alignment of 8.84 million Solexa reads from the 1000 Genomes pilot Bowtie index construction Maximum allowed memory (GB) Building index for NCBI human reference, build 36, on a 2.4 GHz Opteron with 32GB RAM Bowtie index construction • Can build index for a mammalian genome on a desktop workstation in < 1 day • Pre-built indices at CBCB: H. sapiens 2.1 GB M. musculus 1.8 GB D. melanogaster 118 MB S. cerevisiae 12 MB others… 39 Acknowledgements Assembly with short reads Dan Sommer, Daniela Puiu, Vincent Lee Short-read alignment (Bowtie) Ben Langmead, Cole Trapnell, Mihai Pop Funding NIH R01-LM06845, R01-GM083873