MGM Workshop Assembly Tutorial Alicia Clum DOE Joint Genome Institute, Walnut Creek, CA May 14, 2012 Contents 1. Vocabulary introduction 2. Introduction to short-read genome sequencing and assembly 3. Practical experience of short read genome assembly 4. Improving genome assembly using 3rd generation sequencing Contents 1. Vocabulary introduction 2. Introduction to short-read genome sequencing and assembly 3. Practical experience of short read genome assembly 4. Improving genome assembly using 3rd generation sequencing Vocabulary • Fragment library: a short insert (270bp) library with overlapping ends. Aka std library • Long insert library: A 4-8kb library where only 100 bp on each end are sequenced. Aka CLIP, mate pair library • Contig: A contiguous sequence of DNA • Scaffold: One or more contigs linked together by unknown sequence • Captured gap: A gap within a scaffold. The order and orientation of the contigs spanning the gap is known A B C D E Contents 2. Introduction to short-read genome sequencing and assembly • • • Short read sequencing and assembly basics Short read assembly - De Bruijn graph example Short read assembly – Scaffolding Why sequence genomes using short reads? Mb/day Cost / Mb Read length 20,000 650bp 450bp $400 1 Sanger $15 1,000 150bp 454 Illumina HiSeq Traditional genome sequencing technology $0.1 Short-read $$! - We have to figure out how to sequence microbial genomes using only illumina data Short read genome sequencing Genomic DNA Random fragmentation 4-8 kb fragments 270 bp fragments molecular biology Paired-end short insert reads (10’s millions) Sequencing (Illumina) Paired-end long insert reads (10’s millions) How do we assemble this data back into a genome? Assembly outline Reads Contigs Assembly algorithms e.g. Allpaths, Velvet, Meraculous Scaffolds Assembly outline Reads Contigs Scaffolds ‘De Bruijn’ assembly Contents 2. Introduction to short-read genome sequencing and assembly • • • Short read sequencing and assembly basics Short read assembly - De Bruijn graph example Short read assembly – Scaffolding De Bruijn example “It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity,.... “ Dickens, Charles. A Tale of Two Cities. 1859. London: Chapman Hall Example courtesy of J. Leipzig 2010 De Bruijn example itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolishness… Generate random ‘reads’ How do we assemble? fincreduli geoffoolis Itwasthebe Itwasthebe geofwisdom itwastheep epochofinc timesitwas stheepocho nessitwast wastheageo theepochof stheepocho hofincredu estoftimes eoffoolish lishnessit hofbeliefi pochofincr itwasthewo twastheage toftimesit domitwasth ochofbelie eepochofbe eepochofbe astheworst chofincred theageofwi iefitwasth ssitwasthe astheepoch efitwasthe wisdomitwa ageoffooli twasthewor ochofbelie sdomitwast sitwasthea eepochofbe ffoolishne eofwisdomi hebestofti stheageoff twastheepo eworstofti stoftimesi theepochof esitwasthe heepochofi theepochof sdomitwast astheworst rstoftimes worstoftim stheepocho geoffoolis ffoolishne timesitwas lishnessit stheageoff eworstofti orstoftime fwisdomitw wastheageo heageofwis incredulit ishnessitw twastheepo wasthewors astheepoch heworstoft ofbeliefit wastheageo heepochofi pochofincr heageofwis stheageofw fincreduli astheageof wisdomitwa wastheageo astheepoch olishnessi astheepoch itwastheep twastheage wisdomitwa fbeliefitw bestoftime epochofbel theepochof sthebestof lishnessit hofbeliefi Itwasthebe ishnessitw sitwasthew ageofwisdo twastheage esitwasthe twastheage shnessitwa fincreduli fbeliefitw theepochof mesitwasth domitwasth ochofbelie heageofwis oftimesitw stheepocho bestoftime twastheage foolishnes ftimesitwa thebestoft itwastheag theepochof itwasthewo ofbeliefit bestoftime mitwasthea imesitwast timesitwas orstoftime estoftimes twasthebes stoftimesi sdomitwast wisdomitwa theworstof astheworst sitwasthew theageoffo eepochofbe …etc. to 10’s of millions of reads Traditional all-vs-all assemblers fail due to immense computational resources (scales with number of reads2) A million (106 ) reads requires a trillion (1012) pairwise alignments De Bruijn solution: Represent the data as a graph (scales with genome size) De Bruijn example Step 1: Convert reads into “Kmers” Kmer: a substring of defined length Reads: theageofwi Kmers : the (k=3) hea sthebestof astheageof worstoftim imesitwast sth ast the eag sth heb geo eof ofw fwi sto tof esi sto eag est mes rst hea bes ime ors the ebe age wor sit tof age geo eof oft fti tim itw twa was ast …..etc for all reads in the dataset De Bruijn example Step 2: Build a De-Bruijn graph from the kmers ast ast sth sth the the the hea hea eag eag age age geo geo eof eof ofw heb ebe bes est sto sto tof tof wor ors rst oft fti fwi tim ime was mes esi twa itw sit …..etc for all ‘kmers’ in the dataset De Bruijn example Step 3: Simplify the graph as much as possible: A De Bruijn Graph De Bruijn assemblies ‘broken’ by repeats longer than kmer “It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity,.... “ Drawback of De Bruijn approach Step 4: Dump graph into consensus (fasta) No single solution! Break graph to produce final assembly Kmer size is an important parameter in De Bruijn assembly The final assembly (k=3) wor times incredulity foolishness itwasthe age epoch be st of wisdom belief Repeat with a longer “kmer” length A better assembly (k=20) itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolis… Why not always use longest ‘k’ possible? Sequencing errors: k=3 sth the ebe ent tof heb ben nto sthebentof k=10 sthebentof Mostly unaffected kmers 100% wrong kmer Contents 2. Introduction to short-read genome sequencing and assembly • • • Short read sequencing and assembly basics Short read assembly - De Bruijn graph example Short read assembly – Scaffolding Scaffolding Reads Contigs ‘De Bruijn’ assembly Align reads to DeBruijn contigs Join contigs using evidence from paired end data Scaffolds (An assembly) “Captured” gaps caused by repeats. Represented by “NNN” in assembly Contents 1. Vocabulary introduction 2. Introduction to short-read genome sequencing and assembly 3. Practical experience of short read genome assembly 4. Improving genome assembly using 3rd generation sequencing Real life assembly is messy! Assembly in theory Uniform coverage, no errors, no contamination Assembly in reality Contaminant reads (-> incorrect + inflated assembly) Chimeric reads (->mis-joins) Sequencing errors (-> fragmented assembly) * * Biased* *coverage (->gaps) * * * * Worse than predicted assemblies! Coverage (x) Fraction of normalized coverage Real life assembly is messy! Reference position (bp) Theoretical GC% of 100 base windows Genome properties can also make assembly difficult High repeat content Polyploidy r r r r a a’ r RESULT: misassemblies / collapsed assemblies Biased sequence composition RESULT: fragmented assembly Biased sequence abundance ACTGTCTAGTCAGCGCGC GCGCGCGCGCCCGCGCG CGCGGGCGGCGGCGCGG GCGGGCGCATGTAGTGAT C RESULT: incomplete / fragmented assembly RESULT: Incomplete / fragmented assembly How do we get a good assembly? Key features of the JGI microbe sequencing and assembly pipeline Genomic DNA Fragment Longer inserts Improved scaffolding 270bp Overlapping pairedend reads for short insert library Allows for long ‘kmer’ Sequence 2 x 150bp Extensive data QC: -Remove artifacts -Remove contaminants -Reorder libraries 4-8 kb 2 x 100 bp QC Assembler Best Assemblies Illumina V3 sequencing chemistry Reduced GC bias AllPaths LG assembler Most complete + most accurate assembler in our hands Internal error correction Benefits of ALLPATHS • Internal error correction of all data types • Simplifies the graph by removing kmers at low coverage caused by errors velvet assembly results for error corrected standard data 4000 number of contigs 3500 3000 2500 4084251 2000 4092796 1500 1000 500 0 allpaths-lg control-no error correction msr-ca error correction method • Given the correct input data, a good assembly can be produced with default options • Overlaps the fragment data to use a longer kmer size • Produces accurate and highly contiguous assemblies Simulated De Bruijn assembly for six ‘known’microbial genomes 100 90 80 70 60 50 40 30 20 10 0 97% Bacteria name: A. haemolyticum B. Murdochii C. Flavigena S. Smaragdinae H. Turkmenica C. Woesei Kmer = 150 A small fraction of the genome remains ‘unassemblable’ 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 % Genome in unique ‘kmers’ What fraction of a genome should we be able to assemble, that is, can be represented in unique kmers? 'Kmer' length (bp) Kmer =30, most of the genome CAN be assembled Matt Blow How fragmented are short read assemblies? Goal: genome in 1 fragment per replicon Short insert library only S. Smaragdinae H. Turkmenica (7 replicons) C. Woesei C. Flavigena B. Murdochii Assemblies from simulated data 0 50 100 150 Number of scaffolds Assemblies using only short insert sequencing libraries are acceptable (<100 scaffolds) How good are short read assemblies? Goal: genome in 1 fragment per replicon S. Smaragdinae Assemblies from real data H. Turkmenica Assemblies from simulated data (7 replicons) Short insert library only C. Woesei C. Flavigena B. Murdochii 0 50 100 150 Number of scaffolds Assemblies using only short insert sequencing libraries are acceptable (<100 scaffolds) How good are short read assemblies? Goal: genome in 1 fragment per replicon S. Smaragdinae Assemblies from real data H. Turkmenica Assemblies from simulated data (7 replicons) Short + long insert libraries C. Woesei C. Flavigena B. Murdochii 0 5 10 Number of scaffolds Assemblies using short and long insert sequencing libraries are very good (often a single fragment!) Comparison of assembly results to reference genome annotation Short insert library only Short + long insert libraries Average number of fragments Average % known genes identified 85 97.3% 3 97.4% Near complete and accurate assemblies are now possible using only short read data. But problems remain: Reads Contigs ‘De Bruijn’ assembly Align reads to DeBruijn contigs Join contigs using evidence from paired end data An assembly “Captured” gaps caused by repeats. Represented by “NNN” in assembly Contents 1. Vocabulary introduction 2. Introduction to short-read genome sequencing and assembly 3. Practical experience of short read genome assembly 4. Improving genome assembly using 3rd generation sequencing Pacific Biosciences Sequencer Long reads from “3rd generation” Pacific Biosciences sequencer hold promise for improving short-read based assemblies Late April 2012 14,000 12,000 10,000 8,000 6,000 4,000 2,000 Mean read length = 2743 bp Mean read length = 1080bp Maximum read length >4kb 1 2 3 4 5 Read length (Kb) Low coverage bias Reduced sensitivity to G+C rich regions compared to illumina chemistry 6 7 Fraction of normalized coverage Number of reads “Early” data High error rate Up to 15% error rate In-del errors GC% of 100 base windows PacBio pilot study For each of 20 Microbes: Genomic DNA + 2 x 150bp 270bp insert + 2 x 100bp 8kb insert (Illumina HiSeq, V3 chemistry, 2x150b) Assembly 1 Assembly 2 Assembler: AllPaths V39750 (PacBio enabled) “1000”bp (PacBio) 3rd Generation sequence data improves genome assembly Most improved genome: 53 / 71 (75%) gaps closed Number of ‘captured’ gaps in assembly 100 90 80 70 60 50 40 30 20 10 0 Assembled using: Illumina only Illumina + PacBio Least improved genomes (.. but started out in good shape) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Microbe (sorted by number of gaps closed) PacBio assembly improvement is greatest for genomes with high GC content 90 80 80 LEAST improved genomes average 56% G+C 75 70 70 65 60 60 50 55 40 50 30 45 20 40 10 35 0 Conclusion: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 rd 3 generation sequencing promising for better Microbe (sorted is by a number of gaps tool closed) assemblies, especially for G+C rich genomes. 30 Assembly G+C content Number of ‘captured’ gaps in assembly 100 MOST improved genomes average 66% G+C Summary • High quality genome sequencing using only short-reads is within reach • Short-read microbial genomes assemblies are minimally fragmented and contain the vast majority known genes • Third-generation sequencing may provide an inexpensive path to finished genomes END Assembly Q+A Alex Copeland (Assembly group lead) James Han (QC group lead) Alicia Clum (Analyst, Genome assembly) Metagenome Assembly Cow rumen Acid mine 1 10 100 Soil 1000 10000 Species per metagenome -All challenges of isolate genome assembly remain -Extra challenges from diversity and different abundance of constituent genomes - The same strategies as isolate assembly can be used, but many heuristics fail for metagenomes Road map for Long Mate Pair (LMP) library Improvement & Development Fragmentation column Hydroshear Circularization Cre-Lox Covaris Transposon Hybridization Ligation Nick translation shearing Purification Electro-elution Clean up efficiency Paired tag generation RE enzyme AmplificationPCR cycle #, specificity, choice of enzyme --- 2-D Electrophoresis CLIP LFPE Illumina De Bruijn example The final assembly (k=3) wor times incredulity foolishness itwasthe age epoch be of st wisdom belief Repeat with a longer “kmer” length A better assembly (k=20) itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolis… Why not always use longest ‘k’ possible? Sequencing errors: k=3 sth the ebe ent tof heb ben nto sthebentof k=10 sthebentof Mostly unaffected kmers 100% wrong kmer Short read assemblies are improving over time 100 80 60 40 20 0 80 100 60 80 40 60 20 40 Longest contig (Kb) …resulting in better microbial genome assemblies 150 40 30 100 Average results from suboptimal “QC” assemblies 50 20 10 Q4 2009 (n = 64) Q1 2010 (n = 90) Q2 2010 (n = 31) Q3 2010 (n = 68) Q4 2010 (n = 94) High quality reads (%) Read number (millions) But illumina sequence quantity and quality is increasing… Genome GC (%) 10 8 6 4 2 0 Q1 2011 (n = 43) contig N50 (Kb) Genome size (Mb) Sequenced genome properties remain constant 100 90 80 70 60 50 40 30 20 10 0 97% Bacteria name: A. haemolyticum B. Murdochii C. Flavigena S. Smaragdinae H. Turkmenica C. Woesei Kmer = 150 A small fraction of the genome remains non-unique 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 % Genome in unique ‘kmers’ Q1. What fraction of kmers are unique in a typical microbial genome? (6 known microbe genomes, 15 kmer lengths) 'Kmer' length (bp) Kmer = 30, most of the genome is in unique kmers Majority of genome contained in unique fragments Predicted number of fragments in the assembly Q2. How do non-unique kmers fragment the assembly? 250 200 150 100 50 0 A. haemolyticum B. Murdochii C. Flavigena C. Woesei H. Turkmenica Microbe Assemblies will be fragmented! How do we improve this? S. Smaragdinae Metagenome assembly is an ongoing challenge -All challenges of isolate genome assembly remain -Extra challenges from diversity and different abundance of constituent genomes - The same strategies as isolate assembly can be used, but many heuristics fail for metagenomes Useful Reviews • • Miller JR, Koren S, Sutton G. , Assembly algorithms for nextgeneration sequencing data. Genomics. 2010 Jun;95(6):315-27. Mihai Pop, Genome assembly reborn: recent computational challenges. Brief Bioinform (2009) 10 (4):354-366. Illumina data quality Syntrophorhabdus aromaticivorans Genome Properties PASS Read Quality Library Quality Run Quality Illumina data quality Opitutaceae bacterium TAV2 Genome Properties FAIL Read Quality Library Quality Run Quality Metagenomes are harder to assemble Velvet gaps Fibrobacter succinogenes & Ignisphaera aggregans velvet gap size distribution of aligned contig shreds 120 60 4085750 std (trimmed to Q20) 4085750 jumping (trimmed to Q20) 4085750 std trimmed + jumping trimmed 4086221 std 40 4086221 jumping 20 4086221 std + jumping 100 number of gaps 80 0 < 100 100-999 1000-1999 2000-2999 3000-3999 Gap size (bp) > 4000 Features of Assemblers Algorithm Feature Read features Removal of erroneous reads Base error correction Graph construction Graph reduction Base substitutions Homopolymer miscount Concentrated error in 3′ end Flow space Based on K-mer frequencies Based on K-mer freq and QV For multiple values of K By alignment to other reads Based on K-mer frequencies Based on Kmer freq and QV Based on alignments OLC Assemblers DBG Assemblers Euler, AllPaths, SOAP CABOG Euler Newbler Euler, Velvet, AllPaths AllPaths AllPaths CABOG Euler, SOAP AllPaths CABOG Reads as graph nodes K-mers as graph nodes Simple paths as graph nodes CABOG, Newbler, Edena Collapse simple paths Erosion of spurs Bubble smoothing Bubble detection Reads separate tangled paths Break at low coverage Break at high coverage High coverage indicates repeat CABOG, Newbler CABOG, Edena Edena Euler, Velvet, ABySS, SOAP AllPaths CABOG CABOG Euler, Velvet, SOAP Euler, Velvet, AllPaths, SOAP Euler, Velvet, SOAP AllPaths Euler, SOAP Velvet, SOAP Euler Velvet Graph partitions Partition by K-mers Partition by scaffolds ABySS AllPaths Mate pairs Constrain path searches Guide path selection Merge contigs or fill gaps Transitive link reduction Detect, avoid repeat contigs Create scaffolds Euler, Velvet, AllPaths Euler, Allpaths Velvet, ABySS, SOAP SOAP Velvet, SOAP Euler, Velvet, AllPaths, SOAP CABOG, Shorty CABOG CABOG CABOG, Shorty J.R.Miller et al. Genomics 95 (2010) Greedy overlap CORRECT INCORRECT (A) Overlap between two read (agreement within overlapping region need not be perfect); (B) Correct assembly of a genome with two repeats (boxes) using four reads A–D; (C) Incorrect assembly produced by the greedy approach. (D) Disagreement between two reads (thin lines) that could extend a contig (thick line), indicating a potential repeat boundary. Contig extension must be terminated in order to avoid misassembly. Pop M Brief Bioinform 2009;10:354-366 Overlap-Layout-Consensus (OLC) Overlap graph of a genome containing a two-copy repeat (B). Comparing Overlap and de Bruijn Graphs Schatz et al., Genome Res. (2010) Iterative kmer evaluation: IDBA Y Peng, et al. IDBA - A Practical Iterative de Bruijn Graph De Novo Assembler (2010) Jeremy Leipzig (jerdemo.blogspot.com/2009/11/using-vmatch-to-combine-assemblies.html N50 The N50 size of a set of entities (e.g., contigs or scaffolds) represents the largest entity E such that at least half of the total size of the entities is contained in entities larger than E. For example, given a collection of contigs with sizes 7, 4, 3, 2, 2, 1, and 1 kb (total size = 20kbp), the N50 length is 4 because we can cover 10 kb with contigs bigger than 4kb. http://www.cbcb.umd.edu/research/castats.shtml) ( N50 length is the length ‘x’ such that 50% of the sequence is contained in contigs of length x or greater. (Waterston http://www.pnas.org/cgi/reprint/100/6/3022.pdf) Theoretical performance Assessing performance of a range of read lengths Repeat-induced gaps Cahill et al., PLoS ONE (2010) Supplement: Assembler flowcharts Phrap Bamidele-Abegunde T. 2010 http://library2.usask.ca CAP3 & PCAP Bamidele-Abegunde T. 2010 http://library2.usask.ca MIRA Bamidele-Abegunde T. 2010 http://library2.usask.ca Velvet Bamidele-Abegunde T. 2010 http://library2.usask.ca Supplement: Genome Improvement Typical Microbial project Sequencing Draft assembly Goals: FINISHING Completely restore genome Produce high quality consensus Annotation Public release Metagenomic assembly and Finishing The whole-genome shotgun sequencing approach was used for a number of microbial community projects, however useful quality control and assembly of these data require reassessing methods developed to handle relatively uniform sequences derived from isolate microbes. • Typically size of metagenomic sequencing project is very large • Different organisms have different coverage. Non-uniform sequence coverage results in significant under- and over-representation of certain community members • Low coverage for the majority of organisms in highly complex communities leads to poor (if any) assemblies • Chimerical contigs produced by co-assembly of sequencing reads originating from different species. • Genome rearrangements and the presence of mobile genetic elements (phages, transposons) in closely related organisms further complicate assembly. • No assemblers developed for metagenomic data sets QC: Annotation of poor quality sequence To avoid this: make sure you use high quality sequence choose proper assembler De Bruijn graph for binary code 0000110010111101 K=4 Follow the blue numbered edges to resolve the graph Compeau et al., Nature Biotechnology (2011) Hamiltonian vs Eulerian cycle Compeau et al., Nature Biotechnology (2011) Evaluating Assemblers • When evaluating assemblers we use reference microbial datasets and assess Query • Number of scaffolds and contigs • Performance across a range of size, GC content and repeat content • Accuracy of sequence produced • Gene content captured Reference Vocabulary Node: A point, read or kmer Edge: a line connecting two nodes Graph: a network of nodes connected by edges Node Edge Graph Compeau et al., Nature Biotechnology (2011) Vocabulary •Kmer: a substring of defined length. For the purposes of this talk a substring of the sequence read •de Bruijn graph: a graph representing overlaps between kmers Kmer (k=3) node edge Schatz et al., Genome Res. (2010)