Genome Assembly Kelley Bullard, Henry Dewhurst, Kizee Etienne, Esha Jain, VivekSagar KR, Benjamin Metcalf, Raghav Sharma, Charles Wigington, Juliette Zerick Outline Stake Holders Biology NGS Review Introduction to Genome Assembly Challenges Analysis pipeline/ strategy Tool selection Summary (final pipeline) Stakeholders CDC (Centers for Disease Control and Prevention) GaTech Immunocompromised individuals Consumers of seafood Prediction group (and subsequent groups) Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary Biology… Image of V. vulnificus Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary Vibrio vulnificus Gram-negative o Lipopolysaccharide membrane Motile, facultative anaerobe Halophilic (salt-loving) organism abundant in estuarine ecosystems Major cause of seafood related deaths Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary Vibrio vulnificus – genome architecture Bacterial genomes are codingdense o Introns rare Contains plasmids (pYJ016) V. vulnificus ~5.2mbp genome (similar to E. coli, ~50%) o GC content: 45-47% Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary Vibrio navarrensis Gram-negative Lipopolysaccharide membrane Motile, facultative anaerobe Moderately halophilic organism Some strains do not grow well in moderate to high salt concentrations Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary Vibrio navarrensis - genomic architecture Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary NGS - Review Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary Roche 454 sequencing workflow overview Sample input: Genomic DNA, BACs, amplicons, cDNA Generation of small DNA fragments via shearing Ligation of A/B-Adaptors flanking single- stranded DNA fragments One Fragment Emulsification of beads and fragments in water-inoil microreactors One Bead Clonal amplification of fragments bound to beads in microreactors Sequencing and base calling One Read 400,000 reads per run Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary GS FLX Data analysis – flowgram generation T 4‐ A C mer Flow Order Flowgram G 3‐mer TTCTGCGAA 2‐mer 1‐mer Example of homopolymer errors from 454 sequencing data Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary Example of 454 sff file (text format) Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary Illumina sequencing overview 0.1 - 1.0μg cBot GAIIx User or core facility Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary Example of Illumina *.fastq file @C3PO_0001:2:1:17:1499#0/1 TGAATTCATTGACCATAACAATCATATGCATGATGCAAATTATAATATCATTTTTAGTGACGTCGT GAATCGTTT +C3PO_0001:2:1:17:1499#0/1 abaaaaaaaaaaa`a`aa_aaaaaaaaaaaaaaaa_a aaa`aaaaa^aaaaa`a]^`a YZYZ^`NJDJ\_Z @C3PO_0001:2:1:17:1291#0/1 TGTTTGAGCAAATGATTCATAATAATGTATTTCAATATTTTTAGGAATATCTCCCAATATTGCGCG TGCTGAATT +C3PO_0001:2:1:17:1291#0/1 a`_`_\a_aaaa_a^Z^^a[a^aa]a_^_a_``aa `aa`X^X^^`aa_\_]VR`\a_]W\_`_a]a]][\RZV @C3PO_0001:2:2:1452:1316#0/1 GTCCATCCGCAGCAGCGAATTTTTGACGTCCCCCCCCGAANGGANGNGANNNNGNNGNNNT NTNNAAANGNNNNN +C3PO_0001:2:2:1452:1316#0/1 _U a\ `]_`ZP\\_Z^[]aa^a_]XNBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB … Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary Genome Assembly Input reads V. navarrensis V. vulnificus 2423-01 2009V-1368 08-2462 06-2432 2541-90 08-2435 2756-81 08-2439 07-2444 Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary Introduction to genome assembly An assembly is a hierarchical data structure that maps the sequence data to a putative reconstruction of the target. In addition to contigs, a set of unassembled or partially assembled reads is also given as an output. Reads Contigs multiple sequence alignment of reads plus the consensus sequence. Scaffolds - define the contig order and orientation Output (FASTA) Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary How do we check the quality of our assembly? METRICS! • N50 • minimum/maximum contig length • No. of contigs • No. of errors • FRC (feature response curve) Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary Feature-by-Feature – evaluating de-novo assembly • BREAKPOINT: Points in the assembly where leftover reads partially align; • COMPRESSION: Area representing a possible repeat col- lapse; • STRETCH: Area representing a possible repeat expansion; • LOW_GOOD_CVG: Area composed of paired reads at the right distance and with the right orientation but at low coverage; • HIGH_NORMAL_CVG: Area composed of normal oriented reads but at high coverage; • HIGH_LINKING_CVG: Area composed of reads with associated mates in another scaffold; • HIGH_SPANNING_CVG: Area composed of reads with associated mates in another contig; • HIGH_OUTIE_CVG: Area composed of incorrectly oriented mates (--> -->, <-- -->); • HIGH_SINGLEMATE_CVG: Area composed of single reads (mate not present anywhere); • HIGH_READ_COVERAGE: Region in assembly with unexpectedly high local read coverage; • HIGH_SNP: SNP with high coverage; • KMER_COV: Problematic k-mer distribution. Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary Feature-by-Feature – evaluating de-novo assembly • Most of the traditional metrics used to evaluate assemblies (N50, mean contig size, etc.) emphasize only size, while nothing (or almost nothing) is said about how correct the assemblies are. • A typical such metric (especially, in the NGS context) consists in aligning contigs back to an available reference. However, this naive technique simply counts the number of mis-assemblies without attempting to distinguish or categorize them any further. • After running amosvalidate, each contig is assigned the number of features that correspond to doubtful sequences in the assembly. • For a fixed feature threshold w, the contigs are sorted by size and, starting from the longest, only those contigs are tallied, if their sum of features is ƒw. For this set of contigs, the corresponding approximate genome coverage is computed, leading to a single point of the Feature-Response curve (FRC). Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary Assembly Challenges Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary Challenges Intrinsic Genome architecture Technical Repeats Homopolymer runs Sequence complexity Chimeras? Contaminants Short reads Poisson distribution of coverage Sequencing errors Variable quality Sequence tags 454 raw reads Illumina DeNovo • Allpaths LG • SOAP DeNovo • Velvet • Abyss • Taipan • Bambus2 • SUTTA Illumina raw reads Statistical analysis Pre-processing 454 reads Read stats V. vulnificus CMCP6 • samstats Illumina reads Illumina 454 GAGE • Info. Assemblers GAGE Hawk-eye • • Assemblers All possible combinations of the best 3 contigs * 3 V. vulnificus MO6-24/O Align illumina reads against 454 contigs Mac vector CLC wb • • LEGEND Mimimus MAIA • • Scaffolds Unmapped reads Finished genome Contig merging Unmapped reads • • contigs DeNovo assembly Align Illumina against the reference • PAGIT Mauve Nulceotide identity Gap filling • • GRASS Built-in Genome finishing contigs Illumina/(454?) reference based assembly Reference genome • Reference selection Process Unmapped reads bwa Compare mapping statistics hybrid Chosen Ref. Published Genomes from public databases • 454 DeNovo • Newbler • CABOG • SUTTA Illumina/ 454/ Hybrid DeNovo assembly Pre-processing V. vulnificus YJ016 Illumina Parameter optimization Evaluation Fastqc Prinseq NGS QC • • • 454 Hybrid DeNovo • Ray • MIRA Reference evaluation DNA Diff • AMOScmp Reference based assembly Draft/ Finished genome • DNA Diff Reference evaluation Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges / Analysis Pipeline-Strategy / Tool Selection / Summary DNA Diff Tool Selection - Assembly Algorithm profile Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary Greedy Graph based Branch-and-Bound Basic operation: given any read or contig, add one more read or contig until no more reads or contigs are available Seed-and-extension The contigs grow by “greedy extension” always incorporating a read that is found with the highest scoring overlap Makes locally optimal choice with the hope of finding a globally optimal choice No foresight -> misassembly Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary Greedy Seed-and-extension Graph based Branch-and-Bound It was the best of age of wisdom, it was best of times, it was it was the age it was the age of of it was the worst of of times, it was the of times, it was the It was the best of was the best of times, the best of times, it best of times, it was of wisdom, it was the of times, it was the the age of wisdom, it of times, it was the the best of times, it times, it was the worst the worst of times, it times, it was the age times, it was the age times, it was the worst was the age of wisdom, was the age of foolishness, was the best of times, • It was the best of times, it was the [worst/age] was the worst of times, Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary Greedy Graph based Branch-and-Bound Variation of the greedy assembler Seed-and-Extension Common in aligners, thus some assemblers/aligners may incorporate this approach Particularly designed for short reads based on a contig heuristic scheme Prefix-tree data structure A contig is elongated at either end contingent upon the existence of reads with a prefix of minimal length perfectly matching the end of the contig Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary Greedy Seed-and-extension Graph based Branch-and-Bound Overlap-layout-consensus (OLC): pairwise consensus Overlap: Layout: find potentially overlapping reads layout the reads based on matching alignment Consensus: derive the DNA sequence consensus by joining read sequences ..ACGATTACAATAGGTT.. Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary Greedy Seed-and-extension Graph based Branch-and-Bound HamiltonianApproach Find an assembled sequence that explains the observed sequence = finding a path through a graph that visits every vertex once Repeat Repeat Repeat Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary Greedy Seed-and-extension Graph based Branch-and-Bound de-Brujin Graph Basic operation: k-mer approach Eulerian approach Reads de Bruijn Graph AAGA ACTT ACTC ACTG AGAG CCGA CGAC CTCC CTGG CTTT … CCG Potential Genomes AAGACTCCGACTGGGACTTT TCC AAGACTGGGACTCCGACTTT CGA AAG AGA CTC GAC ACT GGA CTT TTT CTG GGG TGG Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary Greedy Seed-and-extension Graph based Branch-and-Bound Basic operation: relies on “consistent layouts”; it generates all possible consistent layouts organizing them as paths in a “double tree” structure, rooted at a randomly selected seed read Progressive evaluation of optimal criteria encoded by a set of score functions based on the set of overlaps along the layout Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary Tid-bits of advice Greedy Seed-andExtension OLC De-Brujin Branch-andBound Advantages Guaranteed to find a solution sensitivity Suitable for low coverage long reads Repeats are immediately recognized; suitable for high coverage short reads Algorithm allows for checks Disadvantages Misassembly Can be very slow, Computation of Easily confused memory usage overlaps time by complex intensive repeats RAM intensive Ambiguities delay pruning Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary Tools of Choice Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary 454 platform assembly Name Algorithm Newbler 2.5 OLC Comparative analysis of algorithms for whole-genome assembly of pyrosequencing data CABOG OLC Comparative analysis of algorithms for whole-genome assembly of pyrosequencing data SUTTA Branch-andBound Feature-by-Feature – Evaluating De Novo Sequence Assembly Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary Evaluation of 454 assemblers Genomes Used For Comparison Comparative analysis of algorithms for whole-genome assembly of pyrosequencing data Brief Bioinform (2012) 13(3): 269-280 Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary Comparison of 454 assemblers using E. coli genome Comparative analysis of algorithms for whole-genome assembly of pyrosequencing data Brief Bioinform (2012) 13(3): 269-280 Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary Comparison of 454 assemblers using E. coli genome The maximum value reached by the bars is the hypothetical reconstruction HR, defined as the ratio between the assembled bases and the reference length The white section represents the real reconstruction RR, i.e. the portion of genome correctly reconstructed by assemblers. The difference between hypothetical and RR, here called erroneous reconstruction ER, is shown in black Comparative analysis of algorithms for whole-genome assembly of pyrosequencing data Brief Bioinform (2012) 13(3): 269-280 Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary Illumina platform assembly Name ALLPATHS-LG Velvet Taipan Algorithm OLC de-Brujin Supporting Evidence GAGE: A critical evaluation of genome assemblies and assembly algorithms Comparative studies of de novo assembly tools for nextgeneration sequencing technologies A Practical Comparison of De Novo Genome Assembly Hybrid(GreedySoftware Tools for Next-Generation Sequencing based and Technologies graph) SOAPdenovo de-Brujin SUTTA Branch-andBound Feature-by-Feature- Evaluating De Novo Sequence Assembly Feature-by-Feature – Evaluating De Novo Sequence Assembly Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary Evaluation of illumina assemblers Genomes Used For Comparison GAGE: A critical evaluation of genome assemblies and assembly algorithms. Steven L. Salzberg, Adam M. Phillippy, Aleksey Zimin, et al. Genome Res. 2012 22: 557-567 Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary Comparison of illumina assemblers • The best value for each column is shown in bold. For all assemblies • The Errors column contains the number of misjoins plus indel errors >5 bp for contigs, and the total number of misjoins for scaffolds. • Corrected N50 values were computed after correcting contigs and scaffolds by breaking them at each error. See the evaluation section in the text for details on how errors were identified. GAGE: A critical evaluation of genome assemblies and assembly algorithms. Steven L. Salzberg, Adam M. Phillippy, Aleksey Zimin, et al. Genome Res. 2012 22: 557-567 Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary Comparison of illumina assemblers • A ‘‘chaff’’ contig is defined as a single contig <200 bp in length. In many cases, these contigs can be as small as the k-mer size used to build the de Bruijn graph (e.g., 36 bp) and are too short to support any further genomic analysis. • A duplicated repeat is one that appears in more copies than necessary in the assembly, and a compressed repeat is one that occurs in fewer copies. GAGE: A critical evaluation of genome assemblies and assembly algorithms. Steven L. Salzberg, Adam M. Phillippy, Aleksey Zimin, et al. Genome Res. 2012 22: 557-567 Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary Comparison of illumina assemblers • ‘‘Misjoin’’ errors are perhaps the most harmful type, in that they represent a significant structural error. A misjoin occurs when an assembler incorrectly joins two distant loci of the genome, which most often occurs within a repeat sequence. • We have tallied three types of misjoins: (1) inversions, where part of a contig or scaffold is reversed with respect to the true genome; (2) relocations, or rearrangements that move a contig or scaffold within a chro- mosome; and (3) translocations, or rearrangements between chromosomes GAGE: A critical evaluation of genome assemblies and assembly algorithms. Steven L. Salzberg, Adam M. Phillippy, Aleksey Zimin, et al. Genome Res. 2012 22: 557-567 Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary Comparison of illumina assemblers • Average contig (A) and scaffold (B) sizes, measured by N50 values, versus error rates, averaged over all three genomes for which the true assembly is known: S. aureus, R. sphaeroides, and human chromosome 14. • Errors (vertical axis) are measured as the average distance between errors, in kilobases. • In both plots, the best assemblers appear in the upper right. GAGE: A critical evaluation of genome assemblies and assembly algorithms. Steven L. Salzberg, Adam M. Phillippy, Aleksey Zimin, et al. Genome Res. 2012 22: 557-567 Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary Applicability of assemblers Genomes used for comparison A Practical Comparison of De novo Genome Assembly Software Tools for Next-Generation Sequencing Technologies. Wenyu Zhang, et al. Plos One. 2011 6: 1-12 Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary Comparison of illumina assemblers A Practical Comparison of De novo Genome Assembly Software Tools for Next-Generation Sequencing Technologies. Wenyu Zhang, et al. Plos One. 2011 6: 1-12 Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary Comparison of illumina assemblers A Practical Comparison of De novo Genome Assembly Software Tools for Next-Generation Sequencing Technologies. Wenyu Zhang, et al. Plos One. 2011 6: 1-12 Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary Hybrid Platform Assembly Name RAY Algorithm SBH Supporting Evidence Feature-by-Feature – Evaluating De Novo Sequence Assembly Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary Feature-by-Feature – evaluating de-novo assembly • COMPRESSION: Area representing a possible repeat col- lapse; • LOW_GOOD_CVG: Area composed of paired reads at the right distance and with the right orientation but at low coverage; • HIGH_OUTIE_CVG: Area composed of incorrectly oriented mates (--> -->, <-- -->); • HIGH_SINGLEMATE_CVG: Area composed of single reads (mate not present anywhere); • HIGH_READ_COVERAGE: Region in assembly with unexpectedly high local read coverage; • KMER_COV: Problematic k-mer distribution. Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary Feature-by-Feature: evaluating de-novo assembly Real Data - Long Reads Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary Feature-by-Feature – evaluating de-novo assembly Real Data - Short Reads Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary Final Approach Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary 454 raw reads Illumina DeNovo • Allpaths LG • SOAP DeNovo • Velvet • Taipan • SUTTA Illumina raw reads Statistical analysis 454 reads V. vulnificus CMCP6 • samstats Illumina reads Assemblers Assemblers All possible combinations of the best 3 contigs * 3 V. vulnificus MO6-24/O Align illumina reads against 454 contigs Mac vector CLC wb • • LEGEND Mimimus MAIA • • Scaffolds Unmapped reads Finished genome Contig merging Unmapped reads • • contigs DeNovo assembly Align Illumina against the reference • PAGIT Mauve Nulceotide identity Gap filling • • GRASS Built-in Genome finishing contigs Illumina/(454?) reference based assembly Reference genome • Reference selection Info. GAGE Hawk-eye • • Unmapped reads bwa Compare mapping statistics Illumina 454 GAGE • Chosen Ref. Published Genomes from public databases • Process Illumina/ 454/ Hybrid DeNovo assembly Pre-processing V. vulnificus YJ016 hybrid Evaluation Fastqc Prinseq NGS QC Read stats Illumina Parameter optimization 454 DeNovo • Newbler • CABOG • SUTTA Pre-processing • • • 454 Hybrid DeNovo • Ray Reference evaluation MUMer • AMOScmp Reference based assembly Draft/ Finished genome • DNA Diff Reference evaluation Stakeholders / Biology / NGS Review / Introduction to Genome Assembly / Challenges /Analysis Pipeline-Strategy / Tool Selection / Summary MUMer References 1. 2. 3. 4. 5. 6. 7. 8. 9. Finotello, F., et al., Comparative analysis of algorithms for whole-genome assembly of pyrosequencing data. Brief Bioinform, 2012. 13(3): p. 269-80. Vezzi, F., G. Narzisi, and B. Mishra, Feature-by-feature--evaluating de novo sequence assembly. PLoS One, 2012. 7(2): p. e31002. Zhang, W., et al., A practical comparison of de novo genome assembly software tools for next-generation sequencing technologies. PLoS One, 2011. 6(3): p. e17915. Salzberg, S.L., et al., GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Res, 2012. 22(3): p. 557-67. Narzisi, G. and B. Mishra, Comparing de novo genome assembly: the long and short of it. PLoS One, 2011. 6(4): p. e19175. Miller, J.R., S. Koren, and G. Sutton, Assembly algorithms for next-generation sequencing data. Genomics, 2010. 95(6): p. 315-27. Li, Z., et al., Comparison of the two major classes of assembly algorithms: overlap-layoutconsensus and de-bruijn-graph. Brief Funct Genomics, 2012. 11(1): p. 25-37. Lin, Y., et al., Comparative studies of de novo assembly tools for next-generation sequencing technologies. Bioinformatics, 2011. 27(15): p. 2031-7. Zhang, J., et al., The impact of next-generation sequencing on genomics. J Genet Genomics, 2011. 38(3): p. 95-109.