Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye Sequencing Methods Experimental comparison of De Bruijn graph and Overlay graph assemblers Preliminary Results Lab Exercise Sequencing Methods Sanger Sequencing ◦ Cycle sequencing rxn ◦ ddNTP-terminated dyelabeled products ◦ High-resolution electrophoretic separation ◦ Parallelized in 96 or 384 capillaries ◦ Read lengths up to 1kBp ◦ Raw accuracy up to 99.999% ◦ Costs 50 ¢ per kB Sequencing Methods Second Gen. Sequencing ◦ Cyclical array methods 454 Illumina AB SOLiD Polonator HeliScope ◦ Platforms vary in biochemistry and array generation yet conceptually similar in workflow Illumina Illumina continued AB SOLiD 454 Pyrosequencing Create a DNA library ◦ Ligate adaptors to fragments Emulsion PCR ◦ Agarose beads ◦ Oil, water, PCR reagents ◦ Results in 1 mill copies / fragment for each bead More 454 Beads arrayed into picotiter plate ◦ Immobilized via addition of enzyme containing beads ◦ Each cell contains exactly 1 bead Bst polymerase, luciferase, apyrase, ATP sulferylase used Even more 454 Example of Output Flow Order 4-mer 3-mer 2-mer 1-mer T A C G KEY (TCAG) Measures the presence or absence of each nucleotide at any given position Videos (454 Workflow) Videos (Pyrosequencing) note: we did not choose the music Comparison of 2nd Gen Platforms Sequencing Methods Experimental comparison of De Bruijn graph and Overlay graph assemblers Preliminary Results Lab Exercise De Bruijn Graph assemblers and Overlay Graph assemblers De Bruijn Graph assemblers ◦ Velvet, Abyss, Euler Overlay Graph assemblers ◦ Newbler, Edena, SSAKE, VCAKE Synthetic Data used for Experiments Write a C program to simulate reads from reference genome with specific read length, coverage and base error rate ◦ Human chr 22, ~33.5M bases ◦ Streptococcus Suis, NC_012925.1, ~2M bases ◦ Helicobacter acinonychis Sheeba, ~ 1.5M bases Write anther C program to measure the quality of assemblers ◦ ◦ ◦ ◦ N50 length No. of contigs Max contig length No. of mis-assembled contigs Read Length De Bruijn graph assemblers are only suitable for short reads data K limitation ◦ Use Hash table or Sorting to index K-mers Need use a unique key(value) to represent each K-mer K=16 416=232 <-> 32-bit integer (unsigned int) K=32 432=264 <-> 64-bit integer (unsigned long long) K>32? <-> multiple integer to represent the hash table key Simulate reads from Streptococcus Suis 300 read length, 50X coverage, error rate 0.1% Velvet default: K <= 31, so we use 31 # of contigs (total length) Velvet 46515 (1716053 bp) 115 bp # of misassembled contigs (total length) 5 (1346 bp) Recompile velvet, K = 99 # of contigs (total length) Velvet N50 length 441(1974382 bp) N50 length 15328 bp # of misassembled contigs (total length) 1 (34 bp) Quality and Accuracy It is stated in some literatures that “De Bruijn based approach prone to false positives”, “Overlap graph has better quality” Simulate reads from Helicobacter acinonychis Sheeba 35 read length, 50X coverage, error rate 0.1% Assembl ers # of contigs (total length) N50 length # of misassembled contigs (total length) Velvet 336 (1525746 bp) 10.4 kbp 17 (156637 bp) Edena 340 (1513259 bp) 9,8 kbp 0 (0 bp) Simulate reads from Streptococcus Suis 35 read length, 50X coverage, error rate 0.1% Assembl ers # of contigs (total length) N50 length # of misassembled contigs (total length) Velvet 1106 (1969617 bp) 5266 bp 12 (255594 bp) Edena 1003 (1970342 bp) 6416 bp 0 (0 bp) Runtime and Memory Usage Overlap graph based assemblers are computing-expensive and use more memory ◦ All-to-all alignment of reads, O(n2) ◦ Use more memory to store overlap graph Typically, number of reads is weigh larger than the number of K-mers ◦ Especially for short reads data With the same coverage and genome length, shorter reads means more reads ◦ It is stated that De Bruijn graph are more suitable for NGS data Shorter reads, and high throughput Simulate reads from Streptococcus Suis 802995 reads 50 read length, 20X coverage, error rate 0.1% Xeon E5530 2.4 GHz Assemblers Time Memory Velvet 33 secs ~220 M SSAKE 26 mins ~900 M VCAKE 107 mins ~1.1 G However! Recent advance of pattern matching algorithms and technical enable the use of overlap graph ◦ Suffix tree, Suffix array, Prefix array, compressed suffix array Suffix array ◦ Be able to find overlap between reads in linear time ◦ Usage of compressed suffix array can significantly reduce the memory requirements of overlap graph assemblers Examples ◦ D. Hernandez, P. François, L. Farinelli, M. Osteras, and J. Schrenzel , De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. Genome Research. 18:802-809, 2008. ◦ Jared T. Simpson and Richard Durbin Efficient construction of an assembly string graph using the FM-index, Bioinformatics (2010) 26 (12):i367-i373. ◦ Pasqual Pushkar and I have developed a parallel sequence assembler based on overlap graph in our research project Simulate reads from Human chr22 6978908 reads 50 read length, 20X coverage, error rate 0.1% Xeon E5530 2.4 GHz with 4 cores/8 threads Assemblers Time Memory Velvet 292 mins ~17 GB Edena 37 mins ~7 GB Pasqual 43 mins ~8 GB Parallel Pasqual 9 mins ~8 GB Mixed Length Reads H. influenzae ◦ 30 ~ 300 length Velvet does not work ◦ K is fixed ◦ If we use big K, the reads shorter than K can not be assembled. ◦ If we use small K, it is difficult to assemble the long reads Overlap graph assemblers do not have this issue ◦ Newbler Conclusion Controversial ◦ It is still unclear about the relation between De Bruijn graph and Overlap graph We can still conclude from the experiments ◦ Regarding quality and accuracy, Overlap graph assemblers are thought to be better than De Bruijn graph assembler ◦ De Bruijn graph assemblers does not work for long reads ◦ De Bruijn graph assemblers does not work for mixed length reads (K is fixed) ◦ Traditional overlap graph assemblers are slower and use more memory, but latest assemblers are better than De Bruijn graph assemblers Sequencing Methods Experimental comparison of De Bruijn graph and Overlay graph assemblers Preliminary Results Lab Exercise Quality score and length distribution M19107 Mean length Median length 577.5849 569 Std dev 83.9605 Quality score and length distribution M19501 Mean length Median length 624.7172 621 Std dev 78.4074 Quality score and length distribution M21127 Mean length Median length 618.7576 616 Std dev 81.5678 Quality score and length distribution M21621 Mean length Median length 620.6305 621 Std dev 83.978 Quality score and length distribution M21639 Mean length Median length 573.384 564 Std dev 66.5525 Quality score and length distribution M21709 Mean length Median length 626.2459 624 Std dev 78.2447 Velvet Id K No. of contigs N50 Max length Total length % reads used M19107 19 217160 16 665 2905543 97.3535 29 176741 26 655 3315033 88.7319 19 618036 13 429 4716286 78.9177 29 537077 18 490 5725530 35.5981 19 319999 15 483 3498613 91.4239 29 259942 24 416 3998418 73.0187 19 218872 16 640 3052522 93.7490 29 157853 26 838 3256837 87.5425 19 770867 13 628 5818868 85.0236 29 680339 19 601 7348599 46.1671 19 29 291156 207736 16 25 768 816 3425632 3637419 95.7695 83.8704 M19501 M21127 M21621 M21639 M21709 $> velveth <output_dir> <k-mer length> -fasta -long <reads.fasta> $> velvetg <output_dir> Input: Fasta/Fastq Output: Fasta WGS assembler (Celera) • >50 separate programs make up the Celera Assembler pipeline Input: frg format Output: Fasta • runCA script helps manage them all Id M19107 M19501 M21127 M21621 M21639 M21709 No.of Contigs 236 214 345 356 326 520 N50 11881 1230 8349 7791 2092 4393 Max length 32038 4519 26765 30668 9912 15002 Total length 1766060 278112 1947955 1892633 610813 1700040 % reads used 96.3570 98.6032 97.9181 98.1710 98.3939 98.5221 $> sffToCA –trim soft –libraryname ${Id}-trimsoft –output ${Id}-trimsoft ${Id}.sff $> runCA –p ${Id} –d ${Id} ovlConcurrency=4 ${id}-trimsoft.frg Newbler De Novo Assembly Id No.of Contigs M19107 217 M19501 75 M21127 59 M21621 50 M21639 175 M21709 52 N50 15659 157459 121256 138437 43023 140128 Max length 38000 343196 316274 339424 182797 319869 Reference Assembly – (Haemophilus-influenzae-refseq.fasta) Id No.of Contigs N50 Max length M19107 1260 2496 10409 M19501 988 3503 18724 M21127 M21621 M21639 1272 2701 13712 M21709 313 13836 70298 $> runAssembly <reads.sff> Total length 25112606 106836011 40693944 50432798 158028027 69503256 Total length 1224223 1380153 1416318 1607841 // de novo assembly Input: .sff Output: Fasta MIRA MIRA stands for Mimicking Intelligent Read Assembly Id No.of Contigs N50 Max length Total length % reads used M19107 208 18379 51687 1795134 95.7478 M19501 181 185484 321569 1901198 97.7347 M21127 89 81157 305626 1951240 97.4776 M21621 67 90877 253924 1887484 97.5015 M21639 175 90800 152373 2378888 98.1330 M21709 83 62871 197745 1840248 97.6776 Input: Fasta + qual + trace info Output: Fasta, Ace $> sff_extract –s ${Id}_in.454.fasta -q ${Id}_in.454.fasta.qual -x ${Id}_traceinfo_in.454.xml ${Id}.sff $> mira --project=${Id} --job=denovo,genome,normal,454 -GE:not=4 >& ${Id}_assembly.log Eagle view - M19107.ace Eagle view - M19501.ace Works Cited “Next-generation DNA sequencing” Shendure et. al, http://compgenomics2011.biology.gatech.edu/images/f/f9/ShendureNatureBiotechnology-2008.pdf “Next-generation DNA sequencing methods” Mardis et. al, http://compgenomics2011.biology.gatech.edu/images/5/59/MardisAnnuRevGenet-2008.pdf Sequencing Methods Experimental comparison of De Bruijn graph and Overlay graph assemblers Preliminary Results Lab Exercise Lab Exercise Download the Lab Exercise file from the Genome Assembly wiki page