Sequencing of complex plant genomes: big data …big deal? Applications and Challenges of Oxford Nanopore Sequencing in the Life Science Industry Raymond Hulzink, Ph.D Wageningen, April 14, 2016 Rhu@keygene.com Genome assembly The challenge Long-read sequencing technologies have accelerated whole genome (re-)sequencing approaches and reduced costs dramatically .. but, de novo construction of highly accurate draft genome sequences in complex organisms is still a challenge and costly .. therefore, high-quality ultra-long reads are needed ‘Repeats longer than read length cannot be resolved!’ The crop innovation company 2 Plant genomes Size Crops e.g. Melon 400 Mb Source: Alpsdake, via Wikimedia Commons- Own work. Licensed under CC BY-SA 3.0 via Wikimedia Commons http://commons.wikimedia.org/wiki/File:Paris_japonica_Kinugasasou_in_Hakusan_2010_7_18.jpg Tomato 800 Mb Japanese canopy plant 149,000 Mb Lettuce 1,200 Mb Pepper 3,200 Mb Source: Walter Siegmund - Own work. Licensed under CC BY-SA 3.0 via Wikimedia Commons http://commons.wikimedia.org/wiki/File:Cardamine_amara_eF.jp g Large bitter cress 54 Mb Source: Michael Apel - Own work. Licensed under CC BY 3.0 via Wikimedia Commons http://commons.wikimedia.org/ wiki/ File: Fritillaria_meleagris_MichaD.jpg#/ media/ File:Fritillaria_meleagris_MichaD.jpg Barley 5,000 Mb Data source: http://data.kew.org/cvalues/CvalServlet?querytype=1 Snake's head 124,852 Mb The crop innovation company 3 Plant genomes Complexity • Repetitive DNA º Medium - Tandem repeats (rRNA, tRNA) - Gene families (paralogs) - Transposable elements (e.g. retro) º High e.g. pepper genome ~81% - Tandem arranged SSRs repetitive sequences - Centromeric tandem repeats Qin et al. (2014) Whole-genome sequencing of cultivated and wild peppers provides insights into Capsicum demestication and specialization. PNAS 111: 5135-5140 • Heterozygosity, polyploidy The crop innovation company 4 MAP @ KeyGene • Phase 1 (2014): set-up system, • testing software, and sequencing ONT reference DNA (λ genome) Phase 2 (2015): sequencing experimental DNA (plant BAC clones) “ I have just been looking at some QC metrics that the software has sent back to us and see that your flow cell is running hotter than I would have expected …. ” Oxford Nanopore The crop innovation company 5 BAC sequencing Read alignment against reference • Alignment with MarginAlign against PB references Depth of Coverage (# of reads) • Despite a low number of 2D pass reads (<10%), BAC references were completely covered (8-20x depth) • Sequencing error rate showed ~83% of read accuracy MinION / FLO-MAP003 Map Position on Reference The crop innovation company 6 BAC sequencing De novo assembly • de novo assembly with Celera assembler after one or two rounds of error correction (NanoCorrect) • Alignment against PB reference using MUMmer with dnadiff tool for estimation of per-base accuracy PacBio reference (bases) BAC H049 – Assemb 2 BAC H032 – Assemb 2 • Successful de novo assembly for two BAC clones with 10 - 15 fold read depth • High quality assemblies with a small number of substitution errors and a moderate amount of insertion / deletion errors Nanopore assembly (bases) The crop innovation company 7 Genome sequencing Plant pathogen Rhizoctonia solani • • • • • • Soil-borne plant pathogenic fungus Causes a wide range of commercially significant plant diseases Estimated genome size ~50-55 Mb o heterokaryotic (≥ 2 distinct nuclear genomes) o 10% repetitive sequences o duplicated genomic regions Draft genome sequences available from different subgroups High level of sequence differences between different subgroups (~21% shared core genes) Generate draft genome assembly of Rhizoctonia solani o MinION MK1 sequencer with MAP006 chemistry and FLO-MAP103 flow cells The crop innovation company 8 Genome sequencing Extraction of ultra-pure (u)HMW DNA • DNA quality and integrity are essential for obtaining high-quality long reads • Extraction of ultra-pure HMW and uHMW from plants has unique challenges that require specific expertise to deal with carbohydrates, phenolics, and other compounds abundant in plant tissues • KeyGene has developed protocols for extraction, purification, analysis, and quantitation of DNA from a variety of (difficult) plant and pathogen species. The crop innovation company 9 R. solani sequencing Extraction and sizing of fungal HMW DNA Nanodrop Crude QubitBR Tape Station [ng/uL] 260/280 260/230 [ng/uL] [ng/uL] 1,372 1.87 0.92 53 - 210 2.22 2.18 190 162 - - - 265 257 Purified Sized ~45% The crop innovation company 10 R. solani sequencing Library preparation MAP006 work flow ~12.5 K (9K hydropore S ) Lib002 ~18.8 K (10K hydropore L) >60 K 100 90 RECOVERY (%) Lib003 100 80 80 80 70 66 60 50 64 62 Lib002 Lib003 46 Lib004 40 30 20 23 17 10 0 2 Lib004 The crop innovation company 11 R. solani sequencing Library and read size distribution Library size Read length (MinKnow) Lib002 2D Read Length (Metrichor) 2D Pass Read Length 8.5 K 19 K 17.9 K Sequence length 2D 11.3 K Lib003 23 K 21.3 K Sequence length 2D 34 K Lib004 56.6 K The crop innovation company 15.3 K Sequence length 2D 12 R. solani sequencing 2D pass read summary Run Remarks # 2D Pass Reads Total length (Mb) Max Read Length (Kb) Median 2D Quality Score 53.5 ng (6 uL) air bubble 2,900 26 15.8 9.4 53.5 ng (6 uL) heat sink ~40°C 4,204 36 29.0 8.8 89.2 ng (10 uL) 25,346 223 25.9 10.0 37.8 ng (6 uL) 13,068 152 34.2 8.9 17.9 K % Reads Library cumulative length distribution 21.3 K Read length (bases) 56.6 K The crop innovation company 125.2 ng (20 uL) 23,806 269 43.7 8.6 7.8 ng (6 uL) 3,414 53 61.4 9.5 28.6 ng (22 uL) 5,931 89 80.4 9.4 13 Genome assembly Miniasm and Canu assembly summary • ~54 Mb draft genome sequence with Canu consisting of 679 contigs with a N50 value of ~170 K and a maximum contig length of more than 2 Mb • longer reads produce more contiguous assemblies The crop innovation company 14 Genome assembly Comparison between genome assemblies Reference Platform Sequence Yield (Mb) Sum contigs (Mb) # scaffolds N50 length (Kb) # contigs N50 length (Kb) Zheng et al. Nat Commun 2013 GAII 5,604 36.9 2,648 ~475 6,452 ~20.3 Cubeta et al. Genome Ann 2014 Sanger/ FLX - 51.7 326 ~7,444 6,040 ~25.9 Hane et al. PLOS Gen 2014 HiSeq - 39.8 857 ~161 7,606 ~7.2 Wibberg et al. J Biotech 2015 FLX/MiSeq 2,200/ 2,000 42.8 879 - 3,793 ~35.1 Wibberg et al. BMC Gen 2016 MiSeq 2,800 52 2,065 ~81.2 5,826 ~15.2 KeyGene Canu MK1 848,6 54.1 - - 679 ~170 • With only 5 flow cells, about 15X coverage • T.b.d.: detailed read coverage analysis to determine the level of genome duplication and the estimated heterokaryotic genome size The crop innovation company 15 Genome alignment Comparison between two public assemblies • Alignment of public assemblies (MUMmer) Zheng et al 2013- assembly (bases) • Comparative genome analysis reveals considerable genetic differences between different isolates (i.e. genome size, gene number and composition) • Level of similarity between R. solani draft genomes but with an overall low level of co-linearity Cubeta et al 2014- assembly (bases) The crop innovation company 16 Genome alignment KeyGene assembly vs. Cubeta et al. 2014 Cubeta et al 2014- assembly (bases) • Considerable sequence diversity exists between the KeyGene strain and public Rhizoctonia strains KeyGene canu assembly (bases) The crop innovation company 17 Conclusions • Plant BAC DNA sequencing • De novo assembly for two BAC clones with 10 - 15 fold read depth • High quality assemblies using a low number of 2D pass reads • Rhizoctonia solani genome sequencing • Large number of high-quality 2D pass reads in 24 hour runs • Direct sequencing of HMW DNA positively effects read length • Generated a ~54 Mb draft genome sequence with an estimated read depth of 10x • Low level of co-linearity between nanopore assembly and published draft genomes • Sequencing of complex plant genomes: big data … big deal! The crop innovation company 18 What’s next ….? • Rhizoctonia solani genome sequencing • Improving the synthesis of long fragment libraries (yield, size) • Sequencing additional flow cells • Testing more tools and parameters • Plant genome sequencing • KeyGene joined PromethION Early Access Programme (PEAP) • Draft genome sequence of a melon variety using the PromethION • Meet us at … The crop innovation company 19 Acknowledgements Erwin Datema Alex Boshoven Koen Cuelenaere Lisanne Blommers Alexander Wittenberg Nathalie van Orsouw Michiel van Eijk The KeyBase®, KeyPoint® Mutation Breeding, WGP™, Sequenced Based Genotyping and KeyGene® SNPSelect technologies are protected by patents and/or patent applications owned by Keygene N.V. KeyGene, KeyBase, KeyPoint and KeySeeQ are registered trademarks of Keygene N.V. in one or more territories in the world. All other products names, brand names or company names are used for identification purposes only, and may be (registered) trademarks of their respective owners. The crop innovation company 20