Advancing metagenomics with Illumina sequencing technology Anthony J. Cox Computational Biology Group Illumina Cambridge Ltd. 14th April 2014 © 2013 Illumina, Inc. All rights reserved. Illumina, IlluminaDx, BaseSpace, BeadArray, BeadXpress, cBot, CSPro, DASL, DesignStudio, Eco, GAIIx, Genetic Energy, Genome Analyzer, GenomeStudio, GoldenGate, HiScan, HiSeq, Infinium, iSelect, MiSeq, Nextera, NuPCR, SeqMonitor, Solexa, TruSeq, TruSight, VeraCode, the pumpkin orange color, and the Genetic Energy streaming bases design are trademarks or registered trademarks of Illumina, Inc. All other brands and names contained herein are the property of their respective owners. Contents Challenge: achieving a seamless end-toend workflow for metagenomics Case study: Eagle Creek Reservoir – 16S workflow on MiSeq – Shotgun metagenomics on NextSeq Challenge: efficient storage and access for metagenomic data 2 Expanded sequencing portfolio 1800Gb | 6B | 2x150 Increasing System Output 1000Gb | 4B | 2x125 120Gb | 400M | 2x150 HiSeq X Ten 15Gb | 25M | 2x300 HiSeq 2500 NextSeq MiSeq Decreasing Price Per Gb 3 Integration Streamlined end-to-end solution Sample Prep Suite of DNA, RNA & Targeted Solutions 4 Sequencing Analysis Industry’s leading NGS instruments Storage, Processing, Analysis & Collaboration Case study: Eagle Creek reservoir, Indiana Assessing seasonal blooms of Cyanobacteria (blue-green algae) in drinking water that can impact water quality. Collaboration with Center for Earth and Environmental Science, IUPUI 49 reservoir samples collected in different months, at discrete depths. Study combines 16S analysis on MiSeq with shotgun metagenomics on NextSeq By courtesy of: Nicolas Clercin (IUPUI), Rob Schmeider, Brian Steffy, Clotilde Teiling, Kameran Wong (Illumina) 5 MiSeq – continuous performance improvements Delivering on promise of 15Gb+, 2x300 bp reads Output - Gb 20 Output 15 Gb Since launch: Clusters 25M 10x increase in output Read length 2 x 300 bp Price / Gb 7x decrease in price per data point 10 Output >1.5 Gb Clusters ~7M Read length 2 x 150 bp Price / Gb Output >8 Gb Clusters >15M Read length 2 x 250 bp Price / Gb $90 $192 New v3 reagent kits 150 & 600-cycle $643 Faster chemistry Dual surface imaging 1 2Q11 6 3Q11 4Q11 1Q12 *Prices reflect US List only 2Q12 3Q12 4Q12 1Q13 2Q13 3Q13 4Q13 Workflow overview 16S rRNA Sequencing was done on 27 of the samples Sample Prep • Genomic DNA extraction V3–V4 region Amplification • Primer pair sequences for V3 and V4 region create a simple 460 bp long amplicon. Library Prep • Nextera XT indexing kit for 96 samples in parallel The Meta-G-Nome™ DNA Isolation Kit is used to isolate inhibitor-free, fosmid cloning-ready DNA from unculturable or difficult-to-culture microbial species present in environmental water, soil, or compost samples. 7 MiSeq & Primary Analysis • 100,000 reads per sample if using all 96 indexes. Secondary Analysis • Comparative genomics • Phylogenetic classification 16S metagenomics on BaseSpace 8 Taxonomic classification Can run on-instrument using MiSeq Reporter or in cloud with BaseSpace Both analysis pipelines use the same classification algorithm and taxonomic database. – The classification algorithm is a high performance implementation of the published RDP Naïve Bayesian Classifier (http://dx.doi.org/10.1128%2FAEM.00062-07) – The database is an Illumina-curated version of the GreenGenes Consortium 16S rRNA database. Redundant sequences and entries with missing or partial labels are removed. Provides fast, high-accuracy species-level taxonomic classifications Uses full length of Illumina paired-end reads Outputs: PDF reports, raw data (CSV), interactive visualizations 9 Examples of 16S workflow output PCA plot of normalized relative abundance of samples 10 Clustering dendrogram NextSeq innovations Consumables Optics Load-and-go flowcell • High or medium output • Ships dry All-in-one reagent tray • RFID-tagged, ships frozen All-in-one buffer tray • Ships at room temperature Solid state optics • Leverages advances in consumer products • No alignment needed Chemistry 2-dye sequencing chemistry • comparable quality to 4-dye Isothermal amplification • No chiller on instrument Optimized reagent consumption 11 Fluidics Eliminated fluidic tubes • less dead volume, waste, contamination Automatic post-run wash protocols • Bleach step eliminates carry-over Simultaneous chemistry & imaging • chemistry in one lane while imaging other pair Shotgun metagenomics on NextSeq: workflow overview Sample Extraction • • • • 12 Library Prep NextSeq Sequencing 11 samples sequenced in 1 NextSeq run 400 million 2×150bp read pairs generated in 29 hours 78.8% of bases exceeded Q30 Analysis done with MG-RAST Analysis Seasonal variation in composition at bottom of lake 23rd May 25th July 23rd October Actinobacteria = 76% Actinobacteria 33% Actinobacteria=79% Ongoing challenge: what should be our data analysis pipeline for shotgun metagenomic data, e.g. on BaseSpace? • Several standalone apps for taxonomic classification • Seem to be fewer options for functional classification 13 HiSeq 1 terabase run (R&D data) Yield 1035 Gb Reads 4.14B Read Length 2 × 125 bp Throughput / day 172.5 Gb Quality (%>Q30) 87.7% Run Time 6 days 2 x 125 Cycles Per run you can do up to: − 10 genomes − 150 exomes − 80 WT RNA samples 14 *Assumes 100Gb, 30x genome; Nextera Rapid Capture Exome; 50M reads per RNA sample Challenge: efficient storage and access for shotgun metagenomic data Resequencing data (Human genome build ~160 Gbp, ~400 Gbyte FASTQ) FASTQ (gzipped) 150 Gbyte BAM (40 Q-scores) 120 Gbyte BAM (8 Q-scores) 82 Gbyte BAM (consensus compressed) 60 Gbyte CRAM (consensus compressed) 27 Gbyte Relies heavily on known high-quality reference sequence Resequencing data (Human genome build 145Gbp, ~160 Gbp, ~400 Gbyte FASTQ) FASTQ (gzipped, 8 Q-scores) 89 Gbyte • • • BWT compression (now) 37Gbyte BWT compression (likely achievable) 23 Gbyte 8937Gbyte: BWT/PPM for reads, simple binning of Q-scores (lossless) Sort reads for better compression – save 4Gbyte (Cox et al., 2012) Discard uninformative Q-scores (reference free) – save 10Gbyte (Janin et al., 2012) 15 Trading compression for searchability Resequencing data (Human genome build ~165 Gbp) FASTQ (gzipped) 152 Gbyte BWT (searchable) 105 Gbyte Reads (BWT) : 26 Gbyte Q-scores (razip): 64 Gbyte Read names (razip):15 Gbyte NB: 40 Q-scores, both FASTQ and BWT would be smaller for 8 Q-scores For a query sequence q, returns: • Full FASTQ record (sequence, Q-scores, read names) for all reads containing q • … and full FASTQ record of their read pairs • Pipe search output directly to your favourite tool, e.g. Velvet Applications: • “In silico pull-down” • Assembling breakpoints • Genotyping complex variants by tracking k-mers Further info: beetl.github.io/BEETL/, Janin et al. (2014, submitted) 16 Thank you! 17 Extra slides 18 Moleculo Technology Enables Synthetic Long Reads Up to 10Kb from Illumina short reads Synthetic long reads 8 – 10kb Enables fully phased genomes Step1 Accurate de novo assembly of large, complex genomes Available: Illumina services 2H13 … Step2 … Step3 Kit format early 2014 Step4 19 BaseSpace: Plug and Play Genomic Cloud Solution All you need is an internet connection 20 How Is BaseSpace Being Used World Wide? Users & Growth Bioinformatics Cloud Computing Service Illumina Begins Streaming MiSeq Data to the Cloud October 2011 Illumina Begins Data Sharing in the Cloud December 2011 Illumina Begins Streaming HiSeq Data to the Cloud November 2012 Over 20,000 Instrument Runs Streamed to BaseSpace December 2012 BaseSpace Commercial (Supported) Release May 2013 Over 40,000 Instrument Runs Streamed to BaseSpace April 2013 General Availability of BaseSpace to all HiSeq instruments July 2013 Over 60,000 Instrument Runs Streamed to BaseSpace, and Over 10,000 Apps Run September 2013 21