Advancing metagenomics
with Illumina sequencing
technology
Anthony J. Cox
Computational Biology Group
Illumina Cambridge Ltd.
14th April 2014
© 2013 Illumina, Inc. All rights reserved.
Illumina, IlluminaDx, BaseSpace, BeadArray, BeadXpress, cBot, CSPro, DASL, DesignStudio, Eco, GAIIx, Genetic Energy, Genome Analyzer, GenomeStudio, GoldenGate, HiScan, HiSeq, Infinium,
iSelect, MiSeq, Nextera, NuPCR, SeqMonitor, Solexa, TruSeq, TruSight, VeraCode, the pumpkin orange color, and the Genetic Energy streaming bases design are trademarks or registered trademarks
of Illumina, Inc. All other brands and names contained herein are the property of their respective owners.
Contents
Challenge: achieving a seamless end-toend workflow for metagenomics
Case study: Eagle Creek Reservoir
– 16S workflow on MiSeq
– Shotgun metagenomics on NextSeq
Challenge: efficient storage and access for
metagenomic data
2
Expanded sequencing portfolio
1800Gb | 6B | 2x150
Increasing System Output
1000Gb | 4B | 2x125
120Gb | 400M | 2x150
HiSeq X Ten
15Gb | 25M | 2x300
HiSeq 2500
NextSeq
MiSeq
Decreasing Price Per Gb
3
Integration
Streamlined end-to-end solution
Sample Prep
Suite of DNA, RNA &
Targeted Solutions
4
Sequencing
Analysis
Industry’s leading
NGS instruments
Storage, Processing,
Analysis &
Collaboration
Case study: Eagle Creek reservoir, Indiana
Assessing seasonal blooms of Cyanobacteria (blue-green algae) in drinking
water that can impact water quality.
Collaboration with Center for Earth and Environmental Science, IUPUI
49 reservoir samples collected in different months, at discrete depths.
Study combines 16S analysis on MiSeq with shotgun metagenomics on NextSeq
By courtesy of:
Nicolas Clercin (IUPUI), Rob Schmeider, Brian Steffy, Clotilde Teiling, Kameran Wong (Illumina)
5
MiSeq – continuous performance improvements
Delivering on promise of 15Gb+, 2x300 bp reads
Output - Gb
20
Output
15 Gb
Since launch:
Clusters
25M
10x increase in output
Read length
2 x 300 bp
Price / Gb
7x decrease in price per data point
10
Output
>1.5 Gb
Clusters
~7M
Read length
2 x 150 bp
Price / Gb
Output
>8 Gb
Clusters
>15M
Read length
2 x 250 bp
Price / Gb
$90
$192
New v3 reagent kits
150 & 600-cycle
$643
Faster chemistry
Dual surface imaging
1
2Q11
6
3Q11
4Q11
1Q12
*Prices reflect US List only
2Q12
3Q12
4Q12
1Q13
2Q13
3Q13
4Q13
Workflow overview
16S rRNA Sequencing was done on 27 of the samples
Sample Prep
• Genomic DNA
extraction
V3–V4 region
Amplification
• Primer pair
sequences for
V3 and V4
region create a
simple 460 bp
long amplicon.
Library Prep
• Nextera XT
indexing kit for
96 samples in
parallel
The Meta-G-Nome™ DNA Isolation Kit is
used to isolate inhibitor-free, fosmid
cloning-ready DNA from unculturable or
difficult-to-culture microbial species
present in environmental water, soil, or
compost samples.
7
MiSeq &
Primary
Analysis
• 100,000 reads
per sample if
using all 96
indexes.
Secondary
Analysis
• Comparative
genomics
• Phylogenetic
classification
16S metagenomics on BaseSpace
8
Taxonomic classification
Can run on-instrument using MiSeq Reporter or in cloud with BaseSpace
Both analysis pipelines use the same classification algorithm and taxonomic
database.
– The classification algorithm is a high performance implementation of the published
RDP Naïve Bayesian Classifier (http://dx.doi.org/10.1128%2FAEM.00062-07)
– The database is an Illumina-curated version of the GreenGenes Consortium 16S
rRNA database. Redundant sequences and entries with missing or partial labels are
removed.
Provides fast, high-accuracy species-level taxonomic classifications
Uses full length of Illumina paired-end reads
Outputs: PDF reports, raw data (CSV), interactive visualizations
9
Examples of 16S workflow output
PCA plot of normalized
relative abundance of
samples
10
Clustering dendrogram
NextSeq innovations
Consumables
Optics
Load-and-go flowcell
• High or medium output
• Ships dry
All-in-one reagent tray
• RFID-tagged, ships frozen
All-in-one buffer tray
• Ships at room temperature
Solid state optics
• Leverages advances in
consumer products
• No alignment needed
Chemistry
2-dye sequencing chemistry
• comparable quality to 4-dye
Isothermal amplification
• No chiller on instrument
Optimized reagent consumption
11
Fluidics
Eliminated fluidic tubes
• less dead volume, waste,
contamination
Automatic post-run wash protocols
• Bleach step eliminates carry-over
Simultaneous chemistry & imaging
• chemistry in one lane while
imaging other pair
Shotgun metagenomics on NextSeq: workflow overview
Sample
Extraction
•
•
•
•
12
Library Prep
NextSeq
Sequencing
11 samples sequenced in 1 NextSeq run
400 million 2×150bp read pairs generated in 29 hours
78.8% of bases exceeded Q30
Analysis done with MG-RAST
Analysis
Seasonal variation in composition at bottom of lake
23rd May
25th July
23rd October
Actinobacteria = 76%
Actinobacteria 33%
Actinobacteria=79%
Ongoing challenge: what should be our data analysis pipeline for shotgun
metagenomic data, e.g. on BaseSpace?
• Several standalone apps for taxonomic classification
• Seem to be fewer options for functional classification
13
HiSeq 1 terabase run (R&D data)
Yield
1035 Gb
Reads
4.14B
Read Length
2 × 125 bp
Throughput / day
172.5 Gb
Quality (%>Q30)
87.7%
Run Time
6 days
2 x 125 Cycles
Per run you can do up to:
− 10 genomes
− 150 exomes
− 80 WT RNA samples
14
*Assumes 100Gb, 30x genome; Nextera Rapid Capture Exome; 50M reads per RNA sample
Challenge: efficient storage and access for shotgun
metagenomic data
Resequencing data (Human genome build ~160 Gbp, ~400 Gbyte FASTQ)
FASTQ
(gzipped)
150 Gbyte
BAM
(40 Q-scores)
120 Gbyte
BAM
(8 Q-scores)
82 Gbyte
BAM
(consensus
compressed)
60 Gbyte
CRAM
(consensus
compressed)
27 Gbyte
Relies heavily on known high-quality reference sequence
Resequencing data (Human genome build 145Gbp, ~160 Gbp, ~400 Gbyte FASTQ)
FASTQ
(gzipped, 8 Q-scores)
89 Gbyte
•
•
•
BWT compression
(now)
37Gbyte
BWT compression
(likely achievable)
23 Gbyte
8937Gbyte: BWT/PPM for reads, simple binning of Q-scores (lossless)
Sort reads for better compression – save 4Gbyte (Cox et al., 2012)
Discard uninformative Q-scores (reference free) – save 10Gbyte (Janin et al., 2012)
15
Trading compression for searchability
Resequencing data (Human genome build ~165 Gbp)
FASTQ
(gzipped)
152 Gbyte
BWT
(searchable)
105 Gbyte
Reads (BWT) :
26 Gbyte
Q-scores (razip):
64 Gbyte
Read names (razip):15 Gbyte
NB: 40 Q-scores, both FASTQ and BWT would be smaller for 8 Q-scores
For a query sequence q, returns:
• Full FASTQ record (sequence, Q-scores, read names) for all reads containing q
• … and full FASTQ record of their read pairs
• Pipe search output directly to your favourite tool, e.g. Velvet
Applications:
• “In silico pull-down”
• Assembling breakpoints
• Genotyping complex variants by tracking k-mers
Further info: beetl.github.io/BEETL/, Janin et al. (2014, submitted)
16
Thank you!
17
Extra slides
18
Moleculo Technology Enables Synthetic Long Reads
Up to 10Kb from Illumina short reads
 Synthetic long reads 8 – 10kb
 Enables fully phased genomes
Step1
 Accurate de novo assembly of
large, complex genomes
Available:
 Illumina services 2H13
…
Step2
…
Step3
 Kit format early 2014
Step4
19
BaseSpace: Plug and Play Genomic Cloud Solution
All you need is an
internet connection
20
How Is BaseSpace Being Used World Wide? Users & Growth
Bioinformatics Cloud Computing Service
Illumina Begins Streaming MiSeq Data to the Cloud
October 2011
Illumina Begins Data Sharing in the Cloud
December 2011
Illumina Begins Streaming HiSeq Data to the Cloud
November 2012
Over 20,000 Instrument Runs Streamed to BaseSpace
December 2012
BaseSpace Commercial (Supported) Release
May 2013
Over 40,000 Instrument Runs Streamed to BaseSpace
April 2013
General Availability of BaseSpace to all HiSeq instruments
July 2013
Over 60,000 Instrument Runs Streamed to BaseSpace, and Over 10,000 Apps Run
September 2013
21