Reference - Human Microbiome Journal Club

advertisement
Next-Generation Sequencing of Microbial
Genomes and Metagenomes
Christine King
Farncombe Metagenomics Facility
Human Microbiome Journal Club
July 13, 2012
Overview

Next-generation sequencing
 Applications
 Instruments
 Library
prep and sequencing chemistry
 Sequence quality

Project overview
 Microbial
genomes
 Microbial communities
DNA Sequencing

1st generation
Sanger chain termination
 Capillary electrophoresis


2nd generation (NGS)
High throughput,
“massively parallel”
 Shorter reads
 Sequencing-by-synthesis


3rd generation
Single molecule
 Nanopores

Applications

DNA sequencing


De novo genomes
Resequencing




Metagenome




Shotgun (e.g. mutant strains)
Amplicon (e.g. HLA, cancer)
Sequence capture (e.g.
exome)
Amplicon (e.g. 16S, COI, viral)
Shotgun
ChIP
RNA sequencing



Gene expression
Gene annotation, splice
variants
Metatranscriptome
Instruments
Instruments
Instrument
# of
reads
Read
length
(bp)
Total
outpu
t (Gb)
Cost
per
base
Run
Time
GS FLX
1M
450
0.5
$$$$
++
GS FLX+
1M
650
0.6
$$$$
++
GS Jr
100K
450
0.05
$$$$
++
GAIIx
640M
2x 150
90
$$
+++
HiSeq 2000
6B
2x 100
600
$
+++
MiSeq
12M
2x 150
2
$$
++
PacBio RS
>10K
>1000
0.01
$$$$
+
Single-molecule seq, fluorophore
SOLiD 5500xl
1.4B
75 + 35
155
$
+++
emPCR, probe ligation, fluorophore
Ion PGM - 316
1M
>100
0.1
$$$
+
Technology
emPCR, SBS, light detection
Bridge PCR, SBS, fluororphore
emPCR, SBS, pH change
Ion PGM - 318
6M
>100
1
$$
+
Which instrument(s) to use?

Read length vs number of reads

Cost per base, per sample, per project (multiplexing?)

Accuracy

Run time, wait time
Application
Length # Reads
Accuracy Instruments
Considerations
De novo (small)
+++
++
++
MiSeq, 454, Ion
Mix lengths
De novo (large)
+++
+++
++
HiSeq, 454, SOLiD
Mix lengths, MP
Re-seq (small)
++
++
++
MiSeq, Ion
Multiplex?
Re-seq (large)
++
+++
++
HiSeq, SOLiD
Enrichment?
RNA-seq (count) +
+++
+
Illumina, SOLiD, Ion
Ref? Size? Rare?
Amplicons
+++
+
+++
454, MiSeq
Size? Multiplex?
Metagenomics
++
+++
+++
Illumina, 454,
SOLiD
Length vs depth
Library Preparation




Goal: fragments of DNA, each end flanked by adaptor
sequences
Adaptors contain amplification- and sequencing primer binding
sites; platform- and chemistry-specific
Optional: sample-specific barcodes/indexes/MIDs/tags allow
multiplexing during sequencing
Library QC: quantity, size
Library Preparation

Library types:

Shotgun (DNA)
May begin with ChIP
 May follow with sequence capture

Mate pair (DNA)
 Amplicon (DNA)
 Total RNA

May enrich for mRNA (poly-A enrichment, rRNA depletion)
 Convert to cDNA (then similar to DNA protocols)


Small RNA

RNA ligations, convert to cDNA after
Library Preparation: Shotgun

Fragmentation

Sonication
Nebulization

Enzymatic


End repair



3’ overhangs digested
5’ overhangs filled
5’ phosphate added
Library Preparation: Shotgun

Adapter ligation



Library amplification




T-overhangs
Forked structure controls
orientation
Few cycles
Enrich for correctly-adapted
fragments
Required to complete adapter
structure in some protocols
Size selection


Gel excision, AMPure beads
Limit insert size as needed,
remove artifacts
Library Preparation: Amplicon

Amplify region of
interest using PCR

Primers contain
adapter sequences
Library Preparation: Mate Pair


Begin with large
fragments (e.g. 3kb, 20kb)
Circularize and fragment
again




Illumina: direct ligation
454: Cre/Lox
recombination
Enrich for fragments
containing the junction
Proceed with shotgun
library prep
Library Preparation: Mate Pair


Why? Paired sequences
are a known distance
apart; improves genome
assembly
Note: 454 calls these
“paired end libraries”, not
to be confused with
Illumina’s “paired end
sequencing”!
Sequencing: Illumina

Cluster generation


Library fragments hybridize to
oligos on the flow cell

New strand synthesized,
original denatured, removed

Free end binds to adjacent
oligos (bridge formation)

Complimentary strand
synthesized, denatured (both
tethered to flow cell)

Repeat to form clonal cluster

Cleave one oligo, denature to
leave ssDNA clusters
~800K clusters/mm^2
Sequencing: Illumina

Variety of workflows:
 Single-
or paired end reads
 0, 1, or 2 index reads
Sequencing: Illumina




At each cycle, all 4 fluorescently-labeled nucleotides
pass over the flow cell
Each cluster incorporates one nt (terminator) per cycle
Fluor is imaged, then cleaved
De-block and repeat
Sequencing: Illumina

Other terminology:







cBot – accessory instrument that performs cluster generation
Lanes – divisions (8) of HiSeq and GAIIx flow cells
PhiX – bacteriophage with small, balanced genome; PhiX library
spiked in with samples for QC
Phasing/pre-phasing – nt incorporation falls behind or jumps
ahead on a portion of strands in the cluster and contributes to
noise
Chastity filter – measures signal purity (after intensity
corrections); if the background signal is high, cluster will be
discarded
BaseSpace – cloud computing site for processing MiSeq data
File format: fastq
Sequencing: 454


emPCR: clonal
amplification of beadbound library in
microdroplets
Library input amounts
critical!
 One
molecule per
bead
 Titration procedure
Sequencing: 454



Library capture: beads
coated with
complimentary oligo
Amplification: droplet
contains PCR reagents
and the other oligo
Post-PCR: millions of
identical fragments
attached to the bead
Sequencing: 454

Bead Recovery:
physical and chemical
disruption

Enrichment: capture
successfully amplified
beads using
biotinylated primers +
magnetic, streptavidin
beads
Sequencing: 454

Deposit bead layers
onto PicoTiterPlate:
 Enzyme
beads
 Enriched DNA beads
 More enzyme beads
 PPiase beads
Sequencing: 454
Sequencing: 454

Pyrosequencing

4 nucleotides flow
separately

If nt
incorporation…PPi...light

APS + PPi (sulfurylase) 
ATP

Luciferin + ATP (luciferase)
 light + oxyluciferin

Amount of light
proportional to #nt
incorporated

Rinse and repeat with next
nt
Sequencing: 454

Camera captures light
emitted from every well
during every nucleotide flow
Sequencing: 454

Flowgram: representation of a sequence, based on the
pattern of light emitted from a single well
Sequencing: 454

Other terminology:
Lib-L/Lib-A: adapter variants, “ligated” or “annealed”
 Titanium chemistry: ~450 bp reads on all instruments
 XL+ chemistry: ~700 bp reads on the FLX+ instrument
 Flow: one of the four nucleotides flows over the PTP
 Cycle: a set of four flows, in order
 Valley flow: if number of bases incorporated in a given
read during that flow is uncertain, e.g. 1.5 units of light
(background signal, homopolymers)


File format: sff (standard flowgram format)
Sequencing: Ion Torrent



Procedures and chemistry
similar to 454
Instead of PPi, measure H+
release (pH change) via
semiconductor chip
No expensive camera or
laser required, no modified
nucleotides
Sequence Quality
Phred (Q)
Score
Probability
of Error (P)
Base Call
Accuracy
10
1 in 10
90%
20
1 in 100
99%
30
1 in 1K
99.9%
40
1 in 10K
99.99%
50
1 in 100K
99.999%



Error probabilities
determined using
training sets, platformspecific biases
Expressed as a quality
value (QV or Q score)
per base
Similar to PHRED scores:
Q = -10 log10P
 P = 10 -Q/10

Project 1: Microbial Genome

Considerations:
Reference genome?
 How much coverage do I
want?
 How big is the genome
 How much data do I
need?



bp needed = genome size X
coverage
Which
instrument/chemistry
configuration to use?

Coverage
Depth (number of times
a particular base is
“covered” by a read
(e.g. 25X)
 Breadth (% of genome
with at least 1X
coverage)

Project 1: Microbial Genome

Sample preparation
Isolate high quality (not degraded)
and high purity (no RNA) gDNA
 Verify on a gel
 Quantify using dsDNA-specific dye


Library preparation
Can do this yourself if you like
 ~ $200 per sample for Nextera

Cheaper protocols
 Cheaper in bulk


Barcode compatibility
Project 1: Microbial Genome

Library QC
 Insert
size confirmed on BioAnalyzer (within range, no
artifacts)
 Pool barcoded libraries (normalize based on
PicoGreen quantification)
 Absolute quantification of library pools using qPCR
Project 1: Microbial Genome

MiSeq sequencing
 Dilute
and denature library pool (optimal concentration
requires titration...)
 Spike in PhiX library as needed (e.g. 1%)
 Prepare and load reagents, flow cell
 Basic filtering and de-multiplexing performed
automatically
 Download fastq files from BaseSpace
Project 1: Microbial Genome

Data processing
 Additional
filtering
 Trim the ends
 Remove PCR duplicates

Assembly: overlapping
reads are assembled
to eachother based on
sequence similarity =
contigs
Project 1: Microbial Genome

What’s next?
 Polish
the genome
(hybrid assemblies,
mate pair libraries)
 Annotate (ORFs, RNAseq)
 Compare
Project 2: Microbial Community

Shotgun metagenomics
Unbiased survey of
community content
 Random library
fragments may provide
very little taxonomic
resolution (e.g.
conserved, unknown)


Identify genes, classify
by function

Targeted metagenomics
Limited survey of
community content
 Targeted loci provide
excellent taxonomic
resolution, but may
exclude certain taxa


Identify OTUs, classify
by taxonomy
Project 2: Microbial Community




16S rRNA
Multi-copy gene (1.5 kb)
Conserved and
hypervariable regions
Extensive databases
from known species
Project 2: Microbial Community

Considerations:
 Biases
in sampling
methods, culturing,
DNA isolation,
PCR...replicate
 Available SOPs
 How many reads per
sample?
 Read length matters!

Sample preparation:
 Isolate
DNA
 PCR amplify, purify
 High-fidelity
polymerase
 Barcoded primers
 No primer dimers!
 Normalize
PCR
products and pool
Project 2: Microbial Community

454 Sequencing
 emPCR
titrations with
different library input
 Bulk emPCR
 Sequence
 Basic filtering
 Collect sff files

Data processing
 De-multiplexing
 Additional
filtering
 Trim the barcodes,
primers
 Check for chimeras
Project 2: Microbial Community

Clustering
 Sequences
grouped by
similarity = OTUs
Project 2: Microbial Community

Taxonomic identification
OTUs are classifed by
comparing to known 16S
sequences
 Level of classification
(e.g. family vs genus)?


Diversity
Within sample
 Between samples

Download