33. Accurate Estimation of Microbial Community Using Pyrotags

advertisement
Accurate estimation of microbial
communities using 16S tags
Julien Tremblay, PhD
jtremblay@lbl.gov
16S rRNA as phylogenetic marker gene
21 proteins
16S rRNA
30S
70S Ribosome
subunits
50S
34 proteins
5S rRNA
23S rRNA
highly conserved between different species
of bacteria and archaea
Escherichia coli
16S rRNA
Primary and Secondary Structure
Falk Warnecke
16S rRNA in environmental microbiology
(Sanger clone libraries)
900-1100 bp length
Falk Warnecke
Next generation sequencing (NGS)
Illumina
0.5M 450bp reads
10-400M 150bp reads/lane
$$
$
Read length
454
Throughput
Game plan to survey microbial diversity
V1
V2
V3
V4
V5
V6
V7
V8
V9
16S rRNA
Generate amplicons of a
given variable region
from bacterial community
(many millions of sequences)
Reduce dataset by
dereplication/clustering
X 10
X1
X 1,000
X 2,000
X 200
X 1,200
X 800
X 10,000
Amplicon tags =
Deeper, cheaper, faster
Identification
(BLAST, RDP classifier)
Rare biosphere
Abundance
High sequencing depth of NGS  reveals “rare” OTUs
Rare biosphere
Rank
Sequencing error? Chimeras? Background noise?
Relative small size of amplicons
Rare bias sphere?
Is rare biosphere an artifact of the NGS error?
Control experiment: estimate rare biosphere
in a single strain of E.coli
V1 & V2
27F
342R
V8
1114F
1392R
It should not, if relatively stringent clustering parameters are
applied
Subject to controversy – Is rare always real?
Kunin et al., (2009), Environ. Microbiol.
Quince et al., (2009), Nat. Methods
PyroTagger (for 454 amplicons)
Unzip, validate
Remove low-quality reads
Redundancy removal
PyroClust & Uclust
Remove chimeras
Samples comparison,
post-processing
pyrotagger.jgi-psf.org
Classification and barcode separation
• Sequences of cluster (OTU) representatives
100%
90%
80%
70%
60%
• Blast vs GreenGenes and Silva databases,
dereplicated at 99.5%
50%
40%
30%
20%
10%
0%
• Distribution of microbial phyla in the dataset
C lus ter1
C lus ter2
C lus ter3
C lus ter4
C lus ter5
C lus ter7
C lus ter8
C lus ter9
C lus ter1 0
C lus ter1 3
C lus ter1 5
C lus ter1 7
C lus ter1 8
C lus ter1 9
C lus ter2 0
C lus ter2 1
C lus ter2 2
C lus ter2 3
C lus ter2 4
C lus ter2 9
C lus ter3 1
C lus ter3 3
C lus ter3 4
C lus ter3 5
C lus ter4 1
C lus ter4 4
C lus ter5 0
C lus ter5 3
C lus ter6 7
1
2
6732
1
1 3464
6303
1
4464
8836
4
4218
2628
1111
1
648
1737
1
2676
1
4706
828
1
1353
2303
1446
4062
2593
1098
1
150
1203
86 247
1079
625
772
353
347
4
354
490
3
267
322
2330
58
5052
530
55
88
12
128
467
663
1629
23
2
147
138
722
1354
1321
479
98
322
165
378
6
64
33
2
4
5
7 14532
1981
1
7
726
2750
7
266
2304
1
7769
8102
1358
115 3971
2885
104
153
29
118
388
378
2777
8319
3
5204
43 1065
12
4
28
1680
28
7
139
1
2470
19
1436
17
592
1
543
758
6 % identity
A lignment L ength
M is matc hesG aps
9 7 .7
345
8
9 3 7 2 9 8 .8
345
4
100
345
0
2 1 5 3 9 7 .7
345
8
9 9 .7
345
1
4 6 9 0 9 9 .1
345
1
9 9 .7
345
1
1 3 9 6 .5
345
10
2 8 3 7 9 3 .3
345
23
11 100
345
0
2 9 8 .8
345
4
100
345
0
9 6 .8
347
9
9 6 0 9 9 .4
345
2
3214 100
345
0
9 7 .7
345
7
9 9 .1
345
3
9 8 .3
345
6
100
345
0
1
100
345
0
9 7 .1
345
10
8 6 9 6 .8
345
11
100
345
0
1 9 6 .8
345
11
9 6 .8
345
11
9 8 .8
346
3
3
100
345
0
100
345
0
9 7 .7
345
8
0
0
0
0
0
1
0
2
0
0
0
0
1
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
10GP1
Proteobacteria
1PS1
Metazoa
1PS2
Firmicutes
Bacteroidetes
2A?1
Spirochaetes
Q uery Start
Q uery E nd
H it Start
H it E nd E - value
Sc ore I D
Full N ame T axonomy
1
345 1336
992
1 .0 0 E - 1 7 6 6 2 0 among geographic
2 1ally
0 3 1regions
9Bac teria
c entralFirmic
T ibet utes
geothermal
C los tridia
s pring mat
C los
c lone
tridiales
D T MC4los
2 tridiac eae
1
345 1247
903
0
6 5 2 M ic robial c ons ortia
1 3 4fermentor
8 0 0Bac teria
methanogenic
Firmic utes
bioreac
C tor
los tridia
c lone E BR
C los
-0 2tridiales
E - 0 4 3 6C oproc oc c us
1
345 1345
1001
0
6 8 4 Bac teroides s p. s tr.
7 3265833Bac
c teria
Bac teroidetesBac teroidetesBac
(c las
teroidales
s)
Bac teroidac eae
Bac teroides
1
345 1338
994
1 .0 0 E - 1 7 6 6 2 0 G eobac illus s p. D3169654 1 9Bac teria
Firmic utes
Bac illales
Bac illac eae G eobac illus
1
345 1382
1038
0
6 7 6 E lec tric igen E nric
2 hment
2 6 5 8 1Bac
M FC
teria
full-s c ale
Bac
anaerobic
teroidetesbioreac
Bac teroidales
tor s ludge
Bac teroidac
treatingeae
brewery
P revotellac
waseae
te c lone 3 1 f0 6
1
345 1273
931
0
6 5 4 T hermoanaerobac
35
terium
6 9 2 2Bac
s acteria
c harolytic
Firmic
um s tr.
utes
B6 A C los tridia
T hermoanaerobac
T hermoanaerobac
terales T hermoanaerobac
terales Familyterium
I I I . I nc ertae Sedis
1
345 1151
807
0
6 7 6 P ortugues e dry s1moked
0 0 9 7 9Bac
s austeria
ages (c houric os ) type Ribatejano is olate s tr. T e1 6 R
1
345 1365
1023
3 .0 0 E - 1 6 2 5 7 3 C los tridium s terc orarium
3 1 2 7 8Bac
s tr.
teria
D SM 8 5Firmic
3 2 T utes
C los tridia
C los tridiales C los tridiac eae
C los tridium
1
345 1288
944
8 .0 0 E - 1 4 1 5 0 2 pac ked- bed reac2tor
0 4c2lone
4 5Bac
C FBteria
4
Firmic utes
C los tridia
1
345 1388
1044
0
6 8 4 Bac illus c irc ulans
3 4s5tr.
4 1X3
3Bac teria
Firmic utes
Bac illales
Bac illac eae Bac illus
1
345 1277
933
0
6 5 2 mes ophilic anaerobic
1 0 7 4BSA
6 6Bacdiges
teriater c lone
Firmic
BSA
utes
1 B-0C5los tridia
P eptos treptoc oc c ac eae
1
345 1326
982
0
6 8 4 Bac illus s p. s tr. SL
3 3167175 9Bac teria
Firmic utes
Bac illales
Bac illac eae Bac illus
1
345 1359
1013
8 .0 0 E - 1 6 9 5 9 5 A c tinobac ulum s p.
83
P0
1 1s1Bac
tr. P teria
2 P _1 9
A c tinobac teria
A c tinobac teridae
A c tinomyc etales
A c tinomyc ineae
A c tinomyc etac
A ceae
tinobac ulum
1
345 1245
901
0
6 6 8 G uguan hot s pring
1 0is
3 olate
8 8 2Bac
s tr.
teria
K1 L 1 Firmic utes
C los tridia
T hermoanaerobac teriales
1
345 1351
1007
0
6 8 4 C los tridium c ellulos
1 6i0 5 9Bac teria
Firmic utes
C los tridia
C los tridiales C los tridiac eae
C los tridium
1
345 1362
1019
3 .0 0 E - 1 7 4 6 1 3 C los tridiac eae bac
2 1terium
7 0 6 0Bac
SNteria
021
Firmic utes
C los tridia
C los tridiales C los tridiac eae
1
345 1261
917
0
6 6 0 C los tridiac eae s tr.
284
07
Wc
9Bac teria
Firmic utes
C los tridia
C los tridiales C los tridiac eae
1
345 1375
1031
0
6 3 6 intes tinal that ac1tivate
4 2 2 1 dietary
6Bac teria
lignan s ec ois olaric ires inol digluc os ide human fec es is olate E D - M t6 1 /P Y G - s 6 anaerobic s tr. E D - M t6 1 /P Y G - s 6
1
345 1356
1012
0
6 8 4 Klebs iella pneumoniae
3 5 8 7 6s3Bac
tr. FI
teria
U M S1 P roteobac teria
G ammaproteobac
E nterobac
teria teriales
E nterobac teriac
Klebs
eaeiella
1
345 1378
1034
0
6 8 4 P s eudomonas indic
3 8a2 1 7Bac teria
P roteobac teria
G ammaproteobac
P s eudomonadales
teria
P s eudomonadac
P s eudomonas
eae
1
345 1274
930
8 .0 0 E - 1 7 2 6 0 5 C los tridium s p. s1tr.
0 2I 3
M2SN
8Bac
U 4
teria
0011
Firmic utes
C los tridia
C los tridiales C los tridiac eae
C los tridium
1
345 1347
1003
2 .0 0 E - 1 6 9 5 9 7 Symbiobac terium3 4
s p.
4 0s9tr.
8Bac
KAteria
13
Firmic utes
C los tridia
C los tridiales C los tridiales Symbiobac
Family XV Iterium
I I . I nc ertae Sedis
1
345 1376
1032
0
6 8 4 human fec al c lone
2 0SJ
4 4T5U8_G
Bac_0
teria
5 _2 6
P roteobac teria
G ammaproteobac
Betaproteobac
teria
SJteria
T U _B_0 2 _4 5
1
345 1366
1022
3 .0 0 E - 1 5 9 5 6 3 on -A rc tic penins1ula
5 3 2Svalbard
9 1Bac teria
N orwayFirmic
determined
utes genes
C los tridia
and rumen
C losis
tridiales
olates reindeer
RF3 0
fed pelleted
RF6
c onc entrates (RF- 8 0 ) c lone A F 1 1
1
345 1379
1035
2 .0 0 E - 1 6 9 5 9 7 human fec al c lone
2 0SJ
4 1T1U7_C
Bac_0
teria
3 _7 2
Bac teroidetesBac teroidalesBac teroidac eae
1
345 1348
1003
0
6 4 6 C los tridium jejuens
104
e 4s4tr.
7Bac
H Yteria
- 3 5 -1 2 T Firmic utes
C los tridia
C los tridiales C los tridiac eae
C los tridium
1
345 1382
1038
0
6 8 4 E nteroc oc c us c as
31
s eliflavus
2 3 5 9Bac teria
s tr. eS8 5Firmic
2
utes
L ac tobac illales
E nteroc oc c acEeae
nteroc oc c us
1
345 1334
990
0
6 8 4 C los tridium s p. s3tr.
5 7BG
5 8-C
5Bac
6 6teria
Firmic utes
C los tridia
C los tridiales C los tridiac eae
C los tridium
1
345 1372
1028
1 .0 0 E - 1 7 6 6 2 0 mes ophilic anaerobic
2 2 2 7diges
3 3Bacter
teria
c lone GFirmic
3 5 _Dutes
8 _H _B_E
C los
1 1tridia
C los tridiales
• Also see the Qiime pipeline
Illumina tags (itags)
• Typical 454 run  450,000 – 500,000 reads
• “Typical” Illumina run:
• GAIIx  10,000,000 – 40,000,000 reads/lane
• Hiseq  ~ 350,000,000 reads/lane
• Miseq (available soon)  ~4,000,000 reads/lane
• Move 16S tags sequencing to Illumina platform
• HiSeq = huge output compared to 454 (suitable for big
projects 1000+ indexes(barcodes)/libraries
• MiSeq = moderatly high throughput (More suitable?)
• throughput more efficient clustering algorithm
(SeqObs).
Illumina tags (itags)
454
Illumina
~200-220 bp
ACGTGGTACTACGTGATAGTGTAT
~252 bp
• 454 = “1” read
• Illumina = “2” reads => have to be assembled
• Both reads need to be of good quality
itags clustering
Sort by alphabetical order  100% identity
Reduces dataset by 80%
97%
97%
Edward Kirton, JGI
Number of reads >> number of clusters
Illumina rRNA Amplicon Sequencing
30
30000000
25
20000000
20
Clustering happens here!
Number of Sequences
Number of reads (millions)
25000000
15000000
15
10000000
10
5
5000000
00
RAW
BARCODE
OVERLAP
ASSEM
CLUSTERS
Edward Kirton, JGI
Benefits of parallelization
SeqObs Datasize vs Runtimes
2020
1515
Minutes
1010
55
0
200
20
0
19
0
18
0
17
0
16
0
15
0
14
0
13
0
12
0
11
0
10
90
80
70
100
150
Number of reads (millions)
60
50
50
40
30
20
0
10
0
0
Processing time (min.)
2525
Millions of Sequences
Edward Kirton, JGI
MiSeq validation
• Exploratory experiments using 11 wetlands
samples.
• Validate reproducibility between runs
MiSeq validation
• Beta diversity (UniFrac Distances)
Run 1
Run 2
itags
Validating SeqObs output by comparing with pyrotagger results
454  Pyrotagger (V8 region)
Synthetic communities
Termite gut
Surface Sediments
Compost
Sludge
Illumina GAIIx  SeqObs pipeline
(V4, V5 and V9 regions)
Illumina Miseq  SeqObs pipeline
(V4 region)
Comparing 454 with illumina
GAIIx vs 454 region
Comparing 454 with illumina
• Primer pair of variable region is likely to
affect outcome of results.
In silico PCR on 16S Greengenes database.
itags – confidence level
E values
454
220 bp
GAIIx
~110 bp
Miseq 5’ reads
150 bp
Miseq
assembled reads
~250 bp
Challenges
• Short size of amplicon
• What filtering parameters to use (stringency level)?
•  balance between stringency filter and keeping as
much data as we can
• Whole new dimension for rare biosphere?
• Handling large numbers of sample (tens of
thousand magnitude)
• Cost of barcoded primers (will need lots of
barcodes), handling
• Huge ammount of samples  statistics models…
Acknowledgments
•
•
•
•
•
Susannah Tringe
Edward Kirton
Feng Chen
Kanwar Singh
Rob Knight lab (Univ. of Colorado)
Thanks!
16S rRNA
Dangl lab, UNC
Download