Accurate estimation of microbial communities using 16S tags Julien Tremblay, PhD jtremblay@lbl.gov 16S rRNA as phylogenetic marker gene 21 proteins 16S rRNA 30S 70S Ribosome subunits 50S 34 proteins 5S rRNA 23S rRNA highly conserved between different species of bacteria and archaea Escherichia coli 16S rRNA Primary and Secondary Structure Falk Warnecke 16S rRNA in environmental microbiology (Sanger clone libraries) 900-1100 bp length Falk Warnecke Next generation sequencing (NGS) Illumina 0.5M 450bp reads 10-350M 150bp reads/lane $$ $ Read length 454 Throughput and cost Illumina tags (itags) 454 Illumina ~400 bp ACGTGGTACTACGTGATAGTGTAT ~250 bp • 454 = “1” read • Illumina = “2” reads => have to be assembled • Both reads need to be of good quality Game plan to survey microbial diversity V1 V2 V3 V4 V5 V6 V7 V8 V9 16S rRNA Generate amplicons of a given variable region from bacterial community (many millions of sequences) Reduce dataset by dereplication/clustering X 10 X1 X 1,000 X 2,000 X 200 X 1,200 X 800 X 10,000 Why amplicon tags ? Deeper, cheaper, faster Identification (BLAST, RDP classifier) Rare biosphere Abundance High sequencing depth of NGS reveals “rare” OTUs Lots of reads only present once in sample… Rare biosphere Rank Sequencing error? Chimeras? Background noise? Rare bias sphere? Is rare biosphere an artifact of the NGS error? Control experiment: estimate rare biosphere in a single strain of E.coli V1 & V2 27F 342R V8 1114F 1392R Should not be a sequencing artifact, if relatively stringent clustering parameters are applied Subject to controversy – Is rare always real? Kunin et al., (2009), Environ. Microbiol. Quince et al., (2009), Nat. Methods Illumina tags (itags) • Typical 454 run 450,000 – 500,000 reads • “Typical” Illumina run: • GAIIx 10,000,000 – 40,000,000 reads/lane • Hiseq ~ 350,000,000 reads/lane • Miseq (new) ~8,000,000 reads/lane • Move 16S tags sequencing to Illumina platform • HiSeq = huge output compared to 454 (suitable for big projects 1000+ indexes(barcodes)/libraries • MiSeq = moderatly high throughput (More suitable) • throughput more efficient clustering algorithm (SeqObs). 16S tags clustering Edward Kirton, JGI Number of reads >> number of clusters Illumina rRNA Amplicon Sequencing 30 30000000 25 20 15000000 15 10000000 10 5 5000000 00 RAW BARCODE OVERLAP ASSEM ~100,000 clusters 20000000 Clustering happens here! Number of Sequences Number of reads (millions) 25000000 CLUSTERS Edward Kirton, JGI Validation of 16S tags on MiSeq • Quality is superior in MiSeq 454 MiSeq reads 1 and 2 separately MiSeq reads 1 and 2 assembled MiSeq validation • Exploratory experiments using 11 wetlands samples. • Validate reproducibility between runs MiSeq validation • Beta diversity (UniFrac Distances) Run 1 Run 2 MiSeq validation • What are the gains using MiSeq assembled paired-end reads over 454 reads? Average bootstrap value for all clusters at every tax level. Average bootstrap value • 454 clusters shows higher confidence than MiSeq clusters Better quality in MiSeq reads, but lower read lengths Taxonomic level Comparing 454 with MiSeq • What are the gains using MiSeq assembled pairedend reads over 454? Clusters having > 0.50 bootstrap value For instance, ~310,000 reads made it to the class level MiSeq outperforms 454 in terms of read depth itags – rare OTUs MiSeq wetlands test samples Low abundant reads consistently shows low confidence in Classification. Low abundant reads = errors, artifacts? Low abundant reads are underrepresented in databases? Comparing 454 with illumina • Compare runs of 454 and MiSeq of same sample • Although challenge to compare V4 with V6-8 region. 515 806 926 ~291 bp 1392 ~466 bp Comparing 454 with illumina • Primer pair of variable region is likely to affect outcome of results. In silico PCR on 16S Greengenes database. PyroTagger (for 454 amplicons) Unzip, validate Remove low-quality reads Redundancy removal PyroClust & Uclust Remove chimeras Samples comparison, post-processing pyrotagger.jgi-psf.org Classification and barcode separation • Sequences of cluster (OTU) representatives 100% 90% 80% 70% 60% • Blast vs GreenGenes and Silva databases, dereplicated at 99.5% 50% 40% 30% 20% 10% 0% • Distribution of microbial phyla in the dataset C lus ter1 C lus ter2 C lus ter3 C lus ter4 C lus ter5 C lus ter7 C lus ter8 C lus ter9 C lus ter1 0 C lus ter1 3 C lus ter1 5 C lus ter1 7 C lus ter1 8 C lus ter1 9 C lus ter2 0 C lus ter2 1 C lus ter2 2 C lus ter2 3 C lus ter2 4 C lus ter2 9 C lus ter3 1 C lus ter3 3 C lus ter3 4 C lus ter3 5 C lus ter4 1 C lus ter4 4 C lus ter5 0 C lus ter5 3 C lus ter6 7 1 2 6732 1 1 3464 6303 1 4464 8836 4 4218 2628 1111 1 648 1737 1 2676 1 4706 828 1 1353 2303 1446 4062 2593 1098 1 150 1203 86 247 1079 625 772 353 347 4 354 490 3 267 322 2330 58 5052 530 55 88 12 128 467 663 1629 23 2 147 138 722 1354 1321 479 98 322 165 378 6 64 33 2 4 5 7 14532 1981 1 7 726 2750 7 266 2304 1 7769 8102 1358 115 3971 2885 104 153 29 118 388 378 2777 8319 3 5204 43 1065 12 4 28 1680 28 7 139 1 2470 19 1436 17 592 1 543 758 6 % identity A lignment L ength M is matc hesG aps 9 7 .7 345 8 9 3 7 2 9 8 .8 345 4 100 345 0 2 1 5 3 9 7 .7 345 8 9 9 .7 345 1 4 6 9 0 9 9 .1 345 1 9 9 .7 345 1 1 3 9 6 .5 345 10 2 8 3 7 9 3 .3 345 23 11 100 345 0 2 9 8 .8 345 4 100 345 0 9 6 .8 347 9 9 6 0 9 9 .4 345 2 3214 100 345 0 9 7 .7 345 7 9 9 .1 345 3 9 8 .3 345 6 100 345 0 1 100 345 0 9 7 .1 345 10 8 6 9 6 .8 345 11 100 345 0 1 9 6 .8 345 11 9 6 .8 345 11 9 8 .8 346 3 3 100 345 0 100 345 0 9 7 .7 345 8 0 0 0 0 0 1 0 2 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 10GP1 Proteobacteria 1PS1 Metazoa 1PS2 Firmicutes Bacteroidetes 2A?1 Spirochaetes Q uery Start Q uery E nd H it Start H it E nd E - value Sc ore I D Full N ame T axonomy 1 345 1336 992 1 .0 0 E - 1 7 6 6 2 0 among geographic 2 1ally 0 3 1regions 9Bac teria c entralFirmic T ibet utes geothermal C los tridia s pring mat C los c lone tridiales D T MC4los 2 tridiac eae 1 345 1247 903 0 6 5 2 M ic robial c ons ortia 1 3 4fermentor 8 0 0Bac teria methanogenic Firmic utes bioreac C tor los tridia c lone E BR C los -0 2tridiales E - 0 4 3 6C oproc oc c us 1 345 1345 1001 0 6 8 4 Bac teroides s p. s tr. 7 3265833Bac c teria Bac teroidetesBac teroidetesBac (c las teroidales s) Bac teroidac eae Bac teroides 1 345 1338 994 1 .0 0 E - 1 7 6 6 2 0 G eobac illus s p. D3169654 1 9Bac teria Firmic utes Bac illales Bac illac eae G eobac illus 1 345 1382 1038 0 6 7 6 E lec tric igen E nric 2 hment 2 6 5 8 1Bac M FC teria full-s c ale Bac anaerobic teroidetesbioreac Bac teroidales tor s ludge Bac teroidac treatingeae brewery P revotellac waseae te c lone 3 1 f0 6 1 345 1273 931 0 6 5 4 T hermoanaerobac 35 terium 6 9 2 2Bac s acteria c harolytic Firmic um s tr. utes B6 A C los tridia T hermoanaerobac T hermoanaerobac terales T hermoanaerobac terales Familyterium I I I . I nc ertae Sedis 1 345 1151 807 0 6 7 6 P ortugues e dry s1moked 0 0 9 7 9Bac s austeria ages (c houric os ) type Ribatejano is olate s tr. T e1 6 R 1 345 1365 1023 3 .0 0 E - 1 6 2 5 7 3 C los tridium s terc orarium 3 1 2 7 8Bac s tr. teria D SM 8 5Firmic 3 2 T utes C los tridia C los tridiales C los tridiac eae C los tridium 1 345 1288 944 8 .0 0 E - 1 4 1 5 0 2 pac ked- bed reac2tor 0 4c2lone 4 5Bac C FBteria 4 Firmic utes C los tridia 1 345 1388 1044 0 6 8 4 Bac illus c irc ulans 3 4s5tr. 4 1X3 3Bac teria Firmic utes Bac illales Bac illac eae Bac illus 1 345 1277 933 0 6 5 2 mes ophilic anaerobic 1 0 7 4BSA 6 6Bacdiges teriater c lone Firmic BSA utes 1 B-0C5los tridia P eptos treptoc oc c ac eae 1 345 1326 982 0 6 8 4 Bac illus s p. s tr. SL 3 3167175 9Bac teria Firmic utes Bac illales Bac illac eae Bac illus 1 345 1359 1013 8 .0 0 E - 1 6 9 5 9 5 A c tinobac ulum s p. 83 P0 1 1s1Bac tr. P teria 2 P _1 9 A c tinobac teria A c tinobac teridae A c tinomyc etales A c tinomyc ineae A c tinomyc etac A ceae tinobac ulum 1 345 1245 901 0 6 6 8 G uguan hot s pring 1 0is 3 olate 8 8 2Bac s tr. teria K1 L 1 Firmic utes C los tridia T hermoanaerobac teriales 1 345 1351 1007 0 6 8 4 C los tridium c ellulos 1 6i0 5 9Bac teria Firmic utes C los tridia C los tridiales C los tridiac eae C los tridium 1 345 1362 1019 3 .0 0 E - 1 7 4 6 1 3 C los tridiac eae bac 2 1terium 7 0 6 0Bac SNteria 021 Firmic utes C los tridia C los tridiales C los tridiac eae 1 345 1261 917 0 6 6 0 C los tridiac eae s tr. 284 07 Wc 9Bac teria Firmic utes C los tridia C los tridiales C los tridiac eae 1 345 1375 1031 0 6 3 6 intes tinal that ac1tivate 4 2 2 1 dietary 6Bac teria lignan s ec ois olaric ires inol digluc os ide human fec es is olate E D - M t6 1 /P Y G - s 6 anaerobic s tr. E D - M t6 1 /P Y G - s 6 1 345 1356 1012 0 6 8 4 Klebs iella pneumoniae 3 5 8 7 6s3Bac tr. FI teria U M S1 P roteobac teria G ammaproteobac E nterobac teria teriales E nterobac teriac Klebs eaeiella 1 345 1378 1034 0 6 8 4 P s eudomonas indic 3 8a2 1 7Bac teria P roteobac teria G ammaproteobac P s eudomonadales teria P s eudomonadac P s eudomonas eae 1 345 1274 930 8 .0 0 E - 1 7 2 6 0 5 C los tridium s p. s1tr. 0 2I 3 M2SN 8Bac U 4 teria 0011 Firmic utes C los tridia C los tridiales C los tridiac eae C los tridium 1 345 1347 1003 2 .0 0 E - 1 6 9 5 9 7 Symbiobac terium3 4 s p. 4 0s9tr. 8Bac KAteria 13 Firmic utes C los tridia C los tridiales C los tridiales Symbiobac Family XV Iterium I I . I nc ertae Sedis 1 345 1376 1032 0 6 8 4 human fec al c lone 2 0SJ 4 4T5U8_G Bac_0 teria 5 _2 6 P roteobac teria G ammaproteobac Betaproteobac teria SJteria T U _B_0 2 _4 5 1 345 1366 1022 3 .0 0 E - 1 5 9 5 6 3 on -A rc tic penins1ula 5 3 2Svalbard 9 1Bac teria N orwayFirmic determined utes genes C los tridia and rumen C losis tridiales olates reindeer RF3 0 fed pelleted RF6 c onc entrates (RF- 8 0 ) c lone A F 1 1 1 345 1379 1035 2 .0 0 E - 1 6 9 5 9 7 human fec al c lone 2 0SJ 4 1T1U7_C Bac_0 teria 3 _7 2 Bac teroidetesBac teroidalesBac teroidac eae 1 345 1348 1003 0 6 4 6 C los tridium jejuens 104 e 4s4tr. 7Bac H Yteria - 3 5 -1 2 T Firmic utes C los tridia C los tridiales C los tridiac eae C los tridium 1 345 1382 1038 0 6 8 4 E nteroc oc c us c as 31 s eliflavus 2 3 5 9Bac teria s tr. eS8 5Firmic 2 utes L ac tobac illales E nteroc oc c acEeae nteroc oc c us 1 345 1334 990 0 6 8 4 C los tridium s p. s3tr. 5 7BG 5 8-C 5Bac 6 6teria Firmic utes C los tridia C los tridiales C los tridiac eae C los tridium 1 345 1372 1028 1 .0 0 E - 1 7 6 6 2 0 mes ophilic anaerobic 2 2 2 7diges 3 3Bacter teria c lone GFirmic 3 5 _Dutes 8 _H _B_E C los 1 1tridia C los tridiales • Also see the Qiime pipeline Challenges • Short size of amplicon • What filtering parameters to use (stringency level)? • balance between stringency filter and keeping as much data as we can • Whole new dimension for rare biosphere? • Handling large numbers of sample (tens of thousand magnitude) • Sequencing run is fast, but library preparation time is long. • Cost of barcoded primers (will need lots of barcodes), handling. • Huge ammount of samples statistics models… Acknowledgments • • • • • • Susannah Tringe Edward Kirton Feng Chen Kanwar Singh Alison Fern And many others! Thanks! 16S rRNA Dangl lab, UNC