Supplementary Online Text (doc 189K)

advertisement
Supplementary Online Text
Materials and methods
4
Sample collection and processing Two milliliters of saliva were collected from
each human-host individual into a tube containing an equal volume of lysis buffer.
Samples were stored at -80°C before high-salt DNA extraction (Quinque et al 2006).
Thirty microliters of proteinase K (20 mg/mL, Sigma, USA) and 150 μL of 10% SDS
8
were added to 2 mL of the saliva extraction buffer mixture, which was then incubated
overnight at 53 °C in a shaking water bath. After addition of 400 μL 5 M NaCl and 10
min incubation on ice, the mixture was equally distributed into two 2-mL centrifuge
tubes and centrifuged for 10 min at 13 000 rpm in an Eppendorf 5415D centrifuge.
12
The supernatant from each tube was transferred to a new tube, where 800 μL
isopropanol was added. The tubes were then incubated for 10 min at room
temperature and centrifuged for 15 min at 13 000 rpm. The supernatants were
discarded and then the DNA pellets were washed once with 500 μL 70% ethanol,
16
dried and dissolved in 30 μL double-distilled water.
Concentrations of the resulted total DNA were measured by Nanovue (GE,
USA). DNA purity was determined by A260/A280. DNA integrity was verified by
agarose gel electrophoresis after ethidium bromide staining under ultraviolet light.
20
About 100ng DNA was used for generating 16S rRNA gene-PCR amplicons and 10ug
DNA was required for shotgun library-preparation and sequencing on Solexa GA-IIx.
DNA Samples were stored at -20°C before further processing.
PCR-amplification of microbial 16S rRNA genes The V4/V5 hypervariable
24
region of bacterial 16S rRNA genes (Escherichia coli positions 515-907) was chosen
for PCR amplification due to the demonstrated higher accuracy of taxonomy
assignment (Claesson et al 2010, Liu et al 2008). Based on the sequences of this
region, paired PCR primers F515 (GTGCCAGCMGCCGCGG) and R907
28
(CCGTCAATTCMTTTRAGTTT) were selected (Zhou et al 2011). Both primers
were searched against Ribosomal Database (Cole et al 2009) and found to cover over
98% of all 16S rRNA gene sequences in the database. A sample tagging approach was
used for multiplexing samples in the sequencing reaction on 454 MicroTilterplate
32
(Binladen et al 2007, Hamady et al 2008). A unique 6-mer tag for each of the DNA
samples was added to the 5’-end of both primers; the tagged primers were synthesized
by Sangon (Shanghai, China) and used for generating PCR amplicons. The PCR
amplification mix contained 1.25 units of DNA polymerase (Pyrobest, Takara, Japan),
36
5 µl Pfu reaction buffer, 4ul dNTPs (Takara, Japan), and 0.2 µM of each primer in a
total volume of 50 µl. Gel-purified genomic DNA (100ng) was then added to each
amplification mix. Cycling condition was an initial denaturation at 94°C for 3 min, 30
cycles of 95°C for 30 s, 58°C for 60 s and 72°C for 60 s, followed by a final 2-min
40
extension at 72°C. Typically, multiple (two to three) 50µl reactions were needed for
each sample. DNA-amplicons were gel-purified and the concentrations were
determined using Agilent BioAnalyzer 2100 (Invitrogen, USA) and NanoVue
Spectrophotometer (GE, USA). The amplicons were pooled together in equimolar
44
ratios into a single tube before sequencing.
Assessment of potential biases associated with barcodes In order to validate
our barcoding methodology for the 16S rRNA gene PCR amplicons, a pilot dataset
was first generated that included 12 randomly chosen samples. For nine of the
48
samples, each was PCR-amplified twice, each time using a distinct barcode; the
remaining three samples were each amplified once using a unique barcode. The
resulted mixed pool of barcoded PCR amplicons was sequenced on 454 GS-Titanium.
Analysis based on MOTHUR version 1.10.0 (Schloss et al 2009) showed that
52
amplicons from the same sample that were amplified with different barcodes always
clustered together (Figure S2). The result thus validated the feasibility, reliability and
reproductivity of our barcoding methodology.
Data analysis Both 16S rRNA gene amplicon reads from 454 GS-Titanium and
56
metagenomic shotgun reads from Solexa GA-IIx were processed via their respective
computational pipelines customized for human oral microbiome analysis (Turnbaugh
et al 2009, Xie et al 2010). All sequences were deposited into GenBank SRA
(Sequence Read Archive) under accession number SRP005094.
60
The Solexa GA-IIx reads were subjected to quality filtering using
FASTX-Toolkit (V0.0.10; http://hannonlab.cshl.edu/) which trimmed low quality
ends: the last nucleotides with quality less than 20 were removed. Contaminating
human-host sequences were removed after mapping the reads to human genome
64
sequences using Novoalign (V2.06; http://www.novocraft.com); those paired reads
with both mapping quality values exceeding 100 were labeled as contaminants.
Fragments of 16S rRNA genes were retrieved from the shotgun metagenomic
sequences and used for taxonomic assignment via PHYLOSHOP (Shah et al 2011).
68
Furthermore, the HMP oral reference genomes were downloaded from HMP
DACC website (http://www.hmpdacc-resources.org/) and concatenated as a large
reference sequence. Alignment of Illumnia reads (“short reads”) to the human oral
reference genomes was conducted using Bowtie (Langmead et al 2009) to access the
72
abundance of these sequenced references in the sequenced microbiomes. None-human
short reads were assembled using SOAP (Li et al 2009) de novo assembler. MetaGene
(Noguchi et al 2006) was used to predict ORFs from the contigs. The predicted ORFs
were aligned to each other using in-house Perl script to construct a non-redundant
76
protein-coding gene set. The protein sequences were aligned against the reference
genomes using TBLASTN (e<10-5) to evaluate the genome coverage of the reference
genomes. MG-RAST annotation was performed using an e-value cut-off of 1e-5 to
identify abundant genes from the assembled reads (Glass et al 2010, Overbeek et al
80
2005).
The 16S rRNA gene-amplicon 454-read analysis pipeline consists of the
following components.
Quality trimming of reads. Relatively stringent quality-based trimming of 16S
84
rRNA gene 454-reads (Kunin et al 2010) was chosen to minimize effects of random
sequencing errors. Reads generated in three quarters of one run from 454
GS-Titanium were first removed if they were <150 bp, had a quality score <25, had
an ambiguous base-call (N), or did not contain the primer sequence (maximal edit
88
distance was 2). They were then sorted by the tag sequences. Above trimming
processes were completed on RDP Initial Process (http://pyro.cme.msu.edu).
Assessment of microbial diversity based on Operational Taxonomic Unit (OTU).
To assign phylotypes to the tagged sequences, the trimmed reads were first clustered
92
using UCLUST (http://www.drive5.com/uclust/). An in-house Perl script was then
used to convert the UCLUST’s output into a format that MOTHUR (Schloss et al
2009) (http://www.mothur.org/) recognized so that the alpha-diversity could be
conducted. Reads thus were assigned to Operational Taxonomic Units (OTUs). The
96
species richness and diversity estimators (ACE, Chao1, Shannon Index, Simpson
Index) were calculated. Relative abundance of 97%-identity (i.e. species-level) OTUs
was compared between the healthy group and the caries-active group. Rarefaction
curves were generated and compared between the two groups based on a given
100
number of sequences (1 000 or 5 000) randomly seleced from each dataset.
To test whether there was a “core” microbiome shared across the saliva samples,
an in-house Perl script further processed the intermediate file generated by MOTHUR
to iterate the shared OTUs calculations. The algorithm of the script was to randomly
104
pick up two of samples, calculate the size of the shared OTUs, and added a random
the third and so on. The calculation was performed with 100 iterations. The average
value and standard deviation of shared OTUs were then derived and plotted.
Taxonomy assignment. We used both RDP Classifier Version 2.1 (Cole et al
108
2009) and BLAST for the taxonomy assignments, using both Human Oral
Microbiome Database 16S rRNA gene oral sequences (HOMD Version 10.1;
http://www.homd.org/) and RDP Taxonomy (http://rdp.cme.msu.edu/) as reference
databases. For the RDP classifier, the confidence score threshold was set to 0.8, which
112
meant those with bootstrap value below 0.8 was assigned as unclassified.
The
source code of RDP classifier (http://sourceforge.net/projects/rdp-classifier/) was
customized for this study. Training of RDP Classifier required a FASTA file with
taxonomy annotation and a file with hierarchy information. Due to the inconsistency
116
in nomenclatures between NCBI and HOMD 16S rRNA gene databases, we generated
our own four-digit taxon-ID for the hierarchy file. The first digit denoted the
taxonomy level (domain:0, phylum:1, class:2, order:3, family:4, genus:5) and the last
3 digits denoted the serial number for each of the taxon-names.
120
The relative abundances of bacterial taxon at each of the levels of Phylum,
Class, Order, Family and Genus were calculated. The means between the healthy
group and the caries-active group and those between female and male subjects were
reported and compared.
124
Comparing microbial community structures. For every microbiome,
representative sequences were chosen from each OTU by selecting the longest
sequence based on UCLUST. Each sequence was then assigned to its closest relative
in a phylogeny of the Greengenes core set (DeSantis et al 2006) using BLAST’s
128
megablast protocol. The resulted sample ID mapping file and category mapping file
were used as inputs to the Unweighted FastUniFrac (Hamady et al 2010). A distance
(a measurement of the similarity between the structures of two microbiotas) was thus
computed for each pair of microbiomes, both within a single group and across the two
132
groups.
Principal Coordinates Analysis (PCoA; (Lozupone and Knight 2005)) was
performed based on the resulting matrixes of pairwise distances among all of the
microbiomes.
Statistical analyses. Data were typically presented as mean±s.e.m.
136
Mann-Whitney and Student’s t-tests were used to identify statistically significant
differences using R (Version 2.11.1).
In cross-validating the results from the 16S rRNA gene-amplicon-based and the
whole-genome-based sequencing, relative abundances of the top 20 genus from 454
140
reads for each of the two samples were compared to the taxonomic distribution from
the corresponding Solexa reads. Correlation between the community- diversity and
structure as determined by the two approaches was determined by calculating the
Spearman correlation coefficient (r).
144
All statistical tests were two-sided with significance level of p = 0.1. Statistical
tests were performed based on host dental health-state and host gender, respectively.
When appropriate, p values were corrected based on Holm’s adjustment on multiple
comparison.
148
We use asterisks to denote statistical significance (NS: not significant; *: p <0.1;
**: p <0.05; ***: p <0.01).
152
Reference
156
Binladen J, Gilbert MT, Bollback JP, Panitz F, Bendixen C, Nielsen R et al (2007).
The use of coded PCR primers enables high-throughput sequencing of multiple
homolog amplification products by 454 parallel sequencing. PLoS One 2: e197.
160
Claesson MJ, Wang Q, O'Sullivan O, Greene-Diniz R, Cole JR, Ross RP et al (2010).
Comparison of two next-generation sequencing technologies for resolving highly
complex microbiota composition using tandem variable 16S rRNA gene regions.
Nucleic Acids Res.
164
Cole JR, Wang Q, Cardenas E, Fish J, Chai B, Farris RJ et al (2009). The Ribosomal
Database Project: improved alignments and new tools for rRNA analysis. Nucleic
Acids Res 37: D141-145.
168
DeSantis TZ, Hugenholtz P, Larsen N, Rojas M, Brodie EL, Keller K et al (2006).
Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible
with ARB. Appl Environ Microbiol 72: 5069-5072.
172
Glass EM, Wilkening J, Wilke A, Antonopoulos D, Meyer F (2010). Using the
metagenomics RAST server (MG-RAST) for analyzing shotgun metagenomes. Cold
Spring Harb Protoc 2010: pdb prot5368.
176
Hamady M, Walker JJ, Harris JK, Gold NJ, Knight R (2008). Error-correcting
barcoded primers for pyrosequencing hundreds of samples in multiplex. Nat Methods
5: 235-237.
180
Hamady M, Lozupone C, Knight R (2010). Fast UniFrac: facilitating high-throughput
phylogenetic analyses of microbial communities including analysis of pyrosequencing
and PhyloChip data. ISME J 4: 17-27.
184
Kunin V, Engelbrektson A, Ochman H, Hugenholtz P (2010). Wrinkles in the rare
biosphere: pyrosequencing errors can lead to artificial inflation of diversity estimates.
Environ Microbiol 12: 118-123.
188
Langmead B, Trapnell C, Pop M, Salzberg SL (2009). Ultrafast and memory-efficient
alignment of short DNA sequences to the human genome. Genome Biol 10: R25.
Li R, Yu C, Li Y, Lam TW, Yiu SM, Kristiansen K et al (2009). SOAP2: an
improved ultrafast tool for short read alignment. Bioinformatics 25: 1966-1967.
192
Liu Z, DeSantis TZ, Andersen GL, Knight R (2008). Accurate taxonomy assignments
from 16S rRNA sequences produced by highly parallel pyrosequencers. Nucleic Acids
Res 36: e120.
196
Lozupone C, Knight R (2005). UniFrac: a new phylogenetic method for comparing
microbial communities. Appl Environ Microbiol 71: 8228-8235.
200
Noguchi H, Park J, Takagi T (2006). MetaGene: prokaryotic gene finding from
environmental genome shotgun sequences. Nucleic Acids Res 34: 5623-5630.
204
Overbeek R, Begley T, Butler RM, Choudhuri JV, Chuang HY, Cohoon M et al
(2005). The subsystems approach to genome annotation and its use in the project to
annotate 1000 genomes. Nucleic Acids Res 33: 5691-5702.
208
Quinque D, Kittler R, Kayser M, Stoneking M, Nasidze I (2006). Evaluation of saliva
as a source of human DNA for population and association studies. Anal Biochem 353:
272-277.
212
Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB et al (2009).
Introducing mothur: open-source, platform-independent, community-supported
software for describing and comparing microbial communities. Appl Environ
Microbiol 75: 7537-7541.
216
Shah N, Tang H, Doak TG, Ye Y (2011). Comparing bacterial communities inferred
from 16S rRNA gene sequences and shotgun metagenomics. Pac Symp Biocomput:
165-176.
220
Turnbaugh PJ, Hamady M, Yatsunenko T, Cantarel BL, Duncan A, Ley RE et al
(2009). A core gut microbiome in obese and lean twins. Nature 457: 480-484.
224
228
Xie G, Chain PS, Lo CC, Liu KL, Gans J, Merritt J et al (2010). Community and gene
composition of a human dental plaque microbiota obtained by metagenomic
sequencing. Mol Oral Microbiol 25: 391-405.
Zhou J, Wu L, Deng Y, Zhi X, Jiang YH, Tu Q et al (2011). Reproducibility and
quantitation of amplicon sequencing-based detection. ISME J.
Download