Supplementary Online Text Materials and methods 4 Sample collection and processing Two milliliters of saliva were collected from each human-host individual into a tube containing an equal volume of lysis buffer. Samples were stored at -80°C before high-salt DNA extraction (Quinque et al 2006). Thirty microliters of proteinase K (20 mg/mL, Sigma, USA) and 150 μL of 10% SDS 8 were added to 2 mL of the saliva extraction buffer mixture, which was then incubated overnight at 53 °C in a shaking water bath. After addition of 400 μL 5 M NaCl and 10 min incubation on ice, the mixture was equally distributed into two 2-mL centrifuge tubes and centrifuged for 10 min at 13 000 rpm in an Eppendorf 5415D centrifuge. 12 The supernatant from each tube was transferred to a new tube, where 800 μL isopropanol was added. The tubes were then incubated for 10 min at room temperature and centrifuged for 15 min at 13 000 rpm. The supernatants were discarded and then the DNA pellets were washed once with 500 μL 70% ethanol, 16 dried and dissolved in 30 μL double-distilled water. Concentrations of the resulted total DNA were measured by Nanovue (GE, USA). DNA purity was determined by A260/A280. DNA integrity was verified by agarose gel electrophoresis after ethidium bromide staining under ultraviolet light. 20 About 100ng DNA was used for generating 16S rRNA gene-PCR amplicons and 10ug DNA was required for shotgun library-preparation and sequencing on Solexa GA-IIx. DNA Samples were stored at -20°C before further processing. PCR-amplification of microbial 16S rRNA genes The V4/V5 hypervariable 24 region of bacterial 16S rRNA genes (Escherichia coli positions 515-907) was chosen for PCR amplification due to the demonstrated higher accuracy of taxonomy assignment (Claesson et al 2010, Liu et al 2008). Based on the sequences of this region, paired PCR primers F515 (GTGCCAGCMGCCGCGG) and R907 28 (CCGTCAATTCMTTTRAGTTT) were selected (Zhou et al 2011). Both primers were searched against Ribosomal Database (Cole et al 2009) and found to cover over 98% of all 16S rRNA gene sequences in the database. A sample tagging approach was used for multiplexing samples in the sequencing reaction on 454 MicroTilterplate 32 (Binladen et al 2007, Hamady et al 2008). A unique 6-mer tag for each of the DNA samples was added to the 5’-end of both primers; the tagged primers were synthesized by Sangon (Shanghai, China) and used for generating PCR amplicons. The PCR amplification mix contained 1.25 units of DNA polymerase (Pyrobest, Takara, Japan), 36 5 µl Pfu reaction buffer, 4ul dNTPs (Takara, Japan), and 0.2 µM of each primer in a total volume of 50 µl. Gel-purified genomic DNA (100ng) was then added to each amplification mix. Cycling condition was an initial denaturation at 94°C for 3 min, 30 cycles of 95°C for 30 s, 58°C for 60 s and 72°C for 60 s, followed by a final 2-min 40 extension at 72°C. Typically, multiple (two to three) 50µl reactions were needed for each sample. DNA-amplicons were gel-purified and the concentrations were determined using Agilent BioAnalyzer 2100 (Invitrogen, USA) and NanoVue Spectrophotometer (GE, USA). The amplicons were pooled together in equimolar 44 ratios into a single tube before sequencing. Assessment of potential biases associated with barcodes In order to validate our barcoding methodology for the 16S rRNA gene PCR amplicons, a pilot dataset was first generated that included 12 randomly chosen samples. For nine of the 48 samples, each was PCR-amplified twice, each time using a distinct barcode; the remaining three samples were each amplified once using a unique barcode. The resulted mixed pool of barcoded PCR amplicons was sequenced on 454 GS-Titanium. Analysis based on MOTHUR version 1.10.0 (Schloss et al 2009) showed that 52 amplicons from the same sample that were amplified with different barcodes always clustered together (Figure S2). The result thus validated the feasibility, reliability and reproductivity of our barcoding methodology. Data analysis Both 16S rRNA gene amplicon reads from 454 GS-Titanium and 56 metagenomic shotgun reads from Solexa GA-IIx were processed via their respective computational pipelines customized for human oral microbiome analysis (Turnbaugh et al 2009, Xie et al 2010). All sequences were deposited into GenBank SRA (Sequence Read Archive) under accession number SRP005094. 60 The Solexa GA-IIx reads were subjected to quality filtering using FASTX-Toolkit (V0.0.10; http://hannonlab.cshl.edu/) which trimmed low quality ends: the last nucleotides with quality less than 20 were removed. Contaminating human-host sequences were removed after mapping the reads to human genome 64 sequences using Novoalign (V2.06; http://www.novocraft.com); those paired reads with both mapping quality values exceeding 100 were labeled as contaminants. Fragments of 16S rRNA genes were retrieved from the shotgun metagenomic sequences and used for taxonomic assignment via PHYLOSHOP (Shah et al 2011). 68 Furthermore, the HMP oral reference genomes were downloaded from HMP DACC website (http://www.hmpdacc-resources.org/) and concatenated as a large reference sequence. Alignment of Illumnia reads (“short reads”) to the human oral reference genomes was conducted using Bowtie (Langmead et al 2009) to access the 72 abundance of these sequenced references in the sequenced microbiomes. None-human short reads were assembled using SOAP (Li et al 2009) de novo assembler. MetaGene (Noguchi et al 2006) was used to predict ORFs from the contigs. The predicted ORFs were aligned to each other using in-house Perl script to construct a non-redundant 76 protein-coding gene set. The protein sequences were aligned against the reference genomes using TBLASTN (e<10-5) to evaluate the genome coverage of the reference genomes. MG-RAST annotation was performed using an e-value cut-off of 1e-5 to identify abundant genes from the assembled reads (Glass et al 2010, Overbeek et al 80 2005). The 16S rRNA gene-amplicon 454-read analysis pipeline consists of the following components. Quality trimming of reads. Relatively stringent quality-based trimming of 16S 84 rRNA gene 454-reads (Kunin et al 2010) was chosen to minimize effects of random sequencing errors. Reads generated in three quarters of one run from 454 GS-Titanium were first removed if they were <150 bp, had a quality score <25, had an ambiguous base-call (N), or did not contain the primer sequence (maximal edit 88 distance was 2). They were then sorted by the tag sequences. Above trimming processes were completed on RDP Initial Process (http://pyro.cme.msu.edu). Assessment of microbial diversity based on Operational Taxonomic Unit (OTU). To assign phylotypes to the tagged sequences, the trimmed reads were first clustered 92 using UCLUST (http://www.drive5.com/uclust/). An in-house Perl script was then used to convert the UCLUST’s output into a format that MOTHUR (Schloss et al 2009) (http://www.mothur.org/) recognized so that the alpha-diversity could be conducted. Reads thus were assigned to Operational Taxonomic Units (OTUs). The 96 species richness and diversity estimators (ACE, Chao1, Shannon Index, Simpson Index) were calculated. Relative abundance of 97%-identity (i.e. species-level) OTUs was compared between the healthy group and the caries-active group. Rarefaction curves were generated and compared between the two groups based on a given 100 number of sequences (1 000 or 5 000) randomly seleced from each dataset. To test whether there was a “core” microbiome shared across the saliva samples, an in-house Perl script further processed the intermediate file generated by MOTHUR to iterate the shared OTUs calculations. The algorithm of the script was to randomly 104 pick up two of samples, calculate the size of the shared OTUs, and added a random the third and so on. The calculation was performed with 100 iterations. The average value and standard deviation of shared OTUs were then derived and plotted. Taxonomy assignment. We used both RDP Classifier Version 2.1 (Cole et al 108 2009) and BLAST for the taxonomy assignments, using both Human Oral Microbiome Database 16S rRNA gene oral sequences (HOMD Version 10.1; http://www.homd.org/) and RDP Taxonomy (http://rdp.cme.msu.edu/) as reference databases. For the RDP classifier, the confidence score threshold was set to 0.8, which 112 meant those with bootstrap value below 0.8 was assigned as unclassified. The source code of RDP classifier (http://sourceforge.net/projects/rdp-classifier/) was customized for this study. Training of RDP Classifier required a FASTA file with taxonomy annotation and a file with hierarchy information. Due to the inconsistency 116 in nomenclatures between NCBI and HOMD 16S rRNA gene databases, we generated our own four-digit taxon-ID for the hierarchy file. The first digit denoted the taxonomy level (domain:0, phylum:1, class:2, order:3, family:4, genus:5) and the last 3 digits denoted the serial number for each of the taxon-names. 120 The relative abundances of bacterial taxon at each of the levels of Phylum, Class, Order, Family and Genus were calculated. The means between the healthy group and the caries-active group and those between female and male subjects were reported and compared. 124 Comparing microbial community structures. For every microbiome, representative sequences were chosen from each OTU by selecting the longest sequence based on UCLUST. Each sequence was then assigned to its closest relative in a phylogeny of the Greengenes core set (DeSantis et al 2006) using BLAST’s 128 megablast protocol. The resulted sample ID mapping file and category mapping file were used as inputs to the Unweighted FastUniFrac (Hamady et al 2010). A distance (a measurement of the similarity between the structures of two microbiotas) was thus computed for each pair of microbiomes, both within a single group and across the two 132 groups. Principal Coordinates Analysis (PCoA; (Lozupone and Knight 2005)) was performed based on the resulting matrixes of pairwise distances among all of the microbiomes. Statistical analyses. Data were typically presented as mean±s.e.m. 136 Mann-Whitney and Student’s t-tests were used to identify statistically significant differences using R (Version 2.11.1). In cross-validating the results from the 16S rRNA gene-amplicon-based and the whole-genome-based sequencing, relative abundances of the top 20 genus from 454 140 reads for each of the two samples were compared to the taxonomic distribution from the corresponding Solexa reads. Correlation between the community- diversity and structure as determined by the two approaches was determined by calculating the Spearman correlation coefficient (r). 144 All statistical tests were two-sided with significance level of p = 0.1. Statistical tests were performed based on host dental health-state and host gender, respectively. When appropriate, p values were corrected based on Holm’s adjustment on multiple comparison. 148 We use asterisks to denote statistical significance (NS: not significant; *: p <0.1; **: p <0.05; ***: p <0.01). 152 Reference 156 Binladen J, Gilbert MT, Bollback JP, Panitz F, Bendixen C, Nielsen R et al (2007). The use of coded PCR primers enables high-throughput sequencing of multiple homolog amplification products by 454 parallel sequencing. PLoS One 2: e197. 160 Claesson MJ, Wang Q, O'Sullivan O, Greene-Diniz R, Cole JR, Ross RP et al (2010). Comparison of two next-generation sequencing technologies for resolving highly complex microbiota composition using tandem variable 16S rRNA gene regions. Nucleic Acids Res. 164 Cole JR, Wang Q, Cardenas E, Fish J, Chai B, Farris RJ et al (2009). The Ribosomal Database Project: improved alignments and new tools for rRNA analysis. Nucleic Acids Res 37: D141-145. 168 DeSantis TZ, Hugenholtz P, Larsen N, Rojas M, Brodie EL, Keller K et al (2006). Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl Environ Microbiol 72: 5069-5072. 172 Glass EM, Wilkening J, Wilke A, Antonopoulos D, Meyer F (2010). Using the metagenomics RAST server (MG-RAST) for analyzing shotgun metagenomes. Cold Spring Harb Protoc 2010: pdb prot5368. 176 Hamady M, Walker JJ, Harris JK, Gold NJ, Knight R (2008). Error-correcting barcoded primers for pyrosequencing hundreds of samples in multiplex. Nat Methods 5: 235-237. 180 Hamady M, Lozupone C, Knight R (2010). Fast UniFrac: facilitating high-throughput phylogenetic analyses of microbial communities including analysis of pyrosequencing and PhyloChip data. ISME J 4: 17-27. 184 Kunin V, Engelbrektson A, Ochman H, Hugenholtz P (2010). Wrinkles in the rare biosphere: pyrosequencing errors can lead to artificial inflation of diversity estimates. Environ Microbiol 12: 118-123. 188 Langmead B, Trapnell C, Pop M, Salzberg SL (2009). Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10: R25. Li R, Yu C, Li Y, Lam TW, Yiu SM, Kristiansen K et al (2009). SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25: 1966-1967. 192 Liu Z, DeSantis TZ, Andersen GL, Knight R (2008). Accurate taxonomy assignments from 16S rRNA sequences produced by highly parallel pyrosequencers. Nucleic Acids Res 36: e120. 196 Lozupone C, Knight R (2005). UniFrac: a new phylogenetic method for comparing microbial communities. Appl Environ Microbiol 71: 8228-8235. 200 Noguchi H, Park J, Takagi T (2006). MetaGene: prokaryotic gene finding from environmental genome shotgun sequences. Nucleic Acids Res 34: 5623-5630. 204 Overbeek R, Begley T, Butler RM, Choudhuri JV, Chuang HY, Cohoon M et al (2005). The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res 33: 5691-5702. 208 Quinque D, Kittler R, Kayser M, Stoneking M, Nasidze I (2006). Evaluation of saliva as a source of human DNA for population and association studies. Anal Biochem 353: 272-277. 212 Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB et al (2009). Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol 75: 7537-7541. 216 Shah N, Tang H, Doak TG, Ye Y (2011). Comparing bacterial communities inferred from 16S rRNA gene sequences and shotgun metagenomics. Pac Symp Biocomput: 165-176. 220 Turnbaugh PJ, Hamady M, Yatsunenko T, Cantarel BL, Duncan A, Ley RE et al (2009). A core gut microbiome in obese and lean twins. Nature 457: 480-484. 224 228 Xie G, Chain PS, Lo CC, Liu KL, Gans J, Merritt J et al (2010). Community and gene composition of a human dental plaque microbiota obtained by metagenomic sequencing. Mol Oral Microbiol 25: 391-405. Zhou J, Wu L, Deng Y, Zhi X, Jiang YH, Tu Q et al (2011). Reproducibility and quantitation of amplicon sequencing-based detection. ISME J.