Minimally invasive sampling method identifies differences in taxonomic richness of airway microbiomes in young infants associated with mode of delivery Meghan H. Shilts1, Christian Rosas-Salazar2, Andrey Tovchigrechko3, Emma K. Larkin4, Manolito Torralba3, Asmik Akopov1, Rebecca Halpin1, R. Stokes Peebles4, Martin L. Moore5, Larry J. Anderson5, Karen E. Nelson3, Tina V. Hartert4, Suman R Das1* 1 Virology Group, J. Craig Venter Institute, Rockville, MD, USA 2 Department of Pediatrics, Vanderbilt University School of Medicine, Nashville, TN, USA 3 Genomic Medicine Group, J. Craig Venter Institute, Rockville, MD, USA 4 Department of Medicine, Vanderbilt University School of Medicine, Nashville, TN, USA 5 Department of Pediatrics, Emory University School of Medicine, Atlanta, GA, USA * Corresponding authors Suman R. Das, PhD Infectious Diseases Group J. Craig Venter Institute 9704 Medical Center Dr. Rockville, MD 20850, USA Email: sdas@jcvi.org Phone: 301-795-7328 Fax: 301-795-7070 Key Words: microbiome, 16S rRNA, next-generation sequencing, upper respiratory tract Materials and Methods DNA Sampling and Extraction Using sterile gloves, dry filter papers (LeukosorbTM, Pall Corporation, Port Washington, NY) were inserted inside both of the infant’s nares for a minimum of 30 seconds, and up to two minutes. After sampling, the nasal filters were placed into sterile tubes and stored at -80 °C until further processing. To extract microbial genomic DNA, filters were resuspended in 700 μl of lysis buffer (20 mM Tris-HCl, 2 mM ETDA, 1.2% Triton X-100) and incubated at 75 °C for 10 minutes. After samples were cooled to room temperature, 60 μl of 200 mg/ml lysozyme (Sigma Aldrich, St. Louis, MO) was added to each tube and samples were incubated 37 °C overnight. Following overnight incubation, 110 μl of 10% SDS and 42 μl of 20 mg/ml proteinase K (QIAGEN, Valencia, CA) were added to each sample, and the tubes were incubated at 55 °C for 30 minutes. After incubation, an equal volume of phenol:chloroform:isoamylalcohol (25:24:1) was added to each sample. Samples were vortexed and then centrifuged at maximum speed for 20 minutes. The aqueous phase was removed and subjected to a second phenol-chloroform extraction. After centrifugation, a 1/10th volume of 3 M sodium acetate pH 5.2 and an equal volume of chloroform:isoamylalcohol (24:1) were added to the aqueous phase. Samples were vortexed and then centrifuged at maximum speed for 15 minutes. The aqueous phase was removed, and an equal volume of isopropanol was added. Samples were subsequently incubated at -80 °C for 30 minutes. After precipitation, samples were centrifuged at 4 °C at maximum speed for 10 minutes. The supernatant was removed, and the pellet was washed with 80% ethanol. Samples were again centrifuged at 4 °C at maximum speed for 10 minutes. The supernatant was removed by decanting, and the tubes were left open at 37 °C for 10 minutes to ensure complete ethanol evaporation. The pellet was resuspended in 50 μl TE. For each sample, DNA was quantified using SYBR Green (Life Technologies, Grand Island, NY) on a Synergy HT plate reader (BioTek, Winooski, VT). 16S rRNA Gene Profiling by 454 Pyrosequencing 2 Approximately 100 ng of DNA from each sample was amplified using primers 27F (5’CCTATCCCCTGTGTGCCTTGGCAGTCTCAGAGAGTTTGATYMTGGCTCAG-3’) and 534R (5’-CCATCTCATCCCTGCGTGTCTCCGACTCAG-NNNNNNNNNNATTACCGCGGCTGCTGG-3’), which target the V1 – V3 region of the 16S rRNA gene [1]. Each reverse primer contained a unique 10 base pair barcode, represented with “N” in the 534R primer sequence above, to allow multiplex pyrosequencing. Amplicons were generated with Platinum Taq polymerase (Life Technologies, Grand Island, New York) using the following cycling conditions: 95 °C for 5 minutes; 35 cycles of 95 °C for 30 seconds, 55 °C for 30 seconds, 72 °C for 30 seconds; and a final extension step at 72 °C for 7 minutes. The presence of amplicons was verified by visualization on a 1% agarose gel. Amplicons were cleaned using the QIAquick PCR Purification Kit (QIAGEN, Valencia, CA), as per the manufacturer’s instructions, with an additional drying step after the ethanol wash to ensure complete ethanol removal. Purified amplicons were quantified using SYBR Green (Life Technologies, Grand Island, NY) on a Synergy HT plate reader (BioTek, Winooski, VT), normalized, and pooled. The template was subjected to emulsion PCR, and pyrosequencing was performed at the J. Craig Venter Institute on a 454 sequencer using FLX-Titanium chemistry (Roche, Branford, CT). Data Analysis All analysis was conducted using our open source package MGSAT [2]. The tests used in this analysis are described in detail below. Ranking of features according to their differential abundance with regard to metadata variables and hypothesis testing for differential abundance. The GeneSelector [3] R package was used when there were two groups of observations. Briefly, the same ranking method (package function RankingWilcoxon) was applied to multiple random subsamples of the full set of observations (400 replicates, sampling 50% of observations without replacement). RankingWilcoxon ranks features in each replicate according to the test statistic with regard to the group difference. Consensus ranking between replicates was then found with a Monte Carlo procedure 3 (function AggregateMC), and the features were reported in the order of that consensus. The consensus ranking is expected to be more stable with regard to sampling error as compared to ranking obtained just once for the entire dataset. We used a variant of the RankingWilcoxon method that applied the Wilcoxon rank-sum test to feature abundance values for independent observations. The abundance counts were normalized to simple proportions within each observation. For each feature, we also reported, from the same test done on the full dataset, the p-value computed using the test implementation from R package exactRankTests [4]; the q-value computed with the Benjamini & Hochberg [5] False Discovery Rate (FDR) method in the R function p.adjust; and several types of the effect size measurements. Stabsel is a stability selection approach implemented in the R package stabs [6]. This feature selection method implements a stability selection procedure described in [7] with the improved error bounds described in [8]. Elastic net (from R package glmnet [9]) was used as the base feature selection method that was wrapped by the stability protocol. For groupings with two factor levels, a binomial family model was built with the grouping as a response and the matrix of the abundance values as predictors. For modeling microbiome changes versus age, the age was used as a response in a Gaussian family model. The mixing parameter α of glmnet was selected based on a 15-fold crossvalidation minimizing deviance on the full dataset. The predictors were first normalized to simple proportions within each multivariate observation, transformed with the inverse hyperbolic sign log (𝑥 + √(𝑥 2 + 1)), and then standardized to zero means and unit variances. With its multivariate base feature selection method, this protocol can potentially detect those correlated groups of biologically relevant features that will be missed by the univariate methods. The ranking of taxa and their probability of being selected into the model were reported, as well as the probability cutoff corresponding to the per-family error rate (PFER) that is controlled by this method. Our PFER cutoff was set to 0.05, and the target number of features selected by the base classifier was set to √(0.8 × 𝑝) where p is the total number of features [7]. In our experience with omics datasets, the PFER control in this method is fairly conservative, and we typically look at the ranking of features as opposed to only concentrating on features that pass the PFER cutoff. 4 DESeq2 [10] is a method for the differential analysis of count data that uses shrinkage estimation for dispersions and fold changes to improve the stability and interpretability of estimates. The DESeq2 test uses a negative binomial model rather than simple proportion-based normalization or rarefaction to control for different sequencing depths, which may both increase power and lower the false positive detection rate [11]. Comparing overall dissimilarity between taxonomic abundance profiles. We applied the PermANOVA (permutation-based analysis of variance) [12] test of statistical significance (as implemented in the Adonis function of the R vegan package) [13] on the association between the abundance profile dissimilarities and the metadata variables. We used the Bray-Curtis dissimilarity index [14] and 4000 permutations. The counts were normalized to simple proportions within each observation. Diversity and richness analysis. For genus-level and OTU count matrices, we performed the following richness and diversity analyses using the R vegan [13] package. Counts were rarefied to the lowest library size, and then abundance-based and incidence-based alpha diversity indices and richness estimates were computed. This was repeated multiple times (n = 400), and the results were averaged. Incidence-based estimates were computed on pools of observations split by the relevant metadata attribute, and in each repetition, observations were also stratified to balance the number of observations at each level of the metadata attribute. Inverted Simpson and Shannon diversity indices were converted into corresponding Hill numbers [15]. Linear models were fit to test for associations between abundance-based richness and diversity estimates and metadata attributes. A beta diversity dissimilarity matrix (Sorensen index, equals Bray-Curtis index on incidence data) was computed by averaging over multiple rarefactions. The function betadisper from vegan was used to test for the homogeneity of group variances. Adonis was used to test for the association of beta diversity with the metadata attributes. Independent filtering. 5 For diversity and richness estimates, full count matrices as produced by the mothur annotation were used [16]. After completing that step and before proceeding to the differential abundance analysis, in order to remove the likely non-informative features and to reduce the associated penalty from the multiple testing correction applied after univariate tests, we used unbiased metadata-independent filtering at each level of the taxonomy by eliminating all features that were detected with a mean proportional abundance of less than 0.0005. The absolute counts from the removed features were aggregated into a category “other,” which was taken into an account when computing simple proportions during data normalization, but were otherwise discarded. The R package ggplot2 [17] was used to generate plots of taxonomic abundance profiles. Profiles were normalized to proportions. 6 Table S1. The number of reads, averaged Good’s coverage score, and the estimated number of OTUs per sample is listed below. Sample ID RSVSP_TH_00001 RSVSP_TH_00002 RSVSP_TH_00003 RSVSP_TH_00004 RSVSP_TH_00005 RSVSP_TH_00006 RSVSP_TH_00007 RSVSP_TH_00008 RSVSP_TH_00009 RSVSP_TH_00010 RSVSP_TH_00011 RSVSP_TH_00012 RSVSP_TH_00013 RSVSP_TH_00014 RSVSP_TH_00015 RSVSP_TH_00017 RSVSP_TH_00018 RSVSP_TH_00019 RSVSP_TH_00020 RSVSP_TH_00041 RSVSP_TH_00045 RSVSP_TH_00046 RSVSP_TH_00047 RSVSP_TH_00048 RSVSP_TH_00049 RSVSP_TH_00050 RSVSP_TH_00051 RSVSP_TH_00053 RSVSP_TH_00054 RSVSP_TH_00057 RSVSP_TH_00058 RSVSP_TH_00059 RSVSP_TH_00060 Raw Reads 26839 24369 24299 23111 30646 33554 24526 34828 27841 29137 21840 24827 19366 4940 21129 25918 33234 22065 20790 29702 27755 21922 24919 25747 16540 27867 20374 20979 19380 25062 29892 21851 24605 Reads after Filtering 17673 16193 2617 14921 18982 18334 15228 20955 16563 13505 14140 16128 12522 3189 12647 14557 21098 7700 13665 18131 10027 15826 15443 18187 10218 19213 15283 16180 9829 5658 19821 15161 17959 Average Good's Coverage Scorea 1.00 0.99 0.99 1.00 1.00 0.99 0.99 0.99 0.98 0.99 0.99 0.99 0.99 0.98 0.98 0.99 0.99 0.99 0.98 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 Estimated OTUs (S.obs)b 25.58 28.18 39.00 26.87 20.67 46.09 68.35 62.17 115.95 83.27 97.63 53.83 88.68 135.20 89.56 49.38 36.44 41.99 124.86 34.14 38.78 78.31 51.96 66.58 86.68 45.91 72.17 62.34 38.30 69.18 36.39 49.03 66.60 7 a Average calculated over 1,000 iterations, after rarefying to the minimum sequence count. b The number of OTUs per sample was calculated by rarefying to the minimum sequence count, and averaging the estimate over 400 iterations. 8 References: 1. Jeraldo P, Chia N, Goldenfeld N (2011) On the suitability of short reads of 16S rRNA for phylogeny-based analyses in environmental surveys. Environ Microbiol 13 (11):30003009. doi:10.1111/j.1462-2920.2011.02577.x 2. Tovchigrechko A (2015) MGSAT - Statistical analysis of microbiome and proteome abundance matrices with automated report generation. 3. Boulesteix AL, Slawski M (2009) Stability and aggregation of ranked gene lists. Brief Bioinform 10 (5):556-568. doi:10.1093/bib/bbp034 4. Hothorn T, Hornik K (2013) exactRankTests: Exact Distributions for Rank and Permutation Tests. http://cran.r-project.org/package=exactRankTests. 5. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Series B Stat Methodol 57 (1):289300. doi:10.2307/2346101 6. Hofner B, Hothorn T (2014) stabs: Stability Selection with Error Control. http://CRAN.R-project.org/package=stabs. 7. Meinshausen N, Bühlmann P (2010) Stability selection. J R Stat Soc Series B Stat Methodol 72 (4):417-473. doi:10.1111/j.1467-9868.2010.00740.x 8. Shah RD, Samworth RJ (2013) Variable selection with error control: another look at stability selection. J R Stat Soc Series B Stat Methodol 75 (1):55-80. doi:10.1111/j.14679868.2011.01034.x 9. Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33 (1):1-22 10. Love MI, Huber W, Anders S (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15 (12):550. doi:10.1186/s13059-014-0550-8 11. McMurdie PJ, Holmes S (2014) Waste not, want not: why rarefying microbiome data is inadmissible. PLoS Comput Biol 10 (4):e1003531-e1003531. doi:10.1371/journal.pcbi.1003531 12. Anderson MJ (2001) A new method for non-parametric multivariate analysis of variance. Austral Ecol 26 (1):32-46. doi:10.1111/j.1442-9993.2001.01070.pp.x 13. Oksanen J, Blanchet FG, Kindt R, Legendre P, Minchin PR, O'Hara RB, Simpson GL, Solymos P, Stevens MHH, Wagner H (2014) vegan: Community Ecology Package. http://CRAN.R-project.org/package=vegan. 14. Bray JR, Curtis JT (1957) An ordination of the upland forest communities of southern Wisconsin. Ecol Monogr 27 (4):325-349 15. Hill MO (1973) Diversity and evenness: a unifying notation and its consequences. Ecology 54 (2):427-432. doi:10.2307/1934352 16. Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, Lesniewski RA, Oakley BB, Parks DH, Robinson CJ, Sahl JW, Stres B, Thallinger GG, Van Horn DJ, Weber CF (2009) Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol 75 (23):7537-7541. doi:10.1128/AEM.01541-09 17. Wickham H (2009) ggplot2: elegant graphics for data analysis. Springer, New York 9