Minimally invasive sampling method identifies differences in

advertisement
Minimally invasive sampling method identifies differences in taxonomic richness of
airway microbiomes in young infants associated with mode of delivery
Meghan H. Shilts1, Christian Rosas-Salazar2, Andrey Tovchigrechko3, Emma K. Larkin4,
Manolito Torralba3, Asmik Akopov1, Rebecca Halpin1, R. Stokes Peebles4, Martin L.
Moore5, Larry J. Anderson5, Karen E. Nelson3, Tina V. Hartert4, Suman R Das1*
1
Virology Group, J. Craig Venter Institute, Rockville, MD, USA
2
Department of Pediatrics, Vanderbilt University School of Medicine, Nashville, TN,
USA
3
Genomic Medicine Group, J. Craig Venter Institute, Rockville, MD, USA
4
Department of Medicine, Vanderbilt University School of Medicine, Nashville, TN,
USA
5
Department of Pediatrics, Emory University School of Medicine, Atlanta, GA, USA
* Corresponding authors
Suman R. Das, PhD
Infectious Diseases Group
J. Craig Venter Institute
9704 Medical Center Dr.
Rockville, MD 20850, USA
Email: sdas@jcvi.org
Phone: 301-795-7328
Fax: 301-795-7070
Key Words: microbiome, 16S rRNA, next-generation sequencing, upper respiratory tract
Materials and Methods
DNA Sampling and Extraction
Using sterile gloves, dry filter papers (LeukosorbTM, Pall Corporation, Port
Washington, NY) were inserted inside both of the infant’s nares for a minimum of 30
seconds, and up to two minutes. After sampling, the nasal filters were placed into sterile
tubes and stored at -80 °C until further processing. To extract microbial genomic DNA,
filters were resuspended in 700 μl of lysis buffer (20 mM Tris-HCl, 2 mM ETDA, 1.2%
Triton X-100) and incubated at 75 °C for 10 minutes. After samples were cooled to room
temperature, 60 μl of 200 mg/ml lysozyme (Sigma Aldrich, St. Louis, MO) was added to
each tube and samples were incubated 37 °C overnight. Following overnight incubation,
110 μl of 10% SDS and 42 μl of 20 mg/ml proteinase K (QIAGEN, Valencia, CA) were
added to each sample, and the tubes were incubated at 55 °C for 30 minutes. After
incubation, an equal volume of phenol:chloroform:isoamylalcohol (25:24:1) was added to
each sample. Samples were vortexed and then centrifuged at maximum speed for 20
minutes. The aqueous phase was removed and subjected to a second phenol-chloroform
extraction. After centrifugation, a 1/10th volume of 3 M sodium acetate pH 5.2 and an
equal volume of chloroform:isoamylalcohol (24:1) were added to the aqueous phase.
Samples were vortexed and then centrifuged at maximum speed for 15 minutes. The
aqueous phase was removed, and an equal volume of isopropanol was added. Samples
were subsequently incubated at -80 °C for 30 minutes. After precipitation, samples were
centrifuged at 4 °C at maximum speed for 10 minutes. The supernatant was removed, and
the pellet was washed with 80% ethanol. Samples were again centrifuged at 4 °C at
maximum speed for 10 minutes. The supernatant was removed by decanting, and the
tubes were left open at 37 °C for 10 minutes to ensure complete ethanol evaporation. The
pellet was resuspended in 50 μl TE. For each sample, DNA was quantified using SYBR
Green (Life Technologies, Grand Island, NY) on a Synergy HT plate reader (BioTek,
Winooski, VT).
16S rRNA Gene Profiling by 454 Pyrosequencing
2
Approximately 100 ng of DNA from each sample was amplified using primers
27F (5’CCTATCCCCTGTGTGCCTTGGCAGTCTCAGAGAGTTTGATYMTGGCTCAG-3’)
and 534R (5’-CCATCTCATCCCTGCGTGTCTCCGACTCAG-NNNNNNNNNNATTACCGCGGCTGCTGG-3’), which target the V1 – V3 region of the 16S rRNA gene
[1]. Each reverse primer contained a unique 10 base pair barcode, represented with “N”
in the 534R primer sequence above, to allow multiplex pyrosequencing. Amplicons were
generated with Platinum Taq polymerase (Life Technologies, Grand Island, New York)
using the following cycling conditions: 95 °C for 5 minutes; 35 cycles of 95 °C for 30
seconds, 55 °C for 30 seconds, 72 °C for 30 seconds; and a final extension step at 72 °C
for 7 minutes. The presence of amplicons was verified by visualization on a 1% agarose
gel. Amplicons were cleaned using the QIAquick PCR Purification Kit (QIAGEN,
Valencia, CA), as per the manufacturer’s instructions, with an additional drying step after
the ethanol wash to ensure complete ethanol removal. Purified amplicons were quantified
using SYBR Green (Life Technologies, Grand Island, NY) on a Synergy HT plate reader
(BioTek, Winooski, VT), normalized, and pooled. The template was subjected to
emulsion PCR, and pyrosequencing was performed at the J. Craig Venter Institute on a
454 sequencer using FLX-Titanium chemistry (Roche, Branford, CT).
Data Analysis
All analysis was conducted using our open source package MGSAT [2]. The tests
used in this analysis are described in detail below.
Ranking of features according to their differential abundance with regard to
metadata variables and hypothesis testing for differential abundance.
The GeneSelector [3] R package was used when there were two groups of
observations. Briefly, the same ranking method (package function RankingWilcoxon) was
applied to multiple random subsamples of the full set of observations (400 replicates,
sampling 50% of observations without replacement). RankingWilcoxon ranks features in
each replicate according to the test statistic with regard to the group difference.
Consensus ranking between replicates was then found with a Monte Carlo procedure
3
(function AggregateMC), and the features were reported in the order of that consensus.
The consensus ranking is expected to be more stable with regard to sampling error as
compared to ranking obtained just once for the entire dataset. We used a variant of the
RankingWilcoxon method that applied the Wilcoxon rank-sum test to feature abundance
values for independent observations. The abundance counts were normalized to simple
proportions within each observation. For each feature, we also reported, from the same
test done on the full dataset, the p-value computed using the test implementation from R
package exactRankTests [4]; the q-value computed with the Benjamini & Hochberg [5]
False Discovery Rate (FDR) method in the R function p.adjust; and several types of the
effect size measurements.
Stabsel is a stability selection approach implemented in the R package stabs [6].
This feature selection method implements a stability selection procedure described in [7]
with the improved error bounds described in [8]. Elastic net (from R package glmnet [9])
was used as the base feature selection method that was wrapped by the stability protocol.
For groupings with two factor levels, a binomial family model was built with the
grouping as a response and the matrix of the abundance values as predictors. For
modeling microbiome changes versus age, the age was used as a response in a Gaussian
family model. The mixing parameter α of glmnet was selected based on a 15-fold crossvalidation minimizing deviance on the full dataset. The predictors were first normalized
to simple proportions within each multivariate observation, transformed with the inverse
hyperbolic sign log (𝑥 + √(𝑥 2 + 1)), and then standardized to zero means and unit
variances. With its multivariate base feature selection method, this protocol can
potentially detect those correlated groups of biologically relevant features that will be
missed by the univariate methods. The ranking of taxa and their probability of being
selected into the model were reported, as well as the probability cutoff corresponding to
the per-family error rate (PFER) that is controlled by this method. Our PFER cutoff was
set to 0.05, and the target number of features selected by the base classifier was set to
√(0.8 × 𝑝) where p is the total number of features [7]. In our experience with omics
datasets, the PFER control in this method is fairly conservative, and we typically look at
the ranking of features as opposed to only concentrating on features that pass the PFER
cutoff.
4
DESeq2 [10] is a method for the differential analysis of count data that uses
shrinkage estimation for dispersions and fold changes to improve the stability and
interpretability of estimates. The DESeq2 test uses a negative binomial model rather than
simple proportion-based normalization or rarefaction to control for different sequencing
depths, which may both increase power and lower the false positive detection rate [11].
Comparing overall dissimilarity between taxonomic abundance profiles.
We applied the PermANOVA (permutation-based analysis of variance) [12] test
of statistical significance (as implemented in the Adonis function of the R vegan package)
[13] on the association between the abundance profile dissimilarities and the metadata
variables. We used the Bray-Curtis dissimilarity index [14] and 4000 permutations. The
counts were normalized to simple proportions within each observation.
Diversity and richness analysis.
For genus-level and OTU count matrices, we performed the following richness
and diversity analyses using the R vegan [13] package. Counts were rarefied to the lowest
library size, and then abundance-based and incidence-based alpha diversity indices and
richness estimates were computed. This was repeated multiple times (n = 400), and the
results were averaged. Incidence-based estimates were computed on pools of
observations split by the relevant metadata attribute, and in each repetition, observations
were also stratified to balance the number of observations at each level of the metadata
attribute. Inverted Simpson and Shannon diversity indices were converted into
corresponding Hill numbers [15]. Linear models were fit to test for associations between
abundance-based richness and diversity estimates and metadata attributes.
A beta diversity dissimilarity matrix (Sorensen index, equals Bray-Curtis index on
incidence data) was computed by averaging over multiple rarefactions. The function
betadisper from vegan was used to test for the homogeneity of group variances. Adonis
was used to test for the association of beta diversity with the metadata attributes.
Independent filtering.
5
For diversity and richness estimates, full count matrices as produced by the
mothur annotation were used [16]. After completing that step and before proceeding to
the differential abundance analysis, in order to remove the likely non-informative features
and to reduce the associated penalty from the multiple testing correction applied after
univariate tests, we used unbiased metadata-independent filtering at each level of the
taxonomy by eliminating all features that were detected with a mean proportional
abundance of less than 0.0005. The absolute counts from the removed features were
aggregated into a category “other,” which was taken into an account when computing
simple proportions during data normalization, but were otherwise discarded.
The R package ggplot2 [17] was used to generate plots of taxonomic abundance
profiles. Profiles were normalized to proportions.
6
Table S1. The number of reads, averaged Good’s coverage score, and the estimated
number of OTUs per sample is listed below.
Sample ID
RSVSP_TH_00001
RSVSP_TH_00002
RSVSP_TH_00003
RSVSP_TH_00004
RSVSP_TH_00005
RSVSP_TH_00006
RSVSP_TH_00007
RSVSP_TH_00008
RSVSP_TH_00009
RSVSP_TH_00010
RSVSP_TH_00011
RSVSP_TH_00012
RSVSP_TH_00013
RSVSP_TH_00014
RSVSP_TH_00015
RSVSP_TH_00017
RSVSP_TH_00018
RSVSP_TH_00019
RSVSP_TH_00020
RSVSP_TH_00041
RSVSP_TH_00045
RSVSP_TH_00046
RSVSP_TH_00047
RSVSP_TH_00048
RSVSP_TH_00049
RSVSP_TH_00050
RSVSP_TH_00051
RSVSP_TH_00053
RSVSP_TH_00054
RSVSP_TH_00057
RSVSP_TH_00058
RSVSP_TH_00059
RSVSP_TH_00060
Raw
Reads
26839
24369
24299
23111
30646
33554
24526
34828
27841
29137
21840
24827
19366
4940
21129
25918
33234
22065
20790
29702
27755
21922
24919
25747
16540
27867
20374
20979
19380
25062
29892
21851
24605
Reads
after
Filtering
17673
16193
2617
14921
18982
18334
15228
20955
16563
13505
14140
16128
12522
3189
12647
14557
21098
7700
13665
18131
10027
15826
15443
18187
10218
19213
15283
16180
9829
5658
19821
15161
17959
Average
Good's
Coverage
Scorea
1.00
0.99
0.99
1.00
1.00
0.99
0.99
0.99
0.98
0.99
0.99
0.99
0.99
0.98
0.98
0.99
0.99
0.99
0.98
0.99
0.99
0.99
0.99
0.99
0.99
0.99
0.99
0.99
0.99
0.99
0.99
0.99
0.99
Estimated
OTUs
(S.obs)b
25.58
28.18
39.00
26.87
20.67
46.09
68.35
62.17
115.95
83.27
97.63
53.83
88.68
135.20
89.56
49.38
36.44
41.99
124.86
34.14
38.78
78.31
51.96
66.58
86.68
45.91
72.17
62.34
38.30
69.18
36.39
49.03
66.60
7
a
Average calculated over 1,000 iterations, after rarefying to the minimum sequence
count.
b
The number of OTUs per sample was calculated by rarefying to the minimum sequence
count, and averaging the estimate over 400 iterations.
8
References:
1. Jeraldo P, Chia N, Goldenfeld N (2011) On the suitability of short reads of 16S rRNA
for phylogeny-based analyses in environmental surveys. Environ Microbiol 13 (11):30003009. doi:10.1111/j.1462-2920.2011.02577.x
2. Tovchigrechko A (2015) MGSAT - Statistical analysis of microbiome and proteome
abundance matrices with automated report generation.
3. Boulesteix AL, Slawski M (2009) Stability and aggregation of ranked gene lists. Brief
Bioinform 10 (5):556-568. doi:10.1093/bib/bbp034
4. Hothorn T, Hornik K (2013) exactRankTests: Exact Distributions for Rank and
Permutation Tests. http://cran.r-project.org/package=exactRankTests.
5. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and
powerful approach to multiple testing. J R Stat Soc Series B Stat Methodol 57 (1):289300. doi:10.2307/2346101
6. Hofner B, Hothorn T (2014) stabs: Stability Selection with Error Control.
http://CRAN.R-project.org/package=stabs.
7. Meinshausen N, Bühlmann P (2010) Stability selection. J R Stat Soc Series B Stat
Methodol 72 (4):417-473. doi:10.1111/j.1467-9868.2010.00740.x
8. Shah RD, Samworth RJ (2013) Variable selection with error control: another look at
stability selection. J R Stat Soc Series B Stat Methodol 75 (1):55-80. doi:10.1111/j.14679868.2011.01034.x
9. Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear
models via coordinate descent. J Stat Softw 33 (1):1-22
10. Love MI, Huber W, Anders S (2014) Moderated estimation of fold change and
dispersion for RNA-seq data with DESeq2. Genome Biol 15 (12):550.
doi:10.1186/s13059-014-0550-8
11. McMurdie PJ, Holmes S (2014) Waste not, want not: why rarefying microbiome data
is inadmissible. PLoS Comput Biol 10 (4):e1003531-e1003531.
doi:10.1371/journal.pcbi.1003531
12. Anderson MJ (2001) A new method for non-parametric multivariate analysis of
variance. Austral Ecol 26 (1):32-46. doi:10.1111/j.1442-9993.2001.01070.pp.x
13. Oksanen J, Blanchet FG, Kindt R, Legendre P, Minchin PR, O'Hara RB, Simpson
GL, Solymos P, Stevens MHH, Wagner H (2014) vegan: Community Ecology Package.
http://CRAN.R-project.org/package=vegan.
14. Bray JR, Curtis JT (1957) An ordination of the upland forest communities of southern
Wisconsin. Ecol Monogr 27 (4):325-349
15. Hill MO (1973) Diversity and evenness: a unifying notation and its consequences.
Ecology 54 (2):427-432. doi:10.2307/1934352
16. Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, Lesniewski
RA, Oakley BB, Parks DH, Robinson CJ, Sahl JW, Stres B, Thallinger GG, Van Horn
DJ, Weber CF (2009) Introducing mothur: open-source, platform-independent,
community-supported software for describing and comparing microbial communities.
Appl Environ Microbiol 75 (23):7537-7541. doi:10.1128/AEM.01541-09
17. Wickham H (2009) ggplot2: elegant graphics for data analysis. Springer, New York
9
Download