Our randomization scheme reassigns samples to habitats and is

advertisement
1
Supplementary Online Material
2
Local Specificity Index and permutation tests
3
The local specificity index is defined as the product of prevalence and share of total count. The index is
4
asymmetric by construction as it focuses on “positive” specificity over “negative” specificity. It places a
5
higher importance on systematic and exclusive presence than on systematic and exclusive absence: high
6
specificity to one habitat automatically entails low specificity to other habitats but the reverse is not
7
true. A species uniformly distributed across many habitats has low local specificity to each of them. The
8
focus on positive specificity arises from the limitations of diversity surveys: presence is easier to assess
9
confidently than absence. The quantitative nature of the index makes it robust to limited contamination:
10
even if a species is present in all samples due to contamination, it will be picked as specific by the index if
11
its overall distribution is concentrated on a given habitat.
12
Our randomization scheme reassigns samples to habitats and is similar to non-parametric association
13
tests. It preserves common and rare species within a sample and therefore detects specific species
14
conditionally upon the abundance distribution. Other randomizations schemes exist (Hardy 2008) but
15
typically involve randomizing species abundance within samples, instead of or in addition to samples
16
within habitats. Such randomization schemes do not preserve rare and abundant species and implicitly
17
assume ecological equivalence of all species. When assessing the specificity of abundant species, they
18
also lead to less conservative tests. Consider the extreme case depicted in Table S1.
Habitat A
Habitat B
Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Sample 6
OTU1 100
100
100
0
0
1
0
OTU2 0
0
0
100
100
100
19
Table S1: Relative abundances from a fictional dataset with 2 OTUs, 2 habitats, 2 samples per habitats and perfect association
20
between OTUs and conditions.
21
After randomization of both species abundances and habitats, the maximum local specificity of OTU1
22
across habitats 1 and 2 is centered on 0.45 with standard deviations of 0.20. In particular, only 3 % of the
23
randomized count tables have a perfect association between OTU1 and either habitat. The
24
corresponding values in our randomization scheme are 0.5, 0.16 and 10 %. In particular, the specificity is
25
significant under complete randomization (p = 0.03) but not under habitat switching (p = 0.10). By
26
preserving rare and abundant species, our randomization scheme creates more situations where high
27
abundance samples cluster in a single habitat and lead to a high value of max  ih by chance only
28
whereas complete randomizations smooth out those situations. In other words, we test whether OTU1 is
29
specific to habitat A, knowing that it is abundant is some samples.
30
h
The decision to compare  i to the null distribution of max  ih instead of its own null distribution shifts
31
the null distribution towards higher values and increases right-tail p-values (i.e. probability of being more
32
specific than expected by chance). Comparing to the maximum corrects for the multiple testing
33
problems, where the specificity of a species is assessed for several habitats at once, and ensures that any
34
species is specific to at most one habitat.
35
Implicit assumptions behind normalization
36
Our index uses a double normalization of species counts. The first normalization gives the same weight
37
to each sample within a habitat by changing counts to relative abundances. This normalization has
38
various flavors in metagenomics studies and reflects the sampling component of sequencing, as species
39
are sequenced in proportion to their relative abundances. This normalization however implicitly assumes
40
species neutrality: a sample site spans only one habitat, shared by all species in that habitat, and one
ℎ
ℎ
2
41
species’ gain is another’s loss. If non-competing species at a given location occupy exclusive habitats, one
42
part of the bacterial community may grow whereas the rest remains unchanged. If the carrying
43
capacities (proxy for contribution) of the habitats differ among samples, relative abundances are
44
distorted, even though samples share a common structure. This is illustrated in Table S2. Samples 1 and
45
2 correspond to the same site, made up of habitats A and B, at two different time points. The carrying
46
capacity of habitat A is divided by 3 between the two time points. Although each habitat composition
47
remains unchanged, the global site composition changes with time.
Habitat A (capacity = KA)
Habitat B (capacity = KB)
OTU A1 OTU A2 OTU A3 OTU B1 OTU B2 OTU B3
Sample 1 (KA = 90, KB = 30)
30
30
30
10
10
10
25
25
25
8.3
8.3
8.3
10
10
10
10
10
10
16.7
16.7
16.7
16.7
16.7
16.7
33
33
33
33
33
33
Absolute counts
Sample 1
Relative abundances (%)
Sample 2 (KA = 30, KB = 30)
Absolute Counts
Sample 2
Relative abundances
Common structure: within
Habitat relative abundances (%)
48
Table S2: Relative abundances from a fictional dataset with 2 samples from the same site at two time points. The habitat is
49
made of two identically structured habitats (last line) but the carrying capacity of habitat A changes across time, leading to
50
distorted relative abundances.
51
The second normalization weights all habitats equally when computing the total count of a species. This
52
implicitly assumes that all habitats have comparable bacterial loads. If not, a species can be very specific
3
53
to a habitat while being found mostly in another. This apparent paradox, illustrated in Table S3, stems
54
from the base-rate fallacy and is unavoidable in many fields, including screening tests and biomarker
55
discovery. This is not a problem of unknown library sizes, as in differential abundance or differential
56
expression studies, but rather of unknown bacterial load and can probably not be solved by
57
normalization methods alone. It also means that local specificity is a measure of relative specificity
58
rather than absolute one.
Total counts
Normalized counts
OTU1 OTU2 OTU1
OTU2
Habitat A 80
20
80
20
Habitat B
920
8
92
80
59
Table S3: Raw and normalized counts from a fictional dataset with two habitats and two OTUs. OTU1 is specific to condition A
60
(0.91 = 80/88) but found in equal counts (80) in conditions A and B thanks to the 10 times larger bacterial local of habitat B.
61
From a pragmatic point of view, a priori information on both the habitat composition of a sampling site
62
and the bacterial load of a habitat are usually unavailable. The previous assumptions are therefore
63
necessary to study specificity from count data.
64
New skin habitats
65
The skin group from the human microbiota study is a patchwork of body sampling sites (18 sites in
66
Costello et al. (2009) dataset) and is unlikely to form a coherent habitat. This is reflected in the weak
67
abundance – specificity relationship and the poor fit of NCM (Fig. S7). We decided to subdivide the skin
68
group into smaller groups, more likely to form environmentally coherent habitats. The groups were
69
formed using hierarchical clustering of unweighted UniFrac distances and cutting the tree at the highest
70
level that respects the symmetry of the human body, imposing symmetric sites (left/right) to belong to
71
the same habitat. As expected, the 6 new habitats mainly regroup by spatially close by sites (hand palm
4
72
and index finger) in addition to the enforced symmetry (left/right back of knee) with only one peculiar
73
habitat: armpit and soles of feet. Kembel et al. (2012) noted that armpits and soles of feet are both dark
74
and moist sampling sites; this could be the relevant environmental variables here. Repeating the analysis
75
with the new skin habitats does not change the results for the previous non-skin habitats. It however
76
strengthens the abundance – specificity in the new skin habitats (Fig. S1), although the relationship
77
remains weaker than in other habitats. This may be evidence that in the presence of strong migration
78
and less differentiated environmental conditions, as expected between skin sampling sites, the relative
79
contribution of neutral processes and niche differentiation shifts towards neutral processes, as
80
suggested in Gravel et al. (2006).
81
We stress that the new habitats were built only from presence/absence data obtained by aggregating all
82
samples from a body site. In particular, they do not explicitly incorporate any information about
83
abundance or local prevalence of species. We therefore argue that the stronger abundance – specificity
84
relationship is genuine and not merely an artifact of the clustering step, although the latter could
85
reinforce the former.
86
87
88
89
90
SOM References
91
92
93
2.
Kembel, S.W., Wu, M., Eisen, J.A. & Green, J.L. (2012). Incorporating 16S gene copy number information
improves estimates of microbial diversity and abundance. PLoS Comput Biol, 8, e1002743.
94
Supplementary Figures
95
Figure S1: Positive relationship between local abundance – specificity in zooplankton samples after correcting counts for copy
96
number variation, using either average copy number (top) in a phylum (or class for Proteobacteria) or a random copy number
1.
Hardy, O.J. (2008). Testing the spatial phylogenetic structure of local communities: statistical
performances of different null models and test statistics on a locally neutral community: Testing spatial
community phylogenetic structure. J. Ecol., 96, 914–926.
5
97
(bottom) in the observed range of that taxon phylum (or class for Proteobacteria). The sampling distribution was a truncated
98
Gaussian with mean and variance set to observed mean and variance in the phylum (or class). The sequence similarity threshold
99
is 97 %.
100
Figure S2: Positive relationship between local abundance and specificity in zooplakton samples using sequence similarity
101
thresholds ranging from 91 to 99 %. The abundance – specificity relationship is generally stronger for higher thresholds, expect
102
perhaps 99 %, but can already be observed at 91 % threshold (roughly corresponding to the family level, (Youngblut et al.
103
2013)). As expected, higher thresholds reduce the abundance of the most abundant species.
104
Figure S3: Positive relationship between local abundance and specificity in the human microbiome (Costello et al. 2009) using
105
refined habitat. Construction of the new skin habitats is detailed in the SOM. Details of this figure are the same as for Figure 1.
106
Figure S4: The abundance – specificity relationship in the human microbiome (Costello et al. 2009) is stable over time. Curves
107
corresponding to different sampling days (labelled Day 1 and Day 2) are essentially identical for all habitats.
108
Figure S5: Abundance of species in the human microbiome (Costello et al. 2009) is stable over time. The variations observed at
109
low abundances and materialized by the distance from the diagonal are likely due to limited sequencing depth. As relative
110
abundances increase, they are better estimated and more consistent across the two datasets.
111
Figure S6: Positive relationship between local abundance and specificity in diverse environmental habitats from the Global
112
Patterns study (Caporaso et al. 2011) using moderate subsampling (5,000 reads). This confirms that abundance – specificity
113
relationship is not simply an artefact of rare species being left out when using moderate subsampling.
114
Figure S7: Fit of the Neutral Community Model (Sloan et al. 2006) to the Skin group of the human microbiome (Costello et al.
115
2009). The model was fitted using equation (14) of Sloan et al. (2006). NCM correctly captures the occupancy - abundance curve
116
at low abundances but breaks down at high abundances, suggesting that the Skin group is partitioned into several
117
environmentally coherent habitats.
6
Download