1 Supplementary Online Material 2 Local Specificity Index and permutation tests 3 The local specificity index is defined as the product of prevalence and share of total count. The index is 4 asymmetric by construction as it focuses on “positive” specificity over “negative” specificity. It places a 5 higher importance on systematic and exclusive presence than on systematic and exclusive absence: high 6 specificity to one habitat automatically entails low specificity to other habitats but the reverse is not 7 true. A species uniformly distributed across many habitats has low local specificity to each of them. The 8 focus on positive specificity arises from the limitations of diversity surveys: presence is easier to assess 9 confidently than absence. The quantitative nature of the index makes it robust to limited contamination: 10 even if a species is present in all samples due to contamination, it will be picked as specific by the index if 11 its overall distribution is concentrated on a given habitat. 12 Our randomization scheme reassigns samples to habitats and is similar to non-parametric association 13 tests. It preserves common and rare species within a sample and therefore detects specific species 14 conditionally upon the abundance distribution. Other randomizations schemes exist (Hardy 2008) but 15 typically involve randomizing species abundance within samples, instead of or in addition to samples 16 within habitats. Such randomization schemes do not preserve rare and abundant species and implicitly 17 assume ecological equivalence of all species. When assessing the specificity of abundant species, they 18 also lead to less conservative tests. Consider the extreme case depicted in Table S1. Habitat A Habitat B Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Sample 6 OTU1 100 100 100 0 0 1 0 OTU2 0 0 0 100 100 100 19 Table S1: Relative abundances from a fictional dataset with 2 OTUs, 2 habitats, 2 samples per habitats and perfect association 20 between OTUs and conditions. 21 After randomization of both species abundances and habitats, the maximum local specificity of OTU1 22 across habitats 1 and 2 is centered on 0.45 with standard deviations of 0.20. In particular, only 3 % of the 23 randomized count tables have a perfect association between OTU1 and either habitat. The 24 corresponding values in our randomization scheme are 0.5, 0.16 and 10 %. In particular, the specificity is 25 significant under complete randomization (p = 0.03) but not under habitat switching (p = 0.10). By 26 preserving rare and abundant species, our randomization scheme creates more situations where high 27 abundance samples cluster in a single habitat and lead to a high value of max ih by chance only 28 whereas complete randomizations smooth out those situations. In other words, we test whether OTU1 is 29 specific to habitat A, knowing that it is abundant is some samples. 30 h The decision to compare i to the null distribution of max ih instead of its own null distribution shifts 31 the null distribution towards higher values and increases right-tail p-values (i.e. probability of being more 32 specific than expected by chance). Comparing to the maximum corrects for the multiple testing 33 problems, where the specificity of a species is assessed for several habitats at once, and ensures that any 34 species is specific to at most one habitat. 35 Implicit assumptions behind normalization 36 Our index uses a double normalization of species counts. The first normalization gives the same weight 37 to each sample within a habitat by changing counts to relative abundances. This normalization has 38 various flavors in metagenomics studies and reflects the sampling component of sequencing, as species 39 are sequenced in proportion to their relative abundances. This normalization however implicitly assumes 40 species neutrality: a sample site spans only one habitat, shared by all species in that habitat, and one ℎ ℎ 2 41 species’ gain is another’s loss. If non-competing species at a given location occupy exclusive habitats, one 42 part of the bacterial community may grow whereas the rest remains unchanged. If the carrying 43 capacities (proxy for contribution) of the habitats differ among samples, relative abundances are 44 distorted, even though samples share a common structure. This is illustrated in Table S2. Samples 1 and 45 2 correspond to the same site, made up of habitats A and B, at two different time points. The carrying 46 capacity of habitat A is divided by 3 between the two time points. Although each habitat composition 47 remains unchanged, the global site composition changes with time. Habitat A (capacity = KA) Habitat B (capacity = KB) OTU A1 OTU A2 OTU A3 OTU B1 OTU B2 OTU B3 Sample 1 (KA = 90, KB = 30) 30 30 30 10 10 10 25 25 25 8.3 8.3 8.3 10 10 10 10 10 10 16.7 16.7 16.7 16.7 16.7 16.7 33 33 33 33 33 33 Absolute counts Sample 1 Relative abundances (%) Sample 2 (KA = 30, KB = 30) Absolute Counts Sample 2 Relative abundances Common structure: within Habitat relative abundances (%) 48 Table S2: Relative abundances from a fictional dataset with 2 samples from the same site at two time points. The habitat is 49 made of two identically structured habitats (last line) but the carrying capacity of habitat A changes across time, leading to 50 distorted relative abundances. 51 The second normalization weights all habitats equally when computing the total count of a species. This 52 implicitly assumes that all habitats have comparable bacterial loads. If not, a species can be very specific 3 53 to a habitat while being found mostly in another. This apparent paradox, illustrated in Table S3, stems 54 from the base-rate fallacy and is unavoidable in many fields, including screening tests and biomarker 55 discovery. This is not a problem of unknown library sizes, as in differential abundance or differential 56 expression studies, but rather of unknown bacterial load and can probably not be solved by 57 normalization methods alone. It also means that local specificity is a measure of relative specificity 58 rather than absolute one. Total counts Normalized counts OTU1 OTU2 OTU1 OTU2 Habitat A 80 20 80 20 Habitat B 920 8 92 80 59 Table S3: Raw and normalized counts from a fictional dataset with two habitats and two OTUs. OTU1 is specific to condition A 60 (0.91 = 80/88) but found in equal counts (80) in conditions A and B thanks to the 10 times larger bacterial local of habitat B. 61 From a pragmatic point of view, a priori information on both the habitat composition of a sampling site 62 and the bacterial load of a habitat are usually unavailable. The previous assumptions are therefore 63 necessary to study specificity from count data. 64 New skin habitats 65 The skin group from the human microbiota study is a patchwork of body sampling sites (18 sites in 66 Costello et al. (2009) dataset) and is unlikely to form a coherent habitat. This is reflected in the weak 67 abundance – specificity relationship and the poor fit of NCM (Fig. S7). We decided to subdivide the skin 68 group into smaller groups, more likely to form environmentally coherent habitats. The groups were 69 formed using hierarchical clustering of unweighted UniFrac distances and cutting the tree at the highest 70 level that respects the symmetry of the human body, imposing symmetric sites (left/right) to belong to 71 the same habitat. As expected, the 6 new habitats mainly regroup by spatially close by sites (hand palm 4 72 and index finger) in addition to the enforced symmetry (left/right back of knee) with only one peculiar 73 habitat: armpit and soles of feet. Kembel et al. (2012) noted that armpits and soles of feet are both dark 74 and moist sampling sites; this could be the relevant environmental variables here. Repeating the analysis 75 with the new skin habitats does not change the results for the previous non-skin habitats. It however 76 strengthens the abundance – specificity in the new skin habitats (Fig. S1), although the relationship 77 remains weaker than in other habitats. This may be evidence that in the presence of strong migration 78 and less differentiated environmental conditions, as expected between skin sampling sites, the relative 79 contribution of neutral processes and niche differentiation shifts towards neutral processes, as 80 suggested in Gravel et al. (2006). 81 We stress that the new habitats were built only from presence/absence data obtained by aggregating all 82 samples from a body site. In particular, they do not explicitly incorporate any information about 83 abundance or local prevalence of species. We therefore argue that the stronger abundance – specificity 84 relationship is genuine and not merely an artifact of the clustering step, although the latter could 85 reinforce the former. 86 87 88 89 90 SOM References 91 92 93 2. Kembel, S.W., Wu, M., Eisen, J.A. & Green, J.L. (2012). Incorporating 16S gene copy number information improves estimates of microbial diversity and abundance. PLoS Comput Biol, 8, e1002743. 94 Supplementary Figures 95 Figure S1: Positive relationship between local abundance – specificity in zooplankton samples after correcting counts for copy 96 number variation, using either average copy number (top) in a phylum (or class for Proteobacteria) or a random copy number 1. Hardy, O.J. (2008). Testing the spatial phylogenetic structure of local communities: statistical performances of different null models and test statistics on a locally neutral community: Testing spatial community phylogenetic structure. J. Ecol., 96, 914–926. 5 97 (bottom) in the observed range of that taxon phylum (or class for Proteobacteria). The sampling distribution was a truncated 98 Gaussian with mean and variance set to observed mean and variance in the phylum (or class). The sequence similarity threshold 99 is 97 %. 100 Figure S2: Positive relationship between local abundance and specificity in zooplakton samples using sequence similarity 101 thresholds ranging from 91 to 99 %. The abundance – specificity relationship is generally stronger for higher thresholds, expect 102 perhaps 99 %, but can already be observed at 91 % threshold (roughly corresponding to the family level, (Youngblut et al. 103 2013)). As expected, higher thresholds reduce the abundance of the most abundant species. 104 Figure S3: Positive relationship between local abundance and specificity in the human microbiome (Costello et al. 2009) using 105 refined habitat. Construction of the new skin habitats is detailed in the SOM. Details of this figure are the same as for Figure 1. 106 Figure S4: The abundance – specificity relationship in the human microbiome (Costello et al. 2009) is stable over time. Curves 107 corresponding to different sampling days (labelled Day 1 and Day 2) are essentially identical for all habitats. 108 Figure S5: Abundance of species in the human microbiome (Costello et al. 2009) is stable over time. The variations observed at 109 low abundances and materialized by the distance from the diagonal are likely due to limited sequencing depth. As relative 110 abundances increase, they are better estimated and more consistent across the two datasets. 111 Figure S6: Positive relationship between local abundance and specificity in diverse environmental habitats from the Global 112 Patterns study (Caporaso et al. 2011) using moderate subsampling (5,000 reads). This confirms that abundance – specificity 113 relationship is not simply an artefact of rare species being left out when using moderate subsampling. 114 Figure S7: Fit of the Neutral Community Model (Sloan et al. 2006) to the Skin group of the human microbiome (Costello et al. 115 2009). The model was fitted using equation (14) of Sloan et al. (2006). NCM correctly captures the occupancy - abundance curve 116 at low abundances but breaks down at high abundances, suggesting that the Skin group is partitioned into several 117 environmentally coherent habitats. 6