Microbial adaptation, differentiation, and community structure by Jonathan Friedman Submitted to the Program in Computational and Systems Biology in partial fulfillment of the requirements for the degree of Doctor of Philosophy at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2013 © Massachusetts Institute of Technology 2013. All rights reserved. Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Program in Computational and Systems Biology February 28, 2013 Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Eric J. Alm Associate Professor Thesis Supervisor Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daniel H. Rothman Professor Thesis Supervisor Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christopher B. Burge Chair, Computational and Systems Biology Ph.D. Graduate Committee Microbial adaptation, differentiation, and community structure by Jonathan Friedman Submitted to the Program in Computational and Systems Biology on February 28, 2013, in partial fulfillment of the requirements for the degree of Doctor of Philosophy Abstract Microbes play a central role in diverse processes ranging from global elemental cycles to human digestion. Understanding these complex processes requires a firm understanding of the interplay between microbes and their environment. In this thesis, we utilize sequencing data to study how individual species adapt to different niches, and how species assemble to form communities. First, we study the potential temperature and salinity range of 16 marine Vibrio strains. We find that salinity tolerance is at odds with the strains’ natural habitats, and provide evidence that this incongruence may be explained by a molecular coupling between salinity and temperature tolerance. Next, we investigate the genetic basis of bacterial ecological differentiation by analyzing the genomes of two closely related, yet ecologically distinct populations of Vibrio splendidus. We find that most loci recombine freely across habitats, and that ecological differentiation is likely driven by a small number of habitat-specific alleles. We further present a model for bacterial sympatric speciation. Our simulations demonstrate that a small number of adaptive loci facilitates speciation, due to the opposing roles horizontal gene transfer (HGT) plays throughout the speciation process: HGT initially promotes speciation by bringing together multiple adaptive alleles, but later hinders it by mixing alleles across habitats. Finally, we introduce two tools for analyzing genomic survey data: SparCC, which infers correlations between taxa from relative abundance data; and StrainFinder, which extracts strain-level information from metagenomic data. Employing these tools, we infer a rich ecological network connecting hundreds of interacting species across 18 sites on the human body, and show that 16S-defined groups are rarely composed of a single dominant strain. Thesis Supervisor: Eric J. Alm Title: Associate Professor Thesis Supervisor: Daniel H. Rothman Title: Professor 2 Acknowledgments Numerous people have provided guidance, support, and friendship during my thesis work. The following does but little to convey the extent of my gratefulness. My deepest gratitude is given to my thesis advisors, Eric Alm and Daniel Rothman, for showing me how to practice science, and how to enjoy the process. Whatever scientific taste I possess is owed to them. Thanks to past and present members of the Alm and Rothman labs for providing a stimulating and vibrant scientific environment. In particular, Lawrence David, Jesse Shapiro, Alex Petroff and Olivier Devauchelle have been a constant source of both inspiration and humility. Martin Polz and his group have provided a much needed tether to the natural world. It is also my pleasure to acknowledge members of my thesis advisory committee and my collaborators for their contributions to this thesis. Thanks to Tracy, Michelle, Christoph, Alon, Alex, Olivier, Lawrence, Jesse, Otto and Francisco for their continued friendship and support. To Arne and Celine, for demonstrating that there is life after kids. To Amit and Tamar, for six years of friendship, philosophy, and indian food. To Tova and Moti, for their unconditional support. To my siblings, Dana, Yael, Moran, Uzi, and Neta, who put up with it all, and remind me where home is. To my grandfather, in whose footsteps I am trying to follow. To my parents for both nature and nurture, and for fostering my inquisitiveness, even when it led me to the origin of flags and the salaries of international referees. Above all, to Liraz, Ella and Sammy, for being my girls. 3 Contents I Introduction II 5 Contributions 10 1 Shape and evolution of the fundamental niche in marine Vibrio 12 2 Population Genomics of Early Events in the Ecological Differentiation of Bacteria 33 3 Sympatric speciation: when is it possible in bacteria? 44 4 Inferring Correlation Networks from Genomic Survey Data 66 5 Inferring Strain Diversity from Metagenomic Data 93 III Conclusion 107 IV Bibliography 111 4 Part I Introduction 5 Microbes play a central role in diverse processes ranging from global elemental cycles [1] to human digestion. Gleaning insights regarding such complex natural systems depends critically on data availability. A primary source of data for microbial ecology is the rapidly increasing amounts of fully sequenced bacterial genomes, and genomic surveys [2]. These data provide a unique opportunity for investigating ecological processes and the forces that drive them, as genomes bear the markings of their evolutionary history, and genomic surveys record the taxonomic and functional composition of entire microbial communities. Extracting meaningful insights from biological sequence data requires overcoming several challenges. Sequence data typically provide a comprehensive characterization of individual genomes or a broad characterization of whole communities, each presenting its own challenges. Individual genomes, though providing a record of their evolutionary history, do so in a non-trivial way, integrating the effects of multiple selective pressures whose relative importance changes over time and across environments. Moreover, bacterial genome are often comprised of sequence blocks that have evolved under distinct sets of selective pressures, due to the fact that bacteria engage in a non-hereditary exchange of genetic material between potentially distantly related individuals, a phenomenon known as horizontal gene transfer (HGT) [3]. Despite the fact that genomic surveys provide broad coverage, they give incomplete representations of the surveyed communities: surveys based on specific marker genes provide limited phylogenetic resolution [4] and do not capture flexible or mobile elements [5]; and metagenomic surveys provide broad coverage of the genomic makeup of a community, but offer limited information regarding the linkage between the observed elements [6]. The works presented in this thesis attempt to overcome the aforementioned challenges and leverage both full genomes and genomic surveys to elucidate the interplay between microbes and their environment, on scales ranging from individual strains to entire communities. First, Chapter 1, is concerned with the potential ecological niche species can occupy in the absence of inter-species competition, a topic prevalent in the theoretical literature, but surprisingly scant in the experimental one. In this 6 chapter, we present a novel experimental apparatus to jointly measure the potential temperature and salinity range of individual strains, and develop a conceptual framework for linking these ranges to the coupling between underlying molecular stress response pathways. Applying these tools to 16 marine Vibrio strains, we find that only a subset of possible stress coupling mechanisms are observed, though we could not detect a selective signature for this finding. We further show that salinity tolerance is at odds with the strains’ natural habitats, and provide evidence that this inconsistency may be explained by coupling between salinity and temperature tolerance at the molecular level. The theory and analysis included in this work are my own, whereas the experimental work was conducted by my co-authors. Chapters 2 and 3 are focused on the genetic basis of adaptation and ecological differentiation in bacteria. One popular model for the evolutionary dynamics of ecological adaptation is the stable ecotype model [7]. In this model individuals that acquire novel adaptive alleles proceed to take over their ecological niche in a selective sweep that purges all the genetic diversity within the niche. However, akin to the effect of recombination in sexual eukaryotes, HGT may break the linkage between the adaptive alleles and the genetic background in which they arise, and thus preserve some genome-wide diversity [7]. In previous work, I have used stochastic simulations to show that the amount of genome-wide diversity following a selective sweep at a particular adaptive loci depends critically on the recombination rate and the strength of selection [8]. In light of these ideas, in Chapter 2, we analyze the genomes of two closely related, yet ecologically distinct populations of the marine bacteria Vibrio splendidus. We find that most loci recombine freely across habitats, and ecological differentiation is likely driven by a small number of habitat-specific alleles. Moreover, we show that there is reduced gene flow between habitats, and propose that the separation of gene pools between the populations may gradually increase over time. Such a separation of gene pools across habitats is consistent with the stable ecotype model, which may apply to later stages of ecological differentiation in recombining bacteria. My contributions to this work include the flexible genome analysis that demonstrated the reduced gene 7 flow across habitats, and the identification of several habitat specific genes that may contribute to habitat adaptation. The results of a simulation study designed to better understand the role HGT plays in ecological differentiation and sympatric speciation are presented in Chapter 3. These simulations extend our previous work to include two distinct, yet genetically coupled niches, rather than a single niche. Our simulations demonstrate that HGT has opposing effects during early and late stages of the differentiation process. Initially, HGT promotes adaptation by bringing together multiple adaptive alleles into a single genome. In contrast, in the absence of genetic isolation, HGT later on mixes different niche-adapted genomes and creates unfit intermediates. We show that, without timevarying recombination rates, this tradeoff is resolved when only a small number of loci are required in order to adapt to a new niche. The final two chapters present algorithms used to extract information regarding community structure from genomic surveys, whose popularity has been rapidly increasing over recent years. Chapter 4 presents SparCC, an algorithm capable of inferring correlations between genes or species from relative abundance data, which is a first step on the road to identifying ecological interactions. Analysis of both simulated data and datasets collected from 18 sites on the human body demonstrates that, as genomic survey data give only relative abundances, traditional correlation analysis may yield erroneous results. We further show that the accuracy of standard correlations depends critically on the diversity of the community studied and on the density of the true correlation network, and that spurious correlations prevail in several real-world datasets. In contrast, such artifact are absent from the rich interaction networks inferred by SparCC. In Chapter 5, we introduce StrainFinder, an algorithm which uses metagenomic data to identify strains at a higher phylogenetic resolution than that offered by 16S profiling. Such high resolution data are crucial for detecting migration between environments, and provide a unique opportunity for gleaning insights into the assembly of natural communities. StrainFinder leverages the fact that alleles found in a common set of strain should have similar proportions in a metagenomic dataset, a fact that 8 is formalized in a probabilistic framework that also accounts for the sampling noise, and facilitates the inference of strain abundances. When applied to metagenomic data sampled from human cheeck and feces, we find that operational taxonomic units (OTUs) identified in corresponding 16S surveys are rarely composed of a single dominant strain. Moreover, we find that strain diversity differs consistently between OTUs across individuals, and that these differences cannot be accounted for by the sampling body site. 9 Part II Contributions 10 Chapter 1 Shape and evolution of the fundamental niche in marine Vibrio Arne C. Materna*, Jonathan Friedman*, Claudia Bauer, Christina David, Sara Chen, Ivy B. Huang, April Gillens, Sean A. Clarke, Martin F. Polz and Eric J. Alm *These authors contributed equally to this work. This chapter is presented as it originally appeared in ISME Journal 320, 1081 (2008). Corresponding Supplementary Material is appended. Chapter 1 Shape and evolution of the fundamental niche in marine Vibrio Hutchinson’s fundamental niche, defined by the physical and biological environments in which an organism can thrive in the absence of interspecies interactions, is an important theoretical concept in ecology. However, little is known about the overlap between the fundamental niche and the set of conditions species inhabit in nature, and about natural variation in fundamental niche shape and its change as species adapt to their environment. Here, we develop a custom-made dual gradient apparatus to map a cross-section of the fundamental niche for several marine bacterial species within the genus Vibrio based on their temperature and salinity tolerance, and compare tolerance limits to the environment where these species commonly occur. We interpret these niche shapes in light of a conceptual model comprising five basic niche shapes. We find that the fundamental niche encompasses a much wider set of conditions than those strains typically inhabit, especially for salinity. Moreover, though the conditions strains typically inhabit agree well with the strains temperature tolerance, they are negatively correlated with the strains salinity 12 tolerance. Such relationships can arise when the physiological response to different stressors is coupled, and we present evidence for such a coupling between temperature and salinity tolerance. Finally, comparison with well documented ecological range in V. vulnificus suggests that biotic interactions limit the occurrence of this species at low temperature - high salinity conditions. Our findings highlight the complex interplay between the ecological, physiological and evolutionary determinants of niche morphology, and caution against making inferences based on a single ecological factor. 13 Introduction In his often-cited 1957 essay, G. E. Hutchinson introduced the concept of the fundamental niche as the combination of all environmental conditions (biotic and abiotic) that permit a species to sustain its population in the absence of interference from other species [9]. Since that time, the niche concept has been one of the most studied and one of the most debated concepts in theoretical ecology [10–13]. Despite its central importance, few empirical studies have been undertaken to map the fundamental niche space associated with non-consumable environmental factors (NEFs), such as temperature, salinity and pH [14, 15]. Rather, studies have primarily focused on the effects of multiple consumable environmental resources (e.g., nutrients) on inter-species competition [11, 12, 16–18]. Hutchinson considered whether niche shapes, i.e., the two-dimensional cross-section of a pair of environmental factors, could prove informative regarding the physiological response of an organism toward those factors. He deduced that independently acting factors should map out a rectangular fundamental niche, because the maximum tolerance to either factor is independent of the other, and supposed that interactions between factors could lead to more complex shapes. In this way, a species niche shape reflects the organism’s physiological response, e.g., whether the stress response pathways for each factor are independent. In a broader evolutionary context, the population-wide variation in niche shape and the plasticity of these shapes with respect to mutation will have profound effects on how species adapt to new or changing environments. For example, if niche shapes are highly constrained within species, then large changes in environmental factors can lead to significant changes in community composition with many species being displaced. On the other hand, if variation is high within populations, community composition could be robust to large changes in environmental parameters. In this study, we combine theory and experiment to address the following fundamental questions: (i) What basic classes of two-dimensional niche shapes are anticipated based on considerations of organismal physiology? (ii) Which niche shapes are 14 observed? (iii) How does the fundamental niche relate to the environment species typically inhabit? (iv) How do niche shapes vary within and between populations? Although our scientific questions and theoretical development apply generally to any model system and any environmental factors, we focused our experimental efforts on the salinity-temperature niche of 17 environmental isolates of the bacterial genus Vibrio, representing 11 well-characterized and genetically diverse species, isolated from a range of different environments (Table 1.1) [19–21]. Bacteria constitute a convenient model system since their growth can be assayed in high throughput over a range of tightly controlled environmental conditions. The Vibrio genus is an excellent model system because it is found in a notable range of environmental temperatures and salinities [22], which have been shown to be the strongest environmental determinants of microbial community composition [23, 24]. Moreover, there is evidence that temperature and salinity are prominent factors in Vibrio ecology, population dynamics, physiological stress response, and evolution [25–29]. Materials and methods Bacterial strains and isolates The following bacterial strains were used in this study: V. parahaemolyticus EB101 (ATCC 17802), V. fischeri MJ11 (ATCC 7744), V. cholerae N16961 (ATCC 39315), V. metecus strains OP6B# and OP3H#, Listonella (formerly Vibrio) anguillarum NCMB 6 (ATCC 19264), Vibrio sp. (DAT7222), Vibrio sp. MED222 , and the V. splendidus strains 12B1*, 12F1*, 12A10*, 14B11*, 12E10*, Vibrio sp. F12*, V. crassostreae 1C5*, V. cyclitrophicus strains 1-273 and Z-264. Strains marked with * and # where isolated from seawater collected at the Plum Island (PIE) Long-term Ecological Research (LTER) site (http://ecosystems.mbl.edu/pie/data/est/EST.htm) [30] and Oyster Pond, Woods Hole, MA [31], respectively. Vibrio sp. DAT7222 was isolated from aquaculture tanks used to raise marine mud-crab larvae in Australia, Darwin, Northern Territory and was kindly provided by Yan Boucher [32] Vibrio sp. MED222 15 16 c b 17.27 16.20 16.27 20.38 15.05 15.77 19.72 21.03 15.98 19.70 16.26 17.33 17.49 23.33 16.92 17.09 19.59 15.88 18.50 TM in (°C) 21.95 21.68 20.49 21.71 25.50 23.79 25.33 26.71 22.02 24.36 26.34 25.04 25.45 29.61 27.51 26.99 30.94 33.05 28.32 TOpt (°C) Plum Island Long-term Ecological Research. Biological replicates. Estimated value. 29.16 26.62 27.08 28.05 29.79 25.61 30.15 30.99 33.15 30.00 33.22 33.16 32.48 40.35 33.96 31.55 40.24 36.16 41.02 V. splendidus 12A10 V. splendidus 14B11 V. splendidus 12B1-1b V. splendidus 12B1-2b V. splendidus 12F1 V. splendidus 12E10 V. sp. MED222 V. cyclitrophicus 1273 V. cyclitrophicus Z-264 V. sp. F12 V. crassostrea 1C5 V. fischeri MJ11-1b V. fischeri MJ11-2b V. parahaemolyticus V. sp. DAT7222 Listonella anguillarum V. metecus OP6B V. metecus OP3H V. cholerae N16961 a TM ax (°C) Experiment 7.21 4.94 6.59 6.34 4.29 1.82 4.82 4.28 11.13 5.64 6.88 8.12 7.03 10.74 6.45 4.56 9.30 3.11 12.70 RangeHot (°C) 1.08 0.99 1.07 1.1 1.13 1.03 1.21 1.25 1.35 1.17 1.23 1.38 1.3 1.55 1.27 1.17 1.19 1.32 1.28 SM ax (M NaCl) 1 0.90 0.90 0.83 0.77 0.84 0.83 0.87 0.73 0.90 0.72 0.77 0.78 0.68 0.88 0.75 0.64 0.69 0.66 MHot 16 16 16 16 16 22 16 16 16 16 NA NA 20–30c NA 25 25 NA PIE-LTER PIE-LTER NW Mediterranean Sea, 1 m depth PIE-LTER PIE-LTER PIE-LTER PIE-LTER North sea Dried sardines (shirasu), Japan Aquafarm, Darwin, NT, Australia Ulcerous lesion in cod Oyster pond, MA, USA Oyster pond, MA, USA Stool of cholera patient, Bangladesh Tisolation (°C) PIE-LTERa PIE-LTER PIE-LTER Isolation environment NA 0.52c NA 0.07 0.07 NA 0.52–0.58 0.52–0.58 NA NA NA 0.52–0.58 0.52–0.58 NA 0.52–0.58 0.52–0.58 0.52–0.58 Sisolation (M NaCl) Table 1.1: Measured niche shape parameters and strain isolation conditions was kindly provided by Jarone Pinhassi, University of Kalmar, Sweden. Note that strains obtained from the ATCC may have undergone periods of replication in constant laboratory conditions. Cultures and growth conditions To ensure clonality of cultures, cells were streaked out to single colonies on solid medium (all analyzed strains lack a swarming motility phenotype on solid medium); subsequently a single colony was picked for inoculating liquid medium. This isolation step was generally applied before and after storage of a strain at -80C. To prepare cultures for -80C storage, autoclaved glycerol was added to a final concentration of 25% v/v. Cultures were maintained in tryptic soy broth (TSB) supplemented with 2% w/v additional sodium chloride (NaCl). For preparation of solid TSB medium, Bacto Agar (BD, Franklin Lakes, NJ, USA) was added to a concentration of 1.5% w/v. Cultures were grown at 22C with gentle shaking (190 rpm). For all experiments on 2D gradients, salinity-adjusted medium (LB-S) was prepared following the protocol for lysogeny broth (LB) medium with the NaCl concentration adjusted as required. We relied solely on NaCl for adjusting the salinity of the growth medium since Stanley and coworkers [33] have shown that NaCl is the most effective salt with respect to restoring normal temperature tolerance in bacteria when replenishing dilute seawater. Other deviations from standard LB medium were that the pH of LB-S medium was adjusted to 8.0 using 1N NaOH and the concentration of agar in solid LB-S was raised from 1.2% to 1.5% wv. Preparation of 2-dimensional (2D) salinity/temperature gradients Orthogonal salinity and temperature gradients were established in disposable, nonvented 241 x 241 x 20 mm bio-assay polystyrene dishes (Nunc, Thermo Fisher Scientific, Roskilde, Denmark). Salinity gradients were established by diffusion between two adjacent medium layers [14] using the following procedure: The dish was tilted 17 by raising one side by 5 mm and a wedge-shaped high salinity base layer was poured, using 170 ml solid LB-S medium containing 2.2 M NaCl. After solidification the plate was leveled and the top layer was poured using 170 ml solid LB-S medium without additional NaCl (0 M, Fig. S1). The gradient was equilibrated for 43 h to ensure diffusion-driven formation of the salinity gradient along one axis of the bio-assay dish. Prior to inoculation, any residual water (from condensation during cooling of the solid medium) was removed from the surface of the gradient followed by drying for 10 min under sterile laminar air flow. Stable and reproducible temperature gradients were established using a custom made device (supporting text), into which the bio-assay dish was installed after establishment of a salinity gradient (Fig. S1). Inoculation of 2D salinity/temperature gradients After colony purification on solid TSB medium, cultures were grown overnight in liquid LB-S containing 0.5 M NaCl. After adjusting OD600 to 0.07 using a NanoDrop 1000 (Thermo Scientific, Wilmington, DE, USA) spectrophotometer (1 mm optical path length), the cultures were further diluted in LB-S 0.5 M NaCl to a density of 6 ∗ 107 cells/ml. Cells were then spotted onto the surface of the salinity gradient in a regularly spaced grid of 24 x 24 spots using a 96-well replicator, rather than spread across the surface to avoid cell density effects (e.g., colonies that extend into regions that otherwise do not support growth from individual cells). Assuming that each pin of the 96-well replicator transfers a volume of 2-20ul, the number of cells inoculated onto each spot is estimated to be on the order of 105 − 106 . The distance of 0.9 mm between neighboring culture spots corresponds to the pins on a 96-well plate replicator. Data acquisition from 2D salinity/temperature gradients Gradient plates were incubated for 24h +/-2h after inoculation. Zero Net Growth Isoclines (ZNGIs) were determined from the most extreme temperature/salinity val- 18 ues that developed visible colonies within this time frame (Fig. 1.2A), and were used as proxies for niche boundaries. The salinity of all culture spots along the ZNGI and the temperature gradient were measured, and a digital image of the experiment was taken. Since temperature gradients proved stable along the salinity axis (Fig. S3B, Fig. S5D), the temperature of culture spots along the ZNGI could be inferred from the temperature gradient, rather than measured for each spot individually. In each experiment, temperature was measured in 1 cm intervals along the temperature axis, using a handheld IR thermometer, with four replicate measurements taken at each point. These measurements were fit to a cubic function, which was then used to infer the temperature at the ZNGI culture spots. By contrast, the salinity gradients were more variable along the temperature axis (Fig. S4D, Fig. S5B), necessitating the direct measurement of salinity at each of the ZNGI culture spots. At each spot, a small core sample was taken, and separated into liquid and solid components by centrifugation for 30 min at 20,800 x g. The liquid phase was diluted in deionized H2O, and ppt salinity measured using a refractometer. A standard curve was used to convert the measured salinities into mol/l NaCl (Fig. S2). Two duplicate measurements were performed on each core, and their mean was taken as the spots salinity. Phylogenetic analysis and independent contrasts A phylogenetic tree of the studied strains plus E. coli K12 (outgroup for rooting) was constructed based on partial sequences from 8 housekeeping genes: hsp60, recA, gyrB, sodA, adk, mdh, pgi, and chiA. Sequences for strains Vibrio sp. MED222, V. parahaemolyticus, and V. fischeri MJ11, and Listonella anguillarum were obtained from complete genomes available on GenBank. For all other strains, partial sequences were obtained during previous studies using multilocus sequencing [31, 34, and references therein]. Sequences were aligned using MUSCLE [35], and manually trimmed before concatenating all alignments. A phylogenetic tree was inferred using PhyML [36] with default settings. We note that this tree reflects the strains evolutionary history, rather than the evolutionary history of particular genes that determine the niche shape. 19 Felsensteins independent contrasts method [37, 38] was implemented in the Python programming language, and used to compute correlations between shape parameters and the statistical significance of these correlations. This method is utilized since standard Pearson correlations do not take into account the phylogenetic relationships between strains, which may lead to spurious correlations. The method of independent contrasts circumvents this problem by constructing new, independent variables, which are differences between the parameters of different strains. This method is based on the assumptions that trait evolution is consistent with a neutral Brownian motion model, and that dependencies arise from shared evolutionary ancestry. The validity of these assumptions was assessed using two different methods, as suggested by Garland et al. [39]. First, under the model assumptions normalized contrasts are predicted to be independent of the square root of the sum of their branch lengths. This expectation was validated by performing a linear ordinary least-squares fit between these variables. Contrasts that were clear outliers with respect to any of the niche parameters were excluded from further analysis (e.g. Fig. S8). These contrasts corresponded to nodes with extremely short branch length, and are known to be very sensitive to measurement errors [40]. Second, the model prediction that contrasts be normally distributed was assessed using the Shapiro-Wilk test for normality (Table S1). Theory Theory of niche shapes and corresponding classes of stress interactions To avoid ambiguity, we formally define the fundamental niche and related terms, as used in this paper. Following Hutchinson, the fundamental niche represents the combination of environmental conditions that permit a species to maintain its populations in the absence of other species. Each combination of permissive conditions can be represented as a point in a high-dimensional space in which each coordinate represents the magnitude of a single environmental factor. The collection of all such points is 20 termed the niche space, and the niche boundary is defined as the curve encapsulating all permissive conditions in the niche space. Equivalently, for a given species, let g(E) be the species net growth rate at environmental conditions E. The fundamental niche corresponds to the set of conditions for which g(E)0, and the niche shape to the set of conditions for which g(E) = 0. Because it is not possible to characterize the full (multi-dimensional) extent of the niche space for an organism, we focus on a crosssection in which two non-consumable environmental factors (NEFs), temperature and salinity, are systematically varied. Changing other environmental factors, such as pH or type of media, can affect the shape of the temperature-salinity fundamental niche since a different cross-section of the full niche space might be measured. Five basic classes of niche shapes are illustrated in Fig. 1.1. These shapes capture the effect of two NEFs (with levels E1 and E2) on a single population, with all other parameters held constant. We assume there exists a unique optimal condition E1opt , E2opt that supports the largest sustainable population, and defined the origin to be this condition. The axes thus correspond to deviations from optimal conditions, or stress levels. The curves represent niche boundaries, and contained between a curve and the axes is the area corresponding to the fundamental niche. It is important to note that different quadrants represent different types of stress (e.g., heat stress vs. cold stress), which may have entirely different response pathways and couplings. Thus, the niche shape may vary between quadrants. Fig. 1.1 represents only one quadrant of the niche shape, in which both NEF are at, or above their optimal level. We focus on basic niche shapes that can be interpreted in terms of the physiology underlying the organisms stress response, rather than developing complex models that are able to capture the finer details of niche shapes. To do so, we considered shapes corresponding to zero, linear, and non-linear interactions between stresses, with the restrictions that the type of stress interaction is independent of the stress levels (i.e., the niche boundary does not cross the additivity line), and that the sign of the curvature of the boundary is constant. An analogous approach is widely used in multicomponent drug therapeutics [41, 42], where it is known as isobolographic analysis [43]. 21 These assumptions result in five basic shapes: concave, diagonal, convex, rectangular, and super-convex (Fig. 1.1 and Table 1.2). The simplest case is that of non-interacting stressors, which lead to a rectangular niche shape since the maximal tolerance to either stressor is independent of the level of the other stressor. One way stresses can interact is through both creating demand for a common, limiting cellular resource, such as ATP. That is, some amount of the common cellular resource is dedicated to alleviate the effects of either stress. If the stress level maps to limiting resource consumption in a linear fashion, the fundamental niche boundary will be diagonal. On the other hand, if the resource consumption increases faster or slower than linear with stress levels, niche shapes can be concave or convex, respectively. Other shapes can arise when the concentration or activity of some cellular components (e.g., proteins or pathways) affects both stresses simultaneously. These components may not compete for cellular resources, but instead change the physicochemical properties of the cell. For example, the permeability of the cell membrane is altered by the concentration of porins, and can influence the effect of multiple stresses. Super-convex shapes can arise if the effects of one stress alleviate the effects of another, and concave shapes can result if the optimal activity of these cellular components is different. These basic niche shapes are a simplification and an idealization, which is not expected to be precisely reproduced in observed niche shapes. Rather, they enable the identification of the dominant features of stress response physiology. For instance, though stress response is typically multifaceted, if a nearly diagonal shape is observed (e.g., Fig. 1.2C), it is likely that the main response pathways of the two stresses compete for the same cellular resource. 22 10 9 8 Super-convex (Anagonistic) 7 S2 S2 6 Rectangular (Noninteracting) 5 Convex (Antagonistic) 4 Diagnonal (Additive) Concave (Synergistic) 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 S1 S1 Figure 1.1: Five classes of fundamental niche morphology and underlying stress interactions. Axes measure the deviations from the optimal conditions (the origin), i.e., the stress due to the environmental factors. The curves represent niche boundaries, with the area contained between a curve and the axes corresponding to the fundamental niche area. Each curve is labeled according to its shape, with the corresponding type of stress interaction indicated in parentheses. Note that only the positive quadrant (both NEFs above their optimal levels) is depicted. Results Measurement of 2-dimensional (2D) salinity-temperature gradients We designed an inexpensive platform to create 2D orthogonal gradients of temperature and salinity on solid media, which allowed estimation of niche shapes realized in natural populations and their comparison to our simple theoretical model. We focused on the high temperature/high salinity quadrant, because our experimental apparatus did not allow us to examine sufficiently low temperatures for many strains. In addition, we found that upon removal of the temperature stress, new colonies developed at the cold temperature region, whereas no new colonies form at the high tempera- 23 B 0.4 M temperature gradient 1.7 M A temperature [˚C] 1.4 1.4 SMax ATot salinity (NaCl) gradient 1.3 salinity [M NaCl] 1.2 1.2 SMin, cold AFN 1.1 Sal [M NaCl] 1.3 1.1 1 1.0 0.9 0.9 0.8 0.8 0.7 SMin, hot TExt, cold TOpt TExt, hot 0.6 0.5 ˚C 40.0 ˚C 0.00 M 1.0 ˚C 0.10 M 0.05 M 0.6 1.5 ˚C 0.7 14 14 16 16 18 18 20 22 22 24 24 26 26 28 28 30 30 32 32 34 34 T [˚C] 15.0 ˚C C 1.25 1.25 V. metecus OP6B 1.40 1.4 1.10 1.1 V. fischeri MJ11 1.30 1.3 1.05 1.05 1.10 1.1 1.20 1.2 1.001 1.05 1.05 1.001 0.95 0.95 0.90 0.9 0.85 0.85 M = AFN/ATot M = 0.64 1.10 1.1 1.001 M = AFN/ATot 0.90 0.9 = 0.77 [˚C] TT [˚C] 0.90 0.9 0.80 0.8 M = AFN/ATot M = 0.90 0.75 0.75 0.7 0.70 20 20 22 24 24 26 26 28 28 30 30 32 32 34 36 38 38 40 42 42 V. splendidus 12B1 0.95 0.95 0.85 0.85 M 0.80 0.8 0.80 0.8 0.75 0.75 Sal [M NaCl] 1.15 1.15 Sal [M NaCl] Sal NACl] Sal [M [M NaCl] 1.20 1.2 18 18 20 20 22 22 24 24 26 [˚C] TT [˚C] 28 28 30 32 32 17 18 19 19 20 21 22 23 23 24 25 26 26 27 27 28 28 [˚C] TT [˚C] Figure 1.2: Zero Net Growth Isoclines (ZNGI) on 2 dimensional salinitytemperature gradients. (A) Example of a 2-dimensional solid-medium salinitytemperature gradient established in 24x24 cm square culture dishes. White spots are bacterial colonies. Dark spots are locations along the estimated ZNGI where salinity was measured directly by destructive coring. The outer plot along the x-axis shows the temperature values (bottom plot, solid lines) and their measurement error (upper plot, diamonds) along the test gradient, when a low (blue) or high (red) temperature range setup was employed. Corresponding plots for salinity are shown along the y-axis (for details see Materials & Methods, supporting text). (B) Example of a measured ZNGI (open circles) and niche shape parameters. Parameters are shown for both the high temperature and low temperature quadrants, though our analysis focuses on the high temperature quadrant (see main text). TExt,hot (TM ax ) - highest permissive temperature. TExt,cold - lowest permissive temperature. TOpt - optimal temperature (as defined in main text). SM ax - highest permissive salinity. SM in - lowest permissive salinity. AF N - area of the fundamental niche (shaded). AT ot - area of the fundamental niche if the stresses acted independently. (C) Examples of the three classes of niche shapes observed: diagonal, convex and rectangular (from left to right). Shaded areas correspond to the niche area in the high-temperature quadrant. 24 ture region within an additional 24 hours, consistent with previous studies showing resuscitation following a temperature increase [44]. Therefore, at low temperatures there was ambiguity between no growth and slow growth phenotypes, while the effect of higher temperatures was cell death, making analysis more straightforward. Our motivation to focus on the high salinity region was to achieve greater resolution and to be able to use the same rich medium for all strains. Assaying lower salinities would require using strain-specific media, which may differentially affect niche morphology, convoluting niche comparisons. Niche morphology was quantified using the following parameters: highest permissive temperature (TM ax ), highest permissive salinity (SM ax ), optimal temperature (TOpt ), and niche morphology index (M ). The niche morphology index, which summarizes how interactions between stressors affect niche boundaries, was defined to objectively classify the different niche shapes. M is defined as the area spanned by the niche divided by the area of the rectangular niche that would be observed had the two stressors been acting independently (Fig. 1.2). Table 1.2 illustrates how M values correspond to different classes of niche shape: concave - M < 0.5; diagonal M = 0.5; convex - 0.5 < M < 1; rectangular - M = 1; super-convex - M > 1. We note that while SOpt was outside the measured salinity range the inferred M values are representative of those of the full high temperature/high salinity quadrant, as long as the class of niche shape is consistent in this quadrant (note that no such switches were observed in the measured range of any of the strains). In general, TOpt can be a function of salinity and requires detailed growth curves at each position to measure directly. We avoided this added complexity by approximating TOpt using two independent methods: (i) we took TOpt as the temperature that allows growth in the highest salinity (SM ax ), and (ii) we defined TOpt as the midpoint of the range of temperatures in which the largest colonies formed. We note that the first approach is not applicable to super-convex shapes. However, the second method, which is capable of detecting super-convex shapes, yielded nearly identical results, with no super-convex shapes. Therefore, we report results based on the first, more straightforward approach. 25 Observed niche shapes M values fell in a narrow range between 0.64-1.00, and the observed niche shapes were all convex, ranging from nearly diagonal, to nearly rectangular (Table 1.1, Fig. 1.3). No concave or super-convex shapes were observed. Niche shape parameters for all strains are summarized in Table 1.1, and niche shapes of all strains are depicted in Fig. S6. We note that classes of niche shapes in the low temperature/high salinity quadrant are consistent with those of the high temperature/high salinity quadrant (Fig. S7), and were excluded solely because we have less confidence in their accuracy, as detailed in the previous section. Maximal permissible temperatures, ranging from 25C to 41C, were consistent with realistic environmental conditions, whereas permissible salinities, ranging from 0.99 M NaCl to 1.55 M NaCl, far exceeded that typical of marine environments of 0.6 M NaCl [45]. Moreover, though the conditions strains typically inhabit agrees well with the strains temperature tolerance, they are negatively correlated with the strains salinity tolerance (Fig. 1.3). This is exemplified by the fact that V. cholerae strains from warm estuarine and freshwater environments display high temperature and salt tolerance, whereas V. splendidus strains from a colder marine environment have a markedly lower temperature and salt tolerance. Additionally, we observe that the various niche parameters do not vary independently. Salinity and temperature tolerance are positively correlated with each other, and negatively correlated with M values (Fig. 1.3). Thus, strains that display niche shapes consistent with independent responses to both stressors generally have lower overall tolerance to those stressors. By contrast, strains with high tolerances to a single stress tend to also have a high tolerance to the other stressor, and have a niche shape consistent with the existence of coupling between the stress responses. Although these trends appear highly significant, spurious correlations between independently evolving factors can arise because of the phylogenetic structure of the data. To control for phylogenetic effects, we used Felsensteins method of independent contrasts [37, 38] to compute correlations between niche shape parameters and eval- 26 Smax Mhot Legend: 0.1 0.1 Tmax TMax SMax MHot Smax Tmax Mhot 27 0.1 0.1 Legend: Tmax 0.1 Legend: Tmax Figure 1.3: Evolution of the fundamental niche. (A) ZNGIs of all strains at the high temperature/high salinity quadrant, colored by M values. ZNGIs tend to be located along the diagonal from the bottom-left corner to the top-right corner, indicating a correlation between temperature and salinity tolerance. Additionally, as strains become more tolerant their niche shape varies from rectangular to diagonal (decreasing M value). (B) Maximal permissive temperature (TM ax ) vs. maximal permissive salinity (SM ax ) and M -value. Each point represents the values measured for a single strain. Note that the correlations were calculated using Felsensteins method of independent contrasts (see main text). (C) MLSA phylogenetic tree of all strains. Corresponding values of TM ax , SM ax and M are given by the color bars to the right. V.spl1C5 V.spl13C1 V.splZ264 V.splI273 V.spMED222 V.spl12E10 V.spl12F1 V.spl12B1 V.spl14B11 V.spl12A10 TMax S Mhot 38 Mhot MHot Smax Mhot 36 MMax Smax Hot 34 S Tmax MMax Smax Hot 32 0.1 0.1 30 0.1 0.1 SMax [M NaCl] 1.4 1.3 1.2 V.chN16961 V.chN16961 V.chOP4A V. cholera N16961 V.psOP3H V.psOP3H OP3H metecusN16961 V. cholera V.chN16961 V.chN16961 V.chOP4A V. V.psOP6B V.psOP6B OP6B metecus V. V.psOP3H V.psOP3H V.chN16961 OP3H metecus V. V.chN16961 V. cholera N16961 L.anguill L.anguill anguillarum L. V.psOP6B V.psOP6B V.psOP3H OP6B metecus V. V.psOP3H OP3H metecus V. V.harveyi V.harveyi L.anguill DAT7222 sp. V. L.anguill V.psOP6B anguillarum L. V.psOP6B OP6B V. metecus V.parahaem V.parahaem parahaemolyticus V. V.harveyi V.harveyi L.anguill DAT7222 sp. V. L.anguill anguillarum L. V.fischeri V.fischeri MJ11 fischeri V. V.parahaem V.parahaem V.harveyi V. V.harveyi sp. DAT7222 V. parahaemolyticus V.spl1C5 V.spl1C5 1C5 crassostrea V. V.fischeri V.fischeri V.parahaem MJ11 fischeri V. V.parahaem parahaemolyticus V. sp. V.spl13C1 V.spl13C1 F12 V. V.spl1C5 V.spl1C5 1C5 crassostrea V. V.fischeri V.fischeri MJ11 fischeri V. V.splZ264 V.splZ264 Z-264 cyclitrophicus V. sp. V.spl13C1 V.spl13C1 V.spl1C5 F12 V. V.spl1C5 1C5 crassostreaV.splI273 V.splI273 1-273 cyclitrophicus V. cyclitrophicus V.splZ264 V.splZ264 Z-264 V. V.spl13C1 V.spl13C1 V. sp. F12 V.spMED222 V.spMED222 MED222 sp. V. V.splI273 V.splI273 1-273 cyclitrophicus V. V.splZ264 V.splZ264 Z-264 V. cyclitrophicus V.spl12E10 V.spl12E10 splendidus V. sp. V.spMED222 V.spMED222 MED22212E01 V. V.splI273 V.splI273 1-273 V. cyclitrophicus V.spl12F1 V.spl12F1 12F01 splendidus 12E01 V. splendidus V.spl12E10 V.spl12E10 V. V.spMED222 V.spMED222 MED222 sp. V. V.spl12B1 V.spl12B1 12B01 splendidus V. V.spl12F1 V.spl12F1 12F01 splendidus V. V.spl12E10 V.spl12E10 12E01 splendidus V. V.spl14B11 V.spl14B11 14B11 splendidus 12B01 V. splendidus V.spl12B1 V.spl12B1 V.spl12F1 V. V.spl12F1 12F01 V.spl12A10 V.spl12A10 12A10 splendidus V. V.spl14B11 V.spl14B11 V. V.spl12B1 V.spl12B1 12B01 splendidus 14B11 V. splendidus V.spl12A10 V.spl12A10 V.spl14B11 V. splendidus 12A10 V.spl14B11 14B11 V.spl12A10 V.spl12A10 V. splendidus 12A10 V.chOP4A SMax MHot 0.8 1.1 1.0 40 MHot 28 28 28 Max TMax [˚C] 30 32 34 36 30 T32 34 36 Max [˚C] 30 T32 34 36 [˚C] V.spl12A10 V.chN16961 V.chOP4A V.psOP3H V.chN16961 V.chOP4A V.psOP6B V.psOP3H V.chN16961 L.anguill V.psOP6B V.psOP3H V.harveyi L.anguill V.psOP6B V.parahaem V.harveyi L.anguill V.fischeri V.parahaem V.harveyi V.spl1C5 V.fischeri V.parahaem V.spl13C1 V.spl1C5 V.fischeri V.splZ264 V.spl13C1 V.spl1C5 V.splI273 V.splZ264 V.spl13C1 V.spMED222 V.splI273 V.splZ264 V.spl12E10 V.spMED222 V.splI273 V.spl12F1 V.spl12E10 V.spMED222 V.spl12B1 V.spl12F1 V.spl12E10 V.spl14B11 V.spl12B1 V.spl12F1 V.spl12A10 V.spl14B11 V.spl12B1 V.spl12A10 V.spl14B11 V.chOP4A had a correlation of -0.56 (p-value 4.6e-2). Legend: TMax Tmax Legend: T SMax Max Tmax 0.1 0.1 28 Mhot 26 Mhot V.chOP4A T M S ax M M ax Ho t 0.9 Legend: 0.1 0.1 V.chN16961 V.chN16961 V. cholera N16961 V.psOP3H V.psOP3H V. metecus OP3H V.psOP6B V.psOP6B V. metecus OP6B L.anguill L. anguillarum L.anguill V.harveyi V.harveyi V. sp. DAT7222 V.parahaem V.parahaem V. parahaemolyticus V.fischeri V.fischeri V. fischeri MJ11 V.spl1C5 V. crassostrea V.spl1C5 1C5 V.spl13C1 V.spl13C1 V. sp. F12 V.splZ264 V.splZ264 V. cyclitrophicus Z-264 V.splI273 V.splI273 V. cyclitrophicus 1-273 V.spMED222 V. sp. MED222V.spMED222 V.spl12E10 V.spl12E10 V. splendidus 12E01 V.spl12F1 V.spl12F1 V. splendidus 12F01 V.spl12B1 V.spl12B1 V. splendidus 12B01 V.spl14B11 V.spl14B11 V. splendidus 14B11 V.spl12A10 V.spl12A10 V. splendidus 12A10 0.1 26 26 26 1.5 Smax Mhot 40 40 40 SMax S MMax Hot SMax M Hot MHot 1.6 Smax 40 0.68 0.72 0.76 0.80 0.84 0.88 0.92 0.96 0.68 0.72 0.76 0.80 0.84 0.88 0.92 0.96 0.68 0.72 0.76 0.80 0.84 0.88 0.92 0.96 0.68 0.72 0.76 0.80 0.84 0.88 0.92 0.96 0.1 B Tmax Smax 35 C C C 30 35 35 35 0.6 Legend: 30 Temp [˚C] 30 [˚C] Temp 30 [˚C] Temp Temp [˚C] 0.8 MHot 25 25 25 1.4 MHot MHot 0.1 20 20 20 1.6 1.6 1.5 1.6 1.5 1.5 1.4 1.4 1.3 1.4 1.3 1.2 1.3 1.2 1.1 1.2 1.1 1.1 1.0 1.0 0.9 1.0 0.9 0.9 1.6 0.4 0.4 MHot 0.6 0.6 0.4 25 0.8 0.8 0.6 C 20 1.0 1.0 0.8 B B B 1.0 1.2 1.2 1.0 Legend: 1.4 1.4 1.2 0.4 [M NaCl] [M NaCl] SalSal [MSal NaCl] Sal [M NaCl] 1.2 SMax [M NaCl] S[M [M NaCl] SMax NaCl] Max A T T T M M M S aSx aSx ax M M M M aMx aMx ax Ho Ho Ho t t t A 1.6 A 1.6 A 1.4 1.6 38 38 38 40 40 40 uate their significance. Even after accounting for the phylogeny, the observed trends remained statistically significant: TM ax and SM ax had a correlation of 0.86 (p-value 7.2e-4); TM ax and Mhot had a correlation of -0.62 (p-value 2.7e-2); SM ax and Mhot 1.0 0.95 0.9 0.85 0.75 0.7 0.65 0.6 TMax [˚C] V.chOP4A V.chN16961 V.psOP3H V.psOP6B L.anguill V.harveyi V.parahaem V.fischeri 1.0 1.0 0.95 1.0 0.95 0.9 0.95 0.9 0.85 0.9 0.85 0.8 0.85 0.8 0.75 0.8 0.75 0.7 0.75 0.7 0.65 0.7 0.65 0.6 0.65 0.6 0.6 MHot MHot MHot Evolution of the fundamental niche Comparison of niche shape across 17 strains provides a clear answer to the question of whether the majority of variation in niche shape occurs within or between species. Inspection of Figure 3C shows that niche shapes tend to be more similar between closely related strains, with very limited or no qualitative niche morphology changes occurring within species. For example, the cluster of V. splendidus isolates shows low tolerance to high temperatures and salinities, and almost rectangular niche shapes (high M ), while the V. cholerae/metecus isolates display high temperature and salinity tolerances and nearly diagonal niche shapes (low M ). The average difference in niche parameters measured between these groups exceeds the within-population variation by a factor of 2.5 (for SM ax ) to 5 (for TM ax ). Discussion The fundamental niche captures the range of environmental conditions in which a species can persist in isolation. However, since natural communities typically comprise numerous interacting species, the extent to which the fundamental niche is relevant to the ecology of natural communities is unclear and warrants further investigation. In this study, we show that all studied strains could tolerate salinities that are unlikely to be ecologically relevant, indicating that, in this dimension, their fundamental niche is far broader than their ecological range. In addition, Randa et al. [27] found that V. vulnificus usually occurs in low salinity environments, but, as the temperature increases, it can expand its range into higher salinity environments. In light of our results, the fact that V. vulnificus is rare at saline, low temperature environments is likely a consequence of biotic interactions, such as competitive exclusion or predation, rather than an inherent inability to thrive at these conditions. Thus, V. vulnificus realized ecological range is a limited section of its fundamental niche. Temperature and salinity tolerances are positively correlated across all strains in the study despite the fact that these factors are negatively correlated in some of the environments the strains were isolated from (e.g., V. metecus strains from a warm, 28 low salinity environment in contrast to V. splendidus strains from a colder, saline marine environment.). These results indicate that salinity and temperature tolerance mechanisms may be coupled at the molecular level, consistent with the findings of previous studies demonstrating that pre-exposure to osmotic stress increased the tolerance to heat stress in Vibrio vulnicus [46]. Such coupling may result from temperature and salinity responses using common molecular pathways such as the SOS response pathway, the heat-shock response pathway, and the general stress response pathway [47, 48]. However, Rosche et al. have demonstarted that the cross-protection between osmolarity and heat in Vibrio vulnicus is independent of rpoS, the master regulator of the general stress response mechanism, and were unable to elucidate the origin of this cross-protection [46]. An alternative potential mechanism for coupling between stress responses is partitioning of cellular resources between different stress response pathways. In the simplest (linear) model, a cells tolerance would be a function of the sum of the two stress levels, leading to diagonal niche shapes (M ∼ 0.5). Remarkably, we observe an inverse correlation between stress tolerance and M values that remains significant even after controlling for phylogeny. Together with the unexpected positive correlation between the two stress tolerances, this correlation suggests a model in which the temperature and salinity response pathways are coupled, and moreover, that they compete for common cellular resources (e.g., ATP). Temperature adaptation may explain the incongruence between salinity tolerance and environmental salinity. Since maximal temperatures are consistent with conditions at environments strains typically occur in, and strains can tolerate a much wider range of salinities than is ecologically relevant, it is possible that heat tolerance is under much stronger selection than salinity tolerance. Under this hypothesis, heat tolerance is set by the environmental conditions, which in turn affects salinity tolerance, which is mechanistically coupled to it. Thus, like heat tolerance, salinity tolerance is predicted to be correlated with environmental temperature, rather than environmental salinity. This is indeed the case in our data. The potential for coupling between stressors cautions against making inferences 29 based on individual stress tolerance phenotypes, as the results could be at odds with ecology. In fact, we cannot rule out coupling along other unmeasured dimensions (e.g., nutrient levels, pH), nor can we rule out that these factors (rather than temperature) may be driving salinity and/or temperature tolerance levels. Nonetheless, these results indicate that meaningful work linking physiological, ecological, and evolutionary aspects of niche theory will require an appreciation of the nuances of a multi-dimensional and potentially highly coupled niche space. 30 Concave Super-Convex Synergistic Antagonistic Additive Diagonal Convex Type of interaction Non-interacting Niche shape Rectangular The combined stress incurred by a combination of the two factors is more severe than the stress incurred by only one factor at the equivalent level. The combined stress incurred by a combination of two factors is less severe than the stress incurred by only one factor at the equivalent level. Growth rate is determined by a linear combination of the stresses. Description Growth rate is determined by the more severe of the two stresses and is completely unaffected by the other stress. As noted by Hutchinson the corresponding niche shape is rectangular (G.E. Hutchinson, 1957). Multiple forms. compatible functional Concentrations of relevant cellular components under either stress alone are disparate. Concentrations of relevant cellular components under either stress alone are similar. For each unit of either stress a common cellular resource is dedicated to alleviate stress effects. The cellular resource may be ATP, a mineral, specific proteins (e.g. chaperons), etc. g(S1, S2) = f (a1 ∗ S1 + a2 ∗ S2), where f (a1 ∗ S1 + a2 ∗ S2) is any arbitrary function of its argument, and a1, a2 are the amount of common resource required to cope with stresses S1 and S2, correspondingly. Multiple compatible functional forms. Mechanism/example The cellular mechanisms related to the two stresses, and/or the limiting cellular resources used to alleviate their influence are distinct. Functional form g(S1, S2) = min{f 1(S1), f 2(S2)}. Note that f 1(S1) and f 2(S2) can be any two arbitrary functions. M < 0.5 M >1 0.5 < M < 1 M = 0.5 M M =1 Table 1.2: Basic classes of fundamental niche shapes, and corresponding stress interactions 31 Chapter 2 Population Genomics of Early Events in the Ecological Differentiation of Bacteria B. Jesse Shapiro, Jonathan Friedman, Otto X. Cordero, Sarah P. Preheim, Sonia C. Timberlake, Gitta Szab, Martin F. Polz, Eric J. Alm This chapter is presented as it originally appeared in Science 336, 48 (2012). Chapter 2 Population Genomics of Early Events in the Ecological Differentiation of Bacteria Genetic exchange is common among bacteria, but its effect on population diversity during ecological differentiation remains controversial. A fundamental question is whether advantageous mutations lead to selection of clonal genomes or, as in sexual eukaryotes, sweep through populations on their own. Here we show that in two recently diverged populations of ocean bacteria, ecological differentiation has occurred akin to a sexual mechanism: a few genome regions have swept through subpopulations in a habitat specific manner, accompanied by gradual separation of gene pools as evidenced by increased habitat-specificity of the most recent recombinations. These findings reconcile previous, seemingly contradictory empirical observations of the genetic structure of bacterial populations, and point to a more unified process of differentiation in bacteria and sexual eukaryotes than previously imagined. 33 How adaptive mutations spread through bacterial populations and trigger ecological differentiation has remained controversial. While it is agreed that the key factor is the balance between recombination and positive selection, theory and observations remain seemingly at odds. On the one hand, evidence for genes spreading through populations independently via recombination (gene-specific sweeps) is found in observations of environment-specific genes [49] and alleles [50], and reduced diversity at single loci amidst high genomewide polymorphism [51, 52]. On the other hand, mathematical modeling suggests that empirically observed rates of homologous recombination should not be high enough to unlink a gene, which is under even moderate selection, from the rest of the genome [8, 53]. Importantly, this recombination/selection balance, expressed most saliently by the ecotype theory, leads to a prediction that is actually observed but at odds with gene-specific sweeps: bacterial diversity is organized into ecologically differentiated clusters [21, 34, 54]. The proposed mechanism involves cycles of neutral diversification punctuated by genomewide selective sweeps [53]. While the observations of environment-specific genes and locusspecific reduced diversity conflict with the ecotype model of selected clonal genomes, they do not explain why its prediction of coincident genetic and ecological clusters hold true, nor provide insights into the early genomic events accompanying adaptation. How to reconcile these different empirical observations, so seemingly at odds with each other, therefore remains an open question. Here, we test whether recombination is strong enough relative to selection to allow gene-specific rather than genomewide selective sweeps in natural microbial populations and explore the effect on population-level diversity. Using whole-genome sequences from two recently diverged Vibrio populations with clearly delineated habitat associations, we show that genome regions rather than whole genomes sweep through populations, triggering gradual, genomewide differentiation. Our proposed evolutionary scenario is based on three lines of evidence. First, most of the genetic divergence between ecological populations is restricted to a few genomic loci with low diversity within one or both of the populations, suggesting recent sweeps of confined regions of the genome. Second, we show that only one of the two chromosomes compris34 ing the genome has swept through part of one population. Third, the most recent recombination events tend to be population specific but older events are not, reinforcing the notion that these populations are on independent evolutionary trajectories, which may ultimately lead to the formation of genotypic clusters with different ecology. Although such clusters have been interpreted as evidence for the ecotype model, our results suggest that they can arise even in populations that do not experience genomewide selective sweeps. In a previous study, we noticed an instance of very recent ecological differentiation among two populations of Vibrio cyclitrophicus by their divergence in fast-evolving protein-coding genes and differential occurrence in the large (L) and small (S) size fractions of filtered seawater, suggesting association with different zoo- and phytoplankton or suspended organic particle types [21]. This population structure was reproduced across independent samples taken in 2006 and 2009. We sequenced whole genomes from both populations (13 L and 7 S isolates, all obtained in 2006). As in other Vibrionaceae, these genomes consist of two chromosomes, each with a flexible and core component, defined as blocks of DNA not universally present in all isolates or shared by all, respectively. To estimate the extent and patterns of recombination among the isolates, we subdivided the core genome into blocks of DNA on the basis of their supporting different phylogenetic relationships among the 20 isolates (see Materials and Methods). Overall, the ecological populations described here are among the most closely related (identical 16S and ¿99% average amino acid identity) studied with genomewide sequence data, making them an ideal test case for observing the early events involved in ecological differentiation. Genes, not genomes, sweep populations Our first line of evidence favoring gene-specific rather than genomewide selective sweeps is that most of the differentiation between populations is restricted to just a few small patches of the core genome. Ecological differentiation (separate phylogenetic grouping of S vs. L) is supported by 725 ecoSNPs, which cluster in a few discrete patches of the genome (11 in total, 3 of which contain ¿80% of ecoSNPs), whereas 35 the rest of the genome is dominated by 28,744 SNPs rejecting the ecological partition (Fig. 3.1; S1, S2). Any signal of clonal ancestry has been thoroughly obscured by homologous recombination (affecting equally genes of all functions, therefore likely not driven by selection; fig. S3, Materials and Methods), such that no single bifurcating tree relating the 20 strains adequately describes the evolution of more than 1% of the core genome (Fig. 3.1C). Such a pattern could have been produced either by an ancient genomewide selective sweep in one or both populations, followed by recombination between populations eroding the clonal frame down to a few regions, or by recent gene-specific selective sweeps centered on these few regions. The latter explanation is favored because most major ecoSNP clusters (three out of the four peaks in Fig. 3.1B) have significantly lower within-habitat diversity (in one or both habitats) than the chromosome-wide average. The exception is the highly diverse RTX/RpoS locus, which may be under diversifying selection both within and between habitats. The low within-habitat diversity in the other three regions, which account for the majority of ecoSNPs, suggests they arrived recently by recombination (likely from a distantly related population, see Materials and Methods) and swept through a population before accumulating much polymorphism. Our second line of evidence shows that genomic fragments can sweep through populations in an ecology-specific manner without purging genomewide variation. In particular, a large fraction of chromosome II has swept through a subset of the S population, without impacting the diversity of chromosome I. As evidence for this, each chromosome has a distinct core phylogeny, with five of the seven S strains grouping together on chromosome II, but not chromosome I (Fig. 3.1). This 5-S clade (grouping together strains 1F97, 1F111, 1F273, FF274 and FF160; blue branch in Fig. 3.1A and blue points in Fig. 3.1B) is supported by 796 SNPs: 790 on chromosome II and six on chromosome I an over 200-fold imbalance after normalizing by the 1.45X more SNPs/site on chromosome II. Chromosome II also strongly supports one phylogeny within the 5-S strains; SNPs inconsistent with this phylogeny are restricted almost entirely to chromosome I (fig. S4, S5). The degree of support for the 5-S group on chromosome II suggest that a variant of this chromosome swept through these five 36 S strains, independently of chromosome I. The sweep likely occurred recently, before the clear phylogenetic signal within the 5-S strains was disrupted by recombination. This signature of a long stretch of DNA (in this case, a chromosome) largely uninterrupted by recombination is a hallmark of recent positive selection in sexual eukaryotes [55], suggesting a selective sweep of chromosome II independently of the rest of the genome (chromosome I). The mobilization of genomic fragments on the size scale of chromosomes may also explain the hybrid genomes observed in novel pathogenic variants of Vibrio vulnificus [56]. Emergent habitat-specific recombination Our third line of evidence shows how, despite the lack of genomewide selective sweeps, tight genotypic clusters may eventually emerge as a result of preferential recombination within, rather than between, habitats. This is evident from quantification of recent recombination in the core genome, using three very recently diverged pairs of sister strains, 1F175-1F53, 1F111-1F273 and ZF30-ZF207, that group together at nearly all SNPs in the genome (Fig. 3.1A). The grouping of such young sister pairs should only be broken by the most recent recombination events identifiable in our sample, involving one of the sister strains as a donor or acceptor. We quantified such events by counting core genome blocks inconsistent with phylogenetic pairing of sister strains (see Materials and Methods). Out of 93 such blocks (Fig. 3.2A), 76 resulted from one sister strain pairing with another strain from the same habitat. This is significantly more within-habitat recombination than expected under a model with random recombination across habitats (p ¡ 1e-5; (Materials and Methods)). The excess within-habitat recombination was detectable in both S (p = 0.03) and L (p < 1e−5) populations considered separately, and is robust to variation in our assumptions about the relative S:L population sizes (see Materials and Methods). In contrast, the pairing of more anciently diverged S strains, FF160-FF274, is more often broken up by recombination with L (222 blocks) than S strains (8 blocks)(p < 1e−5), perhaps owing to the higher abundance of L strains in the past (e.g., if the ancestral, undifferentiated population was L-associated). This suggests that the trend toward 37 A m p ep e o e e o e L ECO p B h omo ome h omo ome m G m Xm n ECO p n ECO p # NP 5S p X m m # NP upp # SNPs m po po ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●●● ●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●● ●●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●●● ●● ●●● ●● ●● ●●● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ●● ●●● ●● ● ●● ● ● ● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●●●● ● ● ● ● ● ●●●● ● ● ● ●●● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● m C ● ● M genome position (Mbp) % m % % % % % % Figure 2.1: Phylogeny follows ecology at just a few habitat-specific loci. (A) Maximum-likelihood (ML) V. cyc itrophicus phylogenies rooted by V. sp endidus 12B01, based on core genome nucleotide sequence for chromosome I (left) and II (right). Scale is substitutions per site; all nodes have 100% bootstrap support unless indicated. (B) Genome regions with uninterrupted support for (black bars) or against (gray bars; note different scale) the ecological split of strains into distinct habitats (S/L). Bar height indicates the number of in- formative SNPs in each region. ECO-sup regions 1 to 11 are described in table S2; ML trees for four major regions are shown, rooted with 12B01; polyL/polyS indicates regions with significantly higher (up arrows) or lower (down arrows) nucleotide diversity and density of segregating polymorphic sites within the L (red) or S (green) habitat, relative to the chromosome-wide average. Tracks below x axis are as follows. ECO: locations of ECOsupporting (black points) and -rejecting (gray) SNPs. 5-S: SNPs supporting (blue points) or rejecting (gray) the 5-S branch. Breaks: number of inferred recombination breakpoints per kb. (C) Tree topologies accounting for most genome length. Top four ranked unrooted topologies are shown for chromosome I, top 2 for chromosome II, and the percentage of the core genome accounted for (see Materials & Methods). the habitat-specific gene flow we identified has emerged relatively recently. The preference for within-habitat recombination is also apparent in the flexible genome. This component of the genome changes so rapidly that even the two most closely-related genomes in our study (1F175-1F53), differing by only 66 substitutions in 3.54 Mb of core genome, each contain about 4,500 bp of unique DNA (fig. S6). 38 A CORE GENOME B recombination of one sister strain with S strains FLEXIBLE GENOME S strains 0.1 with L strains FF75 ZF99 ZF99 1F289 ZF28 ZF28 ZF30 ZF65 ZF65 within L 8 ZF207ZF207 4 2 ZF30 ZF30 6 5 9 11 992 8 6 2 sisters 6 ZF207 8 similarity in flexible genome content between ZF264ZF264 ZF99 3 ZF270 FF75 FF75 ZF65 ZF255ZF255 ZF205 ZF14 ZF14 ZF28 613 ZF170ZF170 within S ZF270ZF270 1 1F53 1F53 ZF170 between ZF264 1F1751F175 FF274 552 580 sisters FF160 FF160FF160 998 FF274FF274 within S 1F97 1F97 1F2731F273 1F1111F111 1.00 ZF255 1F2891F289 0.001 L strains ZF14 ZF205ZF205 4 sisters 3 1F175 1000 1F53 between 1 2 1F97 1 1 1 1F111 991 1000 1 1F273 0.1 ZF14 ZF255 Figure 2.2: Recent recombination is more common within than between habitats. (A) Genomewide ML phylogeny based on 3.54 Mb of aligned core genome, with sister strains highlighted in red or green. All nodes have 79 to 100 bootstrap support. Bar graphs show events (number of core genome blocks) that split up sisters by recombination between (gray bars) or within habitats (S, green; L, red). (B) Relative amount of shared flexible genomic blocks between strains. The neighbor-joining (NJ) tree (left) is a consensus across 1000 bootstrap resamplings of the flexible blocks. Only nodes with support > 500 are shown. Scale bar: Bray-Curtis distance used to construct the NJ tree (see Materials & Methods). FF75 1F289 ZF30 ZF207 ZF99 ZF270 ZF65 ZF205 ZF28 ZF170 ZF264 FF274 FF160 1F175 1F53 1F97 1F111 1F273 The flexible genome tree also has a different topology to that of the core (Fig. 3.2), suggesting that the flexible genome is shaped largely by horizontal transfer (integrasemediated and illegitimate recombination), with limited clonal descent. The separate grouping of S and L strains (Fig. 3.2B; 99.8% bootstrap support) when clustered by the proportion of shared flexible DNA (Fig. 3.2B) indicates preferential recombination occurs within habitats. Compared with a model of random recombination among habitats, there is significantly more habitat-specific sharing of flexible blocks than expected by chance (p < 5.5x10 − 58; Materials and Methods; Table S1). Interestingly, all seven S strains not just the 5-S strains hypothesized to have undergone a selective sweep on chromosome II share significant amounts of flexible DNA on this chromosome (fig. S7). Therefore flexible genome turnover is sufficiently rapid that flexible DNA does not hitchhike with selective sweeps for very long. Rather, high turnover, with a clear bias toward within-habitat sharing of DNA, maintains distinct but dynamic and habitat-specific gene pools. 39 Functions of ecologically differentiated genes The revelation that there is a suite of habitat-specific genes and alleles has shed light on the selective pressures associated with specialization to different microhabitats in the ocean (Table S1, S2; Materials and Methods). The RTX locus and syp operon exhibit both allelic variation (core) and gene content variation (flexible). Several syp genes, present in all L but absent from S genomes, and their upstream regulator sypG, present in different allelic variants between habitats, are involved in biofilm formation and host colonization [57]. RTX proteins are important virulence factors in pathogens [58] and may play a role in interactions with different hosts. The stressresponse sigma factor RpoS, in the core genome nearby the RTX locus, has been shown to mediate a tradeoff between stress tolerance and nutritional specialization in environmental Escherichia coli isolates [59]. Finally, MSHA biosynthesis genes, many of which are unique to L flexible genomes, promote adherence to chitin [60] and zooplankton exoskeletons [61]. Together, these suggest that ecological specialization, possibly through differential host association, can be achieved by fine-tuning genes in a few key functional pathways. A model for ecological differentiation in bacteria Our observations can be generalized with a model predicting independent evolutionary trajectories for nascent populations triggered by gene-specific sweeps (Fig. 3.3). The mosaic genomes we observed, with different genome blocks supporting different phylogenies suggest a frequently recombining, ecologically uniform ancestral population (Fig. 3.3B, early time points). The recent acquisition of habitat-specific flexible genes and core alleles likely initiated specialization to different hosts or habitats leading to decreased gene flow between populations. The populations we studied are in a very early stage of ecological specialization, with little genetic divergence between them. However, if the trend towards greater within-population recombination can be extrapolated into the future (as might indeed be expected given that recombination drops loglinearly with sequence divergence [62–66]), they will eventually form 40 distinct genetic clusters, potentially indistinguishable from those predicted by (and often taken as evidence for) the ecotype model (Fig. 3.3A). Genetic isolation by preferential recombination has been suggested previously [67], and this trend might be enhanced if homologous recombination between populations is reduced in the vicinity of acquired habitat-specific genes [68]. Thus, a mechanism of gene-centered sweeps may eventually lead to a pattern characteristic of genomewide sweeps. In this way, our study of the very early stages of ecological specialization has provided a simple resolution to seemingly conflicting empirical observations. Outlook Our findings of ecological differentiation driven by gene-specific rather than genomewide selective sweeps, followed by gradual emergence of barriers to gene flow, leave open three major questions for future investigation: what mechanisms (aside from unrealistically high recombination rates) are responsible for preventing genomewide selective sweeps (e.g., negative frequency dependent selection by viruses and protozoa), how often and by what mechanism are entire chromosomes mobilized, and what are the barriers to gene flow between sympatric ecological populations (e.g., reduced encounter rates or some form of assortative mating)? No matter how marked the decline in gene flow between ecological populations, they will always remain open to uptake of DNA from other populations, thus remaining fundamentally different from biological species of sexual eukaryotes [50]. Yet strikingly, the process of ecological differentiation we have inferred for these ocean bacteria is similar to models of sympatric speciation by habitat-specific allelic sweeps in sexual eukaryotes [69, 70]. Despite differences in how adaptive alleles are acquired, our results suggest that how they spread within populations may follow a more uniform process in both prokaryotes and eukaryotes than previously imagined. 41 A neutral marker genes sampled 0.1 1 0.1 sample t2 (A) B sample t1 ancestral gene flow population dynamics (C) restricted gene flow between habitats? (B) time (t) Time habitat - specific alleles arrive Figure 2.3: Ecological differentiation in recombining microbial populations. (A) Example genealogy of neutral marker genes sampled from the population(s) at different times. (B) Underlying model of ecological differentiation. Thin gray or black arrows represent recombination within or between ecologically associated populations. Thick colored arrows represent acquisition of adaptive alleles for red or green habitats. 42 Chapter 3 Sympatric speciation: when is it possible in bacteria? Jonathan Friedman, Eric J. Alm, B. Jesse Shapiro This chapter is presented as it originally appeared in PLoS ONE 336, 48 (2012). Chapter 3 Sympatric speciation: when is it possible in bacteria? According to theory, sympatric speciation in sexual eukaryotes is favored when relatively few loci in the genome are sufficient for reproductive isolation and adaptation to different niches. Here we show a similar result for clonally reproducing bacteria, but which comes about for different reasons. In simulated microbial populations, there is an evolutionary tradeoff between early and late stages of niche adaptation, which is resolved when relatively few loci are required for adaptation. At early stages, recombination accelerates adaptation to new niches (ecological speciation) by combining multiple adaptive alleles into a single genome. Later on, without assortative mating or other barriers to gene flow, recombination generates unfit intermediate genotypes and homogenizes incipient species. The solution to this tradeoff may be simply to reduce the number of loci required for speciation, or to reduce recombination between species over time. Both solutions appear to be relevant in natural microbial populations, allowing them to diverge into ecological species under similar constraints as sexual eukaryotes, despite differences in their life histories. 44 Author Summary It has been theorized for some time that sexual animal populations diverge more readily into new species when a small number of genes are sufficient to initiate and complete the speciation process. Some of these genes may promote sexual isolation between species (important in ’sympatric’ speciation, where there are no physical barriers to mating) and others may confer adaptation to different ecological niches. Bacteria, like sexual organisms, adapt to diverse ecological niches. Yet they reproduce clonally, exchanging genes occasionally by recombination, potentially with distant relatives, such that sexual isolation between species is never complete. Despite these differences from sexual organisms, we show that bacteria are also more likely to form new species when just a few genes are involved in niche adaptation. The reason for this is an evolutionary tradeoff: frequent genetic exchange (recombination) is required to generate optimal combinations of multiple adaptive alleles, but without mechanisms for sexual isolation these optimally adapted species tend to merge back together rather than diverging into separate species. This tradeoff imposes a simple constraint on the speciation process in bacteria, and may explain certain aspects of bacterial biology such as the clustering of genes into operons (effectively reducing the number of genes involved in speciation) and variable recombination rates over time. 45 Introduction Microbes have adapted to nearly every ecological niche imaginable on earth, yet the evolutionary mechanisms of the specialization process and their constraints remain poorly understood. In our recent study of ecological differentiation between two marine Vibrio populations [71], we were surprised to observe relatively few regions of the genome underlying differentiation: up to 11 (but as few as 3) in the ’core’ genome, with different alleles fixed between large (L) and small (S) particle associated strains, and up to 19 in the ’flexible’ genome (horizontally transferred tracts of DNA exclusively present in genomes from one habitat but absent in the other). Habitat specific genes and alleles likely arrived in S and L populations by recombination with other more distantly related population, suggesting that recombination rather than mutation is the dominant source of genetic variation. Aside from these few habitat-specific regions, most of the genome showed a history of rampant recombination within and between populations, consistent with a relatively large influence of recombination on Vibrio genomes [72]. At face value, this observation seemed to be satisfactorily explained by modeling work predicting that speciation by habitat shift should not involve many loci [73–77]. However, these were models of sexual eukaryotes, which unlike bacteria necessarily undergo homologous recombination every generation and form species by assortative mating and sexual isolation. In their 1986 paper, ”Sympatric speciation: when is it possible?”, Kondrashov and Mina [78] conclude that for sexual populations ”sympatric speciation is possible when essential differences (including isolating mechanisms) between the formed species depend on up to 10 loci”. Here, we reformulate these classic models in order to ask the same question for bacteria: when, and with how many loci involved, is sympatric ecological differentiation likely to occur? While many different speciation scenarios are possible for bacteria, here we limit ourselves to one scenario that we hypothesize to be most relevant for Vibrio and other frequently recombining bacteria, where populations differentiate in sympatry, with relatively high rates of recombination among all individuals. For simplicity, we take 46 sympatry to mean that all individuals recombine with equal probability, regardless of their niche or species affiliation. This is, intentionally, a rather extreme version of sympatry, in which there is no opportunity for establishment of new species in physical isolation. Although extreme, such a model is consistent with the suspected transience of marine Vibrio microhabitats: vibrios disperse rapidly among invertebrate hosts [79], and particle habitats may have short half-lives. Rapid turnover of microhabitats would allow very little opportunity for recombination within habitats (environmental assortative mating, as termed by J. Mallet, personal communication). If turnover is less rapid, a model of mosaic sympatry [80] may be more appropriate for bacteria with preferences for different microhabitats (e.g. large zooplankton or copepods vs. small organic particles), but with frequent mixing among microhabitats. We do not aim to describe bacteria that speciate due to strong barriers to recombination (allopatry, e.g. [81]), or in which recombination is weak relative to mutation, selection and drift [82–84]; such ’effectively clonal’ populations are already well described by various versions of the stable ecotype model, involving periodic selection and genomewide selective sweeps [82, 85]. Sympatric speciation is is increasingly being observed in plants and animals [77, 86]. In all domains of life, the likelihood of sympatric speciation depends on the balance between the homogenizing force of recombination (inhibiting speciation) and disruptive selection, favoring speciation by adaptation to different niches [87]. This balance essentially a restatement of Haldane’s isolation theory [88] can be applied to individual genes, such that some parts of the genome (i.e. those under disruptive selection) can be strongly differentiated, while others (neutral loci) are not. Here we are interested in the early stages of speciation, in which populations become differentiated at selected loci, but not necessarily at neutral loci [89]. This amounts to an ecological species concept, in which species are defined as genotypes optimally adapted to different niches through acquisition of selectively favored alleles at one or more loci [90]. Speciation is therefore driven by selection on adaptive loci, and does not require differentiation in neutral loci, or any form of assortative mating. For the purposes of this study, we define speciation in this ecological sense: directional 47 selection favors different alleles in different niches. As a result, incipient species are defined as having the optimal combination of alleles in their respective niches. Following Kondrashov [73, 74, 78], sympatric speciation can be conceptually divided into early stages, where recombination helps generate optimally adapted genotypes, and late stages, where recombination homogenizes incipient species unless barriers to gene flow emerge. At early stages, the rate of adaptation to new niches may be increased dramatically in recombining (as opposed to strictly clonal) populations by bringing multiple adaptive alleles into a common genomic background [91], and unlinking them from deleterious mutations [92]. When fitness in a new niche is controlled by adaptive mutations that can fix in any order, recombining populations are favored by selection [93], but when fitness depends on the order of fixation, recombining populations adapt more slowly than clonal populations [94]. Sex may also come with a cost (e.g. reduced rate of cell division), in which case switching between sexual and clonal phenotypes provides an optimal strategy [95]. Although recombination speeds up adaptation within a single population (at least when fitness is additive), it can be a powerful hindrance to the later stages of sympatric speciation. In the absence of barriers to gene flow between emerging species, recombination may homogenize their gene pools, effectively preventing speciation. Under the biological species concept [96], assortative mating is generally required for sympatric speciation, and this is borne out in modeling work [75, 97]. In these and earlier models [74, 78], sympatric speciation was delayed by increasing the number of loci contributing to assortative mating, such that speciation is very unlikely when more than ∼ 10 loci are involved. In support of this theoretical prediction, recent surveys of genomic variation in sympatric pairs of sexual eukaryotic species have identified just a few (< 10) genomic islands of high divergence between mosquito species [69], although divergence was subsequently shown to encompass larger stretches of the genome than initially thought [98–100]. These islands could contain genes responsible for ecological specialization, assortative mating, or both. Sequence divergence in islands may locally inhibit recombination between incipient species, extending the islands into ’conti48 nents’ of divergence in a process termed ’divergence hitchhiking’ [101]. Similarly, in stickleback fish, a small number of alleles (∼ 1 − 5) involved in pelvic morphology [102] and armor plate patterning [103] are sufficient for significant ecological differentiation. Although these alleles may have contributed to early divergence of marine and freshwater stickleback species 2 million year ago, many more adaptive changes (in ∼ 90 − 215 loci) have occurred since that time [104]. With or without assortative mating, it seems plausible that speciation will proceed more readily when fewer loci are required to confer ecologically different phenotypes to each nascent species. This concept is explained eloquently by Kondrashov [74]: “In a panmictic equilibrium population the phenotype variance [and the variance in relative fitness, if this phenotype is related to fitness] decreases with the growth of the number of loci controlling any character. It leads to the weakening of the effects of selection [...]”. Under an additive fitness model where each locus contributes equally to niche adaptation, as more loci are involved in adaptation, the fitness landscape becomes smoother and the competitive advantage of the optimal genotypes becomes relatively weaker, such that intermediate genotypes are maintained and speciation is impeded. Regardless of what type of fitness model is used, as more loci are necessary to achieve an optimal genotype (ecological species), more recombination will be needed to collect all the optimal alleles into a single genome. However, high recombination rates come at a cost: once an optimal genotype is generated for a new niche, there is nothing to prevent it from recombining with individuals from the ancestral niche, thus delaying or preventing speciation. We investigate the importance of this apparent tradeoff in a microbial context, where recombination can occur, but not necessarily every generation. We strove for the simplest possible model that could account for our empirical findings that few adaptive loci could drive ecological differentiation in highly recombining bacterial populations [71]. Like Kondrashov’s models, ours is a deliberate abstraction aimed at capturing major qualitative features of speciation. We follow the conceptual framework of Kondrashov [78], changing details as necessary to suit bacterial populations. Unlike models of sexual eukaryotes, we use a model with49 out assortative mating, which we assume to be negligible among very closely-related bacteria in the very early stages of speciation. Although forms of assortative mating cannot be ruled out, studies of highly recombining bacteria such as Streptococcus pneumoniae have failed to find evidence for assortativeness [105]. However, because recombination drops loglinearly with sequence divergence [52, 62, 63, 65], barriers to gene flow should eventually emerge between more distantly-related bacteria, and even very closely related sympatric bacterial populations show signs of emerging barriers to gene flow [71, 106]. We chose not to model assortativeness or any barriers to gene flow for two reasons. First, such models have already been explored [75, 78, 97]. Second, we are interested in the early stages of speciation, which are necessary for, yet do not guarantee the establishment of long-lived species. These early stages are driven by directional selection, and may or may not be followed by a reduction of gene flow between incipient species. Model Our sympatric speciation simulation (symsim, implemented in MatLab and available at http://almlab.mit.edu/star/symsim/) models a microbial population growing in an environment composed of two distinct niches, one ancestral (niche 0) and one derived (niche 1). Niches are deliberately abstract, but could consist of different carbon sources, optimal growth temperature, host or particle association, etc. Genotypes consist of L unlinked adaptive loci, each with two allelic states, 0 or 1, conferring adaptation to niche 0 or 1, respectively, and one representative background locus with two allelic states both neutral to fitness. We considered two models of fitness: one additive and the second a ’step’ function [78]. In both models, fitness is a function of fi,j , the fraction of adaptive loci in genotype i that contain alleles adapted to niche j. Optimally adapted genotypes (defined here as ecological species) have fi,j = 1. In the additive model, microbes compete and reproduce exclusively in the niche to which their genotype is best adapted, (Figure 3.1A). Fitness is an additive function of fi,j . For example, with L = 5, the genotype 50 11100 would compete in niche 1 with f11100,1 = 3/5. Genotypes always reproduce in the niche for which fi,j > 0.5; if fi,j = 0.5, a niche is chosen at random. This constitutes a strong tradeoff in niche adaptation: each strain can only have one niche. As a result, until a strain acquires at least L/2 niche-1 adaptive alleles, it will compete in niche 0, with any niche-1 alleles incurring a fitness disadvantage. The step fitness model relaxes this strong tradeoff: strains with fi,j < 1 may shuttle between niches, acquiring their full complement of resources from a combination of both niches (Figure 3.1B). However, the disruptive selection necessary for speciation is maintained because all intermediate genotypes must pay a fitness cost for the time and energy spent switching between niches. This cost (in the form of a selective coefficient, s) is uniform for all fi,j < 1. Thus, all intermediates pay an identical fitness cost relative to the specialized, optimally niche-adapted genotypes. Both fitness models, while simple, are reasonable starting points for describing natural bacterial populations with different metabolic or microhabitat specializations. In natural marine Vibrio populations, for example, ecological differentiation is likely driven by genes involved in host or particle attachment (mshA and syp genes) and transcriptional regulators (sypG, rpoS) [71]. Additive fitness effects would result if there were an incremental advantage to encoding and expressing each additional adaptive gene or allele. Alternatively, intermediate genotypes could gather resources from two different hosts or particle types, but with a penalty for time spent switching between the two, as in the step model. Similar fitness landscapes can be imagined for populations that rely on different metabolic or successional strategies to achieve ecological differentiation [52]. Symsim is a discrete generations model with varying population size, required for investigating the colonization of a new, initially empty niche. Each generation consists of resource competition, population growth and recombination, as follows: I. Competition At each generation, each niche has an amount of resource R for which individuals compete. In each niche, the resource is partitioned among individuals according 51 Figure 3.1: Schematic of the symsimmodel. (A) Additive fitness. The steps of the simulation are (i) growth/selection according to relative additive fitness within each niche, (ii) small probability of recombination (r) by gene conversion of homologous loci (diagonal lines) in a sympatric, mixed pool of genotypes from both niches, and (iii) individuals return to the niche to which their genotype is best adapted (e.g. in this 3-locus example, genotypes 000 and 010 go to niche 0, while 111 and 011 go to niche 1). Steps (i), (ii) and (iii) are iterated for a set number of generations or until any of the derived alleles go extinct. (B) Step fitness. The steps of the simulations are the same as for additive fitness, except that individuals can grow and be selected in both niches. Optimally adapted genotypes (111 and 000) compete in just one niche. Intermediates compete in both niches, but pay a fitness cost s for niche switching. In the example shown, an individual of genotype 010 obtains 2/3 of its resources in niche 0 (and adds a count of 2/3 of an individual to the population size of niche 0), and 1/3 from niche 1 (and adds a count of 1/3 of an individual to niche 1). to their competitive fitness. In the additive model: s wi,j = 1 + fi,j . 52 (3.1) And in the step model: wi,j = 1 fi,j = 1 1 − s otherwise, (3.2) where s is a tunable parameter controlling the strength of selection. We ran simulations using s = 0.1 and s = 0.01, representing the range of selective coefficients measured for single fixed beneficial mutations in experimentally evolved E. coli populations [107]. Instead of selection on a single locus, in our model s stands for disruptive selection (Equation 3.1) or negative selection (Equation 5.2) on multiple loci. The average amount of resource allocated to each individual of genotype i in niche j scales with their relative fitness: wi.j , i wi,j Ni,j,t Ri,j = R P (3.3) where Ni,j,t is the number of cells with genotype i in niche j at time t. Under additive fitness, genotypes only obtain resources in the niche to which they are best adapted (for which fi,j ≥ 0.5, as described above). Under step fitness, each individual can obtain resources from both niches, and also counts toward the number of cells in in both niches, proportionally to its genotype. For example, genotypes with fi,j = 0 obtains resources only from niche 0, fi,j = 1 only from niche 1, and fi,j = 0.5 obtains half its resources from each niche and contributes half a count to each population size. II. Clonal Reproduction At each generation, the average expected number of offspring of each individual of genotype i in niche j is: Ei = 1 + bi − a, (3.4) where d is the death rate and bi , the birth rate, is a saturating Michaelis-Menton 53 function of the resources allocated to it: bi = bmax Ri,j , R0 + Ri,j (3.5) where bmax is the maximal birth rate per cell and R0 is the half-saturation constant of the birth function. Therefore, the number of individuals (N ) of genotype i in the next generation is drawn from a Poisson distribution: Ni,j,t+1 = Poisson(Ei Ni,j,t ). (3.6) The carrying capacity (K) of each niche (the steady-state population size) is related to the amount of resource available and the birth and death parameters by: R K= R0 bmax −1 . d (3.7) In all simulations, we set R = 1, bmax = 10, d = 0.1, and K = 106 in each niche. We investigated two different selective coefficients: s = 0.1 and s = 0.01. III. Recombination After population growth and selection, all microbes enter a common sympatric pool and recombine with one another at random, independently of their genotype (Figure 3.1), with rate r per individual per locus. The total number of recombination events per generation is therefore rN (B + L), where N is the total number of individuals in both niches, L is the number of adaptive loci, and B is the number of neutral background loci (set at B = 1 in all simulations). In each event, a donor/acceptor pair of microbes is chosen at random, and one of the B + L genomic loci is chosen to be recombined, also at random. Recombination between homologous loci is non-reciprocal, and occurs by gene conversion resulting in allelic replacement of the acceptor by the donor allele. Cycles of growth and recombination continue for a set number of generations, or until any of the adaptive alleles go extinct (meaning that speciation fails to occur). Mutation is not included in our model because we are particularly 54 interested in the case where recombination rates are much higher than mutation rates, and where adaptive alleles are highly divergent, containing multiple nucleotide differences. We also point out that symsim is model as one of homologous recombination by gene conversion, although extending it to include non-homologous recombination (horizontal gene transfer) would be straightforward. Result We aimed to answer two questions: (1) How long does it take the derived (niche-1 optimized) species to appear, and what is the likelihood that it appears at all? (2) Given that the derived species does appear, what is its equilibrium frequency relative to intermediate (sub-optimal) genotypes? In other words, what is the completeness of the ecological speciation process? Recombination expedites the appearance of new ecological species To address the first question, we initiated the symsim model with niche-1 empty (the novel niche having just appeared) and niche-0 (the ancestral niche) occupied by 95% optimal genotypes. The remaining 5% of the population contained a niche-1 allele at a single locus (distributed uniformly across the L loci), with all other loci containing niche-0 alleles. We varied the model of selection (additive or step), the selection coefficient (s), the number of loci involved in niche adaptation (L) and the recombination rate (r), and allowed the simulation to proceed until the niche-1 optimal genotype (defined as the derived species) was generated by recombination, or until any of the niche-1-adapted alleles went extinct, rendering the niche-1 optimal genotype unattainable. We performed 100 replicate simulations for each parameter combination and obtained qualitatively similar results for both additive (Figure 3.2) and step (Figure 3.3) models of selection, with both strong and weak selection. We 55 will therefore focus first on the results of the additive model with weak selection (s = 0.01) because they are also representative of the qualitative features of the other models. How does the number of loci contributing to adaptation affect the likelihood that the derived species is generated by recombination? With only two adaptive loci, the derived optimal genotype is always generated by recombination (Figure 3.2A), even when the recombination rate is relatively low (r = 10−6 , resulting in only ∼ 2 recombination events per locus per generation in the pooled population), but not when it approaches zero expected recombination events per generation (r = 10−7 ). As the number of adaptive loci is increased to L = 3, the derived species still almost always appears, but takes longer to do so. With L = 5, a higher recombination rate (r = 10−4 ) is required for even a 48% chance of appearance before extinction, and with L = 7 appearance is only likely at very high recombination rates (r = 10−2 ). The exact values of L and r required for appearance of niche-1 optimal genotypes depend on the selection model, strength of selection and population size (which influences the likelihood of stochastic extinctions, not investigated here). With more than 2 adaptive loci, the appearance of derived optimal genotypes is less likely under the step fitness model (Figure 3.3), likely due to strong and uniform selection against intermediate genotypes. To a lesser extent, increasing the strength of selection also hinders slightly the appearance of derived optimal genotypes (compare panels A and D in Figures 2 and 3); the effect is only slight because fitness is relative to other competitors in the population (Equation 3.3). In general, the simulations all support a major qualitative conclusion: higher recombination rates are necessary for ecological speciation when more adaptive loci are involved. For many adaptive loci and low recombination rates (area below the red line in Figures 2 and 3), ecological speciation is very unlikely to be initiated Recombination hinders the later stages of speciation To investigate any potential tradeoffs between the early and late stages of speciation (appearance of the new species, and disappearance of sub-optimal intermediate geno56 types, respectively), we fast-forwarded the simulation to a point in time when the derived species (niche-1 optimal genotype) had reached 1% of the pooled population (having thus escaped extinction by drift), the ancestral (niche-0) species constituted 95%, and the intermediate genotypes were distributed uniformly to make up the remaining 4%. From these starting conditions, and we ran symsim for 100,000 generations for each combination of L and r. In each replicate simulation, we recorded the maximum mean frequency of optimal genotypes (averaged over both niches) observed any time after 25,000 generations. Based on visual inspection, genotype frequencies always reached an equilibrium by this point, so the maximum frequency of optimal genotypes proved to be a reasonable measure of the ’completeness’ of speciation. We found that the high recombination rates necessary to generate the derived species tended to hinder the completion of speciation later on. For example, under additive fitness and s = 0.01, with L = 5 adaptive loci, a minimum of r ≈ 10−4 is required for the derived species to appear and survive (Figure 3.2A,B), yet this amount of recombination also generates many intermediate genotypes, resulting in a low equilibrium frequency (0.21) of optimally adapted species (Figure 3.2C, 4B). The incompleteness of speciation at high recombination rates is less pronounced with fewer adaptive loci, for example when L = 2 (Figure 3.2C, 4A). When the recombination rate is kept low (r = 10−6 ), speciation proceeds essentially to completion, with nearly the entire populations of both niches occupied by optimally adapted genotypes (Figure 3.2C, 4C,D). Of course, such low recombination rates would be unlikely to produce optimally adapted genotypes in the first place (falling below the red line in Figures 2), resulting in a serious impediment to sympatric speciation as increasing numbers of adaptive loci are involved. The same tradeoff between early and late stages of speciation is also observed in the step fitness model. Although speciation is generally more complete under step than additive fitness (for the same values of r and L), the combinations of r and L most likely to maintain complete speciation are unlikely to have generated optimally adapted derived genotypes at earlier stages (Figure 3.3). Thus, even under a step fitness model, in which the variance in fitness does not decrease with L (and there is 57 no weakening of selection with L, as in the additive fitness model), we still observe a tradeoff between early and late stages of speciation. Figure 3.2: Results of symsimmodel under additive fitness. (A, B, C) Weak selection. (D, E, F) Strong selection. (A, D) Probability of appearance (p) of the derived (niche 1) optimal genotype in 100 replicate simulations for each combination of the number of loci involved in niche adaptation, L and the recombination rate, r. High probabilities (p = 1) are shown in white, low probabilities in black, and intermediate probabilities in grey scale. The space under the red line indicates extinction of the niche-1 optimal genotype in all 100 replicates (p,0.01). (B, E) Time to appearance of the niche 1 optimal genotype (mean over 100 replicate simulations). The red line is the same as in (A); n.d. refers to appearance time not determined, or effectively infinite, because extinction of niche-1 alleles occurred before the optimal genotype could appear. Shorter times are shown in white, effectively infinite times in black, and intermediate times in grey scale. (C, F) Completeness of speciation. The mean fraction of the pooled populations (niche 0 and 1) occupied by optimally-adapted genotypes is based on 10 replicate simulations for every combination of L and r. Complete speciation (optimal genotype fraction near 1) shown in white, incomplete in black, and intermediate completeness in grey scale. Magenta letters in C refer to the same simulations depicted in panels (A, B, C, D) of Figure 3.4. 58 A practical note on our ability to recognize adaptive loci Sympatric species of bacteria are difficult to recognize in the wild because the niches to which they adapt are often difficult to observe, and their adaptive genomic changes indistinguishable from neutral changes. In practice, sympatric ecological species can be identified when different alleles have been fixed between populations by directional selection [8, 108]. In a perfectly clonal scenario (r = 0), selectively neutral background alleles would also become differentially fixed between species, making them hard to distinguish from the adaptive alleles driving speciation. Our simulations included a representative neutral background locus, and we asked under what circumstances it could be mistaken for an adaptive locus. Selection acts to fix a different allele in each niche at adaptive loci, but not background loci. However, when recombination rates are low, background alleles may hitchhike with adaptive alleles, resulting in the fixation of background alleles between habitats. Given sufficient recombination, background alleles will eventually become randomly distributed across niches (’mixed’), making them unlikely to be confused for adaptive loci. We defined a population as mixed when background alleles were randomly distributed across niches (e.g. background allele 1 at frequency 0.5 in both niches). We also defined t(mix) as the time, in generations, when mixing is achieved in a simulation initiated as described in section (ii) above, with the derived species at an initial frequency of 1%, and background alleles in perfect association with niches (allele 0 in niche 0 only; allele 1 in niche 1 only). As expected, t(mix) was highly dependent on the recombination rate. Under additive fitness and s = 0.01, as the recombination rate increased, less time was required to achieve mixing: the mean t(mix) across replicate simulations was always over 100,000 generations with r = 10−5 , 7,999 generations (s.d. = 2420) with r = 10−4 , and only 62 generations (s.d. = 2.2) with r = 10−2 . This suggests that at high recombination rates, adaptive loci should be easily discernible from neutral loci very soon after the initiation of speciation, but at low to moderate recombination rates neutral loci may hitchhike for thousands of generations, making them easy to confuse with adaptive loci using many standard population genetic tests for positive 59 selection Shapiro:2009hp. Figure 3.3: Results of symsim model under step fitness. (A, B, C) Weak selection. (D, E, F) Strong selection. See Figure 3.2 legend. Discussion Our simulations identified a major tradeoff between early- and late-stage recombination, predicting that the initiation of sympatric speciation is much more likely when the number of loci required to adapt to a new niche is small. The same qualitative result is obtained using different selection models and selection coefficients. As the number of loci increases, the tradeoff becomes more restrictive: high recombination rates are required to generate multi-locus optimal genotypes at early stages, but such high recombination rates eventually homogenize gene pools and prevent the maintenance of ecologically adapted species at late stages. Have we described a realistic model of bacterial sympatric speciation? It is cer60 Figure 3.4: Higher recombination rates maintain intermediate genotypes and reduce the completeness of speciation. Panels A, B, C and D show dynamics of a single simulation under different combinations of L (number of adaptive loci) and r (recombination rate), corresponding to magenta letters in Figure 3.2C. The y-axis shows the frequency of optimal genotypes in a given niche (ancestral or derived). tainly not a model of biological species because biological speciation is impossible without barriers to recombination, which are not included in our model. It is more a model of early niche invasion and evolution of optimal genotypes. We argue that this is an important early step toward speciation, and one that is increasingly observed in diverging natural microbial sympatric populations, which mix by recombination at all but a few adaptive loci [52, 71]. Because the fitness landscapes of these diverging populations are unknown, we have used two very different fitness models, one additive and one ’step’, which can be considered an extreme version of epistasis. Other fitness models are also plausible, so long as they include some degree of disruptive selection 61 [77], but a thorough investigation of other models is beyond the scope of this study. Despite being a deliberate abstraction of real fitness landscapes, our model describes one simple (but not exclusive) mechanism that generates data consistent with the observation that relatively few adaptive loci tend to drive early stages of sympatric speciation, in microbes just as in many sexual eukaryotes [69, 103, 109, 110]. Another limitation of our model is that extinction occurs when any of the adaptive alleles disappears. In real bacterial populations, alleles may be replenished by recombination from other populations, giving speciation another chance to occur. Therefore, our model probably overestimates the likelihood of extinction preventing speciation. Due to this and other limitations and assumptions, we do not claim that there is a single ’magic number’ of adaptive loci that provides the easiest path to sympatric ecological speciation. The exact number will always depend on the recombination rate, selection strength, population size and niche complexity, but smaller numbers will always tend to facilitate speciation. Given the tradeoff predicted by our simple model, how do pairs of well-adapted sympatric bacterial species emerge? Perhaps they do not, or only rarely. Ecological species may be transient, and rapidly merge back in to a homogenous parent population. In cases where a new species does evolve, the effective number of adaptive loci could be kept low by combining suites of genes into operons, allowing complex phenotypes to be acquired in a single recombination event. Our model could thus provide an example of how linking coadapted genes together into operons, in addition to providing a selective advantage for the genes themselves as in the selfish operon hypothesis [111] could also provide an advantage in facilitating ecological speciation. Despite such ’strategies’ to keep L low, it is possible that many opportunities for speciation are missed because new niches are too complex to be exploited by genotypes with just a few adaptive alleles. However, there is mounting evidence from natural microbial communities that surprisingly few adaptive genes or alleles may be sufficient for adaptation to fairly complex niches, including different hosts or nutrient utilization strategies [49, 52, 71, 112]. In other cases, large continents of divergence have been observed, containing many genes [106]. Such continents may not fit the model 62 described here, or may come about at later time points in the speciation process, perhaps once gene flow has become more restricted. What if many ( 10) adaptive loci are required to exploit a new niche? Perhaps speciation could still be achieved if recombination rates varied over time, such that the stress experienced in a new habitat would trigger periods of hyper-recombination, promoting the rapid generation of genotypes with the optimal combination of adaptive alleles. In order for new species to be maintained on the long term (rather than being eroded by recombination with the ancestral species), recombination would have to revert to a low rate. This could occur if barriers to gene exchange between niches, or forms of assortative mating, emerged over time. It is known that sex (recombination) can slow the rate of adaptation within a population when the fitness landscape is such that alleles must be acquired in a particular order to be adaptive [94]. Here we have shown another intuitive, yet to our knowledge unrecognized, disadvantage of sex that arises even in simple, additive or step fitness landscapes without assortative mating. Recombination, although advantageous in the early stages of speciation, erodes ecological specialization later on, resulting in less fit populations. As a result, recombining microbial populations, like sexual eukaryotic populations, are predicted to form new species using only a few loci. Although this limitation is common to microbes and sexual eukaryotes, it comes about for different reasons. In simulated populations of sexual eukaryotes, recombination is uniformly high (occurring before every mating), and some form of assortative mating is usually required for speciation (e.g. [87]). In the earlier models of Kondrashov [73–75, 78] and Felsenstein [87], it was shown how the reduced efficacy of selection with increasing number of loci (involved in adaptation, assortative mating, or both) is important in preventing the initiation of sympatric speciation, and also how recombination later erodes ecological specialization. Here, we have presented an updated model, specifically for microbial populations with clonal reproduction, variable recombination rates, and without assortative mating. In microbes, although the efficacy of selection is still important, the recombination rate is the key factor: for ecological traits controlled by many adaptive loci, high recombination rates are necessary to generate a new species, 63 but this also produces many intermediate genotypes and reduces the completeness of speciation at later stages. A simple, although not exclusive, solution to this tradeoff is to require adaptation at just a few loci to initiate and maintain ecological speciation. The generality of this solution will be put to the test as more data become available from population genomic studies of ecologically diverse, recombining wild microbial populations. 64 Chapter 4 Inferring Correlation Networks from Genomic Survey Data Jonathan Friedman, Eric J. Alm This chapter is presented as it originally appeared in PLoS Computational Biology 320, 1081 (2008). Corresponding Supplementary Material is appended. Chapter 4 Inferring Correlation Networks from Genomic Survey Data High-throughput sequencing based techniques, such as 16S rRNA gene profiling, have the potential to elucidate the complex inner workings of natural microbial communities - be they from the world’s oceans or the human gut. A key step in exploring such data is the identification of dependencies between members of these communities, which is commonly achieved by correlation analysis. However, it has been known since the days of Karl Pearson that the analysis of the type of data generated by such techniques (referred to as compositional data) can produce unreliable results since the observed data take the form of relative fractions of genes or species, rather than their absolute abundances. Using simulated and real data from the Human Microbiome Project, we show that such compositional effects can be widespread and severe: in some real data sets many of the correlations among taxa can be artifactual, and true correlations may even appear with opposite sign. Additionally, we show that community diversity is the key factor that modulates the acuteness of such compositional effects, and develop a new approach, called SparCC (available at https://bitbucket.org/yonatanf/sparcc), which is capable of estimating correlation values from compositional data. To illustrate a potential 66 application of SparCC, we infer a rich ecological network connecting hundreds of interacting species across 18 sites on the human body. Using the SparCC network as a reference, we estimated that the standard approach yields 3 spurious species-species interactions for each true interaction and misses 60% of the true interactions in the human microbiome data, and, as predicted, most of the erroneous links are found in the samples with the lowest diversity. 67 Author Summary Genomic survey data, such as those obtained from 16S rRNA gene sequencing, are sub ject to underap- preciated mathematical di?culties that can undermine standard data analysis techniques. We show that these e?ects can lead to erroneous correlations among taxa within the human microbiome despite the statistical signicance of the associations. To overcome these di?culties, we developed SparCC; a novel procedure, tailored to the properties of genomic survey data, that allow inference of correlations between genes or species. We use SparCC to elucidate networks of interaction among microbial species living in or on the human body. 68 Introduction The study of natural communities using high throughput genomic surveys, such as 16S rRNA gene profiling, has become routine [2], yet the development of appropriate, well validated analysis methods is still ongoing. The first challenge is obtaining reliable and informative counts from 16S rRNA gene sequences by filtering spurious reads and grouping the remaining reads in a meaningful way [113], [114], [115]. Once such counts have been obtained, analysis techniques which are appropriate for discrete survey data need to be applied [116], [117] [118]. A common goal of genomic surveys is to identify correlations between taxa within ecological communities. Correlation analysis provides a well trodden path to achieving this goal, but we show that it is not valid when applied to genomic survey data (GSD), and may produce misleading results. The challanges associated with GSD stem from the fact that they are a relative, rather than absolute, measure of abundances of community components. The counts comprising these data (e.g., 16S rRNA gene reads) are set by the amount of genetic material extracted from the community or the sequencing depth, and analysis typically begins by normalizing the observed counts by the total number of counts. The resulting fractions fall into a class of data termed closed or compositional, and poses its particular geometrical and statistical properties [119], [118]. Specifically, standard methods for computing correlations from GSD are theoretically invalid. Correlation estimates are biased by the fact that, since they must sum to 1, fractions are not independent and tend to have a negative correlation regardless of the true correlation between the underlying absolute abundances (termed the basis abundances) [120]. Thus, correlations estimates often reflect the compositional nature of the data, and are not indicative of the underlying biological processes [121]. In fact, in 1897 Karl Pearson warned against “attempts to interpret correlations between ratios whose numerators and denominators contain common parts” [122], and since that time it has been shown that many other standard analysis techniques are invalid when applied to such compositional data, and that their interpretation is unreliable and often misleading [121], [123], [124]. Nonetheless, these 69 methods remain the primary tools used in studies of microbial ecology. Although approaches to compositional data analysis have been developed (e.g. [124], [125]), the basic task of inferring dependencies between components remains an outstanding challenge. A widely used method is Aitchison’s test for complete subcompositional independence [126], which tests whether any dependencies are present, but does not indicate which components are correlated, nor the magnitude of the correlation. Filzmoser and Hron [127] recently developed a method for inferring correlations in compositional data after an appropriate mathematical transformation, but their method does not provide a mapping relating the correlations of the transformed variables to those of the underlying genes or species. In this paper, we first use simulations and real-world data from the Human Microbiome Project (HMP) to demonstrate that GSD can be severely biased by “compositional” effects, and then identify the factors the modulate their severity. Finally, we present a novel method, called SparCC, and show that it can infer correlations with high accuracy even in the most challenging data sets. Results Standard correlation inference techniques perform poorly on GSD To what extent do compositional artifacts affect real-world GSD? We applied standard statistical methods to 16S rRNA gene survey data from the Human Microbiome Project (HMP) [128], which measure the compositions of microbial communities found in different body sites of ∼ 200 individuals. The composition of each community is described in terms of operational taxonomic units (OTUs). Because only relative abundances for each OTU are available, these data qualify as compositional and are thus subject to potential biases as described above. Networks inferred from Standard Pearson correlation display distinct patterns within different body sites, suggestive of biological structure (Fig. 4.1, left column. 70 See Fig. S1 for all 18 HMP body sites). Specifically, a prominent feature of the midvagina, retroauricular crease, and buccal mucosa networks is the presence of an OTU that is negatively correlated with multiple other OTUs. Despite the temptation to attribute biological significance to these observations, correlation networks inferred from randomly shuffled data with similar taxon abundances, but lacking any correlations between OTUs (see Materials and Methods), reproduce this feature (Fig. 4.1, middle column) indicating that it may arise from the closure (normalization) process. The mechanism behind these spurious correlations is straightforward. The pattern observed in the mid-vagina network results from the dominance of OTU 3, a Lactobacillus. This OTU has a median abundance of 97%, so fluctuations in its relative abundance have a strong effect on the abundance of the rest of the community simply due to the requirement that the relative abundances of all OTUs sum to 100%: when the abundance of Lactobacillus varies, all other OTUs’ relative abundances vary in unison in the opposite direction creating artificial negative correlations with Lactobacillus, and artificial positive correlations with each other. Diversity and correlation density control the severity of compositional effects Compositional effects are severe in some datasets, but mild in others. We found that diversity of the samples in the dataset (often referred to as alpha diversity), is a good predictor of the strength of compositional effects, which diminish with increased diversity. Intuitively, the fewer OTUs comprise the community, the worse the compositional effects are, with the extreme case of a community composed of only two OTUs, which will always appear to be perfectly negatively correlated. Moreover, compositional effects can be significant even in communities comprised of multiple OTUs, if only a few OTUs dominate the community. This notion of diversity can be quantified using the Shannon effective number of OTUs,(nef f ) [129], which quantifies both the number of OTUs and the dominance in a community. nef f ranges from 1, when the community is completely dominated by a single OTU, to the number of 71 OTUs in the community (richness), when all OTUs are equally abundant. Simulated networks of varying nef f (see Material and Methods) with known correlations illustrate the effect of diversity on compositional artifacts. True correlations (Fig. 4.2A-C) are only recovered when the community is diverse (Fig. 4.2F). In networks of similar diversity to the HMP samples, inferred connections are often dominated by negative correlations to the dominant OTU, which leads to positive correlations among the remaining OTUs (Fig. 4.2D,E). This effect is so strong that it eliminates the negative correlation between OTU 4 and OTUs 3 and 5, and positive correlation between OTUs 1 and 2 (Fig. 4.2E). Worse yet, as diversity decreases further, the negative correlation between OTU 4 and OTUs 3 and 5 is turned into an apparently positive one (Fig. 4.2D). It is important to note that these compositional effects are not limited to Pearson correlation, and are also present in non-parametric correlations, such as Spearman correlations (Fig. S2). If the underlying network has true positive correlations, then compositional effects can be even more pronounced than expected based on the community diversity. This happens because strong correlations between components lowers the effective diversity of the sample (i.e., two OTUs that are perfectly correlated behave as a single OTU). This effect can confound naive efforts to correct for compositional effects by comparing observed correlations against shuffled networks. When the data are shuffled, as in the middle column of Fig. 4.1, few spurious connections may arise relative to the structure observed for the unshuffled data (as observed for the buccal mucosa samples), creating false confidence in the observed network. Thus, randomization is not sufficent to establish significance of observed correlations, nor is it possible to identify correlations by comparing against (or “subtracting out”) a randomized network. SparCC: a novel procedure for inferring correlations from GSD Here, we describe a new technique for inferring correlations from compositional data called SparCC (Sparse Correlations for Compositional data). SparCC estimates the linear Pearson correlations between the log-transformed components. Since these correlations cannot be computed exactly (as described below), SparCC utilizes an 72 approximation which is based on the assumptions that: (i) the number of different components (e.g., OTUs or genes) is large, and (ii) the true correlation network is ’sparse’ (i.e., most components are not strongly correlated with each other). Later, we show that SparCC is surprisingly robust to violations of the sparsity assumption. SparCC does not rely on any particular distribution of the basis variables, i.e. the true abundances in the community can follow any distribution, and the choice of the log-normal distribution in subsequent examples is motivated solely by ease of implementation and empirical fit. For clarity, we present the method in the context of 16S rRNA gene data, where the components are OTUs and the basis variables are their true abundances in a community, but SparCC can be applied to any compositional data for which its approximation is valid. Like most compositional data analysis techniques, SparCC is based on the logratio transformation: yij = log xi = log xi − log xj , xj (4.1) where xi is the fraction of OTU i. This transformation carries several advantages: First, the new variables yij contain information regarding the true abundances of OTUs, as the ratio of fractions is equal to the ratio of the true abundances. Second, unlike the fractions themselves, the ratio of the fractions of two OTUs is independent of which other OTUs are included in the analysis, a property termed subcompositional coherence. Third, this transformation is mathematically convenient, as the new variables yij are no longer limited to the simplex, but are free to assume any real value. Taking the logarithm removed the positivity constraint, and induces (anti) symmetry in the treatment of the variables. To describe the dependencies in a compositional dataset, Aitchison suggested using the quantity xi tij ≡ Var log = Var [yij ] , xj (4.2) where the variance is taken across all samples [123]. When OTUs are perfectly cor73 related, their ratio is constant, therefore tij = 0, whereas the ratio of uncorrelated OTUs varies and the corresponding tij is large. Though tij contains information regarding the dependence between the OTUs, it is hard to interpret as it lacks a scale. That is, it is unclear what constitutes a large or small value of tij (does a value of 0.1 indicate strong dependence, weak dependence, or no dependence?). This can be further appreciated by relating tij to our quantity of interest, the correlation between the true abundances of the OTUs. The relation is given by tij = ωi2 + ωj2 − 2ρij ωi ωj , (4.3) where ωi2 and ωj2 are the variances of the log-transformed basis abundances of OTUs i and j, and ρij is the correlation between them. It is now evident that tij can only be interpreted in relation to the basis abundance’s variances: tij < ωi2 + ωj2 indicates a positive correlation, and tij > ωi2 + ωj2 indicates a negative correlation. Ideally, we would like to solve the set of eqs. 4.3 for all OTU pairs and simultaneously infer both the basis variances and correlations. However, because there are more unknown variables than equations, this is not generally possible. Nonetheless, it is possible to obtain a good approximation of the variances if, on average, OTUs are uncorrelated. Once we obtain estimates of the basis variances, these can be plugged into eqs. 4.3 to infer the correlations between each OTU pair, which, unlike the average correlations, needn’t be small. More accurate estimation can be achieved by iterating the above procedure. At each iteration the strongest correlated OTU pair identified in the previous iteration is excluded from the basis variance estimation. This reinforces sparsity among the remaining pairs and yields better variance and correlation estimates. OTU fractions need to be estimated from the observed counts to apply SparCC. Normalizing each OTU by the total counts in the sample (the maximum-likelihood estimate) is unreliable for rare OTU because it overestimates the number of zero fractions [130]. This can give rise to artifacts that are driven by variations in the sequencing depth. These artifacts have motivated some authors to downsample their 74 data such that all samples have the same total counts, however downsampling does nothing to alleviate compositional effects, and requires discarding a substantial portion of the available data. Therefore, we employed a Bayesian approach to estimate component fractions (see Materials and Methods), which allows the assessment of the robustness of downstream analysis and the assignment of confidence values. SparCC is highly accurate on simulated data We used the previously described simulated datasets to demonstrate the accuracy of SparCC at inferring correlations, even in highly problematic compositional data dominated by a single OTU (Fig. 4.2G-I). A more systematic evaluation of SparCC was performed by creating multiple simulated datasets of varying diversity and density. We measure density as the average Pearson correlation between OTUs, such that denser datasets have more strongly correlated OTUs, challenging the sparsity assumption used by SparCC. For each combination of density and diversity, multiple true correlation networks were assigned, and corresponding data was sampled. Networks inferred by SparCC or standard correlations were evaluated using the rootmean-square error (RMSE) (Fig. 4.3). Standard techniques only gave reasonable estimates for very diverse, sparse networks (Pearson RMSE ∼ 0.02), whereas for networks with diversity comparable to those observed in the HMP set, the Pearson RMSE was unacceptably high, reaching ∼ 0.5 for communities with diversity similar to the mid-vagina. Spearman correlations performed only marginally better (Fig. S3A). By contrast, the performance of SparCC was independent of diversity, and gave improved results for all parameter values, even for dense networks in which the sparsity assumption is violated. In fact, the worst accuracy achieved by SparCC (∼ 0.02, for unrealistically dense networks), was comparable to the best accuracy achieved using standard correlations on highly diverse samples. Moreover, though stronger correlation can be estimated more reliably, using standard methods, attention needs to be restricted to exceptionally strong correlations before the accuracy improves significantly, and the resulting accuracy is at best comparable to SparCC’s accuracy (Fig. S5). 75 SparCC identifies phylogenetically structured correlations in HMP data We used SparCC to infer the taxon-taxon interaction networks from the HMP data sets (Fig. 4.1, right column, Fig. 4.4), and from their corresponding shuffled datasets (in which all OTUs are uncorrelated). In contrast to the naive approach shown in Fig. 4.1, SparCC found no significant correlations in the shuffled dataset (Dataset S1). For the real data, however, numerous correlations are found, which differed significantly from the standard Pearson correlations. SparCC inference indicated that on average ∼ 3/4 of the correlated OTU pairs identified using Pearson were false, and that ∼ 2/3 of the correlated OTU pairs were missed using Pearson (see Table S1 for breakdown by body site.). Of particular note, we observe a positive correlation between OTU 3 and OTU 148, both belonging to the Lactobacillus genus, which was absent from the Pearson network, likely because of the bias of the highly abundant OTU 3 toward making negative correlations. Intriguingly, using SparCC we observe a higher likelihood of positive correlations between phylogenetically related taxa (Table S2), a finding that on its surface seems to support a role for neutral community dynamics as related organisms are likely to inhabit similar niches, but do not seem to dominate by competitive exclusion (although more complicated scenarios are certainly possible). We anticipate that techniques such as SparCC will play a major role in analyzing these data to address this and other basic ecological questions. Discussion In this study we have focused on an outstanding challenge of compositional data analysis – inference of correlations. We have demonstrated that compositional effects are pronounced in 16S rRNA gene surveys of the human microbiome, and, motivated by the properties of this data, have developed a novel procedure for estimating correlations. We found that diversity of species and density of interactions are the two key 76 factors that influence the severity of compositional effects on correlation estimates, with low diversity, high density data being the most challenging to infer correlation from using standard methods. SparCC does not rely on high diversity, rather it only requires sparsity of correlations, but in practice is robust even when the sparsity assumption is strongly violated (%30 of all component pairs are strongly correlated). Therefore, we recommend that SparCC be used on any GSD that has low diversity: as a rule of thumb we recommend an effective number of components of at least 50 for standard techniques (with the potential caveat that if strong positive correlations are present among many OTUs, the effective diversity may be much lower than estimated). We emphasize that simply having many components is not sufficient to avoid compositional effects. For example, 16S rRNA gene surveys from the HMP include hundreds to thousands of distinct OTUs, yet have have a relatively low effective number of species, with a small number of species dominating most samples. An important subclass of GSD are genome-wide surveys conducted using techniques such as DNA microarrays, RNA-seq and ChIP-seq. These genome-wide data are also subject to compositional effects, however, as these data tend to have high diversity, they are likely to be much less severe or negligible. For example, the average effective number of genes in microarray experiments available through the M 3D database [131] was 2200 for S. cerevisiae and 1800 for E. coli. This may explain why to date comparatively less attention has been paid to compositional effects in the biological sciences than in other disciplines. The preponderance of zero values are another area of concern with GSD. These zeros can represent either components that are truly absent from the community, or rare components that, by chance, were not present in the sample drawn from the community. Without additional knowledge, these options are indistinguishable, and, depending on goal of the analysis, the researcher must decide how to interpret them, and choose analysis methods accordingly. We emphasize that the treatment of zero values is a challenge that is in no way unique to compositional data, but is merely highlighted by the log-ratio transformations employed to analyze these data [132]. In this study, we eliminate zero fractions by adding small pseudocounts, as 77 detailed in the Materials and Methods. Complementary approaches, where zeros are treated differently than non-zero values, are substantially more challenging, and are the subject of ongoing research [133]. Though the method presented in this paper allows detection of correlation within communities, many challenges still remain. First, SparCC relies on having reliable component counts, which as noted in the introduction, is not trivial. Second, the correlations estimated by SparCC measure the linear relationship between log transformed abundances. Compositional methods for inferring more general dependencies between components, equivalent to rank correlations and mutual information for non-compositional data, have not yet been developed. Third, relating the patterns detected within a community to external factors (e.g. relating the composition of a human gut microbial community to human health status), and detecting temporal patterns within and between communities requires non-standard, compositional approaches. While some such methods exist [124], [123], [134] they are rarely employed in the context of GSD, and are not tailored for its particular properties. Finally, GSD is often associated with phylogenetic information (relatedness of species or genes), which ideally would be included in the analysis (e.g. the weighted UniFrac distance, which attempts to capture differences in both abundance and phylogenetic composition of communities.). We believe that developing systematic, statistically-sound methods for such analyses of compositional GSD is a necessary step on the road to understanding the structure of biological communities, the processes by which they evolve, and the forces that shape them, and thus represents an important direction for future research. Materials and Methods HMP 16S rRNA gene data HMP OTU counts and their taxonomic classification were obtained from the HMPOC dataset, build 1.0, available at http://hmpdacc.org/ [135]. The dataset corresponding 78 to high-quality reads from the v3-5 region was used. Only samples from the May 1st production study were included in the analysis. Additionally, if multiple samples were obtained from the same body site of an individual, only the first sample collected was included in the analysis. For each body site, the data was further filtered by removing samples for which less than 500 reads were collected and OTUs that were, on average, represented by less than 2 reads per sample . Shuffled HMP datasets Shuffled datasets are created by assigning each OTU in each sample a number of counts that is randomly sampled from the OTU’s observed counts across all samples, with replacement. This procedure ensures that the resulting marginal distributions of counts of each OTU alone are the same as in the real data, and that there are no correlations between the OTUs in the simulated data. Simulated basis datasets for basis correlations estimation Simulated communities were generated by sampling the joint abundances of 50 OTUs from a log-normal distribution with a given mean and covariance matrix. The mean abundances were equal for all OTUs except OTU 1, whose abundance was set such that the community will have a given effective number of OTUs (nef f ), on average. The variance was set to .01 for all OTUs, and random covariance matrices were generated by assigning each OTU pair a probability p of being perfectly correlated, with positive or negative correlations being equally probable. The resulting random symmetric matrix was then converted to the nearest positive-definite matrix to ensure it is a valid covariance matrix. 500 individuals were randomly sampled from each of these communities to give counts data similar to the one contained in GSD. For each combination of the parameters nef f and p, 50 such random communities were simulated, and the correlation inference accuracy was quantified using the root- 79 mean-squared error averaged over all OTU pairs, given by: RM SE = X 1 |ρ̂ij − ρij | D(D − 1) i>j (4.4) The final inference error is given by averaging the inference error of all 50 runs. Effective number of species The entropy effective number of species of a community, is defined as nef f = eH , where H = − P i (4.5) xi logxi is the entropy of the community [129]. Sample entropies were computed according to the method descried by Chao and Shen [136], as implemented in the R ’entropy’ package [137]. For each body-site, the effective number of species reported in the main text is the average of the effective number of species of all samples corresponding to that body-site. Estimation of component fractions We adopt a bayesian framework for estimating the true fractions from the observed counts. Assuming unbiased sampling in the sequencing procedure, and a uniform prior, the posterior joint fractions distribution is the Dirichlet distribution [138]: p(x | N ) = Dir(N + 1), (4.6) where x and N are vectors of the components’ true fractions and observed counts, respectively. Unlike Maximum-Likelihood estimation, the bayesian approach results in the full joint distribution of fractions, rather than their point estimates. Point estimates of fraction values, if desired, can be given by the the mean of the posterior distribution: N +1 x̂M AP = PD . i=1 (Ni + 1) 80 (4.7) , which is equivalent to adding a pseudocount of 1 to all count values, and normalize by the total number of counts in each sample. However, we prefer setting the estimator of true fractions to be a random sample from this posterior distribution. This randomness avoids the detection of spurious correlations between rare components, which arrises since the fractions resulting from adding a fixed value pseudocount mirror the sampling depth. Additionally, repeating downstream analysis using many such randomly drawn estimators allows the quantification of the effects of sampling noise on the analysis (one can attempt to model the noise analytically, but this often challenging in practice). It is important to note that in SparCC, like in any method employing log transformations, some pre-processing is required to eliminate zero values. As described above, SparCC employes a variation of the well-known pseudocounts method which assigns a small fraction to OTUs that were not detected in a sample. This approach implicitly assumes that all components are in fact present in the sample, and that all zero value result from finite detection resolution [130]. For very rare OTUs who are only present at a few samples, this may not be a reasonable assumption. Even if this assumption holds, typically there is not enough information to reliably estimate correlations involving such components, and such components should not be included in the correlation analysis. Basic SparCC As noted in the main text, the quantity xi tij ≡ Var log , xj 81 (4.8) contains information about the dependence between components i and j, and can be related to the basis correlations. The relation is obtained wi xi = Var log = Var [log wi − log wj ] tij ≡ Var log xj wj = Var [log wi ] + Var [log wj ] − 2Cov [log wi , log wj ] (4.9) ≡ ωi2 + ωj2 − 2ρij ωi ωj , where ωi2 and ωj2 are the variances of the log-transformed basis variables i and j, and ρij is the correlation between them [121]. Our aim is to exploit relation 5.1 to infer the unobserved covariance matrix of the log transformed basis variables Ω, from Aitchison’s variation matrix T, whose elements are tij . Unfortunately, this is impossible for the most general case, since the basis variances are unknown a priori, and the system of equations for all pairs of components is underdetermined, as it involves D(D − 1)/2 equations and D(D + 1)/2 variables (D variances and D(D − 1)/2 correlations). In fact, even the D variance variables alone, with all correlations set to zero, allow solving eq. 5.1 for up to three components. Therefore, at least four components are required to detect deviations from complete independence between all components (this is related to the fact that Aitchison’s test for complete subcompositional independence is only effective when at least four components are analyzed [139]). Since an exact solution cannot be found, we SparCC utilizes an approximation, which is valid when there are many components which are only sparsely correlated. Eq. 5.1 can be rearranged to give the following expression for the correlation: ωi2 + ωj2 − tij ρij = , 2ωi ωj (4.10) which, given the basis variances can be solved to give the basis correlations. Therefore, we employ the following approximation procedure to estimate the basis variances: 82 First, define the variation of component i as ti ≡ D X tij = dωi2 + j=1 X ωj2 − 2 j6=i " X ρij ωi ωj j6=i X ωj2 ωi2 j6=i 1 X ωj 1 −2 ρij = 1+ d d j6=i ωi " # 2 ω ω j j ≡ dωi2 1 + h ii − 2hρij ii , ωi ωi dωi2 # (4.11) where d ≡ D − 1, and h·ii represents averaging over all pairs involving component i. Next, assume that the correlation terms in eq. 4.11 are small, i.e. 2 ωj ωj 1+h ii 2hρij ii , ωi ωi (4.12) and neglect them, yielding the approximate set of equations: ti ' dωi2 + X ωj2 , i = 1, 2, . . . D. (4.13) j6=i Finally, solve eq. 4.13 to obtain the approximated basis variances to be plugged into eq. 4.10, yielding values of the basis correlations. To elucidate the nature of this approximation, consider the case where all the basis variables have the same variance ω. The assumption made in eq. 4.12 simplifies to: 1 hρij ii , (4.14) i.e., we assume that the average correlations are small, rather than requiring that any particular correlation be small. Using the above approximation, the basic inference procedure is the following: 1. Estimate the component fractions in all the samples as outlined above, to obtain the fractions matrix X. 2. Compute the variation matrix T. 83 3. Compute the component variations {ti }. 4. Solve eqs. 4.13 to get an approximate value for all basis variances {ωi }. 5. Plug the estimated log-basis variances into eqs. 5.1 to obtain the basis correlations {ρij }. Iterative SparCC The basic inference procedure can be improved upon by employing the following iterative refinement scheme (Fig. 4.5): 1. Estimate correlations using the basic procedure described above. 2. Identify the most strongly correlated pair of components that was not previously excluded. If the magnitude of this strongest correlation exceeds a given threshold, add this pair to the set of excluded pairs. Otherwise, terminate the estimation procedure. 3. Identify components that form only excluded pairs and completely exclude them from the analysis. Since the assumptions of our method are not met by such components, it is unable to infer their correlations. If all components but three are excluded, terminate the estimation procedure, as the sparsity assumption is violated for the whole system. 4. If any components were excluded, re-estimate the fractions of the remaining components. Note that the new fractions are relative to the new subset of components. (n) 5. Calculate the component variations ti , excluding all strongly correlated pairs. (n) That is, if ci is the set of indices of components identified to be strongly correlated with component i at the previous, nth , iteration, then (n+1) ti = X (n) j ∈c / i 84 tij . (4.15) 6. Use the newly computed component variations to compute the basis correlations, as in steps 4 and 5 of the basic inference procedure. 7. Repeat steps 2 through 6 for a given number of iterations, or until no new strongly correlated pairs are identified. Note that the iterative procedure can result in correlations whose magnitude is greater than 1, indicating that too many pairs were excluded. Setting a higher exclusion threshold, or a lower iteration number will remedy this fallacy, though the resulting approximation is likely to be of poor accuracy. Basis correlation can also be inferred using transformed variables (see Text S1). However, the iterative exclusion detailed above improves the quality of the approximation, making SparCC superior to these alternatives (Fig. S1) To account for the sampling noise, the inference procedure is repeated multiple times, each time with fraction values drawn randomly from their posterior distribution, generating a distribution of each pairwise correlation. The median value of each pairwise correlation distribution is taken as its estimated value. In this work, a threshold of 0.1 and a maximal number of 20 iterations were chosen, and the iterative procedure was repeated 100 times. Comparison of HMP networks inferred using Pearson and SparCC For each body site, pairwise correlations between all OTUs were inferred using both Pearson and SparCC as described above. Interaction networks were subsequently build by connecting all OTU pairs that had a correlation magnitude greater than a given threshold. Results reported in the main text were obtained using a threshold value of 0.3. Comparison between corresponding Pearson and SparCC networks was done by treating the SparCC network as the true one, and computing the number of true-positives (TP), false-positives (FP), true-negatives (TN) and false-negatives (FN) detected in the Pearson network. The above quantities were calculated as following: 85 TP = number of edges that have the same sign in both networks, TN = number of edges are missing from both networks, 1 FP = number of edges that appear only in the Pearson network + number of edges 2 that have different signs, 1 FN = number of edges that appear only in the SparCC network + number of edges 2 that have different signs. Assessing statistical significance The statistical significance of the inferred correlations can be assessed using a bootstrap procedure. First, a large number of simulated datasets, where all components are uncorrelated, are generated as described in Material and Methods. Next, correlations are inferred from each simulated dataset using SparCC with the same parameter setting as is used for the original data. Finally, for each component pair, pseudo pvalues are assigned to be proportion of simulated data sets for which a correlation value at least as extreme as the one computed for the original data was obtained. Computer implementation All analysis and procedures were implemented in Python, utilizing the Numpy [140] and Networkx [141] modules. Plotting was done using the Matplotlib [142] module. 86 Pearson Shuffled Pearson 3 SparCC 3 3 Mid-vagina neff = 1.7 148 148 148 Retroauricular Crease neff = 2.8 Buccal Mucosa neff = 7.1 Gut neff = 11.9 Supragingival Plaque neff = 18.4 Figure 4.1: Similar correlation networks are observed for real world vs. randomly shuffled bacterial abundance data. Correlation networks based on 16S rRNA gene survey data collected as part of the Human Microbiome Project (HMP), inferred using Pearson correlations (left column), and SparCC (right column). Additionally, Pearson correlation networks were inferred from shuffled HMP data (middle column), where all OTUs are independent. The Pearson networks inferred from shuffled data show patterns similar to the ones seen in the Pearson networks of the real data, especially for low diversity body sites. This indicates that the observed Pearson network structure may be due to biases inherent in compositional data rather than a real biological signal. In contrast, no significant correlation were inferred from the shuffled data using SparCC (data not shown). Nodes represent OTUs, with size reflecting the OTU’s average fraction in the community. Edges between nodes represent correlations between the nodes they connect, with edge width and shade indicating the correlation magnitude, and green and red colors indicating positive and negative correlations, respectively. For clarity, only edges corresponding to correlations whose magnitude is greater than 0.3 are drawn. See Fig. S1 for all 18 HMP body sites. 87 <neff> A Basis 3 D 2 5 4 1 10 B 15 5 6 3 2 3 G 2 1 4 E 1 4 Pearson 5 6 3 2 3 2 1 4 H 1 4 SparCC 5 6 3 2 1 4 20 25 C 5 6 3 2 30 4 F 1 5 6 5 6 3 2 I 1 4 5 6 5 6 3 2 1 4 5 6 Figure 4.2: Pearson correlations inference quality deteriorates with decreasing diversity. Basis data was simulated with a known correlation structure. OTU counts were generated by randomly drawing from the basis , and were subsequently subject to both correlation inference procedures. (A-C) True basis correlation network. (D-F) Networks inferred using standard procedure. (G-I) Networks inferred using SparCC. The average community diversities, as given by the Shannon entropy effective number of components nef f , used in the simulations and observed in the HMP data are indicated on left indicates. As in Fig. 4.1 , nodes represent OTUs, with size reflecting the OTU’s average fraction in the community. Nodes represent OTUs, with size reflecting the OTU’s average fraction in the community. Edges between nodes represent correlations between the nodes they connect, with edge width and shade indicating the correlation magnitude, and green and red colors indicating positive and negative correlations, respectively. For clarity, only edges corresponding to correlations whose magnitude is greater than 0.3 are drawn. 88 Midvagina Pearson A SparCC B 2D Midvagina 2G Gut Gut 2E 2H 2F 2I Figure 4.3: SparCC outperforms standard inference. Root-mean-square error (RMSE) of both Pearson (A) and SparCC (B) inferred correlations, as a function of the density of the underlying correlation network, as given by the probability that any pair of components be strongly correlated p, and community diversity, as given by the Shannon entropy effective number of components nef f . SparCC errors are smaller than Pearson errors for all parameter values. For the maximal diversity plotted, 50 effective OTU, the inference error obtained using Pearson correlations is greatly decreased. Therefore, it is likely that Pearson correlations perform well on gene expression data, where the effective number of genes is typically in the hundreds or thousands. For each combination of density and diversity, multiple basis correlation networks were randomly generated, and corresponding data was sampled and used for correlation estimation. Dots labeled mid-vagina and gut indicate the average diversity observed in the mid-vagina and gut communities, and the density of their estimated correlation networks. Dots labeled 4.2D-I indicate the diversity and density used to generate the communities analyzed in Fig. 4.2. 89 Mid-vagina neff = 1.7 Retroauricular Crease neff = 2.8 Actinobacteria Bacteroidetes Cyanobacteria Firmicutes Buccal Mucosa neff = 7.1 Fusobacteria Proteobacteria Spirochaetes Tenericutes Verrucomicrobia Other Gut neff = 11.9 Supragingival Plaque neff = 18.4 Figure 4.4: HMP correlation networks inferred using SparCC. Networks inferred using SparCC from the same data as in Fig. 4.1 (see Fig. S2 for SparCC networks of all HMP body sites). No correlations with magnitude greater than the 0.3 cutoff were inferred from the shuffled data (not shown). Nodes represent OTUs, with size reflecting the OTU’s average fraction in the community, and color corresponding to the phylum to which the OTU belongs. Edges between nodes represent correlations between the nodes they connect, with edge width and shade indicating the correlation magnitude, and green and red colors indicating positive and negative correlations, respectively. For clarity, only edges corresponding to correlations whose magnitude is greater than 0.3 are drawn, and unconnected nodes are omitted. See Fig. S6 for all 18 HMP body sites. 90 abort counts matrix no yes fractions matrix >3 components remain? variation matrix yes no component variations new excluded components? basis variances basis correlations yes max iterations? no new excluded pairs? yes no end Figure 4.5: Flow chart of iterative basis correlation inference procedure. 91 Chapter 5 Inferring Strain Diversity from Metagenomic Data Jonathan Friedman, Katherine Huang, Dirk Gevers, and Eric J. Alm Chapter 5 Inferring Strain Diversity from Metagenomic Data 16S rRNA gene profiling enables the taxonomic characterization of entire natural microbial communities. However, this technique offers only limited phylogenetic resolution: multiple related strains are typically lumped together to form a single “Operational Taxonomical Unit” (OTU). Strainlevel community composition information, which is currently largely unexplored, is crucial for detecting migration between environments, and, potentially, paths of disease transmission. Moreover, strain-level resolution provides a unique opportunity for testing existing community-assembly theories, and gleaning insight into the dynamics of natural communities and the forces that shape them, as closely related strains tend to be mostly functionally similar, with significant differences occurring in ecologically-relevant traits. In this work we present a novel procedure, termed StrainFinder, which is capable of inferring within-OTU diversity from metagenomic shotgun sequencing data, and is applicable to the myriad of published metagenomic datasets. Using simulated data, we quantify StrainFinder’s accuracy and show that sequencing depth and strain diversity are the key parameters affecting the inference accuracy. When sufficient data is available, StrainFinder’s estimates are highly accurate. 93 For diverse assemblages, or when little sequence data is available, less abundant strains remain unresolved, and StrainFinder provides a lower bound on the true strain diversity. To illustrate a potential application of StrainFinder, we have inferred the strain composition of 6 OTUs sampled from human cheeck and feces, and show that these OTUs are rarely composed of a single dominant strain. 94 Introduction 16S rRNA gene profiling is commonly employed to achieve a taxonomic characterization of natural microbial communities [2]. Nonetheless, as this technique is focused on partial sequences of a single marker gene, it inevitably provides limited phylogenetic resolution [4]. Typically, multiple related strains are indistinguishable based on their 16S sequences and thus are all lumped together to form a single “Operational Taxonomical Unit” (OTU) [143]. Obtaining higher resolution data regarding the individual strains that make up these aggregate OTUs would facilitate investigation pertaining to both practical and theoretical aspects of microbial communities. Some of the basic questions in ecology revolve around the speciation process, coexistence of similar species, and the assembly of communities [144, 145]. Data at the strain level may provide insight into such questions, as closely related strains tend to be mostly functionally similar, with significant differences occurring in ecologically-relevant traits [146, 147]. Moreover, individually resolved strains may be used to detect migration between environments, and elucidate the role of the migration process in shaping natural communities. One potential application of detecting migration events is the identification of the paths by which pathogens are transmitted between humans and their environment and within human communities. To date, strain-level phylogenetic resolution has been achieved mainly by targeted sequencing of genetic markers specific to the taxonomic group of interest, or by whole-genome sequencing of multiple isolates of the group of interest (e.g. [30, 148]). Notable exceptions are recent works by Morowitz et al. [149] and Schloissnig et al. [150] in which strain-level information has been inferred from metagenomic shotgun sequencing data. Morowitz et al. have leveraged their ability to detect correlated SNPs in a time-series, along with manual intervention to obtain high quality assemblies of closely related strains that colonize the infant gut. Schloissnig et al. broadly analyzed the within-OTU genetic diversity in the human gut, without any assembly. In this work, we develop a procedure, called StrainFinder, capable of inferring 95 within-OTU genetic diversity and abundance distribution. Importantly, our procedure does not require time-series data, nor manual intervention. Thus, it can be straightforwardly applied to the myriad of published metagenomic datasets. Next, we use simulated data to investigate the impact of the amount of available sequence data, the within-OTU genetic diversity, and abundance distribution on the inference accuracy. Finally, we illustrate a potential application of StrainFinderby applying it to 6 OTUs from 2 sites on the human body. Results Metagenomic shotgun sequencing has the potential to offer higher phylogenetic resolution than 16S profiling, as it does not focus on a single genetic marker [6]. The main caveat is that this technique provides short reads (∼100 bp), randomly sampled from all the genomes in the community. The randomness of the sampling results in a variable sequencing depth within a genome, and renders each strain’s average sequencing depth proportional to the strain’s abundance in the community, making it challenging to achieve good coverage of rare strains. Moreover, short reads provide little genetic linkage information, typically rendering genome assembly unfeasible. The following section outlines a procedure for extracting strain level information from metagenomic shotgun data despite the aforementioned challenges. A prerequisite to a within-OTU-level analysis is the identification of reads sampled from strains comprising the OTU of interest. As reads are too short to enable assembly, the desired reads are recruited by aligning all individual reads against an available reference genome of a strain belonging to the focal OTU (Fig. 5.1, Materials and Methods). This step results in the identification of polymorphic genomic positions, and the proportions of the alleles found in these positions. This approach has been employed by Schloissnig et al. [150] to study the overall genomic variation within an OTU. However, utilizing such data to identify specific strains and their relative abundances has remained elusive, as most polymorphic positions are located too far apart for a single read to span multiple polymorphic sites. 96 Strain Propor6on Genome Fragment Strain 1 0.3 A A C G G A Strain 2 0.7 A A T G G T Strain 1 0.1 C A T G G A Strain 2 0.3 A A C G G A Strain 3 0.6 A A T G G T Figure 5.1: Allele frequencies reflect combinations of strain proportions. Strain proportions can be inferred from the allele proportions, which can be readily inferred when deep sequencing data is available (middle column). With lower sequencing depth, sampling noise obscures the true allele proportions (right column). Data were simulated as detailed in Materials and Methods, with the denoted strain proportions and sequencing depths. For clarity, only allele proportions > 0.05 from polymorphic sites are shown. Progress can be made by noting that alleles found in the same set of genomes should be present in the community at identical proportions. To illustrate the utility of this fact, consider first a simple case in which an OTU is comprised of 2 strains whose relative abundances are {0.7, 0.3} (Fig. 5.1, top row). In this scenario, in addition to monomorphic genomic positions, there is only one type of polymorphic position, in which the 2 strains have different base pairs. Therefore, all alleles found at a 0.3 proportion at different polymorphic positions can be assigned to the same genome, and there is no need to assemble large regions spanning all polymorphic positions. When more than 2 strains are present, more types of polymorphic positions exist, corresponding to different combinations of strains sharing the same allele. For example, if there are 3 strains whose relative abundances are {0.6, 0.3, 0.1}, there may be alleles found at proportions of 0.9, 0.7, 0.6, 0.4, 0.3, and 0.1 (Fig. 5.1, bottom row). In this case, the strains’ relative abundances can be determined by finding the minimal set of strain abundances whose combinations correspond to the allele proportions. Once strain abundances have been inferred alleles can be readily 97 Ppoly = 0.01 Ppoly = 0.005 Inferred Proportions Inferred Proportions Inferred Proportions Ppoly = 0.05 1 Nreads = 50 Nreads = 100 nef f Nreads = 500 7 0.5 0 5 1 0.5 0 3 1 0.5 0 0 0.5 True Proportions 1 0 0.5 True Proportions 1 0 0.5 True Proportions 1 1 Figure 5.2: Accuracy of strain proportion inference. StrainFinder’s strain proportion estimates are highly accurate for abundant strains. For rare strains, inference accuracy decreases in more diverse communities, though reliable estimates are obtained in lower diversity communities. Each dot indicates the true and inferred proportions of a single strain, and colors indicate the diversity of the community to which the strain belongs. Strain proportions were inferred from simulated data with varying number of strains, proportion of polymorphic positions (rows), and sequencing depth (columns), and correspond to the community diversities shown in Fig. 5.3. assigned. The simple procedure outlined above requires precise values of allele proportions, which are typically not available due to the sampling noise (Fig. 5.1, right column). Since metagenomic shotgun sequencing consists of sampling random DNA fragments from an entire community, it is hard to achieve good coverage of a specific OTU, and coverage varies along genomes. This coverage-dependent sampling noise obscures the true allele proportions, except when extremely high coverage is obtained, thus limiting the practical utility of the aforementioned procedure. In the presence of significant sampling noise, a probabilistic framework is required in order to take advantage of the insight that alleles found in a common set of genomes have equal proportions. In the next section we present such a probabilistic approach, which builds upon the same intuition presented above. 98 StrainFinder: an algorithm for inferring correlations from GSD StrainFinder formalizes the fact that allele proportions reflect combinations of strain proportions, and incorporates the sampling procedure to form a likelihood function that can be maximized to find the most likely set of strains. To derive the likelihood function, we first express the allele proportions Π in terms of the strains’ relative abundances p = {p1 , p2 , ..., pn }: Πi (j) = n X pk Iki (j), (5.1) k=1 where Πi (j) is the proportion of allele j at position i, and Iki (j) captures the strains genomes: Iki (j) = 1 strain k has allele j in position i 0 otherwise. (5.2) Next, sequencing errors are accounted for by modifying the allele proportions to form the “effective allele proportions”: 0 Πi (j) = Πi (j) [1 − ] + [1 − Πi (j)] , 3 (5.3) where is the probability that any sequencing error occurs. Note that the above represents a simple error model in which all sequencing errors are equally likely, but more detailed alternatives may be modeled in an analogous manner. Finally, the sampling procedure is assumed to be unbiased, thus the observed counts at each position follow a multinomial distribution. Therefore, the log-likelihood function of the observed counts is given by: ll(p, , I; C) ∝ L X X i j∈{A,C,T,G} 99 0 cij log Πij , (5.4) where cij is the observed counts of allele j at position i, and L is the number of positions. Optimization of this likelihood function yields maximum-likelihood estimates of the error rate and proportions of a predefined number of strains. To infer the optimal number of strains, the above likelihood function is maximized with a sequentially increasing number of strains until no significant improvement in the optimized likelihood is obtained (see Materials and Methods). StrainFinder is highly accurate on simulated data Ppoly = 0.05 Inferred nef f 7 Nreads = 50 Nreads = 100 Nreads = 500 5 3 1 Ppoly = 0.01 Inferred nef f 7 n=1 n=2 n=3 n=4 n=5 n=6 n=7 5 3 1 Ppoly = 0.005 Inferred nef f 7 5 3 1 1 3 True nef f 5 7 1 3 True nef f 5 7 1 3 True nef f 5 7 Figure 5.3: Accuracy of strain diversity inference. StrainFinder’s strain diversity estimates are highly accurate up to a diversity threshold that depends of the amount of available data. For very diverse communities, StrainFinder is unable to resolve all strains and thus the inferred diversity provides a low bound on the true diversity. Strain diversities where inferred from simulated data with varying number of strains (colors), proportion of polymorphic positions (rows), and sequencing depth (columns). Each dot indicates the true and inferred diversities of a single community, as quantified by the Shannon entropy effective number of strains. We have used simulated datasets in order to evaluate StrainFinder’s accuracy and identify data properties that affect the inference accuracy. Simulated datasets were generated by creating a set of random genomes with a given proportion of polymorphic positions, assigning random relative abundances to genomes, and sampling reads from the resulting community at the desired coverage level. Multiple datasets 100 were generated for varying number of strains, proportion of polymorphic positions, and sequencing depth. StrainFinder was applied to each of the simulated datasets, and inferred parameters were compared to the true values used to generate the data. StrainFinder’s strain proportion estimates are highly accurate for abundant strains, with strain diversity and the sequencing depth being the key factors affecting the inference accuracy (Fig. 5.2). For rare strains, inference accuracy decreases in more diverse communities, though reliable estimates are obtained in lower diversity communities. For diverse communities, or in the presence of very rare strains, StrainFinder may be unable to resolve all strains (Fig. 5.2). Consequentially, the inferred number of strains may be unreliable (Fig. S1). However, the Shannon entropy effective number of strains, which accounts for both the number of strains and their proportions, can be inferred much more reliably (Fig. 5.3). StrainFinder’s estimates of this diversity measure are highly accurate up to a diversity threshold that depends of the sequencing depth. Beyond this threshold StrainFinder underestimates the strain diversity, due to the presence of strains with similar proportions that cannot be resolved. Thus the inferred diversity provides a low bound on the true diversity. StrainFinder identifies strain-level patterns in the Human Microbiome OTU number 53 196 1025 585 809 1638 Assigned taxonomy Body site Alistipes Stool Bacteroides vulgatus Stool Ruminococcus torques Stool Gemella haemolysans cheek Neisseria sicca cheek Streptococcus mitis B6 cheek # Samples 123 137 11 44 6 101 Median Nreads 125 211 72 60 61 97 Median Npoly 1568 118 883 82 284 419 Table 5.1: HMP reference genomes used for strain analysis. We used StrainFinder to infer the strain composition of 6 of the most abundant OTUs found in the cheeck and feces (Table 5.1). We find that these OTUs are 101 typically comprised of several strains, rather than a single dominant strain (Fig. 5.4). In addition, there is consistent variation in strain diversity between OTUs across individuals, which appears to be independent of body site (Fig. 5.4). Discussion While the method presented in this paper enables extracting strain-level information from metagenomic shotgun sequencing data, several challenges remain. Since, with current read lengths, reads spanning multiple polymorphic positions are rare and little linkage information is available, StrainFinder does not incorporate this information. However, linkage information may greatly improve the inference accuracy when longer reads are available, or in cases where a high density of polymorphic positions exists. Like all maximum-likelihood methods, StrainFinder yields point estimates of parameter values, whereas a Bayesian approach would yield much richer posterior distribution of model parameters. Such an approach would require choosing an ap- Stool Saliva Figure 5.4: Strain diversity in the human microbiome. OTUs identified in the Human Microbiome Project (HMP) are typically composed of several strains. Consistent differences in strain diversity exist between OTUs, however no relation between body-site and strain diversity is detected. Thick lines, boxes and wishkers represent mdeians, interquartile ranges (IQR), and most extreme data within 1.5 × IQR, correspondingly. 102 propriate process for the generating the number of strains, and prior distributions for model parameters. Additionally, implementing Bayesian approaches often requires computationally-expensive simulations, which may render such approaches unfeasible for ever-expanding metagenomic datasets that comprise of hundreds of samples and OTUs. StrainFinder analyzes each sample individually, and thus can be applied to the myriad of published metagenomic datasets. However, a better inference may be possible when several samples containing the same strains are available. For an individual sample StrainFinder is based the fact that alleles found in a common set of strains have similar proportions within a sample. It is possible to extend this approach by noting that the proportions of alleles found in a common set of strains should be correlated across samples in which these strains are present. Indeed, Morowitz et al. [149] have used the later insight to study the microbial colonization of the infant gut looking at metagenomic time-series data. We believe that StrainFinder sets the path for the development of an automated, probabilistic tool capable of inferring the distribution of individual strains across samples and over time, and, with the rapid increase in the amount of available metagenomic data, would become an integral part of the toolbox used by both microbial ecologists and epidemiologists. Materials and Methods Simulated datasets Each dataset was generated in 3 steps: 1. n genomes of 10000 positions each are constructed. A fraction ppoly of the positions are polymorphic, and the rest are monomorphic. At polymorphic positions, each genome is assigned a randomly selected allele (alleles are uniformly distributed). 2. Strain proportions are sampled from an n-dimensional Dirichlet distribution with concentration 1. This distribution is uniform over all n-strain community 103 compositions. 3. “Observed” allele counts are simulating for each position by drawing from the appropriate multinomial distribution: 0 Csim ∼ Multi(Nreads , Π ), (5.5) 0 where Π is the effective probability of observing each allele, as defined in the Results section, and Nreads is the sequencing depth. A sequencing error rate = 0.005 was used to generate all datasets. All combinations of ppoly ∈ {0.005, 0.01, 0.05}, Nreads ∈ {50, 100, 500}, n ∈ {1 − 7} were used. Multiple datasets were generated for each parameter combination. Since communities composed of more strains have more possible compositions, the number of datasets generated was set to 7n. Effective number of species The Shannon entropy effective number of strains in an OTU is defined as nef f = eH , where H = − P k (5.6) pk log pk is the entropy of strain proportions [129]. The effective number of species reported in the main text is computed using the either the known or maximum-likelihood strain proportions for the corresponding OTU. Accounting for sequencing errors In this subsection we derive a rather general framework for incorporating sequencing errors to the StrainFinder framework. First, we define the error matrix E, whose elements eij give the probability of observing allele i when the true allele is j. The relation between the vector of true allele frequencies Πand the expected allele 104 0 frequencies Π is given by: 0 Π = EΠ. (5.7) In this manuscript, we employed the parsimonious assumption that all errors are equally likely, and that that total error rate is . This assumption can be equivalently stated as: eij = |A| − 1 1 − i 6= j (5.8) otherwise, where |A| is the number of possible alleles. With 4 alleles, plugging equation 5.8 in the general relation given in 5.7 recovers the simple expression given in Eq. 5.3 of the Results section. Optimization procedure Since only polymorphic positions are informative regarding the strain composition, we perform the optimization using only positions for which the null assumption of monomorphism can be rejected. More precisely, the null assumption is that the most abundant allele in a position is the sole true allele in that position, and that other observed alleles resulted from sequencing errors. This assumption can be tested using a one-sided binomial test with a specified error rate. In this work, we use and error rate of 0.01 and a p-value threshold of 0.01. The log-likelihood function given in Eq. 5.4 is maximized by searching through values of the sequencing error rate and strain proportions, and, at each iteration, setting the genomes to the value that maximizes the log-likelihood function given the iteration’s sequencing error rate and strain proportions. As different genomic positions are conditionally independent given the strain proportions, this approach enables optimizing each genomic position separately and greatly reduces the search space. With a limited number of strains, it is feasible to exhaustively evaluate the likelihood of all possible genomes, and this is the approach implemented in StrainFinder. 105 The optimal number of strain is assigned to be the one that minimizes Akaike’s Information Criterion (AIC) [151]. Searching over the strain number is done by starting with a single strain and sequentially adding strains until the AIC increases. The number of model parameters used in calculating the AIC is given by: m = |{z} 1 + error rate n − 1} | {z strain proportions + |{z} nL . (5.9) genomes HMP data processing Metagenomic reads from the stool and cheeck samples collected as part of the HMP were aligned to reference genomes of 3 of the most abundant OTUs in each of these sites (Table 5.1), as identified in the HMP [128]. To minimize the confounding effect of horizontally-transfered flexible genes, only genes that mapped to one of the phylogenetic marker genes identified in the AMPHORA pipeline [152] were included in the analysis. Additionally, reads and base-pairs with a quality score [153] < 20 were excluded. Computer implementation StrainFinder was implemented in the Python programming language, utilizing the Numpy (1.6.2) [154] and openopt (0.38) [155] packages. Optimization is conducted using the ’ralg’ solver implemented in openopt. Simulations and analysis were also implemented in Python, and plotting was done using the Matplotlib (1.0.1) [156] package. 106 Part III Conclusion 107 In this thesis, I have described algorithms and analysis that promote our understanding of microbial adaptation, differentiation, and community structure. Nonetheless, we are still far from a unified understanding of the complex interplay between microbial ecology and evolution [144]. This last section discusses possible extensions of the methods presented in this work, and future research directions. In Chapter 1, we studied the fundamental temperature-salinity niche species can occupy in isolation. Surprising, not only did we not detect a trade-off between temperature tolerance and salinity tolerance, the two were in fact positively correlated. A trade-off could very well exist in some unmeasured parameter, like growth rates or low temperature/salinity range, but this finding highlights the need for a better understanding of the process by which niches evolve, and the forces that drives them. Notably, the only niche shapes observed in this study are the ones corresponding to additive, antagonistic, or no interactions between temperature and salinity stresses. This is in contrast to interactions between antibiotic combinations, which are often synergistic [42]. These findings may be indicative of selection favoring certain interactions, however the niches of many more organisms, across different environmental conditions need to be mapped before general conclusion can be drawn. In Chapter 2, we provided evidence indicating that only a handful of loci are responsible for ecological differentiation of two closely related Vibrio splendidus populations which are found in distinct habitats. Rampant recombination partially explains the breakage of linkage between the adaptive loci and the rest of the genome. However, recombination is not likely to be the sole cause, as this would require unrealistically high rates of recombination [8]. Studying similar occasions of recent differentiation in species that do not recombine frequently may provide clues as to the other mechanisms that hinder selective sweeps, and is necessary to assess their prevalence. In addition, We have detected restricted gene flow between the two ecological populations. A reduced cross-habitat encounter rate may account for this observation, yet the exact mechanism remains to be determined. Finally, we proposed that gene flow may diminish further with time, resulting in two populations that are mostly genetically isolated. Further monitoring the gene flow across the habitats will reveal the 108 dynamics and extent of genetic isolation following a differentiation in a recombining populations, and will provide an opportunity to elucidate the mechanisms that increase genetic isolation. Our simulation results presented in Chapter 3 indicate that time-varying recombination rates may be advantageous in some scenarios. For mutations, such transient periods of elevated evolutionary rates have been both modeled and observed in bacteria [157, 158]. For recombination, we have considered one relatively simple scenario, however the general conditions under which periods of ‘hyper-recombination’ would emerge need to be elucidated. Specifically, the distribution of fitness effects of alleles acquired by HGT, and the ability to affect that distribution via mate selection are likely to play a key role in determining the optimal recombination rate. The algorithm SparCC, described in Chapter 4, allows detection of inter-taxa correlation from 16S survey data. In human-associated communities, we have detected a tendency for more closely related taxa to be more strongly correlated, despite the fact that for such taxa resource competition is expected to result in negative correlations. A potential explanation is that more similar taxa are more likely to share a niche, and that positive correlations are induced by fluctuations in niche size. If true, this hypothesis predicts that abundance should be correlated with functional similarity, a prediction that could be tested using datasets for which both 16S profiling and fully sequenced genomes are available, such as the Human Microbiome Project. A key limitation of SparCC is that it specifically infers linear correlations. However, realworld ecological interactions are unlikely to manifest as simple linear correlations, and their detection will require the development of analogous methods for detecting more complex, non-linear dependencies, and for the analysis of time-series data. Lastly, in Chapter 5, we introduced the algorithm StrainFinder, which can infer within-OTU strain diversity. Since strains comprising a single OTU are likely to be (mostly) functionally equivalent, their dynamics are expected to resemble the predictions of neutral theory [159], and deviations from this expectation can be used as indicators for the presence of additional forces. For example, a prolonged persistence of individual-specific strains is consistent with the action of immune-system selection. 109 In addition, in the previous paragraph we proposed that fluctuations in shared-niche size are responsible for the elevated correlations between related OTUs. This hypothesis can be tested using StrainFinder, as it infers relative strain proportions within an OTU, and thus eliminates niche-size effects. Extensions that will increase StrainFinder’s accuracy, and broaden its applicability include the incorporation of linkage information regarding alleles found on the same read and the joint analysis of several samples expected to contain shared strains, such as samples comprising a metagenomic time-series. 110 Part IV Bibliography 111 Bibliography [1] Falkowski P, Fenchel T, Delong E (2008) The microbial engines that drive earth’s biogeochemical cycles. Science 320: 1034– 9. [2] Medini D, Serruto D, Parkhill J, Relman D, Donati C, et al. (2008) Microbiology in the post-genomic era. Nature Reviews Microbiology 6: 419–430. [3] Gogarten J, Townsend J (2005) Horizontal gene transfer, genome innovation and evolution. Nature Reviews Microbiology 3: 679. [4] Cole JR, Wang Q, Cardenas E, Fish J, Chai B, et al. (2009) The ribosomal database project: improved alignments and new tools for rRNA analysis. Nucleic acids research 37: D141D145. [5] Tettelin H, Riley D, Cattuto C, Medini D (2008) Comparative genomics: the bacterial pan-genome. Current opinion in microbiology 11: 472477. [6] Riesenfeld CS, Schloss PD, Handelsman J (2004) Metagenomics: genomic analysis of microbial communities. Annu Rev Genet 38: 525552. [7] Wiedenbeck J, Cohan F (2011) Origins of bacterial diversity through horizontal genetic transfer and adaptation to new ecological niches. FEMS Microbiology Reviews 35: 957976. [8] Shapiro BJ, David LA, Friedman J, Alm EJ (2009) Looking for Darwin’s footprints in the microbial world. Trends in Microbiology 17: 196–204. [9] Hutchinson G (1957) Concluding remarks. Cold Spring Harbor Symp Quant Bio 22: 415–427. 112 [10] Whittaker R, Levin S, Root R (1973) Niche, habitat, and ecotope. The American Naturalist 107: 321–338. [11] Leibold MA (1995) The niche concept revisited: mechanistic models and community context. Ecology 76: 13711382. [12] Chase JM, Leibold MA (2003) Ecological niches: linking classical and contemporary approaches. Chicago: University of Chicago Press. [13] Holt RD (2009) Bringing the hutchinsonian niche into the 21st century: Ecological and evolutionary perspectives. Proceedings of the National Academy of Sciences of the United States of America 106: 19659–19665. [14] Panagou EZ, Skandamis PN, Nychas GJE (2005) Use of gradient plates to study combined effects of temperature, pH, and NaCl concentration on growth of Monascus ruber van tieghem, an ascomycetes fungus isolated from green table olives. Applied and Environmental Microbiology 71: 392–399. [15] Hooper HL, Connon R, Callaghan A, Fryer G, Yarwood-Buchanan S, et al. (2008) The ecological niche of Daphnia magna characterized using population growth rate. Ecology 89: 1015–1022. [16] May R, MacArthur R (1972) Niche overlap as a function of environmental variability. Proceedings of the National Academy of Sciences 69: 1109–1113. [17] Tilman D (1982) Resource competition and community structure, volume 17 of Monographs in population biology. Princeton: Princeton University Press. PMID: 7162524. [18] Naeem S, Colwell R (1991) Ecological consequences of heterogeneity of consumable resources. In: Kolasa J, Pickett S, editors, Ecological Heterogeneity, New York: Springer-Verlag. pp. 224–255. 13359282131221785228related:jKqyXVG0ZbkJ. [19] Thompson F, Iida T, Swings J (2004) Biodiversity of vibrios. MICROBIOLOGY AND MOLECULAR BIOLOGY REVIEWS 68: 403. [20] Thompson J, Randa M, Marcelino L, Tomita-Mitchell A, Lim E, et al. (2004) Diversity and dynamics of a north atlantic coastal Vibrio community. Applied and environmental microbiology 70: 4103–4110. [21] Hunt D, David L, Gevers D, Preheim S, Alm E, et al. (2008) Resource partitioning and sympatric differentiation among closely related bacterioplankton. Science 320: 1081–1085. [22] Urakawa H, Rivera I (2006) Aquatic environment. In: Thompson F, Austin B, Swings J, editors, The biology of vibrios, Washington, D.C.: ASM Press. p. 175189. [23] Lozupone C, Hamady M, Kelley S, Knight R (2007) Quantitative and qualitative beta diversity measures lead to different insights into factors that structure microbial communities. Appl Environ Microbiol : AEM.01996–06v1. [24] Herlemann DP, Labrenz M, Jurgens K, Bertilsson S, Waniek JJ, et al. (2011) Transitions in bacterial communities along the 2000km salinity gradient of the baltic sea. ISME J 5: 1571–1579. [25] Kaspar CW, Tamplin ML (1993) Effects of temperature and salinity on the survival of Vibrio vulnificus in seawater and shellfish. Applied and environmental microbiology 59: 2425. [26] McCarthy SA (1996) Effects of temperature and salinity on survival of toxigenic Vibrio cholerae o1 in seawater. Microbial ecology 31: 167175. 113 [27] Randa M, Polz MF, Lim E (2004) Effects of temperature and salinity on Vibrio vulnificus population dynamics as assessed by quantitative PCR. Applied and environmental microbiology 70: 5469–5476. [28] Garrity GM, Brenner DJ, Krieg NR (2005) Bergey’s manual of systematic bacteriology, volume 2. New York: Springer. [29] Soto W, Gutierrez J, Remmenga MD, Nishiguchi MK (2009) Salinity and temperature effects on physiological responses of Vibrio fischeri from diverse ecological niches. Microbial ecology 57: 140150. [30] Thompson JR, Pacocha S, Pharino C, Klepac-Ceraj V, Hunt DE, et al. (2005) Genotypic diversity within a natural coastal bacterioplankton population. Science 307: 1311 –1313. [31] Boucher Y, Cordero OX, Takemura A, Hunt DE, Schliep K, et al. (2011) Local mobile gene pools rapidly cross species boundaries to create endemicity within global Vibrio cholerae populations. mBio 2: e00335–10. [32] Boucher Y, Nesb CL, Joss MJ, Robinson A, Mabbutt BC, et al. (2006) Recovery and evolutionary analysis of complete integron gene cassette arrays from Vibrio. BMC Evolutionary Biology 6: 3. [33] Stanley SO, Morita RY (1968) Salinity effect on the maximal growth temperature of some bacteria isolated from marine environments. Journal of Bacteriology 95: 169. [34] Preheim SP, Timberlake S, Polz MF (2011) Merging taxonomy with ecological population prediction in a case study of Vibrionaceae. Applied and Environmental Microbiology 77: 7195–7206. [35] Edgar RC (2004) MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5: 113. [36] Guindon S, Lethiec F, Duroux P, Gascuel O (2005) PHYML online–a web server for fast maximum likelihood-based phylogenetic inference. Nucleic Acids Research 33: W557–W559. [37] Felsenstein J (1985) Phylogenies and the comparative method. The American Naturalist 125: 1–15. [38] Garland T, Bennett AF, Rezende EL (2005) Phylogenetic approaches in comparative physiology. Journal of Experimental Biology 208: 3015–3035. [39] Garland T, Harvey P, Ives A (1992) Procedures for the analysis of comparative data using phylogenetically independent contrasts. Systematic Biology 41: 18 –32. [40] Ives A, Midford P, Garland T (2007) Within-species variation and measurement error in phylogenetic comparative methods. Systematic Biology 56: 252–270. [41] Keith CT, Borisy AA, Stockwell BR (2005) Multicomponent therapeutics for networked systems. Nature Reviews Drug Discovery 4: 71–78. [42] Yeh P, Tschumi A, Kishony R (2006) Functional classification of drugs by properties of their pairwise interactions. Nature Genetics 38: 489–494. [43] Gessner P (1995) Isobolographic analysis of interactions: an update on applications and utility. Toxicology 105: 161–179. [44] McDougald D, Kjelleberg S (2006) Adaptive responses to vibrios. In: Thompson F, Austin B, Swings J, editors, The biology of vibrios, Washington, D.C.: ASM Press. pp. 133–155. [45] Emerson S, Hedges J (2008) Chemical oceanography and the marine carbon cycle. Cambridge: Cambridge University Press. [46] Rosche TM, Smith DJ, Parker EE, Oliver JD (2005) RpoS involvement and requirement for exogenous nutrient for osmotically induced cross protection in Vibrio vulnificus. FEMS Microbiology Ecology 53: 455–462. [47] Hengge-Aronis R (2002) Signal transduction and regulatory mechanisms involved in control of the sigma(s) (RpoS) subunit of RNA polymerase. Microbiology and Molecular Biology Reviews: MMBR 66: 373– 395. 114 [48] Diez-Gonzalez F, Kuruc J (2009) Molecular mechanisms of microbial survival in foods. In: Jaykus LA, Wang HH, Schlesinger LS, editors, Food-borne microbes: shaping the host ecosystem, ASM Press. pp. 135–159. [49] Coleman ML, Chisholm SW (2010) Ecosystem-specific selection pressures revealed through comparative population genomics. Proceedings of the National Academy of Sciences of the United States of America 107: 18634–18639. [50] Papke RT, Zhaxybayeva O, Feil EJ, Sommerfeld K, Muise D, et al. (2007) Searching for species in haloarchaea. Proceedings of the National Academy of Sciences of the United States of America 104: 14092– 14097. [51] Guttman DS, Dykhuizen DE (1994) Detecting selective sweeps in naturally occurring Escherichia coli. Genetics 138: 993– 1003. [52] Denef VJ, Kalnejais LH, Mueller RS, Wilmes P, Baker BJ, et al. (2010) Proteogenomic basis for ecological divergence of closely related bacteria in natural acidophilic microbial communities. Proceedings of the National Academy of Sciences of the United States of America 107: 2383– 2390. [53] Cohan FM, Perry EB (2007) A systematics for discovering the fundamental units of bacterial diversity. Curr Biol 17: R373–86. [54] Koeppel A, Perry EB, Sikorski J, Krizanc D, Warner A, et al. (2008) Identifying the fundamental units of bacterial diversity: a paradigm shift to incorporate ecology into bacterial systematics. Proceedings of the National Academy of Sciences of the United States of America 105: 2504–2509. [55] Sabeti PC, Schaffner SF, Fry B, Lohmueller J, Varilly P, et al. (2006) Positive natural selection in the human lineage. Science 312: 1614–1620. [56] Bisharat N, Cohen DI, Harding RM, Falush D, Crook DW, et al. (2005) Hybrid Vibrio vulnificus. Emerging infectious diseases 11: 30–35. [57] Visick KL (2009) An intricate network of regulators controls biofilm formation and colonization by Vibrio fischeri. Mol Microbiol 74: 782–789. [67] Dykhuizen DE, Green L (1991) Recombination in Escherichia coli and the definition of biological species. Journal of Bacteriology 173: 7257–7268. [58] Satchell KJF (2011) Structure and Function of MARTX Toxins and Other Large Repetitive RTX Proteins. Annu Rev Microbiol 65: 71–90. [68] Lawrence JG (2002) Gene Transfer in Bacteria: Speciation without Species? Theoretical population biology 61: 449–460. [59] King T, Ishihama A, Kori A, Ferenci T (2004) A regulatory trade-off as a source of strain variation in the species Escherichia coli. Journal of Bacteriology 186: 5614– 5620. [60] Meibom KL, Li XB, Nielsen AT, Wu CY, Roseman S, et al. (2004) The Vibrio cholerae chitin utilization program. Proceedings of the National Academy of Sciences of the United States of America 101: 2524–2529. [61] Chiavelli DA, Marsh JW, Taylor RK (2001) The mannose-sensitive hemagglutinin of Vibrio cholerae promotes adherence to zooplankton. Applied and Environmental Microbiology 67: 3220–3225. [62] Majewski J (2001) Sexual isolation in bacteria. FEMS microbiology letters 199: 161– 169. [63] Falush D, Torpdahl M, Didelot X, Conrad DF, Wilson DJ, et al. (2006) Mismatch induced speciation in Salmonella: model and data. Philosophical transactions of the Royal Society of London Series B, Biological sciences 361: 2045–2053. [64] Fraser C, Hanage WP, Spratt BG (2007) Recombination and the nature of bacterial speciation. Science 315: 476–480. [65] Eppley JM, Tyson GW, Getz WM, Banfield JF (2007) Genetic exchange across a species boundary in the archaeal genus ferroplasma. Genetics 177: 407–416. [66] Denef VJ, Mueller RS, Banfield JF (2010) AMD biofilms: using model communities to study microbial evolution and ecological complexity in nature. The ISME journal 4: 599–610. 115 [69] Turner T, Hahn M, Nuzhdin S (2005) Genomic islands of speciation in Anopheles gambiae. PLoS Biology 3: 1572–1578. [70] Neafsey DE, Lawniczak MKN, Park DJ, Redmond SN, Coulibaly MB, et al. (2010) SNP genotyping defines complex gene-flow boundaries among african malaria vector mosquitoes. Science 330: 514–517. [71] Shapiro BJ, Friedman J, Cordero OX, Preheim SP, Timberlake SC, et al. (2012) Population Genomics of Early Events in the Ecological Differentiation of Bacteria. Science 336: 48–51. [72] Vos M, Didelot X (2009) A comparison of homologous recombination rates in bacteria and archaea. The ISME journal 3: 199– 208. [73] Kondrashov AS (1983) Multilocus model of sympatric specation I & II. One and Two characters. Theoretical population biology 24: 121–144. [74] Kondrashov AS (1986) Multilocus Model of Sympatric Speciation III. Computer Simulations. Theoretical population biology 29: 1–15. [75] Kondrashov AS, Kondrashov FA (1999) Interactions among quantitative traits in the course of sympatric speciation. Nature; Nature 400: 351–354. [76] Bush GL (1994) Sympatric speciation in animals: new wine in old bottles. TRENDS INECOLOGY AND EVOLUTION 9: 285– 288. [77] Via S (2001) Sympatric speciation in animals: the ugly duckling grows up. TRENDS INECOLOGY AND EVOLUTION 16: 381–390. [78] Kondrashov AS, Mina MV (1986) Sympatric speciation: when is it possible? Biological Journal of the Linnean Society 27: 201–223. [90] Schluter D (2009) Evidence for ecological speciation and its alternative. Science 323: 737–741. [91] Cooper TF (2007) Recombination speeds adaptation by reducing competition between beneficial mutations in populations of Escherichia coli. PLoS Biology 5: e225. [79] Preheim SP, Boucher Y, Wildschutte H, David LA, Veneziano D, et al. (2011) Metapopulation structure of Vibrionaceae among coastal marine invertebrates. Environmental Microbiology 13: 265–275. [92] Kondrashov AS (1982) Selection against harmful mutations in large sexual and asexual populations. Genetical Research 40: 325–332. [80] Mallet J, Meyer A, Nosil P, Feder JL (2009) Space, sympatry and speciation. Journal of Evolutionary Biology 22: 2332–2341. [93] Levin BR, Cornejo OE (2009) The Population and Evolutionary Dynamics of Homologous Gene Recombination in Bacterial Populations. PLoS Genetics 5: e1000601. [81] Whitaker RJ, Grogan DW, Taylor JW (2003) Geographic barriers isolate endemic populations of hyperthermophilic archaea. Science 301: 976–978. [94] Kondrashov FA, Kondrashov AS (2001) Multidimensional epistasis and the disadvantage of sex. Proceedings of the National Academy of Sciences of the United States of America 98: 12089–12092. [82] Cohan FM, Perry EB (2007) A systematics for discovering the fundamental units of bacterial diversity. Curr Biol 17: R373–86. [83] Fraser C, Hanage WP, Spratt BG (2007) Recombination and the nature of bacterial speciation. Science 315: 476–480. [95] Wylie CS, Trout AD, Kessler DA, Levine H (2010) Optimal strategy for competence differentiation in bacteria. PLoS Genetics 6. [84] Fraser C, Alm EJ, Polz MF, Spratt BG, Hanage WP (2009) The bacterial species challenge: making sense of genetic and ecological diversity. Science 323: 741–746. [96] Mayr E (1942) Systematics and the origin of species, from the viewpoint of a zoologist. Harvard University Press. [85] Wiedenbeck J, Cohan FM (2011) Origins of bacterial diversity through horizontal genetic transfer and adaptation to new ecological niches. FEMS Microbiology Reviews 35: 957–976. [97] Dieckmann U, Doebeli M (1999) On the origin of species by sympatric speciation. Nature 400: 354–357. [98] White BJ, Cheng C, Simard F, Costantini C, Besansky NJ (2010) Genetic association of physically unlinked islands of genomic divergence in incipient species of Anopheles gambiae. Molecular Ecology 19: 925–939. [86] McKinnon J, Rundle H (2002) Speciation in nature: the threespine stickleback model systems. TRENDS INECOLOGY AND EVOLUTION 17: 480–488. [99] Neafsey DE, Lawniczak MKN, Park DJ, Redmond SN, Coulibaly MB, et al. (2010) SNP genotyping defines complex gene-flow boundaries among african malaria vector mosquitoes. Science 330: 514–517. [87] Felsenstein J (1981) Skepticism towards Santa Rosalia, or why are there so few kinds of animals? Evolution 35: 124–138. [88] Haldane J (1932) The Causes of Evolution. London: Longmans, Green & Co. [89] Mallet J (2006) What does Drosophila genetics tell us about speciation? TRENDS INECOLOGY AND EVOLUTION 21: 386–393. [100] Lawniczak MKN, Emrich SJ, Holloway AK, Regier AP, Olson M, et al. (2010) Widespread Divergence Between Incipient Anopheles gambiae Species Revealed by Whole Genome Sequences. Science 330: 512–514. 116 [101] Via S (2012) Divergence hitchhiking and [110] Dasmahapatra KK, Walters JR, Briscoe the spread of genomic isolation during ecoAD, Davey JW, Whibley A, et al. (2012) logical speciation-with-gene-flow. PhiloButterfly genome reveals promiscuous exsophical transactions of the Royal Society change of mimicry adaptations among of London Series B, Biological sciences 367: species. Nature . 451–460. [111] Lawrence J (1999) Selfish operons: the evolutionary impact of gene clustering in [102] Shapiro MD, Marks ME, Peichel CL, prokaryotes and eukaryotes. Current opinBlackman BK, Nereng KS, et al. (2004) ion in genetics & development 9: 642–648. Genetic and developmental basis of evolutionary pelvic reduction in threespine [112] Mandel MJ, Wollenberg MS, Stabb EV, sticklebacks. Nature 428: 717–723. Visick KL, Ruby EG (2009) A single regulatory gene is sufficient to alter bacterial [103] Colosimo PF, Hosemann KE, Balabhadra host range. Nature 457: 215–218. S, Villarreal G, Dickson M, et al. (2005) Widespread parallel evolution in sticklebacks by repeated fixation of Ectodysplasin [113] Huse SM, Welch DM, Morrison HG, Sogin ML (2010) Ironing out the wrinkles in the alleles. Science; Science 307: 1928–1933. rare biosphere through improved otu clustering. Environmental Microbiology 12: [104] Jones FC, Grabherr MG, Chan YF, Rus1889–1898. sell P, Mauceli E, et al. (2012) The genomic basis of adaptive evolution in threespine [114] Haas BJ, Gevers D, Earl AM, Feldgarden sticklebacks. Nature 484: 55–61. M, Ward DV, et al. (2011) Chimeric 16s rrna sequence formation and detection in [105] Cornejo OE, McGee L, Rozen DE (2010) sanger and 454-pyrosequenced pcr ampliPolymorphic Competence Peptides Do Not cons. Genome Research 21: 494–504. Restrict Recombination in Streptococcus pneumoniae. Molecular Biology and Evo[115] Degnan P, Ochman H (2011) Illuminalution 27: 694–702. based analysis of microbial community diversity. ISME J 6: 183–194. [106] Cadillo-Quiroz H, Didelot X, Held NL, Herrera A, Darling A, et al. (2012) Patterns of Gene Flow Define Species of [116] Bent SJ, Forney LJ (2008) The tragedy of the uncommon: understanding limitations Thermophilic Archaea. PLoS Biology 10: in the analysis of microbial diversity. ISME e1001265. J 2: 689. [107] Rozen DE, de Visser JAGM, Gerrish PJ (2002) Fitness effects of fixed beneficial [117] Kuczynski J, Liu Z, Lozupone C, Mcdonald D, Fierer N, et al. (2010) Microbial commutations in microbial populations. Curr munity resemblance methods differ in their Biol 12: 1040–1045. ability to detect biologically relevant patterns. Nat Meth 7: 813. [108] Vos M (2011) A species concept for bacteria based on adaptive divergence. Trends [118] Lovell D, uller WM, Taylor J, Zwart A, Microbiol 19: 1–7. Helliwell C (2010) Caution! compositions! can constraints on omics data lead analyses [109] Nadeau NJ, Whibley A, Jones RT, Davey astray? CSIRO : 1–44. JW, Dasmahapatra KK, et al. (2012) Genomic islands of divergence in hybridizing Heliconius butterflies identified by large- [119] Jackson D (1997) Compositional data in community ecology: the paradigm or peril scale targeted sequencing. Philosophical of proportions? Ecology 78: 929–940. transactions of the Royal Society of London Series B, Biological sciences 367: 343– [120] Buccianti A, Mateu-Figueras G, 353. Pawlowsky-Glahn V, editors (2006) Compositional data analysis in the geosciences : from theory to practice. Number 24 117 in Special Publication. London, Geological Society, 224 pp. UK: [132] Martı́n-Fernández J, Barceló-Vidal C, Pawlowsky-Glahn V (2003) Dealing with zeros and missing values in compositional [121] Aitchison J (2003) The statistical analysis data sets using nonparametric imputation. of compositional data. Caldwell, New JerMath Geol 35: 253–278. sey, USA: Blackburn Press, 416 pp. [133] Aitchison J, Egozcue JJ (2005) Composi[122] Pearson K (1897) On a form of spurious tional data analysis: Where are we and correlation which may arise when indices where should we be heading? Math Geol are used in the measurement of organs. 37: 829–850. Proceedings of the Royal Society of Lon[134] Barcelo-Vidal C, Aguilar L, Martindon 60: 489–502. Fernandez JA (2011) Compositional [123] Aitchison J (2003) A concise guide to comVARIMA time series. In: Pawlowskypositional data analysis. In: 2nd CompoGlahn V, Buccianti A, editors, Compositional Data Analysis Workshop, Girona, sitional Data Analysis, Chichester, West Italy. Accessed 01/01/10. Sussex, UK: Wiley. pp. 87-103. [124] Pawlowsky-Glahn V, Buccianti A, editors [135] Schloss PD, Gevers D, Westcott SL (2011) (2011) Compositional data analysis : theReducing the effects of PCR amplification ory and applications. Chichester, West and sequencing artifacts on 16S rRNASussex, UK: Wiley, 400 pp. Based studies. PLoS ONE 6: e27310. [125] Aitchison J (1992) On criteria for measures [136] Chao A, Shen T (2003) Nonparametric esof compositional difference. Math Geol 24: timation of shannon’s index of diversity 365–379. when there are unseen species in sample. Environmental and Ecological Statis[126] Aitchison J (1981) A new approach to null tics 10: 429–443. correlations of proportions. Math Geol 13: 175–189. [137] Hausser J, Strimmer K (2009) Entropy inference and the james-stein estimator, with [127] Filzmoser P, Hron K (2009) Correlation application to nonlinear gene association analysis for compositional data. Mathenetworks. J Mach Learn Res 10: 1469– matical Geosciences 41: 905–919. 1484. [128] Consortium THMP (2012) A framework [138] Gelman A, Carlin J, Stern H, Rubin D for human microbiome research. Nature (2003) Bayesian data analysis. London, 486: 215–221. UK: Chapman and Hall/CRC press, 696 pp. [129] Jost L (2006) Entropy and diversity. Oikos 113: 363–375. [139] Woronow A, Butler J (1986) Complete subcompositional independence testing of [130] Agresti A, Hitchcock DB (2005) Bayesian closed arrays. Computers & Geosciences inference for categorical data analysis. Sta12: 267–279. tistical Methods and Applications 14: 297– 330. [140] Oliphant T (2007) Python for scientific computing. Computing in Science & En[131] Faith JJ, Driscoll ME, Fusaro VA, Cosgineering 9: 10–20. grove EJ, Hayete B, et al. (2008) Many microbe microarrays database: Uniformly [141] Hagberg AA, Schult DA, Swart PJ (2008) normalized affymetrix compendia with Exploring network structure, dynamics, structured experimental metadata. Nucleic and function using networkx. In: VaroAcids Res 36: D866–D870. quaux G, Vaught T, Millman J, editors, Proceedings of the 7th Python in Science Conference. Pasadena, CA USA, pp. 11 15. 118 [142] Hunter J (2007) Matplotlib: A 2d graph- [152] Wu M, Eisen JA (2008) A simple, fast, ics environment. Computing in Science & and accurate method of phylogenomic inEngineering 9: 90–95. ference. Genome Biology 9: R151. [143] Schloss PD, Westcott SL (2011) Assess- [153] Li H, Ruan J, Durbin R (2008) Maping and improving methods used in opping short DNA sequencing reads and callerational taxonomic unit-based approaches ing variants using mapping quality scores. for 16S rRNA gene sequence analysis. ApGenome research 18: 1851–1858. plied and Environmental Microbiology 77: [154] Oliphant T (2007) Python for scientific 3219–3226. computing. Computing in Science & En[144] May R (1999) Unanswered questions in gineering . ecology. Philosophical Transactions of the OpenOpt: Free Royal Society of London Series B: Biologi- [155] Kroshko D (2007–). scientific-engineering software for mathecal Sciences 354: 1951–1959. matical modeling and optimization. [145] Network TMCS (2012) What do we need to know about speciation? Trends in Ecology [156] Hunter J (2007) Matplotlib: A 2d graphics environment. Computing in Science & & Evolution 27: 27–39. Engineering . [146] Eisen JA (1998) Phylogenomics: improving functional predictions for uncharac- [157] Desai MM, Fisher DS (2011) The balance between mutators and nonmutators terized genes by evolutionary analysis. in asexual populations. Genetics 188: 997– Genome research 8: 163167. 1014. [147] Shapiro BJ, Friedman J, Cordero OX, Preheim SP, Timberlake SC, et al. (2012) Pop- [158] Galhardo RS, Hastings PJ, Rosenberg SM (2008). Mutation as a stress response and ulation genomics of early events in the ecothe regulation of evolvability. logical differentiation of bacteria. science 336: 4851. [159] Hubbell SP (2008) The unified neutral theory of biodiversity and biogeography [148] Tettelin H, Masignani V, Cieslewicz M, Do(MPB-32), volume 32. Princeton Univernati C, Medini D, et al. (2005) Genome sity Press. analysis of multiple pathogenic isolates of streptococcus agalactiae: implications for the microbial ”pan-genome”. Proceedings of the National Academy of Sciences of the United States of America 102: 13950–5. [149] Morowitz MJ, Denef VJ, Costello EK, Thomas BC, Poroyko V, et al. (2011) Strain-resolved community genomic analysis of gut microbial colonization in a premature infant. Proceedings of the National Academy of Sciences 108: 1128–1133. [150] Schloissnig S, Arumugam M, Sunagawa S, Mitreva M, Tap J, et al. (2013) Genomic variation landscape of the human gut microbiome. Nature 493: 45–50. [151] Johnson J, Omland K (2004) Model selection in ecology and evolution. Trends in Ecology \& Evolution 19: 101–108. 119